Feature Extraction and Matching in Amharic Document Image Collections

Letta, Adane (2011) Feature Extraction and Matching in Amharic Document Image Collections. Masters thesis, Addis Ababa University.

[img] PDF (Feature Extraction and Matching in Amharic Document Image Collections)
Adane,Letta.pdf - Accepted Version
Restricted to Repository staff only

Download (1MB) | Request a copy

Abstract

The ubiquity of digital computers and the boom of the Internet and World Wide Web resulted in massive information explosion over the entire world. Different types of information are uploaded in the Internet such as text documents, document images and other multimedia files. Document images facilitate office automation by preserving scanned documents in a document image database. However, information retrieving from document image database becomes a difficult task for organizations due to lack of efficient retrieval schemes. To overcome this challenge, recognition based and recognition free retrieval approaches are attempted by researchers. Recognition based retrieval first applies optical character recognition (OCR) to convert document images into text and then performs text retrieval using search engines. On the other hand, recognition free approach attempts to search and retrieve directly from document images relying on image features. Due to the limitation of OCR systems, recognition based retrieval is not effective. Hence, attempts are made by different researchers to develop a document image retrieval system without explicit recognition. On top of this, attempts are made to develop effective Amharic document image retrieval system. As a continuation, the current study is initiated to explore and design feature extraction and matching schemes that are insensitive to word variants, difference in font types, sizes and styles and degradation. In doing so, eight feature extraction methods and four matching techniques are tested. Of the four matching schemes dynamic time warping is insensitive to font types, sizes and styles difference. The eight feature extraction techniques are tested for performance, and then each feature is combined systematically following best stepwise feature selection method. The result shows that combined features score better performance than individuals. Using the best performer matching algorithm stemming is performed in image domain to handle word variants. Accordingly, promising experimental results are registered for word variants. The explored matching, feature extraction and stemming techniques are integrated with the previous Amharic document image retrieval system and tested on noisy document images. As the experimentation, the performance of the current system outperforms the previous attempts. Besides, relevant conclusions are drawn and some valid recommendations are forwarded to future investigation.

Item Type: Thesis (Masters)
Subjects: P Language and Literature > P Philology. Linguistics
P Language and Literature > PL Languages and literatures of Eastern Asia, Africa, Oceania
Divisions: Africana
Depositing User: Selom Ghislain
Date Deposited: 18 Jun 2018 12:08
Last Modified: 18 Jun 2018 12:08
URI: http://thesisbank.jhia.ac.ke/id/eprint/4424

Actions (login required)

View Item View Item