Retrieval from Real-Life Amharic Document Images

Asnake, Biniam (2012) Retrieval from Real-Life Amharic Document Images. Masters thesis, Addis Ababa University.

[img] PDF (Retrieval from Real-Life Amharic Document Images)
Biniam, Asnake.pdf - Accepted Version
Restricted to Repository staff only

Download (7MB) | Request a copy

Abstract

Bulk of real life documents contain vital information and knowledge about history, culture, economy, politics, religion and science that are available in written form in Ethiopic script. This knowledge ought to be shared and the advancement of technology and research in Information Retrieval (IR), Artificial Intelligence (AI) and related fields bring the need to digitize documents and make it available for public use. The two major approaches of retrieving information from document images are recognition-based (optical character recognition /OCR/) and recognitionfree (document image retrieval without explicit recognition /DIR/). The first approach is a long term process, error-prone and registers minimized performance for noisy documents, where as document image retrieval without explicit recognition is the preferred one. A few researches have been conducted to develop a recognition-free document image retrieval system that extracts information from document images relying on image features only. These systems are highly affected by noise in real life documents which results from paper aging, folding, scanning and printing errors. In this study, an attempt is made to integrate effective noise reduction and thresholding techniques to enhance the effectiveness of the system in searching within real-life document images. This study also improves the online searching process of the system by accepting multiple query terms then retrieving documents in recall-oriented manner and achieve 77.33% F-measure. A combination of three noise reduction techniques: median, adaptive median and wiener filters, and three thresholding techniques: Otsu’s, Niblack’s and Sauvola’s techniques are experimented in printed real-life documents plagued by low, medium, high and very high noise. Performance analysis shows that the best performing combination of denoising and thresholding techniques are wiener filtering and Otsu thresholding. Finally, the performance of the system is evaluated before and after the integration of the selected preprocessing techniques in which an average overall performance of 82.37% F-measure is registered in documents having low, medium, high and very high levels of noise. The major challenge is segmentation error where the current system either considers multiple separate words as one because of noise or a single word as multiple words when the noise is removed and the space between characters of a single word is large enough to be a word (segmentation threshold value) by the segmentation algorithm.

Item Type: Thesis (Masters)
Subjects: P Language and Literature > PL Languages and literatures of Eastern Asia, Africa, Oceania
Z Bibliography. Library Science. Information Resources > Z665 Library Science. Information Science
Z Bibliography. Library Science. Information Resources > ZA Information resources
Divisions: Africana
Depositing User: Selom Ghislain
Date Deposited: 19 Sep 2018 13:43
Last Modified: 19 Sep 2018 13:43
URI: http://thesisbank.jhia.ac.ke/id/eprint/5415

Actions (login required)

View Item View Item