Amharic Character Recognition System for Printed Real-Life Documents

Birhanu, Abay Teshager (2008) Amharic Character Recognition System for Printed Real-Life Documents. Masters thesis, Addis Ababa University.

[img] PDF (Amharic Character Recognition System for Printed Real-Life Documents)
Binder, Birhanu.pdf - Accepted Version
Restricted to Repository staff only

Download (2MB) | Request a copy

Abstract

Optical Character Recognition (OCR) is an area of research and development where a system is made to recognize characters from printed documents. Cultural considerations and enormous flood of printed documents motivated the development of OCR across the world. Unlike other scripts, OCR development for Amharic Characters has been started in 1997 at SISA (School of Information Studies for Africa). Some developments have been made in recognizing various types of machine-printed, typewritten and handwritten Amharic documents. However, Amharic character recognition is still an area that requires the contribution of many research works. There is a need to enhance its performance on real-life documents such as the ‘Addis Zemen’ Amharic newspaper, the Bible, the ‘Federal Negarit Gazeta’ and the fiction ‘Fiker Eskemekabir’, which have a number of artifacts (mode of writing, condition of the input page, printing process, quality of paper, presence of extraneous markings, resolution and quality of scanning etc.) that affect the performance of the recognizer. One such area, OCR technology has been investigated more for real-life Amharic degraded documents. For the recognition to be successful, robust techniques in detecting and removing various noise types are investigated and validated. During experimentation of the applicability of algorithms and approaches for the problem at hand, MATLAB Image processing Toolbox and neural network classifier on MATLAB Neural Network Toolbox is used. The wiener adaptive filtering method for noise removal, Otsu global thresholding method for binarizing the digitized image, linear interpolation techniques for normalization and hitand-miss morphological analysis for thinning are found to work very well for the problem of interest. In due course, the performance of the line segmenter is found to be 100%. The rate of segmentation for basic and labialized characters turns out to be 98.28% and 100% respectively for training character sets, 98.55% and 100% respectively for testing character sets. For classifying the features generated, an artificial neural network approach is implemented. The neural network is trained with eight samples taken from real-life documents. The performance of the developed system is tested with documents taken from real-life documents. Accordingly, an average recognition rate of 96.87% for the test sets from the training sets and 11.40% recognition rate is observed for the new test sets. The segmentation algorithm used in the current study worked reasonably for basic and labialized characters. But it fails to segment special character |v|, punctuations and numbers. In general, observation of the test results show that the performance of the system is greatly affected by the similarity of the shape of Amharic characters and effect of the application of noise removal for cleaning highly degraded document images. Such challenges require to further explore an invariant to shape feature extraction techniques and advanced noise detection and removal algorithms. Based on the results, further research areas are also recommended.

Item Type: Thesis (Masters)
Subjects: P Language and Literature > PL Languages and literatures of Eastern Asia, Africa, Oceania
Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Q Science > QA Mathematics > QA76 Computer software
Z Bibliography. Library Science. Information Resources > Z665 Library Science. Information Science
Z Bibliography. Library Science. Information Resources > ZA Information resources
Divisions: Africana
Depositing User: Selom Ghislain
Date Deposited: 19 Sep 2018 13:45
Last Modified: 19 Sep 2018 13:45
URI: http://thesisbank.jhia.ac.ke/id/eprint/5412

Actions (login required)

View Item View Item