Amharic-English Script Identification in Real-Life Document Images

Abebayehu, Samuel (2012) Amharic-English Script Identification in Real-Life Document Images. Masters thesis, Addis Ababa University.

PDF (Amharic-English Script Identification in Real-Life Document Images)
ABEBAYEHU, SAMUEL.pdf - Accepted Version
Restricted to Repository staff only
Download (3MB) | Request a copy

Abstract

Computer technology enabled humans to process, store, retrieve and disseminate information with much flexibility and ease. As a result of this, vast amount of information is being digitized. Currently, digital libraries are digitizing printed documents in order to offer more people access to larger document collections, and at far greater speed, than physical libraries can. This in turn created the need for effective document image processing systems which resulted number of studies on Optical Character recognition (OCR) and Document Image Retrieval (DIR) systems. Nowadays, the emergence of English as the universal language has resulted in multi-script documents in many nations using their own scripts. This situation posed a serious challenge for the traditional document image processing systems which are capable of processing only documents prepared in a single script. To address this issue number of researches has been conducted on script identification and various techniques have been reported. Ethiopia has also the same situation where many historical, legal, news papers and business documents are prepared using two scripts (English and Amharic). Even though many studies have been conducted on document image processing systems for Amharic, only one research is conducted on script identification for Amharic-English documents. This research is pioneer on the subject and proposed feature extraction techniques for Amharic-English script identification. The present research is a continuation of the previous work aiming in improving the performance of the previously proposed system in Real-Life document images. Real-Life document images have wide facet of challenges. The two main challenges in Real-Life document images are printing variation (font type, size, etc) and noise. To this end, in the present research four noise removal techniques and 11 features extraction techniques are investigated. The experimentation conducted on clean and Real-Life documents showed that the DBF (adaptive noise removal technique) are effective in suppressing noise while keeping the features intact. In addition to this, the combination of features (extracted at word level) selected following the forward sequential feature selection method showed to be effective in terms of less sensitivity to noise, font type and word length variation. More importantly, the experimentation is conducted without performing any normalization of variations (size, space, etc) that are common in Real-Life documents and promising results are registered. In addition to this, important recommendations are forwarded that needs further investigation.

Item Type:	Thesis (Masters)
Subjects:	P Language and Literature > PE English P Language and Literature > PL Languages and literatures of Eastern Asia, Africa, Oceania Z Bibliography. Library Science. Information Resources > Z665 Library Science. Information Science
Divisions:	Africana
Depositing User:	Selom Ghislain
Date Deposited:	19 Oct 2018 06:09
Last Modified:	19 Oct 2018 06:09
URI:	http://thesisbank.jhia.ac.ke/id/eprint/6957

Actions (login required)

View Item