Page Segmentation in Amharic Document Image Collections

Assefa, Gedion (2013) Page Segmentation in Amharic Document Image Collections. Masters thesis, Addis Ababa University.

PDF (Page Segmentation in Amharic Document Image Collections)
Gedion, Assefa.pdf - Accepted Version
Restricted to Repository staff only
Download (2MB) | Request a copy

Abstract

The advancement and accessibility of digital computers and the introduction of the Internet and World Wide Web resulted in massive information explosion all over the world. Large amount of handwritten, typewritten and printed documents contain numerous information and knowledge of different areas. To make the information and knowledge embedded in these documents accessible to the public, it is desirable to digitize, organize and develop retrieval systems for such kind of documents. In response to this need, researchers are moving towards recognition-free approach since optical character recognition OCR engines have various limitations. Researches have been conducted to develop Amharic document image retrieval (DIR) system without explicit recognition that retrieve information from document images relying on image features only. However, effectiveness of the system is highly affected by segmentation errors at word-level. Moreover, the system does not work on real-life document images in which images, graphics, logos, tables, etc. are embedded. This study attempts to integrate effective page segmentation technique that can work on documents which contain images, graphics, tables, etc. and improve word level segmentation. Accordingly, page segmentation algorithms namely: Hough transforms, Connected Components (CC), Horizontal Run Length Smoothing (HRLS), Dilation and Watershed are tested. The performance evaluation showed that the integration of CC and Dilation is the best combination. Average Match Score of 0.865 in different level noisy document images, 0.93 in typewritten documents, 0.97 in documents containing pictures, 0.97 in documents containing tables and 0.45 in handwritten documents (‗kum tshihuf‘) is scored. On the average, an increase of 2.34% F-Measure is scored in different level noisy document images. Degraded features of old documents, slimness of typewritten characters and font size variation had a great impact on the performance of the system which needs further attention by future researches.

Item Type:	Thesis (Masters)
Subjects:	P Language and Literature > PL Languages and literatures of Eastern Asia, Africa, Oceania Z Bibliography. Library Science. Information Resources > Z665 Library Science. Information Science
Divisions:	Africana
Depositing User:	Selom Ghislain
Date Deposited:	13 Jul 2018 12:31
Last Modified:	13 Jul 2018 12:31
URI:	http://thesisbank.jhia.ac.ke/id/eprint/7380

Actions (login required)

View Item