Linguistically Motivated Amharic IR (LM-IR)

Demelash, Biruk (2013) Linguistically Motivated Amharic IR (LM-IR). Masters thesis, Addis Ababa University.

[img] PDF (Linguistically Motivated Amharic IR (LM-IR))
Biruk's.pdf - Accepted Version
Restricted to Repository staff only

Download (2MB) | Request a copy

Abstract

Information Retrieval (IR) is the very essential tool in every society for knowledge acquiring. The challenge of designing effective IR on Amharic is related to linguistic characterstics that are specific for the language. Detail studies on the Amharic language indicate two core features. These features make difficult to apply IR models that are effective on English. The first is syllabic nature of the writing system the other is morphological nature of word formation. These characterstics cause too many morph variation and linguistic ambiguity. That is why applying already existing IR models cause document silence and noise during. Adopted models of statistical preprocessing fail to give enough attention for the core characteristics of the language, in this research an attempt is made to develop a new Linguistic Analyzer (LA) for word preprocessor using morph syntactic analysis (MSA) to resolve challenges related with linguistic ambiguity and linguistic variation. Morph variation has been a major challenge of Amharic IR system by causing document silence during retrieval. This problem has been resolved in this research by introducing incremental index file structure. Incremental indexing has a capability of storing linguistic inflections that are related with gender, number, tense, and other form. This indexing structure helps to keep precession while increasing the recall values of retrieval system. A preprocessor LA is build using 74,000 words found in Amharic bible. After performing preprocessing on 5000 words using the newly designed LA, output found with better performance of 82%. On the same test the statistical preprocessor with stemming can deliver only a maximum of 30%. The LM-IR, that is built on top of LA have incremental indexing file structure that is capable of delivering average F-measure of 83%. It was possible to maintain recall of 88% while the precession is not below 76% The comparison of LA and statistical word preprocessor shows a significant difference on effectiveness therefore LA approach benefits Amharic IR design. In addition the incremental indexing structure protect the semantic lose on index words that used to happen statistical index structures. Incremental indexing structure helps to increase recall and precision at the same time. This research also shows the possibility of designing Amharic IR using linguistic technique. Therefor further research especially on searching part of linguistic approach of Amharic IR would yield even better result.

Item Type: Thesis (Masters)
Subjects: P Language and Literature > PL Languages and literatures of Eastern Asia, Africa, Oceania
Z Bibliography. Library Science. Information Resources > Z665 Library Science. Information Science
Divisions: Africana
Depositing User: Selom Ghislain
Date Deposited: 25 Sep 2018 09:36
Last Modified: 25 Sep 2018 09:36
URI: http://thesisbank.jhia.ac.ke/id/eprint/5541

Actions (login required)

View Item View Item