Automatic Thesaurus Construction for Amharic Text Retrieval

Gezmu, Andargachew Mekonnen (2009) Automatic Thesaurus Construction for Amharic Text Retrieval. Masters thesis, Addis Ababa University.

[img] PDF (Automatic Thesaurus Construction for Amharic Text Retrieval)
Andargachew, Mekonnen Gezmu.pdf - Accepted Version
Restricted to Repository staff only

Download (1MB) | Request a copy

Abstract

Thesauri have been used for literary composition since their inception in 1852, but nowadays their primary use is for information retrieval. Even they are among the crucial components of retrieval systems which are typically used for enhancing indexing operations and query expansions during searching. Even though Amharic language has been a written language for a couple of centuries and huge volumes of Amharic electronic documents are accumulated, not much has been done towards the development of effective and efficient Amharic retrieval systems. In this research work much effort has been exerted to generate thesaurus automatically for text retrieval in order to help the development of an effective and efficient Amharic retrieval system. The development of the automatic thesaurus generation system is based on the WORDSPACE model. The WORDSPACE model is derived from the inverted file index by applying Random Projection algorithm for dimensionality reduction. Nearest Neighboring clustering algorithm is employed to generate thesaurus automatically from the WORDSPACE model constructed. An encouraging result is obtained in the experimentation of the system on Amharic Bible documents. During experimentation the accuracy of the automatically generated thesaurus is evaluated. The result on a random sample of ten terms shows that the system has accuracy of 58%. To further investigate its applicability for Amharic information retrieval, the thesaurus is integrated to an IR system for query expansion. The retrieval system is tested with and without using thesaurus in order to show the improvement made in retrieval effectiveness. Performance analysis shows that the recall of the system while using thesaurus is superior to not using it. The average recall values are 73.34% and 37.29% after and before using thesaurus for query expansion, respectively.

Item Type: Thesis (Masters)
Uncontrolled Keywords: Amharic Thesaurus, WORDSPACE, Information Retrieval (IR)
Subjects: P Language and Literature > P Philology. Linguistics
P Language and Literature > PL Languages and literatures of Eastern Asia, Africa, Oceania
Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Q Science > QA Mathematics > QA76 Computer software
Z Bibliography. Library Science. Information Resources > Z665 Library Science. Information Science
Z Bibliography. Library Science. Information Resources > ZA Information resources > ZA4050 Electronic information resources
Divisions: Africana
Depositing User: Selom Ghislain
Date Deposited: 16 Aug 2018 09:48
Last Modified: 16 Aug 2018 09:48
URI: http://thesisbank.jhia.ac.ke/id/eprint/4758

Actions (login required)

View Item View Item