Concept-based Amharic Documents Similarity (CADS)

Wordofa, Addisalem Abera (2013) Concept-based Amharic Documents Similarity (CADS). Masters thesis, Addis Ababa University.

[img] PDF (Concept-based Amharic Documents Similarity (CADS))
Addisalem, Abebe.pdf - Accepted Version
Restricted to Repository staff only

Download (9MB) | Request a copy

Abstract

Similarity measure has significance in the area of NLP applications such as search engine, information extraction and document classification. These NLP applications are implemented in Amharic language. However, most of them rely on simple matching techniques or probabilistic method to measure similarity. These approaches do not always accurately capture conceptual relatedness as measured by humans. Some of the researches try to consider semantic nature of a document without handling ambiguity of words. In this research, we proposed Concept-based Amharic Document Similarity (CADS) by building AmhWordNet. The objective of this research is to implement effective similarity measure of documents by considering issues like polysemy, synonymy and semantic relationship between words. The main components of the proposed system (CADS) are AmhWordNet and Concept-based Similarity Measure (CSM). CSM consists of Word Sense Disambiguation (WSD), Concept Tree Extraction and Semantic Similarity Measure modules. The AmhWordNet is used as input during concept tree extraction and to implement WSD module. The extracted concept tree together with WSD module helps to find the semantic similarity between words. The output of word similarity is used to compute sentence similarity. Finally document similarity is computed based on sentence similarities. The performance of CADS is evaluated using precision, recall and F-measure evaluation metrics. CADS without WSD (CADSWoWSD), Pointwise Mutual Information (PMI), Jaccard and Cosine similarity measures are implemented so that comparison between the five systems is done. According to the result we get from the experiment we conducted, the proposed system has better performance than the existing ones.

Item Type: Thesis (Masters)
Uncontrolled Keywords: Word Sense Disambiguation, Concept Tree Extraction, Amharic WordNet, Concept-based Similarity Measure.
Subjects: P Language and Literature > P Philology. Linguistics
Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Q Science > QA Mathematics > QA76 Computer software
Divisions: Africana
Depositing User: Selom Ghislain
Date Deposited: 19 Jun 2018 13:10
Last Modified: 19 Jun 2018 13:10
URI: http://thesisbank.jhia.ac.ke/id/eprint/4443

Actions (login required)

View Item View Item