Yirdaw, Eyob Delele (2011) Topic-based Amharic Text Summarization. Masters thesis, Addis Ababa University.
PDF (Topic-based Amharic Text Summarization)
Eyob, Delele.pdf - Accepted Version Restricted to Repository staff only Download (1MB) | Request a copy |
Abstract
Automatic text summarization is important in today’s information age where vast amount of information are produced for consumption. The case of Ethiopia is not an exception. The country has seen steady growth in digital content, ready for consumption by the mass. Compared to other international languages, text summarization works in Ethiopia’s local languages in general and the Amharic language in particular, can be said to be in their early stages of development. In this regard, more work should be carried out to meet present and future needs of the availability of high quality information that needs to be extracted from large collections of data in a timely manner. This thesis investigates the problem of building a concept-based single-document Amharic text summarization system. Because local languages like Amharic lack extensive linguistic resources, we propose to use statistical approaches called topic modeling to create our text summarizer. The proposed algorithms are language and domain independent and hence can also be used for other local languages. More specifically, we propose to use the topic modeling approach of probabilistic latent semantic analysis (PLSA). We show that a principled use of the term by concept matrix that results from a PLSA model can help produce summaries that capture the main topics of a document. We propose six algorithms to help explore the use of the term by concept matrix. All of the algorithms have two common steps. In the first step, keywords of the document are selected using the term by concept matrix. In the second step, sentences that best contain the keywords are selected for inclusion in the summary. To take advantage of the kind of texts we experiment with (news articles) the algorithms always select the first sentence of the document for inclusion in the summary. We evaluated the proposed algorithms for precision/recall for summaries of 20%, 25% and 30% extraction rates. The best results achieved are as follows: 0.45511 at 20%, 0.48499 at 25% and 0.52012 at 30%. We also compared our systems with previous summarization methods that have been developed for other languages based on topic modeling approaches using our Amharic data set. Our results show that the proposed algorithms perform better at all extraction rates.
Item Type: | Thesis (Masters) |
---|---|
Uncontrolled Keywords: | Amharic Text Summarization, Keyword Approach, Probabilistic Latent Semantic Analysis. |
Subjects: | P Language and Literature > PL Languages and literatures of Eastern Asia, Africa, Oceania Q Science > QA Mathematics > QA75 Electronic computers. Computer science Q Science > QA Mathematics > QA76 Computer software |
Divisions: | Africana |
Depositing User: | Selom Ghislain |
Date Deposited: | 04 Oct 2018 12:17 |
Last Modified: | 04 Oct 2018 12:17 |
URI: | http://thesisbank.jhia.ac.ke/id/eprint/6728 |
Actions (login required)
View Item |