Design of Local Web Content Observatory System

Shawo, Gashaw Tsegaye (2015) Design of Local Web Content Observatory System. Masters thesis, Addis Ababa University.

[img] PDF (Design of Local Web Content Observatory System)
Gashaw, Tsegaye.pdf - Accepted Version
Restricted to Repository staff only

Download (1MB) | Request a copy

Abstract

The amount of information on the Web as well as the number of Internet users on the Web is growing rapidly. The Web contents are becoming more multilingual and on diverse subjects. Considering a particular group or country, it is very difficult to know how much Web contents are published and which are in what language and on what specific subject. Knowing the status of local Web content of a country or a culture is of critical importance for making an informal decision on policy and strategy design for the development of the multi-lingual and multicultural Web. This research work is therefore aimed to design a local Web content observatory system that measures and reports periodically the qualitative and quantitative content of different domains. The local Web content observatory system mainly consists of four components – the crawler, content extractor, statistical tracker, language identifier, Web document categorizer and report generator. The crawler downloads documents and then the language identifier detects the language of each crawled Web document and inserts detected language into a database. The statistical tracker monitors the crawler and records statistical data. The Web document categorizer categorizes the collected documents into the selected type of subject. The report generator provides statistical information about the detected language and distribution of Web document per language across the selected sets of domains. To test and evaluate the system, we have selected all domains hosted under the .et domain. Accordingly about two thousand seed URLs under the .et domain are used and the crawler collected around 263,031 Web documents. According to the accuracy rate measures employed to the language identifier, accuracy rate of 98.67% obtained. To demonstrate the effectiveness of the local Web content categorizer precision, recall and F-measures test were conducted and average precision of 91.7%, recall of 97.2% and F-measures of 94.25% obtained for English document and precision of 91.7%, recall of 87.85% and F-measures of 86.65% obtained for Amharic document. The average accuracy rate of the statistical tracker is 98.72%

Item Type: Thesis (Masters)
Uncontrolled Keywords: Information Retrieval, Crawler, Language Identification, Web Document Categorization, Local Web content Observatory System
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Q Science > QA Mathematics > QA76 Computer software
Divisions: Africana
Depositing User: Selom Ghislain
Date Deposited: 01 Nov 2018 09:37
Last Modified: 01 Nov 2018 09:37
URI: http://thesisbank.jhia.ac.ke/id/eprint/7297

Actions (login required)

View Item View Item