Design of a Dark Web Crawler and Offline Language Identifier for Amharic Documents

Wondyifraw, Daniel Adenew (2016) Design of a Dark Web Crawler and Offline Language Identifier for Amharic Documents. Masters thesis, Addis Ababa University.

PDF (Design of a Dark Web Crawler and Offline Language Identifier for Amharic Documents)
Daniel, Adenew.pdf - Accepted Version
Restricted to Repository staff only
Download (2MB) | Request a copy

Abstract

Document searching over the Internet has become daily practice of people for their personal and business matters. Though, there are billions of websites easily available, still many more are not easily accessible. In contrary with surface web or Clearnet, a content inside TOR network which require specific software, configurations or authorization to access from the public Internet commonly referred as DarkWeb. Due to this fact Clearnet search engines like Google are not capable of searching its content. In contrary to the name “DarkWeb”, DarkWeb contains a collection of useful and legal information that can be used for our day to day activity. In fact, its darkness refers the content is being hidden from the Clearnet search engines. As a result, we proposed a design of a Dark Web crawler to discover and give an insight for the DarkNet content, especially for TOR network. We also proposed integration of a language identifier component to be used to identify Amharic content. These were the gaps seen on related researches on DarkWeb regarding crawling and content analysis. These researches were conducted in small dataset for only specific types of DarkWeb sites and did not considered contents available in Amharic language. The main objective of this thesis is designing an architecture for Dark Web crawler for TOR network. Basically, the proposed architecture is composed of a recursive light-weight crawler threads using a Fork-Join parallelism, a concurrent persistence storage manager with a persistent media access, URL queue for tracking links, a Download manager to download dark web contents, and HTML texts are compressed using an HTML compressor and language identification with an offline language identifier components. In the proposed system a Java programming language is used to develop the prototype referred as Dark Web crawler. We have tested the performances of our proposed design using the downloaded web documents and the crawled information. We have collected over 13,000 hidden services and 67,602 dark web URLs and downloaded 56,304 DarkNet web sites, resulting 800 MB data. Google search engine is used to evaluate results for selected number of datasets using parameters (i.e., page title and meta-tag). Out of the selected 30 data sets 7 are Amharic. We found promising result using the proposed system in finding all whereas a Google search engine were not able to find any of them from top 10 returned results.

Item Type:	Thesis (Masters)
Uncontrolled Keywords:	Dark Web, Dark Net, Dark Wet crawler, DarkNet Crawler, Fork-Join, Hidden Services, Google, .Onion, TOR
Subjects:	P Language and Literature > PL Languages and literatures of Eastern Asia, Africa, Oceania Q Science > QA Mathematics > QA75 Electronic computers. Computer science Q Science > QA Mathematics > QA76 Computer software T Technology > T Technology (General)
Divisions:	Africana
Depositing User:	Selom Ghislain
Date Deposited:	25 Sep 2018 11:58
Last Modified:	25 Sep 2018 11:58
URI:	http://thesisbank.jhia.ac.ke/id/eprint/5722

Actions (login required)

View Item