Final Proposal

KIGALI INSTITUTE OF SCIENCE AND TECHNOLOGY INSTITUT DES SCIENCES ET DE TECHNOLOGIE DE KIGALI

Avenue de l'Armée, B.P. 3900 Kigali, Rwanda

FACULTY OF ENGINEERING

DEPARTMENT OF COMPUTER ENGINEERING AND INFORMATION TECHNOLOGY

A PROJECT REPORT

ON

Submitted by ISHIMWE MUHUMUZA Emma Marie

REG.NO: GS20090098

Under the Guidance of Prof.Santhi KUMARAN

Submitted in partial fulfilment of the requirements for the award of

BACHELOR OF SCIENCE DEGREE INCOMPUTER ENGINEERING

February 2012

i

“WEB ENCYCLOPEDIA INFO-BOX QUICK EXTRACTION”“WEB ENCYCLOPEDIA INFO-BOX QUICK EXTRACTION”

ii

PROJECT ID: CEIT/FT/12/13PROJECT ID: CEIT/FT/12/13

ABSTRACT

Due to the growth of mobile devices use, a cell phone is becoming a world's largest tool for making

calls and sending messages but its utilisation in searching information isn’t currently powerful.

This study mainly discusses the extraction of web information through cell phones.

Usual web information extraction is mostly based on DOM tree and HTML tag analysis. Based on

those web information extraction techniques and rules, the study proposes the development of an

SMS application facilitating to make a quick information extraction from web encyclopedia.

The general idea of the application is to send an SMS request to search for specific information; data

will be extracted from a web based on a URL assigned to the key word. After being transformed into

a format readable by each cell phone, the user will easily get trustworthy and updated information.

iii

DECLARATION

I, ISHIMWE MUHUMUZA Emma Marie, hereby declare that, the work presented in this research

paper is original. No one has ever presented it at Kigali Institute of Science and Technology or

elsewhere for any award. For any consulted work, references were made and put in the list of

references. I therefore declare this work to be wholly mine.

Emma Marie ISHIMWE MUHUMUZA

Signature:

iv

KIGALI INSTITUTE OF SCIENCE AND TECHNOLOGYINSTITUT DES SCIENCES ET DE TECHNOLOGIE DE KIGALI

Avenue de l'Armée, B.P. 3900 Kigali, Rwanda

FACULTY OF ENGINEERING

DEPARTMENT OF COMPUTER ENGINEERING AND INFORMATION TECHNOLOGY

CERTIFICATE

This is to certify that the Project Work entitled “Web Encyclopedia Info-Box Quick Extraction” is

a record of the original bonafide work done by ISHIMWE MUHUMUZA Emma Marie

(REG.No: GS20090098) in partial fulfilment of the requirement for the award of Bachelor of

Science Degree in Computer Engineering of Kigali Institute of Science, Technology, during the

Academic Year 2012.

….......................................... ….......................................... Supervisor: Prof.Santhi KUMARAN Head of the Department: Jonathan MWAKIJELE

v

Submitted for the Project Examination held at KIST on ………………………………...

LIST OF FIGURES

Fig.1.1 Gantt chart...................................................................................................................................6

vi

LIST OF ABBREVIATIONS

SMS: Short Message Service

ICT: Information Communication Technology

IT: Information Technology

MUC: Message Understanding Conference

HTML: HyperText Markup Language

DOM: Document Object Model

API: Application Programming Interface

NLP: Natural Language Processing

vii

Table of ContentsABSTRACT............................................................................................................................................ii

DECLARATION...................................................................................................................................iii

CERTIFICATE......................................................................................................................................iv

LIST OF FIGURES.................................................................................................................................v

LIST OF ABBREVIATIONS................................................................................................................vi

Table of Contents..................................................................................................................................vii

CHAPITER ONE: GENERAL INTRODUCTION................................................................................1

1.1. Introduction............................................................................................................................1

1.2 Background.............................................................................................................................2

1.3 Problem Statement.................................................................................................................3

1.4 Objectives of the project........................................................................................................3

1.4.1 General Objective..............................................................................................................3

1.4.2 Specific objectives.............................................................................................................3

1.5 Scope of the study........................................................................................................................3

1.5 Project interest........................................................................................................................3

1.5.1 Individual interests............................................................................................................3

1.5.2 Academic interest..............................................................................................................4

1.5.3 Public interest....................................................................................................................4

1.6 Organization of the study.......................................................................................................5

1.7 Gant chart...............................................................................................................................6

1.8 Expected Results.....................................................................................................................6

1.9 Conclusion...............................................................................................................................7

CHAPTER TWO: LITERATURE REVIEW.........................................................................................8

2.1 Introduction.................................................................................................................................8

2.2.1 Extraction rules......................................................................................................................9

2.2.2 Mechanisms of data extraction...............................................................................................9

2.2.3 Verifying the Extracted Data................................................................................................10

2.3 Terms and technologies............................................................................................................10

2.3.1 Encyclopedia........................................................................................................................10

2.3.2 Info-Box...............................................................................................................................10

2.3.3 DOM tree..............................................................................................................................11

2.3.4 Data Extraction.....................................................................................................................11

2.3.5 Cell phone............................................................................................................................11viii

2.3.6 Web wrapper........................................................................................................................12

2.3.7 Python...................................................................................................................................12

2.3.8 Django..................................................................................................................................12

2.4 Proposed methodology..............................................................................................................12

2.5 System requirements.................................................................................................................13

2.5.1 Software requirements..........................................................................................................13

2.5.2 Hardware requirements........................................................................................................13

2.6 Conclusion..................................................................................................................................13

References.............................................................................................................................................15

Books and Articles...........................................................................................................................15

Internet Sources...............................................................................................................................16

ix

CHAPITER ONE: GENERAL INTRODUCTION

1.1. Introduction

As humanity evolves, events increase and mark the change from one generation to another. Some

events get to be known by high number of people, while others are more or less ignored. Refereed to

those events, some people get known and called celebrities according to what they have discovered,

their experiences in life or their specialty in entertaining the society .Among those events, there are

celebrities known for Political or entertainment purposes such as the heads of states, music and movie

actors and so on....

The fact that those celebrities are known prompt people to search for their specific information.

Everyone try to look for the easiest and cheapest ways to get information which sometimes cost some

amount of money.

When it comes to payments for information, people pay for local information content in some form: It

can be for their local print newspaper, for an Application on their mobile device or for access to

special information Online. [1]

The actual and common source of trustworthy information is an ever updating encyclopedia which

holds information about high profile personalities while another alternative is to wait for

television/radio programs to talk about them.

Using those different sources, people risk to miss up the needed information because those services

are sometimes time consuming.

The rise of mobile phone devices has facilitated tasks by providing real-time information to a large

number of people as each cell phone owner is more likely to use their mobile phones and get each

kinds of information. [2]

The easiest and quickest way of extracting information results then from an SMS application that will

facilitate people to search information from a web encyclopedia in a single SMS that a user will send

via its cell phone and receive a response SMS containing requested information.

1

In this research, Python programming language and Django framework will be used to develop a tool

for collecting and extracting data from the web encyclopedia and transform it into a format readable

by a Cell Phone as a text message. The resulting data can then be sent to the Cell Phone user in the

form of SMS.

1.2 Background

It has been a long-time that people get information from different sources such as newspapers,

Television/Radio and from Application on their mobile device or by accessing online special

information. But most of those sources don’t provide adequate information; they cannot even be

accessible by everyone.

When it comes to extract data from internet, some researchers created tools for extracting data from

web sites and transforming it into a structured data format. The resulting data can then be used to

build new applications without having to deal with unstructured data. [3]

Those ways of getting information miss some features based on how their algorithms are structured.

Thus Web Encyclopedia Info-Box Quick extraction Application has come up with new extraction

technology.

2

1.3 Problem Statement

Most People own a Cell phone, but they don't manage to get quick information from their mobile

devices unless spending time and money to extract needed information.

What can be an easiest way for all cell phone users to retrieve information from either a sophisticated

or a simplest cell phone and get trustworthy information in a real-time and at a lowest price?

1.4 Objectives of the project

1.4.1 General Objective

The General objective of this research is to develop an application suitable to everyone and that

facilitates people make a quick information search for their preferred celebrities through their cell

phones.

1.4.2 Specific objectives

The specific objectives of this research are:

1. To identify the need of SMS based applications.

2. The design and the implementation of quick information extraction using cell phone.

3. Test and Validation of Web information extraction on mobile devices.

4. To improve the use of cell phone, not only for making calls and sending/receiving messages.

1.5 Scope of the study

The study will be defined in human limitation as well as geographically.

Although there is a lot of information that can be accessed in different areas, this research is limited

on brief information of various celebrities. That information is:

Names, Date of Birth, Nationality, origin, Political party, Religion, spouse and Genres &

Occupations

Geography, the application has been developed for each and every cell phone user.

3

1.5 Project interest

1.5.1 Individual interests

The first interest in this research is to improve knowledge in Python programming language and

Django Framework and fill some gaps faced in the ICT industry. Thus contribute in ICT

development.

The second interest in to apply the practical domain skills and engineering approaches learn from

school by solving problems.

1.5.2 Academic interest

1. The project helps the student to advance necessary understanding in PYTHON programming

language.

2. The project helps students to prove what they have been doing during the University studies

by saving as the exemplary people.

1.5.3 Public interest

1. The project facilitates cell phone users to make a quick search via their mobile devices and get

trustworthy information.

2. The project helps the developer to be known in the ICT industry and show its capabilities in

solving problems in the society especially in Information Technology domain.

4

1.6 Organization of the study

This research project report consists of the following chapters:

Chapter one: General introduction

The first chapter introduces the aspects of the study. It describes the problem statement which

describes the current situation in the area that has to be improved through the whole study and

indicate the problem explicitly. It describes also objectives of the study, scope of the study, project

interest and the organization of the study.

Chapter Two: Literature review

The second chapter is about literature review, which describes various theories relating to the project

work.

Chapter three: Research methodology

The third chapter encompasses a research methodology, which describes the waterfall model as well

as the principles of software engineering.

Chapter four: System analysis and design

The fourth chapter encompasses a system design, which describes data modeling, use case, sequence

diagram.

Chapter five: System implementation, testing and results

This chapter is covering the implementation of the SMS application and it finally presents the results

testing on the implemented system.

Conclusion and recommendation:

It is the last portion of this research report. It presents the conclusions and recommendations made

upon this research project

5

1.7 Gant chart

Fig.1.1 Gantt chart

1.8 Expected Results

With Web Encyclopedia Info-Box Quick Extraction application, cell phone users will be able to

easily receive accurate and updated information from a Web Encyclopedia based on their request.

A cell phone user will send an SMS request to search for specific information, and then data will be

extracted from a web based on a URL assigned to the key word. After being transformed into a

format readable by each cell phone, a cell phone user will manage to read the text.

6

1.9 Conclusion

This chapter covers the general introduction of the project. It introduces the project, tells more about

the background of how the information has been extracted before, the background leads to the

statement of the problem, the objectives of the project as well as the limitations of the study.

The interest of this study has been discussed. We have presented the timeline of the project for the

researcher to accomplish every task.

Finally, this chapter covers a brief summary of the whole chapter.

7

CHAPTER TWO: LITERATURE REVIEW

2.1 Introduction

Nowadays, Cell phone usage has deeply penetrated into the society and has become a daily tool not

only for making calls and sending messages, but also their rise has already altered the environment of

local news and information. Thus mobile devices have become one of the most quickly adopted

consumer goods. [4]

The majority of those of who own a cell-phone can get some kind of local news and information on

their mobile devices. [5]

The growth in cell phone use has brought with it a growing use of new applications even if the

adoption of applications, however, is not as rapid as cell phones themselves.

In the research made, just a few number of cell phone owners report having applications that helps

them getting information or news about their local community. [6]

There is a tremendous amount of information available on the Web, but much of that information is

not in a form that can be easily used by other applications. [7]

2.2 Current State

During the past decade, information extraction has been extensively studied with many research

results as well as systems developed. Since the late 1980’s, through the message understanding

conference (MUC), many information extraction systems have been successfully developed and

quantitatively evaluated. [8]

The information source can be classified into three main types, including free text, structured text

and semi-structured text.

Originally, the extraction system focuses on free text extraction.

8

Natural Language Processing (NLP) techniques are developed to extract this type of unrestricted,

unregulated information, which employs the syntactic and semantic characteristics of the language

to generate the extraction rules.

The structured information usually comes from databases, which provide rigid or well defined

formats of information, therefore, it is easy to extract through some query language such as

Structured Query Language (SQL).

The other type is the semi-structured information, which falls between free text and structured

information. Web pages are a typical example of semi-structured information. In some papers, focus

on extracting text information from web pages. [9]

We are going to review various theories relating to the information extraction.

2.2.1 Extraction rules

A wrapper is software used to enable a semi-structured Web source to be queried as if it were a

database. These are sources where there is no explicit structure or schema, but there is an implicit

underlying structure. Even text sources, such as email messages, have some structure in the heading

that can be exploited to extract the date, sender, addressee, title, and body of the messages.

Other sources, such as online catalogs, have a very regular structure that can be exploited to extract

the data automatically. [10]

2.2.2 Mechanisms of data extraction

Lixto offers two basic mechanisms of data extraction: Tree extraction and String extraction.

1. Tree extraction

For tree extraction, elements are identified with their corresponding tree paths and possibly some

properties of the elements themselves. This does not necessarily identify a single element.

A plain tree path is a sequence of consecutive nodes in a sub-tree of an HTML tree. In an

incompletely specified tree path, stars may be used instead of element names. For simplicity,

incompletely specified tree paths are referred to as tree paths. The semantics of a tree path applied to

a tree region of an HTML page is defined as the set of matched elements. [10]

2. String extraction 9

The second extraction method relies on strings. In the HTML parse tree, strings are represented by the

text of content leaves. However, a string is associated to every node of the parse tree available as the

value of the attribute element text. String extraction has to be used when extracting access codes of

the phone-numbers of lixto.html. [11]

2.2.3 Verifying the Extracted Data

A problem that has been largely ignored on extracting data from web sites is that sites change and

they change often. Kushmerick [12] addressed the wrapper verification problem by monitoring a set

of generic features, such as the density of numeric characters within a field, but this approach only

detects certain types of changes. In contrast, they address that problem by applying machine learning

techniques to learn a set of patterns that describe the information that is being extracted from each of

the relevant fields. Since the information for even a single field can vary considerably, the system

learns the statistical distribution of the patterns for each field. Wrappers can be verified by comparing

the patterns of data returned to the learned statistical distribution. When a significant difference is

found, an operator can then be notified or can automatically launch the wrapper repair process.

Based on all of these theories related to the web information extraction, we will improve the

extraction tool by developing a new SMS application that eases to make the quickest web information

extracting in a single SMS via a cell phone and receive response SMS containing requested

information.

2.3 Terms and technologies

2.3.1 Encyclopedia

An encyclopedia (also spelled encyclopaedia or encyclopædia) is a type of reference work, a

compendium holding a summary of information from either all branches of knowledge or a particular

branch of knowledge. Encyclopedias are divided into articles or entries, which are usually accessed

alphabetically by article name. Encyclopedia entries are longer and more detailed than those in most

dictionaries. [13]

10

2.3.2 Info-Box

Generally, Info-box templates are templates that provide standardized information across related

articles.

In this study, an info-box is a fixed-format box under the person’s picture on the top right-hand

corner of articles to consistently present brief information of that person.

2.3.3 DOM tree

DOM tree is a cross platform and a language independent convention for representing and interacting

with objects in HTML,XML and XML documents. The aspects of the DOM tree may be addressed

and manipulated within the syntax of the programming language in use. The public interface is

specified in its application programming interface (API). [14]

2.3.4 Data Extraction

Data extraction is the act or process of retrieving data out of data sources for further data processing

or data storage. The import into the intermediate extracting system is thus usually followed by data

transformation. [15]

Typical unstructured data sources include web pages, emails, documents, PDFs, scanned text,

mainframe reports, spool files etc. Extracting data from these unstructured sources has grown into a

considerable technical challenge where as historically data extraction has had to deal with changes in

physical hardware formats.

The majority of current data extraction deals with extracting data from the unstructured data sources,

and from different software formats. This growing process of data extraction from the web is referred

to as Web scraping.

2.3.5 Cell phone

A cell phone (also known as a cellular phone, mobile phone and a hand phone) is a device that can

make and receive telephone calls as well as send and receive text messages over a radio link whilst

moving around a wide geographic area.

11

It does so by connecting to a cellular network provided by a mobile phone operator, allowing access

to the public telephone network. By contrast, a cordless telephone is used only within the short range

of a single, private base station. [16]

12

2.3.6 Web wrapper

Web wrapper is tool used to extract information from the web given using only a set of general rules

describing the data domain. It cleanly separates out site-independent and site-specific knowledge

from execution implementation.

Site-independent knowledge is expressed in user-supplied domain rules, while site-specific

knowledge is expressed in automatically-generated context-free grammars that describe site

structures. [17]

A wrapper is also used to manually extract a particular format of information.

2.3.7 Python

Python is a programming language that lets you work more quickly and integrate your systems more

effectively.

Python runs on Windows, Linux/Unix, Mac OS X, and has been ported to the Java and .NET virtual

machines. [18]

2.3.8 Django

The Django is a high-level python Web Framework that encourages rapid development and clean

pragmatic design. [19]

It lets you build high-performing, elegant and Web application quickly.

2.4 Proposed methodology In this study we will details some of the possible methodologies that have to be used for developing

SMS application .The research process will then be done into the following steps:

1. Collecting the Data by means of questionnaires

2. Analysis of Data

3. Generalisation and Interpretation of data and

4. Presentation of Results and write ups of conclusions reached.

13

2.5 System requirements

Our application will facilitate users:

To make a request using its cell phone and

To get updated information at a real-time

To get trustworthy information at a real-time

2.5.1 Software requirements

The application will be developed using:

Python programming language

Django Framework

Web wrappers as web extraction systems

MySQL as the database

UBUNTU 11.10 as the Operating system

Python programming language and Django framework will be used to develop a tool for collecting

and extracting data from the web encyclopedia using web wrappers as one of the web extraction

systems.

Semantic search will be used to generate the information extraction based on the meaning of the

given key word and the requested information will be stored into the database, therefore, it is easy to

extract through some query language such as Structured Query Language (SQL).

2.5.2 Hardware requirements

For the application to be used efficiently, the user must have a cell phone with the capability of

sending and receiving text message.

2.6 Conclusion

In this chapter, we discussed about various theories related to the study and we defined some terms

and terminologies used during the project description.

14

We have also discussed about the proposed methodology that have to be used in order to collect the

data and steps that have to be followed during the research process .Finally we mentioned the system

requirements by specifying the Software requirement as well as the hardware requirements.

15

References

Books and Articles

[1] A. Mitchell, Deputy Director, Project for Excellence in Journalism. “How mobile devices are

changing community information environments”, March 14, 2011

[2] K.Purcell, “AssociateDirector-Research”, Pew Internet Project.

http://www.stateofthemedia.org/2011/Mobile-survey accessed on Jan 9, 2012.

[3] C. Hsu and M. Dung. “Generating finite-state transducers for semi-structured data extraction from

the web”, Article, 23(8):521–538, 1998.

[4] Source: Pew Research Center's Project for Excellence in Journalism and Internet & American

Life Project in partnership with the Knight Foundation, January 12-25, 2011 Local Information

Survey.

[5] A. Survey, Technical Report 945, Norweigan Computing Center, Olso, Norway, On July 1999

[6] W. Cohen. Recognizing structure in web pages using similarity queries. In Proc. of the 16th

National Conference on Artificial Intelligence AAAI-1999, pages 59–66, 1999.

[7] A. Craig Knoblock University of Southern California and Fetch Technologies, in 2009

[8] L. Eikvil: Information Extraction from World Wide Web – A Survey, Technical Report

945, Norweigan Computing Center, Oslo, Norway (July 1999)

[9] Y. Zhang et al. (Eds.): APWeb 2008, LNCS 4976, pp. 383–394, 2008.

[10] A. Craig Knoblock University of Southern California and Fetch Technologies, December 6,

1998.

[11] T. Goan, N. Benson, and O. Etzioni. A grammar inference algorithm for the World Wide Web. In Proc. Of the AAAI Spring Symposium on Machine Learning in Information Access, 1996.

[12] N. Kushmerick. Regression testing for wrapper maintenance. In Proc. of the 16th National

Conference on Artificial Intelligence AAAI-1999, pages 74–79, 1999.

[17] M. E., Califf, and Mooney, R. J. 1999. Relational learning of pattern-match rules for information

extraction

16

http://www.stateofthemedia.org/2011/Mobile-survey

Internet Sources

[13] http://en.wikipedia.org/wiki/Encyclopedia Accessed on February 10, 2012

[14] “Document Object Model (DOM)”, http://www.w3.org/:W3C. Accessed on February 11, 2012.

[15] http://en.wikipedia.org/wiki/Data_extraction. Accessed on February 12, 2012

[16] http://en.wikipedia.org/wiki/Mobile_phone. Accessed on February 12, 2012

[19] http://www.boddie.org.uk/python/HTML.html A .Accessed on February 13, 2012

17

http://www.boddie.org.uk/python/HTML.htmlA

http://en.wikipedia.org/wiki/Mobile_phone

http://en.wikipedia.org/wiki/Data_extraction

http://www.w3.org/:W3C

http://en.wikipedia.org/wiki/Encyclopedia

Date post:	22-Oct-2014
Category:	Documents
Upload:	kayihura-marie-paule
View:	31 times
Download:	0 times

Final Proposal

Documents