Date post: | 22-Oct-2014 |
Category: |
Documents |
Upload: | kayihura-marie-paule |
View: | 31 times |
Download: | 0 times |
KIGALI INSTITUTE OF SCIENCE AND TECHNOLOGY INSTITUT DES SCIENCES ET DE TECHNOLOGIE DE KIGALI
Avenue de l'Armée, B.P. 3900 Kigali, Rwanda
FACULTY OF ENGINEERING
DEPARTMENT OF COMPUTER ENGINEERING AND INFORMATION TECHNOLOGY
A PROJECT REPORT
ON
Submitted by ISHIMWE MUHUMUZA Emma Marie
REG.NO: GS20090098
Under the Guidance of Prof.Santhi KUMARAN
Submitted in partial fulfilment of the requirements for the award of
BACHELOR OF SCIENCE DEGREE INCOMPUTER ENGINEERING
February 2012
i
“WEB ENCYCLOPEDIA INFO-BOX QUICK EXTRACTION”“WEB ENCYCLOPEDIA INFO-BOX QUICK EXTRACTION”
ii
PROJECT ID: CEIT/FT/12/13PROJECT ID: CEIT/FT/12/13
ABSTRACT
Due to the growth of mobile devices use, a cell phone is becoming a world's largest tool for making
calls and sending messages but its utilisation in searching information isn’t currently powerful.
This study mainly discusses the extraction of web information through cell phones.
Usual web information extraction is mostly based on DOM tree and HTML tag analysis. Based on
those web information extraction techniques and rules, the study proposes the development of an
SMS application facilitating to make a quick information extraction from web encyclopedia.
The general idea of the application is to send an SMS request to search for specific information; data
will be extracted from a web based on a URL assigned to the key word. After being transformed into
a format readable by each cell phone, the user will easily get trustworthy and updated information.
iii
DECLARATION
I, ISHIMWE MUHUMUZA Emma Marie, hereby declare that, the work presented in this research
paper is original. No one has ever presented it at Kigali Institute of Science and Technology or
elsewhere for any award. For any consulted work, references were made and put in the list of
references. I therefore declare this work to be wholly mine.
Emma Marie ISHIMWE MUHUMUZA
Signature:
iv
KIGALI INSTITUTE OF SCIENCE AND TECHNOLOGYINSTITUT DES SCIENCES ET DE TECHNOLOGIE DE KIGALI
Avenue de l'Armée, B.P. 3900 Kigali, Rwanda
FACULTY OF ENGINEERING
DEPARTMENT OF COMPUTER ENGINEERING AND INFORMATION TECHNOLOGY
CERTIFICATE
This is to certify that the Project Work entitled “Web Encyclopedia Info-Box Quick Extraction” is
a record of the original bonafide work done by ISHIMWE MUHUMUZA Emma Marie
(REG.No: GS20090098) in partial fulfilment of the requirement for the award of Bachelor of
Science Degree in Computer Engineering of Kigali Institute of Science, Technology, during the
Academic Year 2012.
….......................................... ….......................................... Supervisor: Prof.Santhi KUMARAN Head of the Department: Jonathan MWAKIJELE
v
Submitted for the Project Examination held at KIST on ………………………………...
LIST OF FIGURES
Fig.1.1 Gantt chart...................................................................................................................................6
vi
LIST OF ABBREVIATIONS
SMS: Short Message Service
ICT: Information Communication Technology
IT: Information Technology
MUC: Message Understanding Conference
HTML: HyperText Markup Language
DOM: Document Object Model
API: Application Programming Interface
NLP: Natural Language Processing
vii
Table of ContentsABSTRACT............................................................................................................................................ii
DECLARATION...................................................................................................................................iii
CERTIFICATE......................................................................................................................................iv
LIST OF FIGURES.................................................................................................................................v
LIST OF ABBREVIATIONS................................................................................................................vi
Table of Contents..................................................................................................................................vii
CHAPITER ONE: GENERAL INTRODUCTION................................................................................1
1.1. Introduction............................................................................................................................1
1.2 Background.............................................................................................................................2
1.3 Problem Statement.................................................................................................................3
1.4 Objectives of the project........................................................................................................3
1.4.1 General Objective..............................................................................................................3
1.4.2 Specific objectives.............................................................................................................3
1.5 Scope of the study........................................................................................................................3
1.5 Project interest........................................................................................................................3
1.5.1 Individual interests............................................................................................................3
1.5.2 Academic interest..............................................................................................................4
1.5.3 Public interest....................................................................................................................4
1.6 Organization of the study.......................................................................................................5
1.7 Gant chart...............................................................................................................................6
1.8 Expected Results.....................................................................................................................6
1.9 Conclusion...............................................................................................................................7
CHAPTER TWO: LITERATURE REVIEW.........................................................................................8
2.1 Introduction.................................................................................................................................8
2.2.1 Extraction rules......................................................................................................................9
2.2.2 Mechanisms of data extraction...............................................................................................9
2.2.3 Verifying the Extracted Data................................................................................................10
2.3 Terms and technologies............................................................................................................10
2.3.1 Encyclopedia........................................................................................................................10
2.3.2 Info-Box...............................................................................................................................10
2.3.3 DOM tree..............................................................................................................................11
2.3.4 Data Extraction.....................................................................................................................11
2.3.5 Cell phone............................................................................................................................11viii
2.3.6 Web wrapper........................................................................................................................12
2.3.7 Python...................................................................................................................................12
2.3.8 Django..................................................................................................................................12
2.4 Proposed methodology..............................................................................................................12
2.5 System requirements.................................................................................................................13
2.5.1 Software requirements..........................................................................................................13
2.5.2 Hardware requirements........................................................................................................13
2.6 Conclusion..................................................................................................................................13
References.............................................................................................................................................15
Books and Articles...........................................................................................................................15
Internet Sources...............................................................................................................................16
ix
CHAPITER ONE: GENERAL INTRODUCTION
1.1. Introduction
As humanity evolves, events increase and mark the change from one generation to another. Some
events get to be known by high number of people, while others are more or less ignored. Refereed to
those events, some people get known and called celebrities according to what they have discovered,
their experiences in life or their specialty in entertaining the society .Among those events, there are
celebrities known for Political or entertainment purposes such as the heads of states, music and movie
actors and so on....
The fact that those celebrities are known prompt people to search for their specific information.
Everyone try to look for the easiest and cheapest ways to get information which sometimes cost some
amount of money.
When it comes to payments for information, people pay for local information content in some form: It
can be for their local print newspaper, for an Application on their mobile device or for access to
special information Online. [1]
The actual and common source of trustworthy information is an ever updating encyclopedia which
holds information about high profile personalities while another alternative is to wait for
television/radio programs to talk about them.
Using those different sources, people risk to miss up the needed information because those services
are sometimes time consuming.
The rise of mobile phone devices has facilitated tasks by providing real-time information to a large
number of people as each cell phone owner is more likely to use their mobile phones and get each
kinds of information. [2]
The easiest and quickest way of extracting information results then from an SMS application that will
facilitate people to search information from a web encyclopedia in a single SMS that a user will send
via its cell phone and receive a response SMS containing requested information.
1
In this research, Python programming language and Django framework will be used to develop a tool
for collecting and extracting data from the web encyclopedia and transform it into a format readable
by a Cell Phone as a text message. The resulting data can then be sent to the Cell Phone user in the
form of SMS.
1.2 Background
It has been a long-time that people get information from different sources such as newspapers,
Television/Radio and from Application on their mobile device or by accessing online special
information. But most of those sources don’t provide adequate information; they cannot even be
accessible by everyone.
When it comes to extract data from internet, some researchers created tools for extracting data from
web sites and transforming it into a structured data format. The resulting data can then be used to
build new applications without having to deal with unstructured data. [3]
Those ways of getting information miss some features based on how their algorithms are structured.
Thus Web Encyclopedia Info-Box Quick extraction Application has come up with new extraction
technology.
2
1.3 Problem Statement
Most People own a Cell phone, but they don't manage to get quick information from their mobile
devices unless spending time and money to extract needed information.
What can be an easiest way for all cell phone users to retrieve information from either a sophisticated
or a simplest cell phone and get trustworthy information in a real-time and at a lowest price?
1.4 Objectives of the project
1.4.1 General Objective
The General objective of this research is to develop an application suitable to everyone and that
facilitates people make a quick information search for their preferred celebrities through their cell
phones.
1.4.2 Specific objectives
The specific objectives of this research are:
1. To identify the need of SMS based applications.
2. The design and the implementation of quick information extraction using cell phone.
3. Test and Validation of Web information extraction on mobile devices.
4. To improve the use of cell phone, not only for making calls and sending/receiving messages.
1.5 Scope of the study
The study will be defined in human limitation as well as geographically.
Although there is a lot of information that can be accessed in different areas, this research is limited
on brief information of various celebrities. That information is:
Names, Date of Birth, Nationality, origin, Political party, Religion, spouse and Genres &
Occupations
Geography, the application has been developed for each and every cell phone user.
3
1.5 Project interest
1.5.1 Individual interests
The first interest in this research is to improve knowledge in Python programming language and
Django Framework and fill some gaps faced in the ICT industry. Thus contribute in ICT
development.
The second interest in to apply the practical domain skills and engineering approaches learn from
school by solving problems.
1.5.2 Academic interest
1. The project helps the student to advance necessary understanding in PYTHON programming
language.
2. The project helps students to prove what they have been doing during the University studies
by saving as the exemplary people.
1.5.3 Public interest
1. The project facilitates cell phone users to make a quick search via their mobile devices and get
trustworthy information.
2. The project helps the developer to be known in the ICT industry and show its capabilities in
solving problems in the society especially in Information Technology domain.
4
1.6 Organization of the study
This research project report consists of the following chapters:
Chapter one: General introduction
The first chapter introduces the aspects of the study. It describes the problem statement which
describes the current situation in the area that has to be improved through the whole study and
indicate the problem explicitly. It describes also objectives of the study, scope of the study, project
interest and the organization of the study.
Chapter Two: Literature review
The second chapter is about literature review, which describes various theories relating to the project
work.
Chapter three: Research methodology
The third chapter encompasses a research methodology, which describes the waterfall model as well
as the principles of software engineering.
Chapter four: System analysis and design
The fourth chapter encompasses a system design, which describes data modeling, use case, sequence
diagram.
Chapter five: System implementation, testing and results
This chapter is covering the implementation of the SMS application and it finally presents the results
testing on the implemented system.
Conclusion and recommendation:
It is the last portion of this research report. It presents the conclusions and recommendations made
upon this research project
5
1.7 Gant chart
Fig.1.1 Gantt chart
1.8 Expected Results
With Web Encyclopedia Info-Box Quick Extraction application, cell phone users will be able to
easily receive accurate and updated information from a Web Encyclopedia based on their request.
A cell phone user will send an SMS request to search for specific information, and then data will be
extracted from a web based on a URL assigned to the key word. After being transformed into a
format readable by each cell phone, a cell phone user will manage to read the text.
6
1.9 Conclusion
This chapter covers the general introduction of the project. It introduces the project, tells more about
the background of how the information has been extracted before, the background leads to the
statement of the problem, the objectives of the project as well as the limitations of the study.
The interest of this study has been discussed. We have presented the timeline of the project for the
researcher to accomplish every task.
Finally, this chapter covers a brief summary of the whole chapter.
7
CHAPTER TWO: LITERATURE REVIEW
2.1 Introduction
Nowadays, Cell phone usage has deeply penetrated into the society and has become a daily tool not
only for making calls and sending messages, but also their rise has already altered the environment of
local news and information. Thus mobile devices have become one of the most quickly adopted
consumer goods. [4]
The majority of those of who own a cell-phone can get some kind of local news and information on
their mobile devices. [5]
The growth in cell phone use has brought with it a growing use of new applications even if the
adoption of applications, however, is not as rapid as cell phones themselves.
In the research made, just a few number of cell phone owners report having applications that helps
them getting information or news about their local community. [6]
There is a tremendous amount of information available on the Web, but much of that information is
not in a form that can be easily used by other applications. [7]
2.2 Current State
During the past decade, information extraction has been extensively studied with many research
results as well as systems developed. Since the late 1980’s, through the message understanding
conference (MUC), many information extraction systems have been successfully developed and
quantitatively evaluated. [8]
The information source can be classified into three main types, including free text, structured text
and semi-structured text.
Originally, the extraction system focuses on free text extraction.
8
Natural Language Processing (NLP) techniques are developed to extract this type of unrestricted,
unregulated information, which employs the syntactic and semantic characteristics of the language
to generate the extraction rules.
The structured information usually comes from databases, which provide rigid or well defined
formats of information, therefore, it is easy to extract through some query language such as
Structured Query Language (SQL).
The other type is the semi-structured information, which falls between free text and structured
information. Web pages are a typical example of semi-structured information. In some papers, focus
on extracting text information from web pages. [9]
We are going to review various theories relating to the information extraction.
2.2.1 Extraction rules
A wrapper is software used to enable a semi-structured Web source to be queried as if it were a
database. These are sources where there is no explicit structure or schema, but there is an implicit
underlying structure. Even text sources, such as email messages, have some structure in the heading
that can be exploited to extract the date, sender, addressee, title, and body of the messages.
Other sources, such as online catalogs, have a very regular structure that can be exploited to extract
the data automatically. [10]
2.2.2 Mechanisms of data extraction
Lixto offers two basic mechanisms of data extraction: Tree extraction and String extraction.
1. Tree extraction
For tree extraction, elements are identified with their corresponding tree paths and possibly some
properties of the elements themselves. This does not necessarily identify a single element.
A plain tree path is a sequence of consecutive nodes in a sub-tree of an HTML tree. In an
incompletely specified tree path, stars may be used instead of element names. For simplicity,
incompletely specified tree paths are referred to as tree paths. The semantics of a tree path applied to
a tree region of an HTML page is defined as the set of matched elements. [10]
2. String extraction 9
The second extraction method relies on strings. In the HTML parse tree, strings are represented by the
text of content leaves. However, a string is associated to every node of the parse tree available as the
value of the attribute element text. String extraction has to be used when extracting access codes of
the phone-numbers of lixto.html. [11]
2.2.3 Verifying the Extracted Data
A problem that has been largely ignored on extracting data from web sites is that sites change and
they change often. Kushmerick [12] addressed the wrapper verification problem by monitoring a set
of generic features, such as the density of numeric characters within a field, but this approach only
detects certain types of changes. In contrast, they address that problem by applying machine learning
techniques to learn a set of patterns that describe the information that is being extracted from each of
the relevant fields. Since the information for even a single field can vary considerably, the system
learns the statistical distribution of the patterns for each field. Wrappers can be verified by comparing
the patterns of data returned to the learned statistical distribution. When a significant difference is
found, an operator can then be notified or can automatically launch the wrapper repair process.
Based on all of these theories related to the web information extraction, we will improve the
extraction tool by developing a new SMS application that eases to make the quickest web information
extracting in a single SMS via a cell phone and receive response SMS containing requested
information.
2.3 Terms and technologies
2.3.1 Encyclopedia
An encyclopedia (also spelled encyclopaedia or encyclopædia) is a type of reference work, a
compendium holding a summary of information from either all branches of knowledge or a particular
branch of knowledge. Encyclopedias are divided into articles or entries, which are usually accessed
alphabetically by article name. Encyclopedia entries are longer and more detailed than those in most
dictionaries. [13]
10
2.3.2 Info-Box
Generally, Info-box templates are templates that provide standardized information across related
articles.
In this study, an info-box is a fixed-format box under the person’s picture on the top right-hand
corner of articles to consistently present brief information of that person.
2.3.3 DOM tree
DOM tree is a cross platform and a language independent convention for representing and interacting
with objects in HTML,XML and XML documents. The aspects of the DOM tree may be addressed
and manipulated within the syntax of the programming language in use. The public interface is
specified in its application programming interface (API). [14]
2.3.4 Data Extraction
Data extraction is the act or process of retrieving data out of data sources for further data processing
or data storage. The import into the intermediate extracting system is thus usually followed by data
transformation. [15]
Typical unstructured data sources include web pages, emails, documents, PDFs, scanned text,
mainframe reports, spool files etc. Extracting data from these unstructured sources has grown into a
considerable technical challenge where as historically data extraction has had to deal with changes in
physical hardware formats.
The majority of current data extraction deals with extracting data from the unstructured data sources,
and from different software formats. This growing process of data extraction from the web is referred
to as Web scraping.
2.3.5 Cell phone
A cell phone (also known as a cellular phone, mobile phone and a hand phone) is a device that can
make and receive telephone calls as well as send and receive text messages over a radio link whilst
moving around a wide geographic area.
11
It does so by connecting to a cellular network provided by a mobile phone operator, allowing access
to the public telephone network. By contrast, a cordless telephone is used only within the short range
of a single, private base station. [16]
12
2.3.6 Web wrapper
Web wrapper is tool used to extract information from the web given using only a set of general rules
describing the data domain. It cleanly separates out site-independent and site-specific knowledge
from execution implementation.
Site-independent knowledge is expressed in user-supplied domain rules, while site-specific
knowledge is expressed in automatically-generated context-free grammars that describe site
structures. [17]
A wrapper is also used to manually extract a particular format of information.
2.3.7 Python
Python is a programming language that lets you work more quickly and integrate your systems more
effectively.
Python runs on Windows, Linux/Unix, Mac OS X, and has been ported to the Java and .NET virtual
machines. [18]
2.3.8 Django
The Django is a high-level python Web Framework that encourages rapid development and clean
pragmatic design. [19]
It lets you build high-performing, elegant and Web application quickly.
2.4 Proposed methodology In this study we will details some of the possible methodologies that have to be used for developing
SMS application .The research process will then be done into the following steps:
1. Collecting the Data by means of questionnaires
2. Analysis of Data
3. Generalisation and Interpretation of data and
4. Presentation of Results and write ups of conclusions reached.
13
2.5 System requirements
Our application will facilitate users:
To make a request using its cell phone and
To get updated information at a real-time
To get trustworthy information at a real-time
2.5.1 Software requirements
The application will be developed using:
Python programming language
Django Framework
Web wrappers as web extraction systems
MySQL as the database
UBUNTU 11.10 as the Operating system
Python programming language and Django framework will be used to develop a tool for collecting
and extracting data from the web encyclopedia using web wrappers as one of the web extraction
systems.
Semantic search will be used to generate the information extraction based on the meaning of the
given key word and the requested information will be stored into the database, therefore, it is easy to
extract through some query language such as Structured Query Language (SQL).
2.5.2 Hardware requirements
For the application to be used efficiently, the user must have a cell phone with the capability of
sending and receiving text message.
2.6 Conclusion
In this chapter, we discussed about various theories related to the study and we defined some terms
and terminologies used during the project description.
14
We have also discussed about the proposed methodology that have to be used in order to collect the
data and steps that have to be followed during the research process .Finally we mentioned the system
requirements by specifying the Software requirement as well as the hardware requirements.
15
References
Books and Articles
[1] A. Mitchell, Deputy Director, Project for Excellence in Journalism. “How mobile devices are
changing community information environments”, March 14, 2011
[2] K.Purcell, “AssociateDirector-Research”, Pew Internet Project.
http://www.stateofthemedia.org/2011/Mobile-survey accessed on Jan 9, 2012.
[3] C. Hsu and M. Dung. “Generating finite-state transducers for semi-structured data extraction from
the web”, Article, 23(8):521–538, 1998.
[4] Source: Pew Research Center's Project for Excellence in Journalism and Internet & American
Life Project in partnership with the Knight Foundation, January 12-25, 2011 Local Information
Survey.
[5] A. Survey, Technical Report 945, Norweigan Computing Center, Olso, Norway, On July 1999
[6] W. Cohen. Recognizing structure in web pages using similarity queries. In Proc. of the 16th
National Conference on Artificial Intelligence AAAI-1999, pages 59–66, 1999.
[7] A. Craig Knoblock University of Southern California and Fetch Technologies, in 2009
[8] L. Eikvil: Information Extraction from World Wide Web – A Survey, Technical Report
945, Norweigan Computing Center, Oslo, Norway (July 1999)
[9] Y. Zhang et al. (Eds.): APWeb 2008, LNCS 4976, pp. 383–394, 2008.
[10] A. Craig Knoblock University of Southern California and Fetch Technologies, December 6,
1998.
[11] T. Goan, N. Benson, and O. Etzioni. A grammar inference algorithm for the World Wide Web. In Proc. Of the AAAI Spring Symposium on Machine Learning in Information Access, 1996.
[12] N. Kushmerick. Regression testing for wrapper maintenance. In Proc. of the 16th National
Conference on Artificial Intelligence AAAI-1999, pages 74–79, 1999.
[17] M. E., Califf, and Mooney, R. J. 1999. Relational learning of pattern-match rules for information
extraction
16
Internet Sources
[13] http://en.wikipedia.org/wiki/Encyclopedia Accessed on February 10, 2012
[14] “Document Object Model (DOM)”, http://www.w3.org/:W3C. Accessed on February 11, 2012.
[15] http://en.wikipedia.org/wiki/Data_extraction. Accessed on February 12, 2012
[16] http://en.wikipedia.org/wiki/Mobile_phone. Accessed on February 12, 2012
[19] http://www.boddie.org.uk/python/HTML.html A .Accessed on February 13, 2012
17