Date post: | 16-Jan-2016 |
Category: |
Documents |
Upload: | hector-obrien |
View: | 219 times |
Download: | 0 times |
Supporting the Automatic Construction of Entity Aware
Search Engines
Lorenzo Blanco, Valter Crescenzi,
Paolo Merialdo, Paolo Papotti
Dipartimento di Informatica e AutomazioneUniversità degli Studi Roma Tre
Introduction
• A huge number of web sites publish pages based on data stored in databases
• Each of these pages often contains information about a single instance of a conceptual entity
name birthdate
college
BasketballPlayer
weight height
Introduction
http://www.nba.com/ http://sports.espn.go.com/
• We developed a system that: • taking as input a small set of sample pages
from distinct web sites• automatically discovers pages containing
data about other instances of the conceptual entity exemplified by the input samples
Introduction
Overall Approach
• Given a bunch of sample pages• crawl the web sites of the sample pages to
gather other pages offering the same type of information
• extract a set of keywords that describe the underlying entity
• do- launch web searches to find other sources with
pages that contain instances of the target entity- analyze the results to filter out irrelevant pages - crawl the new sources to gather new pages
• while new pages are found
Overall Approach
• Given a bunch of sample pages• crawl the web sites of the sample pages to
gather other pages offering the same type of information
• extract a set of keywords that describe the underlying entity
• do- launch web searches to find other sources with
pages that contain instances of the target entity- analyze the results to filter out irrelevant pages - crawl the new sources to gather new pages
• while new pages are found
Instance IdentifiersAlan AndersonMike DoucetRicky DixonQuentin LedayJarrett Lee…
site Crawler
• Goal: given one sample page, crawl its site to discover as many pages as possible that offer the same information
• A crawling algorithm scans the web site toward pages sharing the same structure of the input sample page
• The crawler also computes a set of strings representing meaningful identifiers for the entity instances (e.g. the athletes' names)
Crawling the seed sites
…
…
…
…
• Given a sample page, the system explores the site structure looking for pages that work as indexes to "similar" pages
• The similarity between pages is measured analyzing their structure
Crawler: intuition
Overall Approach
• Given a bunch of sample pages• crawl the web sites of the sample pages to
gather other pages offering the same type of information
• extract a set of keywords that describe the underlying entity
• do- launch web searches to find other sources with
pages that contain instances of the target entity- analyze the results to filter out irrelevant pages - crawl the new sources to gather new pages
• while new pages are found
Extraction of the entity description
• On a web site, different instances of the same conceptual entity are likely to share a characterizing set of keywords
• It is usual that these keywords appear in the page template
Extraction of the entity description
PPGPPG
RPGRPG
APGAPG
EFFEFF
BornBorn
HeightHeight
WeightWeight
CollegeCollege
Years ProYears Pro
photosphotosBuyBuyphotophotoE-mailE-mail
Extraction of the entity description
• For each known website we extract from its template a set of keywords
• The entity description is a set of keywords built combining these sets
• We favour the more frequent terms
Template Extraction: intuitionTemplate Extraction: intuition
• To extract the terms of the template of a set of pages (from the same web site) the system analyzes the frequencies of the tokens
(inspired by Arasu&Garcia-Molina, Sigmod 2003)
Template Extraction: intuitionTemplate Extraction: intuition
<HTML> <DIV><A>Home</A><P>Sport!</P> </DIV> <DIV> <B>Weight</B><I>97</I> </DIV> <DIV> <B>Height</B><I>180</I> </DIV> <DIV> <B>Profile</B> <SPAN>The career ... </SPAN> </DIV></HTML>
<HTML> <DIV><A>Home</A><P>Sport!</P> </DIV> <DIV> <B>Weight</B><I>136</I> </DIV> <DIV> <B>Height</B><I>212</I> </DIV> <DIV> <B>Profile</B> <SPAN>Giant... Height... </SPAN> </DIV></HTML>
page 1 page 2
/html/body/div[3]/b /html/body/div[3]/b/html/body/div[4]/span
Overall Approach
• Given a bunch of sample pages• crawl the web sites of the sample pages to
gather other pages offering the same type of information
• extract a set of keywords that describe the underlying entity
• do- launch web searches to find other sources with
pages that contain instances of the target entity- analyze the results to filter out irrelevant pages - crawl the new sources to gather new pages
• while new pages are found
• For each entity identifier, the system launches one search on the web to discover new target pages
• To focus the searches, the query includes the entity description
Launches searches on the WebLaunches searches on the Web
(identifier)Michael Jordan+
pts height weight
min ast
(entitydescription)
• We compute and check template of each result
• The pages whose template contains terms that match with the set of keywords of the entity description are considered as instances of the entity- only a percentage of the
terms is taken into account
Overall Approach
• Given a bunch of sample pages• crawl the web sites of the sample pages to
gather other pages offering the same type of information
• extract a set of keywords that describe the underlying entity
• do- launch web searches to find other sources with
pages that contain instances of the target entity- analyze the results to filter out irrelevant pages - crawl the new sources to gather new pages
• while new pages are found
ExperimentsExperiments
• We run some experiments to analyze the approach. We focused on the sport domain, looking for pages containing data about the following entities:- Basketball player- Soccer player- Hockey player- Golf player
• The sport domain as it is easy to:- interpret published data- evaluate precision of results
Experiments: Experiments: extracted entity descriptionsextracted entity descriptions
• All the terms can reasonably represent attribute names for the corresponding player entity
Experiments: Experiments: using entity descriptionsusing entity descriptions
• % of terms (used in the filtering of Google results) vs recall & precision
• 500 pages from 10 soccer web sites • Google returned about 15.000 pages distribuited over
4.000 distinct web sites
Experiments: pages found
• “Hockey player” entity• 2 iterations of the cycle• > 12,000 pages found• > 5,000 distinct instances
Related work
• Our method is inspired by DIPRE (S.Brin, WebDB, 1998)
• Focus crawlers (S.Chakrabarti et al., Computer Networks, 1999)- Typically rely on text classifiers to determine the
relevance of the visited pages to the target topic- Analogies, but we look for pages containing
instances of an entity
• CIMPLE (A.Doan et al., SIGIR, 2006)- Building a platform to support the information needs
of a virtual community- An expert is needed to provide relevant sources and
design the E-R model of the domain of interest
Conclusions and future workConclusions and future work
• We populated an entity aware search engine for sport fans. We used the facilities of Google Co-op:http://flint.dia.uniroma3.it/ (Demo section)
• To improve the entity description we are working on a probabilistic model to dynamically compute a weight for the terms of the page templates
• We are investigating the usage of automatic wrapping techniques to extract, mine and integrate data from the web pages collected by the proposed approach
Thank you!