Date post: | 07-May-2015 |
Category: |
Documents |
Upload: | thanh-tran |
View: | 157 times |
Download: | 2 times |
KIT – Universität des Landes Baden-Württemberg undnationales Forschungszentrum in der Helmholtz-Gemeinschaft
INSTITUT FÜR ANGEWANDTE INFORMATIK UND FORMALE BESCHREIBUNGSVERFAHREN
www.kit.edu
Heterogeneous Web Data SearchUsing Relevance-based On The Fly Data Integration
Daniel M. Herzig, Thanh TranWWW2012
2 WWW 2012
Agenda
Motivation
Problem Definition
Existing Solutions
Our ApproachEntity Relevance Model (ERM)
Ranking
On-The-Fly alignment
Experiments
Conclusion
Daniel M. Herzig - Institute AIFB
3 WWW 2012
Company running a movie shopping website
Daniel M. Herzig - Institute AIFB
Ds
Movies Shopping Website
Company’s dataset
4 WWW 2012 Daniel M. Herzig - Institute AIFB
Users search the website via forms. Search request is internally executed as a structured query
Screenshot of http://www.imdb.com/search/title
Dsqs
Structured Query(e.g. SQL, SPARQL)
i:directors
?x
Steven Spielberg
IMdb i:
i:movie
type
1982
i:year
5 WWW 2012
Company discovers the plethora of Linked Data available on the Web and identifies Data Sources beneficial for its business
Daniel M. Herzig - Institute AIFB
Ds
Linked Data on the Webhttp://richard.cyganiak.de/2007/10/lod/
qs
6 WWW 2012
Zero Star Mugs!
Daniel M. Herzig - Institute AIFB
vs.
7 WWW 2012
Problems of Data Integration arise…
qs does not return results
No links, no integration
No knowledge about the
external data schema
External data might change often
Daniel M. Herzig - Institute AIFB
Ds
qs
8 WWW 2012
Problem Definition
Find relevant entities in a set of target datasets Dt given a source dataset Ds and an structured entity query qs adhering to the vocabulary of Ds.
Daniel M. Herzig - Institute AIFB
qs
DsDt1 Dt2
?
Source Dataset Target Datasets
Structured entity query
9 WWW 2012
Problem Setting
Data Model is labeled directed graphDirectly related to RDF
RDF specifics, e.g. blank nodes, are omitted
Entity query: SPARQL BGP query with one select variableEntity queries are the most frequent type of web search queries, Pound et al. WWW2010
Web Data scenario:Data exhibits a heterogeneity on the schema- and data-level
Daniel M. Herzig - Institute AIFB
10 WWW 2012
Heterogeneous Web Data
Schema-level: actors vs. starring
Data-level: Steven Spielberg vs. Spielberg, Steven
Varying number of attributes per entity
Daniel M. Herzig - Institute AIFB
a:Movie
type
Steven Spielberg
a:Directors
ea
Amazon a: DBpedia db:IMdb i:
Munich
a:Title
Daniel Craig, Eric Bana
DVD
a:Actors
a:Binding
2005
a:ReleaseDatei:movie
type
Spielberg, Steven (I)
i:directors
ei
E.T. (1994)
i:title
Coyote, Peter
i:actors
i:producer
Spielberg, Steven (I)
db:Film
type db:director
ed
1941 (film)
rdfs:label db:starring
db:John_Candy_(actor)
db:Steven_Spielberg
11 WWW 2012
Aim: Integrate External Data into the Search Process
Daniel M. Herzig - Institute AIFB
Ds
qsDt
Dt
?
Keyword SearchWang et al.: Semplore: A scalable IR approach to search the Web of Data. In: Journal of Web Semantics. (2009)
Query rewriting based
on up-front data integration
Calì et al.: Query Rewriting and Answering under Constraints in Data Integration Systems. In: IJCAI. (2003)
12 WWW 2012
Existing Strategies – Keyword Search
Transform qs into keyword query
Match against bag-of-words representation of entities
Bridges schema differences by neglecting the structure
Baseline 1 (KW), IR baseline using Semplore (Lucene)
Daniel M. Herzig - Institute AIFB
a:Movie
type
“Rainer Werner Fassbinder”
a:Directors
?x
Amazon a:
1982
a:TheatricalReleaseDate
directors rainer werner fassbinder theatrical release date 1982 type movie
e1
1982
IMDB i:
Rainer Werner Fassbinder
Veronika Voss
i:released
title veronika voss director rainer werner fassbinder released
1982
i:title
i:director
e2
i:movie
Spielberg, Steven (I)
Schindlers Liste (1994)
type
title schindlers liste 1994 director
spielberg steven i type movie
i:title
i:director
(3)
(2)
(1)
e1
e2
13 WWW 2012
Existing Strategies – Query Rewriting
Create mappings using ontology alignment tools (Falcon AO)
Rewrite query using the mappings, omit missing mappings, replace constants with variables
Reduces the search space, perform keyword search on top
Baseline 2 (QR), database-style baseline
Daniel M. Herzig - Institute AIFB
db:director
?x
DBpedia db:
type
?y
?z
Amazon a: Dbpedia db:a:Directors = db:directora:Title = db:nameA:Actor = db:starring… = …
Ontology Alignment Tool
Schema Amazon
Schema DBpedia
a:Movie
type
“Rainer Maria Fassbinder”
a:Directors
?x
Amazon a:
1982
a:TheatricalReleaseDate
14 WWW 2012
Heterogeneous Web Data SearchUsing Relevance-based On The Fly Data Integration
Daniel M. Herzig - Institute AIFB
15 WWW 2012
Contributions
(1) Novel approach for querying heterogeneous Web data sources
No upfront data integration necessary
Uses an Entity Relevance Model (ERM) for ranking and for computing mappings on the fly
(2) Implementation of the approachConstruction of an ERM and usage for alignment and ranking
Best-effort algorithm for creating mappings during runtime
(3) Large-scale evaluation with 3 real-world datasetsExperiments show our approach exceeds KW and QR baseline by 120%, respectively 54% in terms of Mean Average Precision.
Daniel M. Herzig - Institute AIFB
16 WWW 2012
Overview of our Approach
Daniel M. Herzig - Institute AIFB
qs
Ds
RsEntity
Relevance Model et
et
et
et
Keyword search to cross vocabulary mismatches
Relevance Feedback
Dt
Dt
keyword query
Model leveraging the structure of the data
Matching and Ranking
17 WWW 2012
Entity Relevance Model (ERM)
Based on Structured Relevance Model (Lavrenko et.al 2007)
Entity Relevance Model:
Query specific model
Captures structure and content of relevant results
Composite model consisting of language models weighted by occurrence.
Based onLavrenko et al.: Information Retrieval on Empty Fields. In: HLT- NAACL. (2007)
Daniel M. Herzig - Institute AIFB
18 WWW 2012
ERM (2)
qs Rs = {e1,e2} ERM
Daniel M. Herzig - Institute AIFB
Film
type
Rainer Werner Fassbinder
director
e11973
label
World on Wires
released
Klaus Löwitsch
starring
type director
e21982
Veronika Voss
released
label
Germanlanguage
Barbara Valentinstarring
19 WWW 2012
Modelling Target Entities
Modeled the same way as ERM
Language Model for each attribute
Daniel M. Herzig - Institute AIFB
IMdb i:
i:movietype
Spielberg, Steven (I)
i:directors
ei
E.T. (1994)
i:title
Coyote, Peter
i:actors
i:producer
Spielberg, Steven (I)
20 WWW 2012
Ranking
Idea:
Rank candidate entities according to their similarity to ERM
Note: Alignment between ERM and et neededIf no mapping available, use max H.
Daniel M. Herzig - Institute AIFB
boosting seed query attributes
frequency of as
cross entropy
21 WWW 2012
On The Fly Alignment
as ~ at ??
Idea:
Compare all language models of et to a field of ERM using cross entropy -H.
Establish a mapping, if lowest value for H is lower than a threshold t.
Worst case: n r comparisonsn , r are usually small
Allows reuse of computed cross entropies for subsequent ranking
Daniel M. Herzig - Institute AIFB
22 WWW 2012
EXPERIMENTS
Daniel M. Herzig - Institute AIFB
23 WWW 2012
Datasets
Three real-world, heterogeneous Web datasets:
(1) DBpedia 3.5.1, structured representation of Wikipedia
(2) IMdb, information about movies
(3) Amazon, information about DVD/Videos
(2,3) are crawled and transformed to RDF. Provided by L3S
Daniel M. Herzig - Institute AIFB
24 WWW 2012
Ground Truth
Goal is to find relevant entities in the target datasets
Manually rewriting the seed query qs to obtain the relevant entities in the target datasets.
3 query sets each with 23 corresponding entity BGP SPARQL queries
Daniel M. Herzig - Institute AIFB
a:Movie
type
“Rainer Werner Fassbinder”
a:Directors
?x
db:director
?x
db:Rainer_Werner_Fassbinder
i:directors
?x
“Fassbinder, Rainer Werner”
Amazon a: DBpedia db: IMdb i:
1982
a:TheatricalReleaseDate
db:Film
type
1982
db:released
i:movie
type
1982
i:year
25 WWW 2012
IR Experiments
Baseline KW – Keyword Search
Baseline QR – Query Rewriting
Three configurations of ERM:ERM – computes alignments on the fly
ERMa – uses pre-computed alignments only
ERMq – uses pre-computed alignments and creates mappings on top
Six different retrieval settings.
Daniel M. Herzig - Institute AIFB
26 WWW 2012
Results (1) – Mean Average Precision
ERM improves over KW by 120% and over QR by 54%
ERMa performs slightly better than ERM
ERMq performs best.
Daniel M. Herzig - Institute AIFB
27 WWW 2012
Results (2) – On The Fly Alignment
Pooled mappings for n = 115k entities
Average Precision = 0.7, Average Recall = 0.3 for relevant entities
Pearson correlation ρ(MAP, Precision-Rel) = 0.98
Daniel M. Herzig - Institute AIFB
28 WWW 2012
Results (3) – Parameter and Runtime Analysis
Analysis on the parameters of the modelSensitivness of retrieval performance in terms of MAP for varying parameter configurations
Runtime analysisExecution takes less than 13s on average
Can be improved by moving tasks (e.g. computation of language models) to index time.
Daniel M. Herzig - Institute AIFB
29 WWW 2012
Conclusion
Novel approach for searching entities in a target dataset Dt with a structured query qs adhering to the vocabulary of Ds.
Entity Relevance Model used for ranking and creating mappings during runtime.
Experiments showed that our approach is effective and exceeds the baselines substantially.
Daniel M. Herzig - Institute AIFB
30 WWW 2012 Daniel M. Herzig - Institute AIFB
ACKNOWLEDGEMENTS:We thank our colleagues Philipp Sorg and Günter Ladwig for helpful discussions. Also, we thank Julien Gaugaz and the L3S Research Center for providing us their versions of the IMdb and Amazon datasets. This work was supported by the German Federal Ministry of Education and Research (BMBF) under the iGreen project (grant 01IA08005K).
Baseline Keyword Search
Baseline Query Rewriting
OverviewScenario
Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration
Daniel M. Herzig, Thanh [email protected]
Institute AIFB, Karlsruhe Institute of Technology,Germany
THANK YOU!
31 WWW 2012
Execution Process of our Approach
Daniel M. Herzig - Institute AIFB
qs
Ds
Entity Relevance
Model
Dt
Dt
Rs et
et
et
et
Run qs against Ds to obtain results Rs
Build ERM from Rs
Obtain candidate entities et
Compare et to ERM #
Rank et according to similarity to ERM
32 WWW 2012
Runtime Analysis
Daniel M. Herzig - Institute AIFB
Average execution time less than 13 sec for the parameter setting used in the IR experiments.
Increasing parameter c (i.e. reducing the number of fields of ERM) increases performances
Our implementation performed some tasks at runtime, which can be moved to index time
Improvements are easily possible
33 WWW 2012
Parameter Analysis
Model is robust in certain parameter ranges
Boosting b: Beneficial for similar datasets, not so for diverse
Pruning c: Small effect on effectiveness, larger on efficenicy
Daniel M. Herzig - Institute AIFB
34 WWW 2012
Boosting Parameter b
If attribute as is present in the seed query, the boosting parameter is set to b, in order to increase its influence during ranking.
Daniel M. Herzig - Institute AIFB
35 WWW 2012
Alignment
ERM
Compare LMs (Prob distributions) by cross entropy
et
Daniel M. Herzig - Institute AIFB
36 WWW 2012
Related Work (excerpt)
Keyword SearchWang et al.: Semplore: A scalable IR approach to search the Web of Data. In: Journal of Web Semantics. (2009)
Query rewritingCalì et al.: Query Rewriting and Answering under Constraints in Data Integration Systems. In: IJCAI. (2003)
Our approach is based onLavrenko et al.: Information Retrieval on Empty Fields. In: HLT- NAACL. (2007)
Madhavan et al.: Web-scale Data Integration: You can afford to pay as you go. In: CIDR. (2007)
Daniel M. Herzig - Institute AIFB