Date post: | 11-May-2015 |
Category: |
Education |
Upload: | thanh-tran |
View: | 656 times |
Download: | 0 times |
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association1
Summary Models for Routing Keywords to Linked Data SourcesThanh Tran, Lei Zhang, Rudi Studer
AIFB Institute, KIT
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Agenda
Introduction
Opportunities & challenges
Contributions
Problem Definition
LOD Data
Keyword Query Answer
Keyword Query Routing
Summary Models
Keyword sets
Element-level vs. schema-level vs.
source-level Summary
Validity of Results vs. complexity
Theo. / Exp. Results
Conclusions2
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Semantic Data
- 203 linked datasets serve 25 billion RDF triples interconnected by 395 million links- As of 09-2010 + other data (e.g. LON, ontologies, RDFa ) + increasing rapidly...
3
More Data
More Links
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Opportunities
4
“Articles from awarded researchers at Stanford ”
Freebase contains data about people DBPedia contains information about awards DBLP contains bibliographic data
More Data
More Links
More complex information needs More precise results More integrated results
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Problems“Articles from awarded researchers at Stanford ”
z) n(x,publicatio Stanford) name(y, y) worksAt(x, Award) Turing prizes(x,.,).( yxz
Formulating queries is a hard task!• Which data sources?• Which schema elements?
Processing queries is expensive!• Process against all data sources?
Large number of unknown & irrelevant sources! What is in there? What is relevant?
USABILITY SCALABILITY
5
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Keyword Query Routing
Given the needs expressed as sets of keywords, are there “corresponding answers” in linked data? and what combination of data sources can be used to
produce them?
6
Identify valid combination of sources using keywords
Present schema elements for the user to formulate query
Let user choose combination of sources
Process only relevant combinations of sources
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Contributions
7
Introduce the novel problem of keyword query routing
Introduce various summary models, which aim to compactly represent the search space.
Investigate the resulting trade-offs between result quality and efficiency through theoretical analysis and practical experiments using publicly available linked data sources.
Propose the multi-level relationship graph to capture its search space.
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Agenda
Introduction
Opportunities & challenges
Contributions
Problem Definition
LOD Data
Keyword Query Answer
Keyword Query Routing
Summary Models
Keyword sets
Element-level vs. schema-level vs.
source-level Summary
Validity of Results vs. complexity
Theo. / Exp. Results
Conclusions8
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
LOD Element-level Graph
9
per1
uni1
Stanford University
per2
JohnMcCarthy
JohnMccarthy
per3 prize1
Turing Award
JohnMcCarthy
author
name name name name label
employ
sameAs sameAs prizes
DBLPFreebase DBPedia
pub2
author
pub1 pub3
…John.
title
per4 prize2author
JohnSmith
Music Award
name label
prizes
Web data modeled as a set of interlinked data graphs Each data graph represent a source Element-level graph vs. schema-level graph vs. source-level graph
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
LOD Schema-level Graph
10
Author
University
Person Person Prize
authoremploy
sameAs sameAs prizes
Written Work
author
Article
Web data modeled as a set of interlinked data graphs Each data graph represent a source Element-level graph vs. schema-level graph vs. source-level graph
DBLPFreebase DBPedia
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
LOD Source-level Graph
11
Web data modeled as a set of interlinked data graphs Each data graph represent a source Element-level graph vs. schema-level graph vs. source-level graph
DBLPFreebase DBPedia
sames sameAs
author
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
“Corresponding” Answers
12
), dD,Q,F,R(q ji
User information need award“„stanford article
per1
uni1
Stanford University
per2
JohnMcCarthy
JohnMccarthy
per3 prize1
Turing Award
JohnMcCarthy
author
name name name name label
employ
sameAs sameAs prizes
DBLPFreebase DBPedia
pub2
author
pub1 pub3
…John.
title
per4 prize2author
JohnSmith
Music Award
name label
prizes
Article
type
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Problem Definition
Keyword query result (also called Steiner graph) is a subgraph of the union of the data- and schema-level graph that for every keyword, contains a matching element, and these elements are pairwise connected over a path.
13
d-max Steiner graph is a Steiner graph where paths between keyword elements is d-max or less.
Keyword query routing: compute valid set of data sources called keyword routing plan. A plan is valid if its sources produce non-empty keyword query results.
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
A Valid Keyword Routing Plan
14
), dD,Q,F,R(q ji
User information need award“„stanford article
per1
uni1
Stanford University
per2
JohnMcCarthy
JohnMccarthy
per3 prize1
Turing Award
JohnMcCarthy
author
name name name name label
employ
sameAs sameAs prizes
DBLPFreebase DBPedia
pub2
author
pub1 pub3
…John.
title
per4 prize2author
JohnSmith
Music Award
name label
prizes
Article
type
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
The Search Space Multi-level inter-relationship graphs capture the entire search space Relationships between elements and between different levels
15
Search space is too large! Naïve solution not applicable: apply existing approaches to
keyword search for computing Steiner graphs Steiner graphs might span several linked sources Search space grow exponentially with the number of
sources and their associated links
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Agenda
Introduction
Opportunities & challenges
Contributions
Problem Definition
LOD Data
Keyword Query Answer
Keyword Query Routing
Summary Models
Keyword sets
Element-level vs. schema-level vs.
source-level KERG
Validity of Results vs. complexity
Theo. / Exp. Results
Conclusions16
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Keyword Sets
17
per1
uni1
Stanford University
per2
JohnMcCarthy
JohnMccarthy
per3 prize1
Turing Award
JohnMcCarthy
author
name name name label
employsameAs sameAs prizes
DBLPFreebase DBPedia
pub2
author
pub1 pub3
…John.
title
per4 prize2author
JohnSmith
Music Award
name label
prizes
Stanford
University
John
McCarthy John
McCarthy
McCarthy
John
Turing
Award
Smith Music
One keyword set for every data source Elements stand for distinct keywords mentioned in a source
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Element-level Keyword-Element Relationship Graph (E- KERG)
18
per1
uni1
Stanford University
per2
JohnMcCarthy
JohnMccarthy
per3 prize1
Turing Award
JohnMcCarthy
author
name name name label
employsameAs sameAs prizes
DBLPFreebase DBPedia
pub2
author
pub1 pub3
…John.
title
per4 prize2author
JohnSmith
Music Award
name label
prizes
Stanford
University
John
McCarthy John
McCarthy
McCarthy
John
Turin
Award
Smith Music
A keyword-element captures a keyword k and the data element mentioning k A relationship between two keyword-elements exists iff there is a path between
their associated data elements In d-max KERG, the paths to be considered have length d-max or less
uni1 per2 per1 per3 prize1
per4
John
prize2
Award
John
pub4
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Schema-level Keyword-Element Relationship Graph (S-KERG)
19
per1
uni1
Stanford University
per2
JohnMcCarthy
JohnMccarthy
per3 prize1
Turing Award
JohnMcCarthy
author
name name name label
employsameAs sameAs prizes
DBLPFreebase DBPedia
pub2
author
pub1 pub3
…John.
title
per4 prize2author
JohnSmith
Music Award
name label
prizes
Stanford
University
John
McCarthy John
McCarthy
McCarthy
John
Turin
Award
Smith Music
A keyword-element captures a keyword k and the schema element which contains some instances (date elements) mentioning k
A relationship between two keyword-elements exists if there is a path between some instances of their associated schema elements
Groups elements (relationships) when they capture same pair of keywords in the same class (same keyword relationships between same pair of classes)
uni1 per2 per1 per3 prize1
per4
John
prize2
Award
John
pub4
University Person Author
Article Person Prize
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Data-Source-level Keyword-Element Relationship Graph (D-KERG)
20
per1
uni1
Stanford University
per2
JohnMcCarthy
JohnMccarthy
per3 prize1
Turing Award
JohnMcCarthy
author
name name name label
employsameAs sameAs prizes
DBLPFreebase DBPedia
pub2
author
pub1 pub3
…John.
title
per4 prize2author
JohnSmith
Music Award
name label
prizes
Stanford
University
John
McCarthy John
McCarthy
McCarthy
John
Turin
Award
Smith Music
A keyword-element captures a keyword k and the source which contains some instances (date elements) mentioning k
A relationship between two keyword-elements exists if there is a path between some instances of their associated sources
Groups elements (relationships) when they capture same pair of keywords in the same source (same keyword relationships between the same of pair sources)
uni1 per2 per1 per3 prize1
per4
John
prize2
Award
John
pub4
University Person Author
Article Person Prize
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Agenda
Introduction
Opportunities & challenges
Contributions
Problem Definition
LOD Data
Keyword Query Answer
Keyword Query Routing
Summary Models
Keyword sets
Element-level vs. schema-level vs.
source-level KERG
Validity of Results vs. complexity
Theo. / Exp. Results
Conclusions22
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Theoretical Results When Steiner graphs can be found for K in the data,
then there will be keyword routing plan that can be found in KERG.
23
The keyword routing plan derived from the summary are not necessarily valid s.t. there might be no corresponding Steiner graph in the data
Detailed results + algorithms + complexity results in the paper!
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Experiments
Chunk of the BTC dataset containing 10M RDF triples from 154 sources, linked via 500K mappings
24
Manually crafted 30 keyword valid multi-data-source queries, i.e., produce non-empty keyword answers and involve more than 2 sources Town River America Beijing Conference Database 2007
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Validity
P@k measure the percentage of plans that are valid out of the top-k plans P@5 up to 100% for E-KERG (dmax =4), P@5 for KS only 6% More valid plans were computed when a higher value was used for dmax
dmax =3 seems to be a good tradeoff Queries with larger number of keywords resulted in lower precision
25
2 3 4 50.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0 E-KERG D-KERG
S-KERG KS
|K|
P@5
0 1 2 3 40.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0 E-KERG
D-KERG
S-KERG
KS
dmax
P@5
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Performance
26
Times increased with higher values for dmax
Sharp for E-KERG and S-KERG Relatively stable for D-KERG
Times increase with number of keywords All other models had poor performance w.r.t complex queries but D-KERG E-KERG needed more than 100s for queries with more than 2 keywords
Time for D-KERG was no more than 10ms on average
0 1 2 3 41
10
100
1000
10000
100000
1000000
S-KERG D-KERG KS E-KERG
dmax
Que
ry P
roce
ssin
g Ti
me
(ms)
2 3 4 51
10
100
1000
10000
100000
1000000
S-KERG D-KERG KS E-KERG
|K|
Que
ry P
roce
ssin
g Ti
me
(ms)
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Conclusions
Keyword query routing helps users without knowledge of linked data and schemas to find combination of sources that contain answers corresponding to their needs
27
Summarizing relationships is essential for dealing with the large-scale linked data Web (E-KERG achieved poor performance, requires more than 100s for complex queries)
Summarizing at the level of sources (D-KERG) represents the most practical trade-off, produces results in less than 10ms out of which every second one was valid
However, validity still low for complex queries (<30% when 4 keywords)
Baseline approaches for novel problem Further improve validity and consider relevance! Combine keyword query routing with source and structured query
processing to compute final results!
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Thanks for Your Attention!
Institute AIFB, KIT
28