Date post: | 07-Jan-2017 |
Category: |
Science |
Upload: | filip-ilievski |
View: | 306 times |
Download: | 0 times |
LOTUS:Linked OpenText UnleaShed
Marieke van Erp
Stefan Schlobach
Wouter Beek
Filip Ilievski
Laurens Rietveld
Authors
Consuming LD
Finding relevant LD resources based on natural text
Central for application areas:Information RetrievalNamed Entity Linking
“central indices (e.g. Sindice) have disappeared”
“HELR : The Harvard Environmental Law Review”
Let’s play a game ...
Find Linked Open Data resources for:
How do we find relevant resources on the Semantic Web today ?
literals are not dereferenceable by definition
1) Dereference
2) SPARQL endpoints
find resources only in explicitly stated set of data setsexact or substring/regex matching
Sometimes we do this ...
“COLD means using SPARQL on centralized endpoints”
Summarising: Findability is a problem on SW today
We need :● a single entry point to the
Linked Open Data cloud
● to find resources based on approximate text matching
Towards the Findability problem of the SW
We need :● a single entry point to the
Linked Open Data
● to find resources based on approximate text matching
LOD Laundromat
LOTUS
#1 LOD Laundromat
Infrastructure that washes other people’s dirty data and republishes it as RDFCentral entry point to the Linked Open Data cloud
657,885 documents
38,606,408,433 statements
#1 LOD Laundromat
... can be simultaneously queried from the Laundromat Wardrobe
#2 LOTUS
Full-text lookup index over LOD Laundromat
Finds resources based onassociated natural text
Inspired by application areas:IR and NED
LOTUS’ approach
Text2Literal mapping(and onwards to documents and resources)
for described resources (with at least one associated literal)
that contain natural text(numbers and dates are not findable)
through a rich string approximation model.(substring, phonetic, synonym matching,
TF-IDF scoring, match granularity )
Implementation - index builder
Implementation - index builder
subjectpredicate
object stringuser langtagdocument ID
Index fields
Implementation - Public Interface
Implementation - Public Interface
Implementation - Public Interface
PHRASE: substring matchingphrase(“Harvard Environmental Law Review”)TERMS: lookup a set of termsterms(“HELR. Harvard ELR Environmental Law Review”)
*optionally, supply a langtag:phrase(“Harvard Environmental Law Review”, “en”)
Query modes
5,319,790,836 natural text literals
12,018,939,378 literals
LOTUS v1.0 statistics
474.77 GB disk
56 hours
Preliminary Evaluation
191 local monuments, manually extracted from Dutch tour guide
List of 231 scientific journals from a Norwegian Social Sciences Data Services website
Preliminary Evaluation
Text queries for which we find at least one resource
Local Monuments Scientific journals
191 231
Preliminary Evaluation
Text queries for which we find at least one resource
Local Monuments Scientific journals Overall %
191 231
in DBpedia (via SPARQL) 53 77 30.8%
Preliminary Evaluation
Text queries for which we find at least one resource
Local Monuments Scientific journals Overall %
191 231
in DBpedia (via SPARQL) 53 77 30.8%in DBPedia
(via LOTUS phrase) 165 182 82.2%
Preliminary Evaluation
Text queries for which we find at least one resource
Local Monuments Scientific journals Overall %
191 231
in DBpedia (via SPARQL) 53 77 30.8%in DBPedia
(via LOTUS phrase) 165 182 82.2%
in LOD (via LOTUS phrase) 168 216 91.0%
Preliminary Evaluation
Text queries for which we find at least one resource
Local Monuments Scientific journals Overall %
191 231
in DBpedia (via SPARQL) 53 77 30.8%in DBPedia
(via LOTUS phrase) 165 182 82.2%
in LOD (via LOTUS phrase) 168 216 91.0%in LOD (via LOTUS terms) 188 231 99.3%
Start towards a natural text index over the LOD cloud
5.3B indexed literals can be looked up
Query modes for approximate matching
Accessible through web frontend and API
LOTUS v1.0
Current work (LOTUS v1.1)
Add langtags through Automatic language detection
Extract knowledge base information from URIs
Extract meaning of formatting convention from URIs
Add conjunctive & fuzzy query modes
Future work
Evaluation of precision◎ task-specific (IR, NED)
Integration of structured and unstructured data
Relevance and ranking
http://lotus.lodlaundromat.org
Thanks!We would love to hear your comments and suggestions !
http://lotus.lodlaundromat.org
LOD Laundromat cleans and republishes LOD, making it reachable via single
access point
LOTUS finds LOD Laundromat resources based on natural text.
Appendices
LOTUS vs Sindice
Sindice LOTUS
Relate URIs and literals to documents Relate URIs, literals and documents to each other
Accepts URIs which can be dereferenceable or have a SPARQL endpoint
Accepts any type of data
Partially incorrect datasets are excluded Partially incorrect datasets are included
Relies on original URI availability Original URI can be ‘down’
30M URIs & 45M literals 3,700M URIs & 5,320M literals
You will have a bad time finding these via SPARQL
“National Socialist German Workers' Party Foreign Organisation”“The NSDAP/AO was the Foreign Organization of the National Socialist German Workers Party (NSDAP).”
“De 9 straatjes”“Negen straatjes (Amsterdam), 9 straatjes”@nl“Shopping guide: negen straatjes”@nl-NL
You will have a bad time finding these in SPARQL
"1375 W Lake Street""1501 W. Randolph St.""29 North 7th Street""Fritz-Pregl-Str. 5"@en"33-35 Stoke Newington Road""Trompsingel 27""226 Broadway, 2nd Floor""Shinbo Building, 402-22, B1 Seogyo-dong, Mapo-gu"
Preliminary Evaluation (recall)
% of DBpedia resources in top 100 results
Local Monuments Scientific journals
LOTUS phrase 70.48% 24.83%
LOTUS terms 67.19% 22.33%
Preliminary Evaluation
Measured recall of :◎ NIL entities from CoNLL/AIDA◎ Local monuments◎ List of scientific journals
LOTUS v1.1
object+predicate + subject stringuser+auto langtag
document IDpredicate
subjectTerm-based query
Phrase-based queryConjunctive query
Fuzzy query
+ language tag matching