+ All Categories
Home > Science > Lotus: Linked Open Text UnleaShed - ISWC COLD '15

Lotus: Linked Open Text UnleaShed - ISWC COLD '15

Date post: 07-Jan-2017
Category:
Upload: filip-ilievski
View: 306 times
Download: 0 times
Share this document with a friend
47
LOTUS: Linked Open Text UnleaShed
Transcript
Page 1: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

LOTUS:Linked OpenText UnleaShed

Page 2: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

Marieke van Erp

Stefan Schlobach

Wouter Beek

Filip Ilievski

Laurens Rietveld

Authors

Page 3: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

Consuming LD

Finding relevant LD resources based on natural text

Central for application areas:Information RetrievalNamed Entity Linking

“central indices (e.g. Sindice) have disappeared”

Page 4: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

“HELR : The Harvard Environmental Law Review”

Let’s play a game ...

Find Linked Open Data resources for:

Page 5: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

How do we find relevant resources on the Semantic Web today ?

literals are not dereferenceable by definition

1) Dereference

2) SPARQL endpoints

find resources only in explicitly stated set of data setsexact or substring/regex matching

Page 6: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

Sometimes we do this ...

Page 7: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

“COLD means using SPARQL on centralized endpoints”

Page 8: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

Summarising: Findability is a problem on SW today

We need :● a single entry point to the

Linked Open Data cloud

● to find resources based on approximate text matching

Page 9: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

Towards the Findability problem of the SW

We need :● a single entry point to the

Linked Open Data

● to find resources based on approximate text matching

LOD Laundromat

LOTUS

Page 10: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

#1 LOD Laundromat

Infrastructure that washes other people’s dirty data and republishes it as RDFCentral entry point to the Linked Open Data cloud

Page 11: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

657,885 documents

38,606,408,433 statements

#1 LOD Laundromat

... can be simultaneously queried from the Laundromat Wardrobe

Page 12: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

#2 LOTUS

Full-text lookup index over LOD Laundromat

Finds resources based onassociated natural text

Inspired by application areas:IR and NED

Page 13: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

LOTUS’ approach

Text2Literal mapping(and onwards to documents and resources)

for described resources (with at least one associated literal)

that contain natural text(numbers and dates are not findable)

through a rich string approximation model.(substring, phonetic, synonym matching,

TF-IDF scoring, match granularity )

Page 14: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

Implementation - index builder

Page 15: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

Implementation - index builder

Page 16: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

subjectpredicate

object stringuser langtagdocument ID

Index fields

Page 17: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

Implementation - Public Interface

Page 18: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

Implementation - Public Interface

Page 19: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

Implementation - Public Interface

Page 20: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

PHRASE: substring matchingphrase(“Harvard Environmental Law Review”)TERMS: lookup a set of termsterms(“HELR. Harvard ELR Environmental Law Review”)

*optionally, supply a langtag:phrase(“Harvard Environmental Law Review”, “en”)

Query modes

Page 21: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

5,319,790,836 natural text literals

12,018,939,378 literals

LOTUS v1.0 statistics

474.77 GB disk

56 hours

Page 22: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

Preliminary Evaluation

191 local monuments, manually extracted from Dutch tour guide

List of 231 scientific journals from a Norwegian Social Sciences Data Services website

Page 23: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

Preliminary Evaluation

Text queries for which we find at least one resource

Local Monuments Scientific journals

191 231

Page 24: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

Preliminary Evaluation

Text queries for which we find at least one resource

Local Monuments Scientific journals Overall %

191 231

in DBpedia (via SPARQL) 53 77 30.8%

Page 25: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

Preliminary Evaluation

Text queries for which we find at least one resource

Local Monuments Scientific journals Overall %

191 231

in DBpedia (via SPARQL) 53 77 30.8%in DBPedia

(via LOTUS phrase) 165 182 82.2%

Page 26: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

Preliminary Evaluation

Text queries for which we find at least one resource

Local Monuments Scientific journals Overall %

191 231

in DBpedia (via SPARQL) 53 77 30.8%in DBPedia

(via LOTUS phrase) 165 182 82.2%

in LOD (via LOTUS phrase) 168 216 91.0%

Page 27: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

Preliminary Evaluation

Text queries for which we find at least one resource

Local Monuments Scientific journals Overall %

191 231

in DBpedia (via SPARQL) 53 77 30.8%in DBPedia

(via LOTUS phrase) 165 182 82.2%

in LOD (via LOTUS phrase) 168 216 91.0%in LOD (via LOTUS terms) 188 231 99.3%

Page 28: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

Start towards a natural text index over the LOD cloud

5.3B indexed literals can be looked up

Query modes for approximate matching

Accessible through web frontend and API

LOTUS v1.0

Page 29: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

Current work (LOTUS v1.1)

Add langtags through Automatic language detection

Extract knowledge base information from URIs

Extract meaning of formatting convention from URIs

Add conjunctive & fuzzy query modes

Page 30: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

Future work

Evaluation of precision◎ task-specific (IR, NED)

Integration of structured and unstructured data

Relevance and ranking

Page 31: Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Page 32: Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Page 33: Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Page 34: Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Page 35: Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Page 36: Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Page 38: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

Thanks!We would love to hear your comments and suggestions !

Page 39: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

http://lotus.lodlaundromat.org

LOD Laundromat cleans and republishes LOD, making it reachable via single

access point

LOTUS finds LOD Laundromat resources based on natural text.

Page 41: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

Appendices

Page 42: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

LOTUS vs Sindice

Sindice LOTUS

Relate URIs and literals to documents Relate URIs, literals and documents to each other

Accepts URIs which can be dereferenceable or have a SPARQL endpoint

Accepts any type of data

Partially incorrect datasets are excluded Partially incorrect datasets are included

Relies on original URI availability Original URI can be ‘down’

30M URIs & 45M literals 3,700M URIs & 5,320M literals

Page 43: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

You will have a bad time finding these via SPARQL

“National Socialist German Workers' Party Foreign Organisation”“The NSDAP/AO was the Foreign Organization of the National Socialist German Workers Party (NSDAP).”

“De 9 straatjes”“Negen straatjes (Amsterdam), 9 straatjes”@nl“Shopping guide: negen straatjes”@nl-NL

Page 44: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

You will have a bad time finding these in SPARQL

"1375 W Lake Street""1501 W. Randolph St.""29 North 7th Street""Fritz-Pregl-Str. 5"@en"33-35 Stoke Newington Road""Trompsingel 27""226 Broadway, 2nd Floor""Shinbo Building, 402-22, B1 Seogyo-dong, Mapo-gu"

Page 45: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

Preliminary Evaluation (recall)

% of DBpedia resources in top 100 results

Local Monuments Scientific journals

LOTUS phrase 70.48% 24.83%

LOTUS terms 67.19% 22.33%

Page 46: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

Preliminary Evaluation

Measured recall of :◎ NIL entities from CoNLL/AIDA◎ Local monuments◎ List of scientific journals

Page 47: Lotus: Linked Open Text UnleaShed - ISWC COLD '15

LOTUS v1.1

object+predicate + subject stringuser+auto langtag

document IDpredicate

subjectTerm-based query

Phrase-based queryConjunctive query

Fuzzy query

+ language tag matching


Recommended