+ All Categories
Home > Documents > Gerhard Weikum Max Planck Institute for Informatics mpi-inf.mpg.de/~weikum

Gerhard Weikum Max Planck Institute for Informatics mpi-inf.mpg.de/~weikum

Date post: 06-Jan-2016
Category:
Upload: mab
View: 55 times
Download: 0 times
Share this document with a friend
Description:
Knowledge Harvesting f rom Text and Web Sources. Part 3: Knowledge Linking. Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/~weikum/. Quiz Time. How many days do you need to visit all Shangri -La places on this planet?. Source: geonames.org. - PowerPoint PPT Presentation
Popular Tags:
104
Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/ ~weikum/ Knowledge Harvesting from Text and Web Sources Part 3: Knowledge Linking
Transcript
Page 1: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Gerhard Weikum Max Planck Institute for Informaticshttp://www.mpi-inf.mpg.de/~weikum/

Knowledge Harvesting from Text and Web Sources

Part 3: Knowledge Linking

Page 2: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Quiz Time

3-2

How many days do you need to visitall Shangri-La places on this planet?

Answer: 365

Source: geonames.org

3-2

Page 3: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Quiz Time

3-3

How many days do you need to visitall Shangri-La places on this planet?

3-3

Page 4: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Linkied Data: RDF Triples on the Web

http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png

30 Bio. triples500 Mio. links

3-4

Page 5: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

owl:s

ameAs

rdf.freebase.com/ns/en.rome

owl:sameAs

o

wl:sameAs

data.nytimes.com/51688803696189142301

Coord

geonames.org/5134301/city_of_rome

N 43° 12' 46'' W 75° 27' 20''

dbpprop:citizenOf

dbpedia.org/resource/Rome

rd

f:type

rdf:subclassOf

yago/wordnet:Actor109765278

rd

f:type

rdf:subclassOf

yago/wikicategory:ItalianComposer

yago/wordnet: Artist109812338

prop:a

ctedIn

imdb.com/name/nm0910607/

Linked RDF Triples on the Web

prop: composedMusicFor

imdb.com/title/tt0361748/

dbpedia.org/resource/Ennio_Morricone

3-5

Page 6: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

owl:s

ameAs

rdf.freebase.com/ns/en.rome_ny

owl:sameAs

o

wl:sameAs

data.nytimes.com/51688803696189142301

Coord

geonames.org/5134301/city_of_rome

N 43° 12' 46'' W 75° 27' 20''

dbpprop:citizenOf

dbpedia.org/resource/Rome

rd

f:type

rdf:subclassOf

yago/wordnet:Actor109765278

rd

f:type

rdf:subclassOf

yago/wikicategory:ItalianComposer

yago/wordnet: Artist109812338

prop:a

ctedIn

imdb.com/name/nm0910607/

Linked RDF Triples on the Web

prop: composedMusicFor

imdb.com/title/tt0361748/

dbpedia.org/resource/Ennio_Morricone

Referential data quality?Hand-crafted sameAs links?generated sameAs links? ?

? ?

3-6

Page 7: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

RDF Entities on the Web http://sig.ma

3-7

Page 8: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

RDF Entities on the Webhttp://sig.ma

3-8

Page 9: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Entity-Name Ambiguity http://sameas.org

3-9

Page 10: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Entities in HTML http://sindice.com

3-10

Page 11: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Entity Markup in HTML: Towards Standardized Microformats

http://schema.org/

3-11

Page 12: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Entity Markup in HTML: Towards Standardized Microformats

http://schema.org/

3-12

Page 13: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Web Page in Standard HTML http://schema.org/

Jane Doe<img src="janedoe.jpg" />

Professor20341 Whitworth Institute405 WhitworthSeattle WA 98052(425) 123-4567<a href="mailto:[email protected]">[email protected]</a>

Jane's home page:<a href="http://www.janedoe.com">janedoe.com</a>

Graduate students:<a href="http://www.xyz.edu/students/alicejones.html">Alice Jones</a><a href="http://www.xyz.edu/students/bobsmith.html">Bob Smith</a>

3-13

Page 14: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Web Page in HTML with Microdata<div itemscope itemtype="http://schema.org/Person">  <span itemprop="name">Jane Doe</span>  <img src="janedoe.jpg" itemprop="image" />

  <span itemprop="jobTitle">Professor</span>  <div itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">    <span itemprop="streetAddress">      20341 Whitworth Institute      405 N. Whitworth    </span>    <span itemprop="addressLocality">Seattle</span>,    <span itemprop="addressRegion">WA</span>    <span itemprop="postalCode">98052</span>  </div>  <span itemprop="telephone">(425) 123-4567</span>  <a href="mailto:[email protected]" itemprop="email">    [email protected]</a>

  Jane's home page:  <a href="http://www.janedoe.com" itemprop="url">janedoe.com</a>

  Graduate students:  <a href="http://www.xyz.edu/students/alicejones.html" itemprop="colleague">    Alice Jones</a>  <a href="http://www.xyz.edu/students/bobsmith.html" itemprop="colleague">    Bob Smith</a></div>

http://schema.org/

3-14

Page 15: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Web-of-Data vs. Web-of-Contents

3-15

Critical for knowledge linkage: entity name ambiguity

more structured data combined with text boosted by knowledge harvesting methods

Page 16: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Embedding RDFa in Web Contents

May 2, 2011

Maestro Morricone will perform on the stage of the Smetana Hall to conduct the Czech National Symphony Orchestra and Choir. The concert will feature both Classical compositions and soundtracks such asthe Ecstasy of Gold.In programme two concerts for July 14th and 15th.

<html … May 2, 2011

<div typeof=event:music>

<span id="Maestro_Morricone">Maestro Morricone<a rel="sameAs"resource="dbpedia…/Ennio_Morricone "/></span>…<span property = "event:location" >Smetana Hall </span>…<span property="rdf:type"resource="yago:performance">The concert </span> will feature …<span property="event:date" content="14-07-2011"></span>July 1

</div>

RDF data and Web contents need to be interconnectedRDFa & microformats provide the mechanism

Need ways of creating more embedded RDF triples!3-16

Page 17: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Outline

...

Entity-Name Disambiguation

Motivation

Wrap-up

Mapping Questions into Queries

Entity Linkage

3-17

Page 18: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Named-Entity Disambiguation

Harry fought with you know who. He defeats the dark lord.

1) named-entity detection: segment & label by HMM or CRF (e.g. Stanford NER tagger)

2) co-reference resolution: link to preceding NP (trained classifier over linguistic features)

3) named-entity disambiguation: map each mention (name) to canonical entity (entry in KB)

Three NLP tasks:

HarryPotter

DirtyHarry

LordVoldemort

The Who(band)

Prince Harryof England

3-18

Page 19: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Sergio talked to

Ennio about

Eli‘s role in the

Ecstasy scene.

This sequence on

the graveyard

was a highlight in

Sergio‘s trilogy

of western films.

Named Entity Disambiguation

D5 Overview May 30, 2011

Sergio means Sergio_LeoneSergio means Serge_GainsbourgEnnio means Ennio_AntonelliEnnio means Ennio_MorriconeEli means Eli_(bible)Eli means ExtremeLightInfrastructureEli means Eli_WallachEcstasy means Ecstasy_(drug)Ecstasy means Ecstasy_of_Goldtrilogy means Star_Wars_Trilogytrilogy means Lord_of_the_Ringstrilogy means Dollars_Trilogy … … …

KB

Eli (bible)

Eli Wallach

Mentions(surface names)

Entities(meanings)

Dollars Trilogy

Lord of the Rings

Star Wars Trilogy

Benny Andersson

Benny Goodman

Ecstasy of Gold

Ecstasy (drug)

?

3-19

Page 20: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Sergio talked to

Ennio about

Eli‘s role in the

Ecstasy scene.

This sequence on

the graveyard

was a highlight in

Sergio‘s trilogy

of western films.

Mention-Entity Graph

Dollars Trilogy

Lord of the Rings

Star Wars

Ecstasy of Gold

Ecstasy (drug)

Eli (bible)

Eli Wallach

KB+Stats

weighted undirected graph with two types of nodes

Popularity(m,e):• freq(e|m)• length(e)• #links(e)

Similarity (m,e):• cos/Dice/KL (context(m), context(e))

bag-of-words orlanguage model:words, bigrams, phrases

3-20

Page 21: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Sergio talked to

Ennio about

Eli‘s role in the

Ecstasy scene.

This sequence on

the graveyard

was a highlight in

Sergio‘s trilogy

of western films.

Mention-Entity Graph

Dollars Trilogy

Lord of the Rings

Star Wars

Ecstasy of Gold

Ecstasy (drug)

Eli (bible)

Eli Wallach

KB+Stats

weighted undirected graph with two types of nodes

Popularity(m,e):• freq(e|m)• length(e)• #links(e)

Similarity (m,e):• cos/Dice/KL (context(m), context(e))

jointmapping

3-21

Page 22: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Mention-Entity Graph

22 / 20

Dollars Trilogy

Lord of the Rings

Star Wars

Ecstasy of Gold

Ecstasy(drug)

Eli (bible)

Eli Wallach

KB+Stats

weighted undirected graph with two types of nodes

Popularity(m,e):• freq(m,e|m)• length(e)• #links(e)

Similarity (m,e):• cos/Dice/KL (context(m), context(e))

Coherence (e,e‘):• dist(types)• overlap(links)• overlap (anchor words)

Sergio talked to

Ennio about

Eli‘s role in the

Ecstasy scene.

This sequence on

the graveyard

was a highlight in

Sergio‘s trilogy

of western films.

3-22

Page 23: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Mention-Entity Graph

23 / 20

KB+Stats

weighted undirected graph with two types of nodes

Popularity(m,e):• freq(m,e|m)• length(e)• #links(e)

Similarity (m,e):• cos/Dice/KL (context(m), context(e))

Coherence (e,e‘):• dist(types)• overlap(links)• overlap (anchor words)

American Jewsfilm actorsartistsAcademy Award winners

Metallica songsEnnio Morricone songsartifactssoundtrack music

spaghetti westernsfilm trilogiesmoviesartifactsDollars Trilogy

Lord of the Rings

Star Wars

Ecstasy of Gold

Ecstasy (drug)

Eli (bible)

Eli Wallach

Sergio talked to

Ennio about

Eli‘s role in the

Ecstasy scene.

This sequence on

the graveyard

was a highlight in

Sergio‘s trilogy

of western films.

3-23

Page 24: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Mention-Entity Graph

24 / 20

KB+Stats

weighted undirected graph with two types of nodes

Popularity(m,e):• freq(m,e|m)• length(e)• #links(e)

Similarity (m,e):• cos/Dice/KL (context(m), context(e))

Coherence (e,e‘):• dist(types)• overlap(links)• overlap (anchor words)

http://.../wiki/Dollars_Trilogyhttp://.../wiki/The_Good,_the_Bad, _the_Uglyhttp://.../wiki/Clint_Eastwoodhttp://.../wiki/Honorary_Academy_Award

http://.../wiki/The_Good,_the_Bad,_the_Uglyhttp://.../wiki/Metallicahttp://.../wiki/Bellagio_(casino)http://.../wiki/Ennio_Morricone

http://.../wiki/Sergio_Leonehttp://.../wiki/The_Good,_the_Bad,_the_Uglyhttp://.../wiki/For_a_Few_Dollars_Morehttp://.../wiki/Ennio_MorriconeDollars Trilogy

Lord of the Rings

Star Wars

Ecstasy of Gold

Ecstasy (drug)

Eli (bible)

Eli Wallach

Sergio talked to

Ennio about

Eli‘s role in the

Ecstasy scene.

This sequence on

the graveyard

was a highlight in

Sergio‘s trilogy

of western films.

3-24

Page 25: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Mention-Entity Graph

25 / 20

KB+StatsPopularity(m,e):• freq(m,e|m)• length(e)• #links(e)

Similarity (m,e):• cos/Dice/KL (context(m), context(e))

Coherence (e,e‘):• dist(types)• overlap(links)• overlap (anchor words)

Metallica on Morricone tributeBellagio water fountain showYo-Yo MaEnnio Morricone composition

The Magnificent SevenThe Good, the Bad, and the UglyClint EastwoodUniversity of Texas at Austin

For a Few Dollars MoreThe Good, the Bad, and the UglyMan with No Name trilogysoundtrack by Ennio Morricone

weighted undirected graph with two types of nodes

Dollars Trilogy

Lord of the Rings

Star Wars

Ecstasy of Gold

Ecstasy (drug)

Eli (bible)

Eli Wallach

Sergio talked to

Ennio about

Eli‘s role in the

Ecstasy scene.

This sequence on

the graveyard

was a highlight in

Sergio‘s trilogy

of western films.

3-25

Page 26: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Collective Learning with Prob. Factor Graphs(Chakrabarti et al.: KDD‘09):

• model P[m|e] by similarity and P[e1|e2] by coherence• consider likelihood of P[m1 … mk | e1 … ek]• factorize by all m-e pairs and e1-e2 pairs• use hill-climbing, LP, etc. for solution

Different ApproachesCombine Popularity, Similarity, and Coherence Features(Cucerzan: EMNLP‘07, Milne/Witten: CIKM‘08):

• for sim (context(m), context(e)): consider surrounding mentions and their candidate entities

• use their types, links, anchors as features of context(m)

• set m-e edge weights accordingly• use greedy methods for solution

3-26

Page 27: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Joint Mapping

• Build mention-entity graph or joint-inference factor graph from knowledge and statistics in KB• Compute high-likelihood mapping (ML or MAP) or dense subgraph such that: each m is connected to exactly one e (or at most one e)

90

30

5100

100

50

20 50

90

80 90

30

10 10

20

30

30

3-27

Page 28: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Mention-Entity Popularity Weights

Collect hyperlink anchor-text / link-target pairs from• Wikipedia redirects• Wikipedia links between articles• Interwiki links between Wikipedia editions• Web links pointing to Wikipedia articles…Build statistics to estimate P[entity | name]

Need dictionary with entities‘ names:• full names: Arnold Alois Schwarzenegger, Los Angeles, Microsoft Corporation• short names: Arnold, Arnie, Mr. Schwarzenegger, New York, Microsoft, …• nicknames & aliases: Terminator, City of Angels, Evil Empire, …• acronyms: LA, UCLA, MS, MSFT• role names: the Austrian action hero, Californian governor, the CEO of MS, ……plus gender info (useful for resolving pronouns in context): Bill and Melinda met at MS. They fell in love and he kissed her.

[Milne/Witten 2008, Spitkovsky/Chang 2012]

3-28

Page 29: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Mention-Entity Similarity Edges

Extent of partial matches Weight of matched words

Precompute characteristic keyphrases q for each entity e:anchor texts or noun phrases in e page with high PMI:

)()(

),()(~)|(

mcontextinekeyphrasesq

mcover(q)distqscoremescore

1

)|(#~)|(

qw

cover(q)w

e)|weight(w

ewweight

cover(q)oflength

wordsmatchingeqscore

)()(

),(log),(

efreqqfreq

eqfreqeqweight

Match keyphrase q of candidate e in context of mention m

Compute overall similarity of context(m) and candidate e

„Metallica tribute to Ennio Morricone“

The Ecstasy piece was covered by Metallica on the Morricone tribute album.

3-29

Page 30: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Entity-Entity Coherence EdgesPrecompute overlap of incoming links for entities e1 and e2

))2(),1(min(log||log

))2()1(log())2,1(max(log1

eineinE

eineineein~e2)coh(e1,-mw

Alternatively compute overlap of anchor texts for e1 and e2

or overlap of keyphrases, or similarity of bag-of-words, or …

)2()1(

)2()1(

engramsengrams

engramsengrams~e2)coh(e1,-ngram

Optionally combine with type distance of e1 and e2(e.g., Jaccard index for type instances)

For special types of e1 and e2 (locations, people, etc.)use spatial or temporal distance

3-30

Page 31: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Coherence Graph Algorithm

• Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e)• Greedy approximation: iteratively remove weakest entity and its edges• Keep alternative solutions, then use local/randomized search

90

30

5100

100

50 50

90

80 90

30

10 20

10

20

30

30

[J. Hoffart et al.: EMNLP‘11]140

180

50

470

145

230

3-31

Page 32: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Coherence Graph Algorithm

• Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e)• Greedy approximation: iteratively remove weakest entity and its edges• Keep alternative solutions, then use local/randomized search

90

30

5100

100

50 50

90

80 90

30

1030

30

[J. Hoffart et al.: EMNLP‘11]140

180

50

470

145

230

140

170

470

145

210

3-32

Page 33: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Coherence Graph Algorithm

• Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e)• Greedy approximation: iteratively remove weakest entity and its edges• Keep alternative solutions, then use local/randomized search

90

30

5100

100 90

80 90

30

30

[J. Hoffart et al.: EMNLP‘11]140

170

460

145

210

120

460

145

210

3-33

Page 34: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Coherence Graph Algorithm

• Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e)• Greedy approximation: iteratively remove weakest entity and its edges• Keep alternative solutions, then use local/randomized search

90100

100 90

90

30

[J. Hoffart et al.: EMNLP‘11]

120

380

145

210

3-34

Page 35: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Alternative: Random Walks

• for each mention run random walks with restart (like personalized PR with jumps to start mention(s))• rank candidate entities by stationary visiting probability• very efficient, decent accuracy

50

90

80 90

30

10

20

10

0.83

0.7

0.4 0.75

0.15

0.17

0.2

0.1

90

30

5100

100

50

30

30 20

0.75

0.25

0.040.96

0.77

0.5

0.23

0.3 0.2

3-35

Page 36: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

AIDA: Accurate Online Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/aida/3-36

Page 37: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

AIDA: Accurate Online Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/aida/3-37

Page 38: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

http://www.mpi-inf.mpg.de/yago-naga/aida/

AIDA: Very Difficult Example

3-38

Page 39: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

http://www.mpi-inf.mpg.de/yago-naga/aida/

AIDA: Very Difficult Example

3-39

Page 40: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

AIDA: Accurate Online Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/aida/3-40

Page 41: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

AIDA: Accurate Online Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/aida/3-41

Page 42: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

AIDA: Accurate Online Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/aida/3-42

Page 43: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

AIDA: Accurate Online Disambiguation

http://www.mpi-inf.mpg.de/yago-naga/aida/3-43

Page 44: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Some NED Online Tools forJ. Hoffart et al.: EMNLP 2011, VLDB 2011https://d5gate.ag5.mpi-sb.mpg.de/webaida/

P. Ferragina, U. Scaella: CIKM 2010http://tagme.di.unipi.it/

R. Isele, C. Bizer: VLDB 2012http://spotlight.dbpedia.org/demo/index.html

Reuters Open Calaishttp://viewer.opencalais.com/

S. Kulkarni, A. Singh, G. Ramakrishnan, S. Chakrabarti: KDD 2009http://www.cse.iitb.ac.in/soumen/doc/CSAW/

D. Milne, I. Witten: CIKM 2008http://wikipedia-miner.cms.waikato.ac.nz/demos/annotate/

perhaps more

some use Stanford NER tagger for detecting mentionshttp://nlp.stanford.edu/software/CRF-NER.shtml

3-44

Page 45: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

NED: Experimental Evaluation

Benchmark:• Extended CoNLL 2003 dataset: 1400 newswire articles• originally annotated with mention markup (NER), now with NED mappings to Yago and Freebase• difficult texts: … Australia beats India … Australian_Cricket_Team … White House talks to Kreml … President_of_the_USA … EDS made a contract with … HP_Enterprise_Services

Results:Best: AIDA method with prior+sim+coh + robustness test82% precision @100% recall, 87% mean average precisionComparison to other methods, see paper

J. Hoffart et al.: Robust Disambiguation of Named Entities in Text, EMNLP 2011http://www.mpi-inf.mpg.de/yago-naga/aida/

3-45

Page 46: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Ongoing Research & Remaining Challenges

• More efficient graph algorithms (multicore, etc.)

• Short and difficult texts: • tweets, headlines, etc.• fictional texts: novels, song lyrics, etc.• incoherent texts

• Disambiguation beyond entity names:• coreferences: pronouns, paraphrases, etc.• common nouns, verbal phrases (general WSD)

• Leverage deep-parsing structures, leverage semantic types Example: Page played Kashmir on his Gibson

subj obj

mod

• Allow mentions of unknown entities, mapped to null

• Structured Web data: tables and lists

3-46

Page 47: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

General Word Sense Disambiguation{songwriter, composer}

{cover, perform}

{cover, report, treat}

{cover, help out}

Which

song writers

covered

ballads

written by

the Stones ?

3-47

Page 48: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Handling Out-of-Wikipedia Entities

last.fm/Nick_Cave/Weeping_Song

wikipedia.org/Weeping_(song)

wikipedia.org/Nick_Cave

last.fm/Nick_Cave/O_Children

last.fm/Nick_Cave/Hallelujah

wikipedia/Hallelujah_(L_Cohen)

wikipedia/Hallelujah_Chorus

wikipedia/Children_(2011 film)

wikipedia.org/Good_Luck_Cave

Cave composedhaunting songs likeHallelujah,O Children,and theWeeping Song.

3-48

Page 49: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Handling Out-of-Wikipedia Entities

last.fm/Nick_Cave/Weeping_Song

wikipedia.org/Weeping_(song)

wikipedia.org/Nick_Cave

last.fm/Nick_Cave/O_Children

last.fm/Nick_Cave/Hallelujah

wikipedia/Hallelujah_(L_Cohen)

wikipedia/Hallelujah_Chorus

wikipedia/Children_(2011 film)

wikipedia.org/Good_Luck_Cave

Cave composedhaunting songs likeHallelujah,O Children,and theWeeping Song.

Gunung Mulu National ParkSarawak Chamberlargest underground chamber

eerie violinBad SeedsNo More Shall We Part

Bad SeedsNo More Shall We PartMurder Songs

Leonard CohenRufus WainwrightShrek and Fiona

Nick Cave & Bad SeedsHarry Potter 7 moviehaunting choir

Nick CaveMurder SongsP.J. HarveyNick and Blixa duet

Messiah oratorioGeorge Frideric Handel

Dan Heymannapartheid system

South Korean film

3-49

Page 50: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Handling Out-of-Wikipedia Entities

• Characterize all entities (and mentions) by sets of keyphrases• Entity coherence then becomes: keyphrases overlap, no need for href link data• For each mention add a „self“ candidate: out-of-KB entity with keyphrases computed by Web search

Efficient comparison of two keyphrase-sets two-stage hashing, using min-hash sketches and LSH

KORE (e,f) = pe,qf PO(p,q)2 min(e(p), f(q))

pe e(p) + qf f(q) entities e,f

with phrase weights

PO(p,q) = wpq min(p(w), q(w))

phrases p,q

with word weights wpq max(p(w), q(w))

[J. Hoffart et al.: CIKM‘12]

3-50

Page 51: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Variants of NED at Web Scale

• How to run this on big batch of 1 Mio. input texts? partition inputs across distributed machines, organize dictionary appropriately, … exploit cross-document contexts

• How to deal with inputs from different time epochs? consider time-dependent contexts, map to entities of proper epoch (e.g. harvested from Wikipedia history)

• How to handle Web-scale inputs (100 Mio. pages) restricted to a set of interesting entities? (e.g. tracking politicians and companies)

Tools can map short text onto entities in a few seconds

3-51

Page 52: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Outline

...

Entity-Name Disambiguation

Motivation

Wrap-up

Mapping Questions into Queries

Entity Linkage

3-52

Page 53: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Word Sense Disambiguation forQuestion-to-Query Translation

Select ?p Where {?p type person.?p actedIn Casablanca_(film).?p isMarriedTo ?w.?w type writer .?w bornIn Rome . }

“Who played in Casablanca and was married to a writer born in Rome?”

Translationwith WSD

Question

SPARQL

KB

Answer

?p ?w

3-53

QA system DEANNA[M. Yahya et al.:EMNLP‘12]

www.mpi-inf.mpg.de/yago-naga/deanna/

Page 54: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

DEANNA in a Nutshell

DEANNA

Question

SPARQL

KB

Answers

Phrase detection

Phrase mapping

Dependencydetection

Joint Disambig.

QueryGeneration

3-54

Page 55: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

DEANNA in a Nutshell

DEANNA

Question

SPARQL

KB

Answers

Phrase detection

Phrase mapping

Dependencydetection

Joint Disambig.

QueryGeneration

3-55

Page 56: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

DEANNA in a Nutshell

DEANNA

Question

SPARQL

KB

Answers

Phrase detection

Phrase mapping

Dependencydetection

Joint Disambig.

QueryGeneration

3-56

Page 57: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

DEANNA in a Nutshell

DEANNA

Question

SPARQL

KB

Answers

Phrase detection

Phrase mapping

Dependencydetection

Joint Disambig.

QueryGeneration

3-57

Page 58: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

DEANNA Components

DEANNA

Question

SPARQL

KB

Phrase detection

Phrase mapping

Dependencydetection

Joint Disambig.

QueryGeneration

1

2

3

4

3-58

Answers

Page 59: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Phrase Detection

Casablanca

played

played in

Who

married

married to

was married to

a writer

Concepts: entities & classes: dictionary-based

Relations:mainly use Reverb [Fader et al: EMNLP’11]: V | VP | VW*P… was/VBD married/VBN to/TO a/DT…

Concept Phrase

Casablanca Casablanca

Casablanca Casablanca, Morocco

Casablanca_(film) Casablanca the film

Casablanca_(film) Casablanca

… …

3-59

Page 60: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

DEANNA Components

DEANNA

Question

SPARQL

KB

Phrase detection

Phrase mapping

Dependencydetection

Joint Disambig.

QueryGeneration

1

2

3

4

3-60

Page 61: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Phrase Mapping

Casablanca

played

played in

e:White_Housee:Casablanca

e:Casablanca_(film)e:Played_(film)

r:actedIn

r:hasMusicalRole

Concepts: entities & classes: dictionary-based

  Relations: Dictionary -based 

Concept Phrase

Casablanca Casablanca

Casablanca Casablanca, Morocco

Casablanca_(film) Casablanca the film

Casablanca_(film) Casablanca

Played_(film) Played

Relation Phrase

actedIn acted in

actedIn played in

hasMusicalRole plays

hasMusicalRole mastered

3-61

Page 62: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

DEANNA Components

DEANNA

Question

SPARQL

KB

Phrase detection

Phrase mapping

Dependencydetection

Joint Disambig.

QueryGeneration

1

2

3

4

3-62

Page 63: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Dependency Detection

Look for specific patterns in dependency graph [de Marneffe et al. LREC’06]

writer

in

born

Rome

partmod

prep

pobj

a writer

was born

born

Rome

q1

c:writerr:bornInPlacer:bornOnDate

e:Max_Borne:Born_(film)

e:Sydne_Romee:Rome

3-63

Page 64: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Disambiguation Graph

q1

q2

q3

a writer

Casablanca

played

played in

Who

married

married to

was married to

was born

born

Rome

c:writerr:bornInPlacer:bornOnDate

e:Max_Borne:Born_(film)

e:Sydne_Romee:Rome

e:White_Housee:Casablanca

e:Casablanca_(film)e:Played_(film)

r:actedInr:hasMusicalRole

c:person

e:Married_(series)

c: married_personr:isMarriedTo

q-nodes

Phrase-nodesSemantic nodes

3-64

Page 65: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

DEANNA Components

DEANNA

Question

SPARQL

KB

Phrase detection

Phrase mapping

Dependencydetection

Joint Disambig.

QueryGeneration

1

2

3

4

3-65

Page 66: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Joint Disambiguation - ILP• ILP: Integer Linear Programming• maximize α Σi,j wi,jYi,j + β Σk,l vk,l Zk,l + …• Subject to:

No token in multiple phrases, Triples observe type constraints, …

3-66

Page 67: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Joint Disambiguation – Objective

α Σi,j wi,jYi,j + β Σk,l vk,l Zk,l

Semantic nodes

q1

a writer

was born

born

Rome

c:writerr:bornInPlacer:bornOnDate

e:Max_Borne:Born_(film)

e:Sydne_Romee:Rome

q-nodes

Phrase nodes Coherence Edges

Similarity Edges

Prior

3-67

Page 68: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Joint Disambiguation – Objective

α Σi,j wi,jYi,j + β Σk,l vk,l Zk,l

Semantic nodes

Coherence

q1

a writer

was born

born

Rome

c:writerr:bornInPlacer:bornOnDate

e:Max_Borne:Born_(film)

e:Sydne_Romee:Rome

q-nodes

Phrase nodesSimilarity

Edges Coherence Edges

3-68

Page 69: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Joint Disambiguation – ConstraintsA phrase node can be assigned to only one semantic node:

Casablanca

e:White_House

e:Casablanca

e:Casablanca_(film)

Phrase nodes

Semantic nodes

a

1

2

3

Ya,1

Ya,2

Ya,3

α Σi,j wi,jYi,j + β Σk,l vk,l Zk,l

3-69

Page 70: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Joint Disambiguation – Constraints

Classes translate to type-constrained variables Every semantic triple should have a class to join

& project!person actedIn Casablanca_(film)

?x type person . ?x actedIn Casablanca_(film)

q1

a writer

was born

Rome

c:writer

r:bornInPlace

r:bornOnDate

e:Sydne_Rome

e:Rome

q-nodes

e:The_Writer (magazine)

Phrase nodes Semantic nodes

3-70

Page 71: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

DEANNA Components

DEANNA

Question

SPARQL

KB

Phrase detection

Phrase mapping

Dependencydetection

Joint Disambig.

QueryGeneration

1

2

3

4

3-71

Page 72: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Structured Query Generation

SELECT ?p WHERE { ?w type writer . ?w bornIn Rome . ?p type person. ?p actedIn Casablanca_(film). ?p isMarriedTo ?w }

q1

q2

q3

a writer

Casablanca

played in

Who

was married to

was born

Rome

c:writer

r:bornIn

e:Rome

e:Casablanca_(film)

r:actedIn

c:person

r:isMarriedTo

3-72

Page 73: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Outline

...

Entity-Name Disambiguation

Motivation

Wrap-up

Mapping Questions into Queries

Entity Linkage

3-73

Page 74: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Entity Linkage for the Web of Data

owl:s

ameAs

rdf.freebase.com/ns/en.rome_ny

owl:sameAs

o

wl:sameAs

data.nytimes.com/51688803696189142301

Coord

geonames.org/5134301/city_of_rome

N 43° 12' 46'' W 75° 27' 20''

dbpprop:citizenOf

dbpedia.org/resource/Rome

rd

f:type

rdf:subclassOf

yago/wordnet:Actor109765278

rd

f:type

rdf:subclassOf

yago/wikicategory:ItalianComposer

yago/wordnet: Artist109812338

prop:a

ctedIn

imdb.com/name/nm0910607/

prop: composedMusicFor

imdb.com/title/tt0361748/

dbpedia.org/resource/Ennio_Morricone

sameAs links ?Where? How? ?

? ?

30 Bio. triples500 Mio. links

3-74

Page 75: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Record Linkage (Entity Resolution)

Susan B. Davidson

Peter Buneman

University of Pennsylvania

Yi Chen

record 1 record N

Issues in …

Int. Conf. on VeryLarge Data Bases

O.P. Buneman

S. Davison

U Penn

Y. Chen

Issues in …

VLDB Conf.

Y. Davidson

Penn Station

S. Chen

Issues in …

XLDB Conference

record 2

P. Baumann

S. Davidson

Penn State

Cheng Y.

Issues in …

PVLDB

record 3 …

Sean Penn

Halbert L. Dunn: Record Linkage. American Journal of Public Health. 1946H.B. Newcombe et al.: Automatic Linkage of Vital Records. Science, 1959.I.P. Fellegi, A.B. Sunter: A Theory of Record Linkage. J. of American Statistical Soc., 1969.

Find equivalence classes of entities, and records, based on:• similarity of values (edit distance, n-gram overlap, etc.)• joint agreement of linkage

® similarity joins, grouping/clustering, collective learning, etc.® often domain-specific customization (similarity measures etc.)

3-75

Page 76: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Entity Linkage via Markov Logic

Susan B. Davidson

Peter Buneman

University of Pennsylvania

Yi Chen

record 1 record N

Issues in …

Int. Conf. on VeryLarge Data Bases

O.P. Buneman

S. Davison

U Penn

Y. Chen

Issues in …

VLDB Conf.

Y. Davidson

Penn Station

S. Chen

Issues in …

XLDB Conference

record 2

P. Baumann

S. Davidson

Penn State

Cheng Y.

Issues in …

PVLDB

record 3 …

Find equivalence classes of entities, and records, based on:• similarity of values (edit distance, n-gram overlap, etc.)• joint agreement of linkage

similarity joins, grouping/clustering, collective learning, etc.

Sean Penn

Halbert L. Dunn: Record Linkage. American Journal of Public Health. 1946H.B. Newcombe et al.: Automatic Linkage of Vital Records. Science, 1959.I.P. Fellegi, A.B. Sunter: A Theory of Record Linkage. J. of American Statistical Soc., 1969.

prob. / uncertain rules:sameTitle(x,y) sameAuths(x,y) sameVenue(x,y) sameAs(x,y)sameTitle(x,y) sameAuths(x,y) sameAffil(x,y) sameAs(x,y)overlapAuths(x,y) sameAffil(x,y) sameAuths(x,y)sameAs(rec1.auth1, rec2.auth1) [0.2]sameAs(rec1.auth1, rec2.auth2) [0.9]…

• specify in Markov Logic or as factor graph• generate MRF (or …) and solve by MCMC (or …)

(Singla/Domingos: ICDM’06, Hall/Sutton/McCallum:KDD’08)

3-76

Page 77: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

sameAs-Link Test across SourcesLOD source 1 LOD source 2

sameAs ?

?

?

?

?

ei ej

similarity: sim (ei, ej)

coherence: coh (xN(ei), yN(ej))

neighborhoods: N(ei), N(ej)

sameAs (ei, ej)Ü sim (ei, ej) ≥ … x,y coh(x,y) ≥ …

record linkage problem

3-77

Page 78: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

sameAs-Link Generation across SourcesLOD source 1 LOD source 2 LOD source 3

ek sameAs ?

ej sameAs ?

sameAs ?

ei

3-78

Page 79: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

sameAs-Link Generation across SourcesLOD source 1 LOD source 2 LOD source 3

ek sameAs ?

ej sameAs ?

sameAs ?

ei

sim(ei, ej): likelihood of being equivalent, mapped to [-1,1]coh(x, y): likelihood of being mentioned together, mapped to [-1,1]

0-1 decision variables: Xij … Xjk … Xik …

objective function:

ij (Xij sim(ei,ej) + Xij xNi, yNj coh(x,y))

+ jk (…) + ik (…) = max!

constraints:

j Xij 1 for all i…(1Xij ) + (1Xjk ) (1Xik) for all i, j, k…

• Joint Mapping• ILP model or prob. factor graph or …• Use your favorite solver• How?

at Webscale ???

3-79

Page 80: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Similarity Flooding

Graph with record / entity pairs as nodes (sameAs candidates)and edges connecting related pairs:

R(x,y) and S(u,w) and sameAs candidates (x,u), (y,w) edge between (x,u) and (y,w)

• Node weights: belief strength in sameAs(x,u)• Edge weights: degree of relatedness

Iterate until convergence:• propagate node weights to neighbors• new node weight is linear combination of inputs

Related to belief propagation algorithms,label propagation, etc.

3-80

Page 81: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Blocking of Match CandidatesAvoid computing O(n2) similarities between records / entities

• Group potentially matching records• Run more accurate & more expensive method per group at risk of missing some matches

• Iterative Blocking: distribute found matches to other blocks, then repeat per-block runs• Multi-type Joint Resolution blocks of different record types (author, venue, etc.) propagate matches to other types, then repeat runs

Name Zip Email1 John Doe 49305 jdoe@yahoo2 John Doe 94305 jdoe@gmail3 Jon Foe 94305 jdoe@yahoo4 Jane Foe 12345 jane@msn5 Jane Fog 12345 jane@msn

Group by zip code:{1,4,5} and {2,3} sameAs(4,5), sameAs(2,3)

Group by 1st char of lastname:{1,2} and {3,4,5} sameAs(1,2), sameAs(4,5)

3-81

Page 82: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Iterative Blocking for Joint Resolution with Multiple Entity Types [Whang et al. 2012]

Publications Authors Venues

heuristics for constructing efficient execution plansexploiting „influence graph“

afterround 1

afterround 2

3-82

Page 83: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

RiMOM MethodRisk Minimization Based Ontology Matching Method

for joint matching of concepts (entities, classes) & properties (relations)

[Juanzi Li et al.;TKDE‘09]

Strategies using variety of matching criteria:• Linguistic-based:

• edit distance• context vector distance …

• Structure-based: • similarity flooding …

keg.cs.tsinghua.edu.cn/project/RiMOM/3-83

Page 84: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

COMA++ Framework [E. Rahm et al.]

• Joint schema alignment and entity matching• Comprehensive architecture with many plug-ins for customizing to specific application• Blocked matchers parallelizable on Map-Reduce platform

dbs.uni-leipzig.de/Research/coma.html/3-84

Page 85: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

PARIS Method [F. Suchanek et al. 2012]

webdam.inria.fr/paris/

Probabilistic Alignment of Relations, Instances, and Schema: joint reasoning on sameEntity, sameRelation, sameClass with direct probabilistic assessment

P[literal1 literal2] = … same constant value

P[r1 r2] = … sub-relation

P[e1 e2] = … same entity

P[c1 c2] = … sub-class

Matching entities of DBpedia with YAGO:90% precision, 73% recall, after 4 iterations, 5 h run-time

Iterate through probabilistic equationsEmpirically converges to fixpoint

3-85

Page 86: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

PARIS Method [F. Suchanek et al. 2012]

webdam.inria.fr/paris/

P[literal1 literal2] = … based on similarityand co-occurrence

P[x y] =

(1 r(x,u),r(y,w) (1 fun(r1)P[uw]))if relations werealready aligned

same entity

r(x,u) (1 fun(r) r(y,w) (1 P[uw]))) consideringnegative evidence

fun(r) = #x: y: r(x,y)#x,y: r(x,y))

degree to which rIs a function

where

3-86

P[Shanri-La Zhongdian] = … fun(bornIn-1) P[Jet Li Li Lianjie]

Page 87: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

PARIS Method [F. Suchanek et al. 2012]

webdam.inria.fr/paris/

P[s r]:

P[s r] =#x,u: s(x,u) r(x,u)

#x,u: s(x,u)if entities werealready resolved

P[s r] =

with same-entityprobabilities

s(x,u) (1 r(y,w) (1 P[xy]P[uw]))

s(x,u) (1 y,w (1 P[xy]P[uw]))

sub-relation

3-87

Page 88: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

PARIS Method [F. Suchanek et al. 2012]

webdam.inria.fr/paris/

P[x y] =

(1 s(x,u),r(y,w) (1 P[s r]fun(s1)P[uw]) with sub-relationprobabily

same entityrevisited

(1 P[s r]fun(r1)P[uw]))

s(x,u),r(y,w) (1 P[s r] fun(s) r(y,w) (1 P[uw]))

consideringnegative evidence

(1 P[s r] fun(r) r(y,w) (1 P[uw]))

3-88

Page 89: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

PARIS Method [F. Suchanek et al. 2012]

webdam.inria.fr/paris/

P[c d]:

P[c d] =#x type(x,c)) type(x,d)

#x: type(x,c)if entities werealready resolved

P[c d] =

with same-entityprobabilities

x:type(x,c) (1 y:type(y,d) (1 P[xy]))

#x: type(x,c)

sub-class

3-89

Page 90: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Partitioned MLN Method V. Rastogi et al. 2011]

• Use Markov Logic Network for entity resolution• Partition MLN with replication of nodes so that:• Each node has its neighborhood in the same partitionRepeat

• local computation: run MLN inference via MCMC on each partition (in parallel)

• message passing: exchange beliefs (on sameAs) among partitions with overlapping node setsUntil convergence

R1: sim(x,y) sameAuthor(x,y)R2: sim(x,y) coAuthor(x,a) coAuthor(y,b) s ameAuthor(a,b) sameAuthor(x,y)

3-90

Page 91: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

LINDA: Linked Data Alignment at Scale[C. Böhm et al. 2012]

• uses context sim and joint inference to process sameAs matrix with transitivity and other constraints

• alternates between setting sameAs and recomputing sim• puts promising candidate pairs in priority queue

• queue is partitioned and processing parallelized

e1 … em

e m…

e1

ei ej y

ei ek y

ek el y………………

ei ej y……

e

i

e

j

e

k

e

l

Node 1 Node n InputQueue Q

ResultMatrix X

ei

ej

ek

el

(1) accept

(3) update

InputEntity Graph G

rea

d rea

d

Q-part 1

ek el y……

Q-part n(2) notify

ei ej y‘

ei ek y‘…

QueueUpdates

dis

trib

ue

(4) registerd

istr

ibu

ted

istr

ibu

te

G-part 1 G-part n

Experimentwith BTC+ dataset:• 3 Bio. quads• 345 Mio. triples• 95 Mio. URIsResult after 30 h run-time:• 12.3 Mio. sameAs• 66% precision• > 80% for Dbpedia-Yago

3-91

Page 92: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Cross-Lingual Linking

Source: Z. Wang et al.: WWW‘12

+ simpler than monolingual: natural equivalences, interwiki links harder than monolingual: different terminologies & structures

Z. Wang et al. WWW‘12: factor-graph learning 200,000 sameAsT. Nguyen et al. VLDB‘12: sim features & LSI infobox mappings

en

.wik

iped

ia.o

rg:

3.5

Mio

. art

icle

sb

aik

e.b

aid

u.c

om

:4

Mio

. article

s

3-92

Page 93: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Challenges Remaining

Entity linkage is at the heart of semantic data integration !More than 50 years of research, still some way to go!

Benchmarks:• OAEI Ontology Alignment & Instance Matching: oaei.ontologymatching.org• TAC KBP Entity Linking: www.nist.gov/tac/2012/KBP/• TREC Knowledge Base Acceleration: trec-kba.org

• Highly related entities with ambiguous names George W. Bush (jun.) vs. George H.W. Bush (sen.)

• Out-of-Wikipedia entities with sparse context

• Enterprise data (perhaps combined with Web2.0 data)

• Entities with very noisy context (in social media)

• Records with complex DB / XML / OWL schemas

3-93

Page 94: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

TREC Task: Knowledge Base Acceleration

http://trec-kba.org

Goal: assist Wikipedia / KB editors• recommend key citations as evidence of truth• recommend infobox structure and categories• recommend entity links and external links

3-94

Page 95: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

TREC Task: Knowledge Base Acceleration

http://trec-kba.org3-95

+

Page 96: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Outline

...

Entity-Name Disambiguation

Motivation

Wrap-up

Mapping Questions into Queries

Entity Linkage

3-96

Page 97: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Take-Home LessonsWeb of Linked Data is great100‘s of KB‘s with 30 Bio. triples and 500 Mio. linksmostly reference data, dynamic maintenance is bottleneckconnection with Web of Contents needs improvement

Entity detection and disambiguation is keyfor creating sameAs links in text (RDFa, microformats)for machine reading, semantic authoring, knowledge base acceleration, …

Integrated methods for aligning entities, classes and relations

Linking entities across KB‘s is advancing

combine popularity, similarity, and coherenceextend towards general WSD (e.g. for QA)

NED methods come close to human quality

3-97

Page 98: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Open Problems and Grand Challenges

Automatic and continuously maintained sameAs linksfor Web of Linked Data with high accuracy & coverage

Combine algorithms and crowdsourcing for NED & ER

Robust disambiguation of entities, relations and classes

with active learning, minimizing human effort or cost/accuracy

Relevant for question answering & question-to-query translationKey building block for KB building and maintenance

Entity name disambiguation in difficult situationsShort and noisy texts about long-tail entities in social media

3-98

Page 99: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

End of Part 3

Questions?

3-99

Page 100: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

• J. Hoffart, M. A. Yosef, I. Bordino, et al.: Robust Disambiguation of Named Entities in Text. EMNLP 2011• J. Hoffart et al.: KORE: Keyphrase Overlap Relatedness for Entity Disambiguation. CIKM 2012• R.C. Bunescu, M. Pasca: Using Encyclopedic Knowledge for Named entity Disambiguation. EACL 2006• S. Cucerzan: Large-Scale Named Entity Disambiguation Based on Wikipedia Data. EMNLP 2007• D.N. Milne, I.H. Witten: Learning to link with wikipedia. CIKM 2008• S. Kulkarni et al.: Collective annotation of Wikipedia entities in web text. KDD 2009• G.Limaye et al: Annotating and Searching Web Tables Using Entities, Types and Relationships. PVLDB

2010• A. Rahman, V. Ng: Coreference Resolution with World Knowledge. ACL 2011• L. Ratinov et al.: Local and Global Algorithms for Disambiguation to Wikipedia. ACL 2011• M. Dredze et al.: Entity Disambiguation for Knowledge Base Population. COLING 2010• P. Ferragina, U. Scaiella: TAGME: on-the-fly annotation of short text fragments. CIKM 2010• X. Han, L. Sun, J. Zhao: Collective entity linking in web text: a graph-based method. SIGIR 2011• M. Tsagkias, M. de Rijke, W. Weerkamp.: Linking Online News and Social Media. WSDM 2011• J. Du et al.: Towards High-Quality Semantic Entity Detection over Online Forums. SocInfo 2011• V.I. Spitkovsky, A.X. Chang: A Cross-Lingual Dictionary for English Wikipedia Concepts, LREC 2012• J.R. Finkel, T. Grenager, C. Manning. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. ACL 2005• V. Ng: Supervised Noun Phrase Coreference Research: The First Fifteen Years. ACL 2010• S. Singh, A. Subramanya, F.C.N. Pereira, A. McCallum: Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models. ACL 2011• T . Lin et al.: No Noun Phrase Left Behind: Detecting and Typing Unlinkable Entities. EMNLP 2012• A. Rahman, V. Ng: Inducing Fine-Grained Semantic Classes via Hierarchical Classification. COLING 2010• X. Ling, D.S. Weld: Fine-Grained Entity Recognition. AAAI 2012• R. Navigli: Word sense disambiguation: A survey. ACM Comput. Surv. 41(2), 2009• M. Yahya et al.: Natural Language Questions for the Web of Data. EMNLP 2012• S. Shekarpour: Automatically Transforming Keyword Queries to SPARQL on Large-Scale KBs. ISWC 2011

Recommended Readings: Disambiguation

3-100

Page 101: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Recommended Readings: Linked Data and Entity Linkage

• T. Heath, C. Bizer: Linked Data: Evolving the Web into a Global Data Space. Morgan&Claypool, 2011• A. Hogan, et al.: An empirical survey of Linked Data conformance. J. Web Sem. 14, 2012• H. Glaser, A. Jaffri, I.C. Millard: Managing Co-Reference on the Semantic Web. LDOW 2009• J. Volz, C.Bizer, M.Gaedke, G.Kobilarov : Discovering and Maintaining Links on the Web of Data. ISWC 2009• F. Naumann, M. Herschel: An Introduction to Duplicate Detection. Morgan&Claypool, 2010• H.Köpcke et al: Learning-Based Approaches for Matching Web Data Entities. IEEE Internet Computing 2010• H. Köpcke et al.: Evaluation of entity resolution approaches on real-world match problems. PVLDB 2010• S. Melnik, H. Garcia-Molina, E. Rahm: Similarity Flooding: A Versatile Graph Matching Algorithm and its Application to Schema Matching. ICDE 2002• S. Chaudhuri, V. Ganti, R. Motwani: Robust Identification of Fuzzy Duplicates. ICDE 2005• S.E. Whang et al.: Entity Resolution with Iterative Blocking. SIGMOD 2009• S.E. Whang, H. Garcia-Molina: Joint Entity Resolution. ICDE 2012• L. Kolb, A. Thor, E. Rahm: Load Balancing for MapReduce-based Entity Resolution. ICDE 2012• J.Li, J.Tang, Y.Li, Q.Luo: RiMOM: A dynamic multistrategy ontology alignment framework. TKDE 21(8), 2009• P. Singla, P. Domingos: Entity Resolution with Markov Logic. ICDM 2006• I.Bhattacharya, L. Getoor: Collective Entity Resolution in Relational Data. TKDD 1(1), 2007• R. Hall, C.A. Sutton, A. McCallum: Unsupervised deduplication using cross-field dependencies. KDD 2008• V. Rastogi, N. Dalvi, M. Garofalakis: Large-Scale Collective Entity Matching. PVLDB 2011• F. Suchanek et al.: PARIS: Probabilistic Alignment of Relations, Instances, and Schema. PVLDB 2012• Z. Wang, J. Li, Z. Wang, J. Tang: Cross-lingual knowledge linking across wiki knowledge bases. WWW 2012• T. Nguyen et al.: Multilingual Schema Matching for Wikipedia Infoboxes. PVLDB 2012• A.Hogan et al.: Scalable and distributed methods for entity matching. J. Web Sem. 10, 2012• C. Böhm et al.: LINDA: Distributed Web-of-Data-Scale Entity Matching. CIKM 2012• J. Wang, T. Kraska, M. Franklin, J. Feng: CrowdER: Crowdsourcing Entity Resolution. PVLDB 2012 3-101

Page 102: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Knowledge Harvesting:Overall Take-Home Lessons

KB‘s are great opportunity in the big-data era: revive old AI vision, make it real & large-scale ! challenging, but high pay-off

Strong success story on entities and classes

Many opportunities remaining:temporal knowledge, spatial, visual, commonsensevertical domains: health, music, travel, …

Good progress on relational factsMethods for open-domain relation discovery

Search and ranking:Combine facts (SPO triples) with witness textExtend SPARQL, LM‘s for ranking, UI unclear

Entity linking:From names in text to entities in KBsameAs between entities in different KB‘s / DB‘s

1-102

Page 103: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

Knowledge Harvesting: ResearchOpportunities & Challenges

Explore & exploit synergies between semantic, statistical, & social Web methods: statistical evidence + logical consistency + wisdom of the crowd !

For DB / AI / IR / NLP / Web researchers: • efficiency & scalability• consistency constraints & reasoning• search and ranking• deep linguistic patterns & statistics• text (& speech) disambiguation• killer app for uncertain data management• knowledge-base life-cycleand more 1-103

Page 104: Gerhard  Weikum Max Planck Institute  for Informatics mpi-inf.mpg.de/~weikum

de: vielen Dank

en: thank you

fr: Merci beaucoup

es: muchas gracias

cmn: 非常谢谢你

ru: Большое спасибо

tib: ཐུགས་རྗེ� ་ཆེ� ་།

yue: 唔該

wu: 谢谢侬

expression ofgratitude

dai: ขอบคุ�ณ

3-104


Recommended