Combining Similarities and Regression for Entity Linking.

César de Pablo Sanchez

Stat rosa pristina nomine,

nomine nuda tenemus

1. Task definition: KBP and EL

2. System description

3. Results

4. Conclusions

TAC-KBP 2010 - Combining Similarities and Regression Classifiers for Entity Linking

Overview of previous work

Overview of previous work

Drug Drug Interactions

Relation extraction

Anaphora resolution

OPINATOR - Opinion MiningSentiment loaded dictionaries

Sentiment classification

Opinion summarization

Search/Navigation

Knowledge acquisitionList candidates for the Greek elections in June.



What party does Tsipras represents?

How old is he?

What does Syriza means?



How old is he?




How old is he?


How old is Samaras?



How old is he?


How old is Samaras?

1. Task definition: KBP and EL

2. System description

3. Results

4. Conclusions


Knowledge Base Population

César de Pablo, Juan Perea, Paloma Martínez



KBP

Knowledge

Base


KBP

Knowledge

Base

from Wikipedia dump (2008)● Title, name, type, id, ● wiki text, ● several facts as [name, value]

● 1.3 million English newswire documents

● Published from 1994 and 2008

● 488.240 webpages

IE = KBP?

QA = KBP?

IE = KBP?Accurate extraction of facts – not annotation

Learn facts from corpus - repetition is not important but helps confidence

Asserting wrong information is bad

Scalability

Provenance

QA = KBP?

IE = KBP?Accurate extraction of facts – not annotation

Learn facts from corpus - repetition is not important but helps confidence

Asserting wrong information is bad

Scalability

Provenance

Slots are fixed but targets change

Leverage knowledge from the KB

Global resolution - ground information to the KB

Avoid contradiction

Detect novel info

QA = KBP?

Task at TAC - KBP

● Entity Linking – grounding entity mentions in document to KB entries

● Slot Filling – Learning attributes about target entities

Task 1: Slot Filling

Task 2: Entity Linking

Task at TAC - KBP



Task 1: Slot Filling

Task at TAC - KBP



Task 2: Entity Linking



Entity Linking: Example

<query id="EL006455"><name>Reserve Bank</name><docid>eng-NG-31-100316-11150589</docid><entity>E0700143</entity></query>


For a name string and a document, determine which entity in a KB if any is being referred to by the name string



Entity Linking: Example



…E0421510: Reserve Bank of Australia…E0700143: Reserve Bank of India....

NIL

For a name string and a document, determine which entity in a KB if any is being referred to by the name string



Focus on confusable entities ● Ambiguous names : Reserve Bank, Alan Jackson, Fonda ●

Entity Linking: Challenges



Focus on confusable entities ● Ambiguous names ● Multiple Name variants: Saddam Hussain, Saddam Hussein




Focus on confusable entities ● Ambiguous names ● Multiple Name variants● Acronym expansion: CDC, AZ




Focus on confusable entities ● Ambiguous names ● Multiple Name variants● Acronym expansion● Variety of cases : Centre for Disease Control, European Centre

for Disease Control, AZ, Arizona, Astra Zeneca




Focus on confusable entities ● Ambiguous names ● Multiple Name variants● Acronym expansion● Variety of cases

● Pilot task – entity linking withouth text support● Identify missing entities – then cluster (2011)


Name mention – document pairs● Accuracy micro = num correct / num queries ● Accuracy macro = group by entities (2009)

Entity Linking: Evaluation

queries NIL set genre % NIL

3904 2229 eval 2009 news 0.571

1500 426 train 2010 web 0.284

2250 1230 eval 2010 news + web

0.547



● Supervised architecture● Use similarities between objects or parts of them – avoid a

wide feature vector●

1) Candidate Entity Retrieval

2) Candidate Filtering

3) Validation (NIL classification)

uc3m EL system

uc3m EL system



● Each KB article is indexed using Lucene, using several indexes and fields● ALIAS - include names plus aliases extracted from wiki slots:

alias, abbreviation, website, etc.● NER – Named entities extracted from text: <id, ne, text>● KB - entity slots <id, [(slot_name,slot_value)]>● WIKIPEDIA – anchorList, category, redirect, outlinks, inlinks

● Each EL query transforms into several Lucene queries – result [KB name, score] list

1) Candidate Retrieval



● EL Query: [Michael Jordan,eng-NG-31-100316-11150589]● Lucene queries:

● name=Michael AND name = Jordan● alias=Michael AND alias = Jordan● abbr=Michael AND abbr = Jordan

● For each query: ● [EL0989789, Michael Jordan, 25.00]● [EL6565356, Michael B. Jordan , 25.00]● [EL6565356, Michael I. Jordan , 25.00]● [EL6565356, Michael-Hakim Jordan , 25.00]● [EL6565356, Jordan , 20.00]

1) Candidate Retrieval



● Classification problem ● decide (EL query + text , KB name + wiki text ) is a good

match● In fact, rank by prediction confidence

● Use similarity scores as features – norm and unnorm ● Use a cost sensitive classifier.● Best results: Model trees with linear regression leafs

2) Candidate Filtering



● Index-based scores: ● sim (EL queries, KB entries) directly from initial retrieval

● Context-similarity scores: ● sim(document, wikitext) o sim(document,slots)

● Name similarity score: ● sim (EL queries, KB entries) – more expensive: equal,

QcontainsE, EcontainsQ, Jaro, Jaro-Winkler, SLIM (based on SecondString)

Features



● Classification – selected candidate is good enough or NIL ● Positive examples – correct candidate example● Negative examples – top ranked entities for those queries

that do not have a link in the KB ● Balanced dataset ● Best classifier: Logistic Regression

3) Validation



●

●

●

●

●

●

● Influence of domain?

EL results - mainnews web news+web Highest Median

750 ORG 0.69 0.67 0.67 0.85 0.68

749 GPE 0.52 0.53 0.51 0.80 0.60

751 PER 0.82 0.76 0.85 0.96 0.85

2250 ALL 0.67 0.65 0.68 0.87 0.69



●

●

●

●

●

●


750 ORG 0.69 0.67 0.67 0.85 0.68

749 GPE 0.52 0.53 0.51 0.80 0.60

751 PER 0.82 0.76 0.85 0.96 0.85

2250 ALL 0.67 0.65 0.68 0.87 0.69



●

●

●

●

●

● GPE are particularly difficult


750 ORG 0.69 0.67 0.67 0.85 0.68

749 GPE 0.52 0.53 0.51 0.80 0.60

751 PER 0.82 0.76 0.85 0.96 0.85

2250 ALL 0.67 0.65 0.68 0.87 0.69



● AA


750 ORG 0.69 0.67 0.67 0.85 0.68

749 GPE 0.52 0.53 0.51 0.80 0.60

751 PER 0.82 0.76 0.85 0.96 0.85

2250 ALL 0.67 0.65 0.68 0.87 0.69

news web news+web Highest Median

2250 ALL 0.67 0.65 0.68 0.87 0.69

1020 noNIL 0.51 0.59 0.49

1230 NIL 0.81 0.70 0.82



● AA


750 ORG 0.69 0.67 0.67 0.85 0.68

749 GPE 0.52 0.53 0.51 0.80 0.60

751 PER 0.82 0.76 0.85 0.96 0.85

2250 ALL 0.67 0.65 0.68 0.87 0.69

news web news+web Highest Median

2250 ALL 0.67 0.65 0.68 0.87 0.69

1020 noNIL 0.51 0.59 0.49

1230 NIL 0.81 0.70 0.82



●

●

●

●

●

● Including name similarity scores helped

EL results – pilot w/o textnews(main) news +n-sim NIL +n-sim all

2250 ALL 0.67 0.58 0.66 0.70

1020 noNIL 0.51 0.35 0.40 0.47

1230 NIL 0.81 0.77 0.88 0.88



●

●

●

●

●

● Including name similarity scores helped

EL results – pilot w/o textnews(main) news +n-sim NIL +n-sim all

2250 ALL 0.67 0.58 0.66 0.70

1020 noNIL 0.51 0.35 0.40 0.47

1230 NIL 0.81 0.77 0.88 0.88



● Prior on Link probability/popularity (Stanford-UBC 2009, LCC 2010, Microsoft 2011)

Learning to rank algorithms: ListNet (CUNY 2011)

● Expand queries: acronym expansion/correference (NUS 2011)

● Unsupervised system – entity co-ocurrence + PageRank (WebTLab 2010)

● Inductive EL – first cluster, then link (LCC 2011)

● Collective entity linking (Microsoft 2011)

EL systems comparison



● Supervised EL system● Influence of training size ● beware of training data distribution

● Consider name-similarities even for reranking ● Improve initial candidate retrieval ● Perform collective Entity Linking ● Efficiency?

Conclusion

Related tasks

● Cluster Documents Mentioning Entities ● Entity correference – document and cross-

document● Add missing links between Wikipedia pages ● Link entities to matching Wikipedia articles

Date post:	05-Jul-2015
Category:	News & Politics
Upload:	cesar-de-pablo
View:	235 times
Download:	1 times

Combining Similarities and Regression for Entity Linking.

News & Politics