Date post: | 05-Jul-2015 |
Category: |
News & Politics |
Upload: | cesar-de-pablo |
View: | 235 times |
Download: | 1 times |
César de Pablo Sanchez
Stat rosa pristina nomine,
nomine nuda tenemus
1. Task definition: KBP and EL
2. System description
3. Results
4. Conclusions
TAC-KBP 2010 - Combining Similarities and Regression Classifiers for Entity Linking
Overview of previous work
Overview of previous work
Drug Drug Interactions
Relation extraction
Anaphora resolution
OPINATOR - Opinion MiningSentiment loaded dictionaries
Sentiment classification
Opinion summarization
Search/Navigation
Knowledge acquisitionList candidates for the Greek elections in June.
Knowledge acquisitionList candidates for the Greek elections in June.
Knowledge acquisitionList candidates for the Greek elections in June.
What party does Tsipras represents?
How old is he?
What does Syriza means?
Knowledge acquisitionList candidates for the Greek elections in June.
What party does Tsipras represents?
How old is he?
What does Syriza means?
Knowledge acquisitionList candidates for the Greek elections in June.
What party does Tsipras represents?
How old is he?
What does Syriza means?
How old is Samaras?
Knowledge acquisitionList candidates for the Greek elections in June.
What party does Tsipras represents?
How old is he?
What does Syriza means?
How old is Samaras?
1. Task definition: KBP and EL
2. System description
3. Results
4. Conclusions
TAC-KBP 2010 - Combining Similarities and Regression Classifiers for Entity Linking
Knowledge Base Population
César de Pablo, Juan Perea, Paloma Martínez
TAC-KBP 2010 - Combining Similarities and Regression Classifiers for Entity Linking
Knowledge Base Population
KBP
Knowledge
Base
Knowledge Base Population
KBP
Knowledge
Base
from Wikipedia dump (2008)● Title, name, type, id, ● wiki text, ● several facts as [name, value]
● 1.3 million English newswire documents
● Published from 1994 and 2008
● 488.240 webpages
IE = KBP?
QA = KBP?
IE = KBP?Accurate extraction of facts – not annotation
Learn facts from corpus - repetition is not important but helps confidence
Asserting wrong information is bad
Scalability
Provenance
QA = KBP?
IE = KBP?Accurate extraction of facts – not annotation
Learn facts from corpus - repetition is not important but helps confidence
Asserting wrong information is bad
Scalability
Provenance
Slots are fixed but targets change
Leverage knowledge from the KB
Global resolution - ground information to the KB
Avoid contradiction
Detect novel info
QA = KBP?
Task at TAC - KBP
● Entity Linking – grounding entity mentions in document to KB entries
● Slot Filling – Learning attributes about target entities
Task 1: Slot Filling
Task 2: Entity Linking
Task at TAC - KBP
● Entity Linking – grounding entity mentions in document to KB entries
● Slot Filling – Learning attributes about target entities
Task 1: Slot Filling
Task at TAC - KBP
● Entity Linking – grounding entity mentions in document to KB entries
● Slot Filling – Learning attributes about target entities
Task 2: Entity Linking
● Entity Linking – grounding entity mentions in document to KB entries
● Slot Filling – Learning attributes about target entities
Entity Linking: Example
<query id="EL006455"><name>Reserve Bank</name><docid>eng-NG-31-100316-11150589</docid><entity>E0700143</entity></query>
<query id="EL06472"><name>Reserve Bank</name><docid>eng-NG-31-142262-10040510</docid><entity>E0421510</entity></query>
For a name string and a document, determine which entity in a KB if any is being referred to by the name string
● Entity Linking – grounding entity mentions in document to KB entries
● Slot Filling – Learning attributes about target entities
Entity Linking: Example
<query id="EL006455"><name>Reserve Bank</name><docid>eng-NG-31-100316-11150589</docid><entity>E0700143</entity></query>
<query id="EL06472"><name>Reserve Bank</name><docid>eng-NG-31-142262-10040510</docid><entity>E0421510</entity></query>
…E0421510: Reserve Bank of Australia…E0700143: Reserve Bank of India....
NIL
For a name string and a document, determine which entity in a KB if any is being referred to by the name string
● Entity Linking – grounding entity mentions in document to KB entries
● Slot Filling – Learning attributes about target entities
Focus on confusable entities ● Ambiguous names : Reserve Bank, Alan Jackson, Fonda ●
Entity Linking: Challenges
● Entity Linking – grounding entity mentions in document to KB entries
● Slot Filling – Learning attributes about target entities
Focus on confusable entities ● Ambiguous names ● Multiple Name variants: Saddam Hussain, Saddam Hussein
Entity Linking: Challenges
● Entity Linking – grounding entity mentions in document to KB entries
● Slot Filling – Learning attributes about target entities
Focus on confusable entities ● Ambiguous names ● Multiple Name variants● Acronym expansion: CDC, AZ
Entity Linking: Challenges
● Entity Linking – grounding entity mentions in document to KB entries
● Slot Filling – Learning attributes about target entities
Focus on confusable entities ● Ambiguous names ● Multiple Name variants● Acronym expansion● Variety of cases : Centre for Disease Control, European Centre
for Disease Control, AZ, Arizona, Astra Zeneca
Entity Linking: Challenges
● Entity Linking – grounding entity mentions in document to KB entries
● Slot Filling – Learning attributes about target entities
Focus on confusable entities ● Ambiguous names ● Multiple Name variants● Acronym expansion● Variety of cases
● Pilot task – entity linking withouth text support● Identify missing entities – then cluster (2011)
Entity Linking: Challenges
Name mention – document pairs● Accuracy micro = num correct / num queries ● Accuracy macro = group by entities (2009)
Entity Linking: Evaluation
queries NIL set genre % NIL
3904 2229 eval 2009 news 0.571
1500 426 train 2010 web 0.284
2250 1230 eval 2010 news + web
0.547
● Entity Linking – grounding entity mentions in document to KB entries
● Slot Filling – Learning attributes about target entities
● Supervised architecture● Use similarities between objects or parts of them – avoid a
wide feature vector●
1) Candidate Entity Retrieval
2) Candidate Filtering
3) Validation (NIL classification)
uc3m EL system
uc3m EL system
● Entity Linking – grounding entity mentions in document to KB entries
● Slot Filling – Learning attributes about target entities
● Each KB article is indexed using Lucene, using several indexes and fields● ALIAS - include names plus aliases extracted from wiki slots:
alias, abbreviation, website, etc.● NER – Named entities extracted from text: <id, ne, text>● KB - entity slots <id, [(slot_name,slot_value)]>● WIKIPEDIA – anchorList, category, redirect, outlinks, inlinks
● Each EL query transforms into several Lucene queries – result [KB name, score] list
1) Candidate Retrieval
● Entity Linking – grounding entity mentions in document to KB entries
● Slot Filling – Learning attributes about target entities
● EL Query: [Michael Jordan,eng-NG-31-100316-11150589]● Lucene queries:
● name=Michael AND name = Jordan● alias=Michael AND alias = Jordan● abbr=Michael AND abbr = Jordan
● For each query: ● [EL0989789, Michael Jordan, 25.00]● [EL6565356, Michael B. Jordan , 25.00]● [EL6565356, Michael I. Jordan , 25.00]● [EL6565356, Michael-Hakim Jordan , 25.00]● [EL6565356, Jordan , 20.00]
1) Candidate Retrieval
● Entity Linking – grounding entity mentions in document to KB entries
● Slot Filling – Learning attributes about target entities
● Classification problem ● decide (EL query + text , KB name + wiki text ) is a good
match● In fact, rank by prediction confidence
● Use similarity scores as features – norm and unnorm ● Use a cost sensitive classifier.● Best results: Model trees with linear regression leafs
2) Candidate Filtering
● Entity Linking – grounding entity mentions in document to KB entries
● Slot Filling – Learning attributes about target entities
● Index-based scores: ● sim (EL queries, KB entries) directly from initial retrieval
● Context-similarity scores: ● sim(document, wikitext) o sim(document,slots)
● Name similarity score: ● sim (EL queries, KB entries) – more expensive: equal,
QcontainsE, EcontainsQ, Jaro, Jaro-Winkler, SLIM (based on SecondString)
Features
● Entity Linking – grounding entity mentions in document to KB entries
● Slot Filling – Learning attributes about target entities
● Classification – selected candidate is good enough or NIL ● Positive examples – correct candidate example● Negative examples – top ranked entities for those queries
that do not have a link in the KB ● Balanced dataset ● Best classifier: Logistic Regression
3) Validation
● Entity Linking – grounding entity mentions in document to KB entries
● Slot Filling – Learning attributes about target entities
●
●
●
●
●
●
● Influence of domain?
EL results - mainnews web news+web Highest Median
750 ORG 0.69 0.67 0.67 0.85 0.68
749 GPE 0.52 0.53 0.51 0.80 0.60
751 PER 0.82 0.76 0.85 0.96 0.85
2250 ALL 0.67 0.65 0.68 0.87 0.69
● Entity Linking – grounding entity mentions in document to KB entries
● Slot Filling – Learning attributes about target entities
●
●
●
●
●
●
EL results - mainnews web news+web Highest Median
750 ORG 0.69 0.67 0.67 0.85 0.68
749 GPE 0.52 0.53 0.51 0.80 0.60
751 PER 0.82 0.76 0.85 0.96 0.85
2250 ALL 0.67 0.65 0.68 0.87 0.69
● Entity Linking – grounding entity mentions in document to KB entries
● Slot Filling – Learning attributes about target entities
●
●
●
●
●
● GPE are particularly difficult
EL results - mainnews web news+web Highest Median
750 ORG 0.69 0.67 0.67 0.85 0.68
749 GPE 0.52 0.53 0.51 0.80 0.60
751 PER 0.82 0.76 0.85 0.96 0.85
2250 ALL 0.67 0.65 0.68 0.87 0.69
● Entity Linking – grounding entity mentions in document to KB entries
● Slot Filling – Learning attributes about target entities
● AA
EL results - mainnews web news+web Highest Median
750 ORG 0.69 0.67 0.67 0.85 0.68
749 GPE 0.52 0.53 0.51 0.80 0.60
751 PER 0.82 0.76 0.85 0.96 0.85
2250 ALL 0.67 0.65 0.68 0.87 0.69
news web news+web Highest Median
2250 ALL 0.67 0.65 0.68 0.87 0.69
1020 noNIL 0.51 0.59 0.49
1230 NIL 0.81 0.70 0.82
● Entity Linking – grounding entity mentions in document to KB entries
● Slot Filling – Learning attributes about target entities
● AA
EL results - mainnews web news+web Highest Median
750 ORG 0.69 0.67 0.67 0.85 0.68
749 GPE 0.52 0.53 0.51 0.80 0.60
751 PER 0.82 0.76 0.85 0.96 0.85
2250 ALL 0.67 0.65 0.68 0.87 0.69
news web news+web Highest Median
2250 ALL 0.67 0.65 0.68 0.87 0.69
1020 noNIL 0.51 0.59 0.49
1230 NIL 0.81 0.70 0.82
● Entity Linking – grounding entity mentions in document to KB entries
● Slot Filling – Learning attributes about target entities
●
●
●
●
●
● Including name similarity scores helped
EL results – pilot w/o textnews(main) news +n-sim NIL +n-sim all
2250 ALL 0.67 0.58 0.66 0.70
1020 noNIL 0.51 0.35 0.40 0.47
1230 NIL 0.81 0.77 0.88 0.88
● Entity Linking – grounding entity mentions in document to KB entries
● Slot Filling – Learning attributes about target entities
●
●
●
●
●
● Including name similarity scores helped
EL results – pilot w/o textnews(main) news +n-sim NIL +n-sim all
2250 ALL 0.67 0.58 0.66 0.70
1020 noNIL 0.51 0.35 0.40 0.47
1230 NIL 0.81 0.77 0.88 0.88
● Entity Linking – grounding entity mentions in document to KB entries
● Slot Filling – Learning attributes about target entities
● Prior on Link probability/popularity (Stanford-UBC 2009, LCC 2010, Microsoft 2011)
Learning to rank algorithms: ListNet (CUNY 2011)
● Expand queries: acronym expansion/correference (NUS 2011)
● Unsupervised system – entity co-ocurrence + PageRank (WebTLab 2010)
● Inductive EL – first cluster, then link (LCC 2011)
● Collective entity linking (Microsoft 2011)
EL systems comparison
● Entity Linking – grounding entity mentions in document to KB entries
● Slot Filling – Learning attributes about target entities
● Supervised EL system● Influence of training size ● beware of training data distribution
● Consider name-similarities even for reranking ● Improve initial candidate retrieval ● Perform collective Entity Linking ● Efficiency?
Conclusion
Related tasks
● Cluster Documents Mentioning Entities ● Entity correference – document and cross-
document● Add missing links between Wikipedia pages ● Link entities to matching Wikipedia articles