+ All Categories
Home > Documents > Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages...

Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages...

Date post: 12-Aug-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
40
Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking Heng Ji, Xiaoman Pan, Boliang Zhang, Joel Nothman, James Mayfield, Paul McNamee and Cash Costello [email protected] Thanks to KBP2016 Organizing Committee Overview Paper: http://nlp.cs.rpi.edu/kbp2017.pdf
Transcript
Page 1: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking

Heng Ji, Xiaoman Pan, Boliang Zhang, Joel Nothman,James Mayfield, Paul McNamee and Cash Costello

[email protected] to KBP2016 Organizing Committee

Overview Paper: http://nlp.cs.rpi.edu/kbp2017.pdf

Page 2: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

Goals and The Task

2

Page 3: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

Cross-lingual Entity Discovery and Linking

3

Page 4: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

Where are We Now: Awesome as Usual

§ Great participation (24 teams)§ Improved Quality

§ Almost perfect linking accuracy for linkable mentions (?)§ Almost perfect NIL clustering (?)§ Chinese EDL 4% better than English EDL

§ Improved Portability§ 5 types of entities à 16,000 types§ 1-3 languages à 3,000 languages§ Scarce KBs (Geoname, World Factbook, Name List)

§ Improved Scalability§ 90,000 documents

Page 5: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

The Tasks • Input

o Asetofmulti-lingualtextdocuments(maintask:English,ChineseandSpanish)

• Outputo DocumentID,mentionID,head,offsetso Entitytype:GPE,ORG,PER,LOC,FACo Mentiontype:name,nominalo ReferenceKBlinkentityID,orNILclusterIDo Confidencevalue

• Anewpilotstudyon10low-resourcelanguageso Polish,Chechen,Albanian,Swahili,Kannada,Yoruba,Northern

Sotho,Nepali,KikuyuandSomalio NoNILclusteringo NoFACo NoNominalo KB:03/05/16WikipediadumpinsteadofBaseKB

Page 6: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

Evaluation Measures

6

• CEAFmC+:endtoendmetricforextraction,linkingandclustering

Page 7: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

Data Annotation and Resources • Tr-lingualEDLdetailsinLDCtalkandresourceoverviewpaper

(Getman etal.,2017)• 10LanguagesPilot(Silver-standard+preparedbyRPIandJHU

ChineseRooms,adjudicatedannotationsbyfiveannotators)

• ToolsandReadingListo http://nlp.cs.rpi.edu/kbp/2017/tools.htmlo http://nlp.cs.rpi.edu/kbp/2017/elreading.html

Page 8: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

Window 1 Tri-lingual EDL (part of Cold-Start++ KBP) Participants

8

Page 9: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

Window 1 Tri-lingual EDL (part of Cold-Start++ KBP) Performance (Top team = TinkerBell)

9

Page 10: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

Window 2 Tri-lingual EDL Participants (Top team = TAI)

10

Page 11: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

Window 2 Tri-lingual EDL Performance (top team = TAI)

11

• IsTri-lingualEDLSolved?o Almostperfectlinkingaccuracyforlinkablementions(75.9vs.76.1)o AlmostperfectNILclustering(67.8vs.67.4)

• perfectname/nominalcoreference +cross-docclustering

Page 12: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

12

Comparison on Three Languages

BestF-score

Extraction Extraction+Linking

Extraction+Linking+Clustering

English 81.1% 68.4% 66.3%Chinese 77.3% 71.0% 70.4%Spanish 76.7% 65.0% 64.8%

Page 13: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

10 Languages EDL Pilot Participants

13

• RPI(organizer):10languages• JHUHLT-COE(co-organizer):5languages• IBM:10languages

Page 14: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

10 Languages EDL Pilot Top Performance

14

Data Language Name Tagging NameTagging+Linking

Gold Chechen 55.4% 52.6%

(fromReflexor Somali 78.5% 56.0%

LORELEI) Yoruba 49.5% 35.6%

Silver+ Albanian 75.9% 57.0%

(fromChinese Kannada 58.4% 44.0%

Rooms) Nepali 65.0% 50.8%

Polish 63.4% 45.3%

Swahili 74.2% 65.3%

Silver(~consistency Kikuyu 88.7% 88.7%

insteadofF) NorthernSotho 90.8% 85.5%

All 74.8% 65.9%

• AgreementbetweenSilver+andGoldisbetween72%-85%

Page 15: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

15

What’s New and What Works

(Secret Weapons)

Page 16: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

• JointMentionExtractionandLinking(Sil etal.,2013)o MSRAteam(Luo etal.,

2017)designedonesingleCRFsmodelforjointnametaggingandentitylinkingandachieved1.3%nametaggingF-scoregain

• JointWordandEntityEmbeddings(Caoetal.,2017)o CMU (Maetal.,2017)and

RPI(Zhangetal.,2017b)

Joint Modeling

Page 17: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

Return of Supervised Models: Name Tagging• RichresourcesforEnglish,ChineseandSpanish

o 2009– 2017annotations:EDLfor1,500+documentsandELfor5,000+queryentities

o ACE,CONLL,OntoNotes,ERE,LORELEI,…• Supervisedmodelshavebecomepopularagain• Nametagging

o distributional semanticfeaturesaremoreeffectivethansymbolsemanticfeatures(Celebi andOzgur,2017)

o combiningthemsignificantlyenhancedbothofthequalityandrobustnesstonoiseforlow-resourcelanguages(Zhangetal.,2017)

• Selectthetrainingdatawhichismostsimilartotheevaluationset(Zhaoetal.,2017;Bernier-Colborneetal.,2017)

Page 18: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

18

Incorporate Non-traditional Linguistic

Knowledge to make DNN more robust to noise

• Zhangetal.,2017

Page 19: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

Return of Supervised Models: Entity Linking

• (Sil etal.,2017;MorenoandGrau,2017;Yangetal.,2017)returnedtosupervisedmodelstorankcandidateentitiesforentitylinking

• ThenewneuralentitylinkerdesignedbyIBM(Sil etal.,2017)achievedhigherentitylinkingaccuracythanstate-of-the-artontheKBP2010dataset

Page 20: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

20

Cross-lingual Common Semantic Space• CommonSpace(Zhangetal.,2017)• Zero-shotTransferLearning(Sil etal.,2017)

Page 21: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

21

Remaining Challenges

Page 22: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

A Typical Neural Name Tagger

Page 23: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

Duplicability Problem about DNN§ Many teams (Zhao et al., 2017; Bernier-Colborne et al.,

2017; Zhang et al., 2017b; Li et al., 2017; Mendes et al., 2017; Yang et al., 2017) trained this framework§ the same training data (KBP2015 and KBP2016 EDL corpora)§ the same set of features (word and entity embeddings)

§ Very different results§ ranked at the 1st, 2nd, 4th, 11th, 15th, 16th, 21st§ mention extraction F-score gap between the best system and the

worst system is about 24%§ Reasons?

§ hyper-parameter tuning?§ additional training data? dictionaries? embedding learning?

§ Solutions§ Submit and share systems§ More qualitative analysis

Page 24: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

24

Domain Gap

Name TaggersF-score

Trained fromChinese-RoomNews

TrainedfromWikipediaMarkups

Alabanian 75.9% 54.9%

Kannada 58.4% 32.3%

Nepali 65.0% 31.9%

Polish 55.7% 63.4%

Swahili 74.2% 66.4%

• Topic/Domainselectionismoreimportantthanthesizeofdata

• Testedonnews,withgroundtruthadjudicatedfromannotationsbyfiveannotatorsthroughtwoChineseRooms

Page 25: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

• 72%-85%agreementwithGold-Standardforvariouslanguages

• WhatNIscandobutNon-nativespeakerscannot:• ORGsespeciallyabbreviations,e.g.,

ኢህወዴግ (EthiopianPeople'sLiberationFront);ኮብራ (Cobra)

• Uncommonpersons,e.g.,ባባ መዳን (BabaMedan)

• Generallylowrecall

25

Glass-Ceiling of Chinese Room

RussianNameTagging

• Reachingtheglassceilingwhatnon-nativespeakerscanunderstandaboutforeignlanguages,difficulttodoerroranalysisandunderstandremainingchallenges

• Needtoincorporatelanguage-specificresourcesandfeatures• Movehumanlaborfromdataannotationtointerfacedevelopmenttosomeextent

Page 26: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

• RequiresdeepbackgroundknowledgediscoveryfromEnglishWikipediaandlargeEnglishcorpora:surfacelexical/embeddingfeaturesarenotenough

o Before 2000, the regional capital of Oromia was Addis Ababa, also known as ``Finfinne”.

o Oromo Liberation Front: The armed Oromo units in the Chercher Mountains were adopted as the military wing of the organization, the Oromo Liberation Army or OLA.

o Jimma Horo may refer to: Jimma Horo, East Welega, former woreda (district) in East Welega Zone, Oromia Region, Ethiopia; Jimma Horo, Kelem Welega, current woreda (district) in Kelem Welega Zone, Oromia Region, Ethiopia

o Somali (Somali region) != Somalia != Somaliland• The Ethiopian Somali Regional State (Somali: Dawlada Deegaanka Soomaalida

Itoobiya) is the easternmost of the nine ethnic divisions (kililoch) of Ethiopia.• Somalia, officially the Federal Republic of Somalia(Somali: Jamhuuriyadda Federaalka

Soomaaliya), is a country located in the Horn of Africa.• Somaliland (Somali: Somaliland), officially the Republic of Somaliland (Somali:

Jamhuuriyadda Somaliland), is a self-declared state internationally recognised as an autonomous region of Somalia.

26

Background Knowledge Discovery

Page 27: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

Looking Ahead

27

Page 28: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

Multi-Media EDL

28

Page 29: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

Multi-Media EDL• Howtobuildacommoncross-mediaschema?

•• Whattypeofentitymentionsshouldwefocuson?

• Howmuchinferenceisneeded?NYC?

Page 30: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

Streaming Mode

• Performextraction,linkingandclusteringatreal-time• Dynamicallyadjustmeasuresandconstruct/updateKB• ClusteringmustbemoreefficientthanagglomerativeclusteringtechniquesthatrequireO(n2)spaceandtime

• Smartercollectiveinferencestrategyisrequiredtotakeadvantageofevidenceinbothlocalcontextandglobalcontext

• Encourageimitationlearning,incrementallearning,reinforcementlearning

Page 31: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

Extended Entity Types

• Extendthenumberofentitytypesfromfivetothousands,soEDLcanbeutilizedtoenhanceotherNLPtaskssuchasMachineTranslation

• 1,000entitytypeshavecleanschemaandenoughentitiesinWikipedia;theEnglishtokensinWikipedia withtheseentitytypesoccupy10%vocabulary

Page 32: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

Resources and Evaluation

• Preparelotsofdevelopmentandtestsetsinlotsoflanguages,asgold-standardtovalidateandmeasureourresearchprogress

• Submitsystemsinsteadofresults

Page 33: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

EDL Systems, Data and Resources

• ResourcesandToolso http://nlp.cs.rpi.edu/kbp/2017/tools.html

• Re-trainableRPICross-lingualEDLSystemsfor282Languages:o API:http://blender02.cs.rpi.edu:3300/elisa_ie/apio Data,resourcesandtrainedmodels:http://nlp.cs.rpi.edu/wikiann/

o Demos:http://blender02.cs.rpi.edu:3300/elisa_ieo Heatmap demos:http://blender02.cs.rpi.edu:3300/elisa_ie/heatmap

• Shareyours!33

Page 34: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

34

Thank you for a wonderful decade!

Page 35: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

35

§ http://blender02.cs.rpi.edu:3300/elisa_ie/heatmap

35

Cross-lingual Entity Discovery and Linking

Page 36: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

Where We Have Been

Grow with DEFT 2006-2011 2012-2017Mention Extraction Human (most) AutomaticNIL Clustering None 64 methodsForeign Languages Chinese (5%-10%

lower than English)

System for 282 languages (Chinese/Spanish comparable to/Outperform English); research toward 3,000 languages

Document Size - 500 à90,000 documentsGenre News, web blog News, Discussion Forum, Web blog, Tweets

Entity Types PER, GPE, ORG PER, GPE, ORG, LOC, FAC, hundreds of fine-grained types for typing

Mention Types Name or allconcepts (most)

Name, Nominal, Pronoun (for BeST)

KB Wikipedia Freebase à List onlyTraining Data 20,000 queries

(entity mentions)500 à 0 documents; unsupervised linking comparable to supervised linking

#(Good) Papers 62 110 (new KBP track at ACL); 6 tutorials at top conferences

Page 37: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

Technical Term EDL Examples• P=69.6%,R=61.2%,F=65.1%onEnglish• MandarinandRussianExamples

English Mandarin Russian

Intermediatevaluetheorem 介值定理 Теоремаопромежуточномзначении

p-adic number p进数 P-адичне число

Virtualmemory 虚拟内存 Виртуальнаяпамять

Nonlinearfilter 非线性滤波器 Нелинейныйфильтр

Visualodometry 视觉测距 Визуальнаяодометрия

Wanderingset 游荡集 Неблуждающее множество

Photon 光子 Фотон

Supportvectormachine 支持向量机 Методопорныхвекторов

Neuroscience 神经科学 Нейронауки

Heavywater 重水 Тяжёлаявода

Bus(computing) 总线 Шина

Page 38: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

Many are Interesting and Useful for MT

MostChallengingTypesforMT

#EnglishentitiesinWikipedia

Examples

Quantities 7,992 "30kilometros"to"30kilometers"

Dates 962,838 "21enero 2004"to"january 21,2004"

EnglishCognates(e.g.,technicalterms)

20,365 "mетод опорныхвекторов" to"support vectormachine"

Specifieddisasterwords

"地震" to "earthquake"

PersonTitles 37,722 "BoshVazir"to"primeminister"

Colors 27,678 "màu xanh datrời"to"blue"

Holidays 2,358 "деньматері" to "mothersday"

Page 39: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

• EPRDF = OPDO + ANDM + SEPDM + TPLF• EPRDF: Ethiopian People's Revolutionary Democratic Front, also called Ehadig.• OPDO: Oromo Peoples' Democratic Organization• ANDM: Amhara National Democratic Movement• SEPDM: Southern Ethiopian People's Democratic Movement• TPLF: Tigrayan People's Liberation Front, also called Weyane or Second Weyane,

perhaps because there was a rebellion group called Woyane/Weyane in the Tigrayprovince in 1943

• Qeerroo is not an organization although it has its own website:• The overwhelming belief is that its leaders are handpicked by the TPLF puppet-

masters, and the new generation of Oromo youth – known as the ‘Qeerroo’ – have seen that it is business as usual after the latest reform.

• The Qeerroo, also called the Qubee generation, first emerged in 1991 with the participation of the Oromo Liberation Front (OLF) in the transitional government of Ethiopia. In 1992 the Tigrayan-led minority regime pushed the OLF out of government and the activist networks of Qeerroo gradually blossomed as a form of Oromummaa or Oromo nationalism.

• Today the Qeerroo are made up of Oromo youth. These are predominantly students from elementary school to university, organising collective action through social media. It is not clear what kind of relationship exists between the group and the OLF. But the Qeerroo clearly articulate that the OLF should replace the Tigrayan-led regime and recognise the Front as the origin of Oromo nationalism. 39

Background Knowledge Discovery

Page 40: Overview of TAC-KBP2017 13 Languages Entity Discovery ......Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking HengJi, XiaomanPan, BoliangZhang, Joel Nothman, James

Progress from Window 1 to Window 2

40

Best F-score Extraction Extraction+Linking Extraction+Linking+Clustering

Window1 68.8% 56.0% 54.3%

Window2 76.7% 67.8% 67.4%


Recommended