Presentation at NLDB 2012

Two-stage Named Entity Recognition usingaveraged perceptrons

Lars Buitinck Maarten Marx

Information and Language Processing SystemsInformatics Institute

University of Amsterdam

17th Int’l Conf. on Applications of NLP to InformationSystems

Buitinck, Marx Two-stage NER

Outline


Named Entity Recognition

Find names in text and classify them as belonging topersons, locations, organizations, events, products or“miscellaneous”Use machine learning


Named Entity Recognition

Find names in text and classify them as belonging topersons, locations, organizations, events, products or“miscellaneous”Use machine learning


Named Entity Recognition for Dutch

State of the art algorithm for Dutch by Desmet and Hoste(2011); voting classifiers with GA to train weightsGood training sets are just becoming availableMany practitioners retrain Stanford CRF-NER tagger








Overview

Realize that NER is two problems in one: recognition andclassificationPipeline solution with two classifiersUse custom feature sets for eachDo not used precompiled list of names (“gazetteer”)Work at the sentence level (because of how training setsare set up)


Overview



Overview



Overview



Overview



Recognition stage

Token-level task: is a token the Beginning of, Inside, orOutside any entity name?Features:

Word window wi−2, . . . ,wi+2POS tags for words in windowConjunction of words and POS tags in window, e.g.(wi−1,pi−1)Capitalization of tokens in window(Character) prefixes and suffixes of wi and wi−1REs for digits, Roman numerals and punctuation


Recognition stage




Recognition stage




Recognition stage




Recognition stage




Recognition stage




Recognition stage




Recognition stage




Classification stage

Don’t do this at token-level; we know the entity spans!Input is a list of tokens considered an entity by therecognition stageFeatures:

The tokens we got from recognitionThe four surrounding tokensTheir pre- and suffixes up to length fourCapitalization pattern, as a string on the alphabet (L|U|O)∗The occurrence of capitalized tokens, digits and dashes inthe entire sentence






























Learning algorithm

Use averaged perceptron for both stagesLearns an approximation of max-margin solution (linearSVM)40 iterationsUsed the LBJ machine learning toolkit


Learning algorithm



Learning algorithm



Learning algorithm



Evaluation

Aim for F1 score, as defined in the CoNLL 2002 sharedtask on NERTwo corpora: CoNLL 2002 and a subset of SoNaR(courtesy Desmet and Hoste)Compare against Stanford and Desmet and Hoste’salgorithm


Evaluation



Evaluation



Results on CoNLL 2002

309.686 tokens containing 19901 names, four categories65% training, 22% validation and 12% test setsStanford achieves F1 = 74.72; "miscellaneous" category ishard (< 0.7)We achieve F1 = 75.14; "organization" category is hard











Results on SoNaR

New, large corpus with manual annotationsUsed a 200k tokens subset of a preliminary version,three-fold cross validationState of the art is Desmet and Hoste (2011) withF1 = 84.44Best individual classifier from that paper (CRF) gets 83.77Our system: 83.56Here, “product” and “miscellaneous” categories are hard


Results on SoNaR



Results on SoNaR



Results on SoNaR



Results on SoNaR



Results on SoNaR



Conclusion

Near-state of the art performance from simple learnerswith good feature setsNo gazetteers, so should be fairly reusable(Side conclusion: SoNaR is more easily learnable thanCoNLL)


Conclusion



Conclusion



Future work

Being integrated in UvA’s xTAS text analysis pipelineUsed to find entities in Dutch Hansard corpus(forthcoming) and link entities to WikipediaFull SoNaR is now available; new evaluation needed


Future work



Future work



Date post:	10-May-2015
Category:	Technology
Upload:	maartenmarx
View:	1,090 times
Download:	0 times

Presentation at NLDB 2012

Technology