Date post: | 10-May-2015 |
Category: |
Technology |
Upload: | maartenmarx |
View: | 1,090 times |
Download: | 0 times |
Two-stage Named Entity Recognition usingaveraged perceptrons
Lars Buitinck Maarten Marx
Information and Language Processing SystemsInformatics Institute
University of Amsterdam
17th Int’l Conf. on Applications of NLP to InformationSystems
Buitinck, Marx Two-stage NER
Outline
Buitinck, Marx Two-stage NER
Named Entity Recognition
Find names in text and classify them as belonging topersons, locations, organizations, events, products or“miscellaneous”Use machine learning
Buitinck, Marx Two-stage NER
Named Entity Recognition
Find names in text and classify them as belonging topersons, locations, organizations, events, products or“miscellaneous”Use machine learning
Buitinck, Marx Two-stage NER
Named Entity Recognition for Dutch
State of the art algorithm for Dutch by Desmet and Hoste(2011); voting classifiers with GA to train weightsGood training sets are just becoming availableMany practitioners retrain Stanford CRF-NER tagger
Buitinck, Marx Two-stage NER
Named Entity Recognition for Dutch
State of the art algorithm for Dutch by Desmet and Hoste(2011); voting classifiers with GA to train weightsGood training sets are just becoming availableMany practitioners retrain Stanford CRF-NER tagger
Buitinck, Marx Two-stage NER
Named Entity Recognition for Dutch
State of the art algorithm for Dutch by Desmet and Hoste(2011); voting classifiers with GA to train weightsGood training sets are just becoming availableMany practitioners retrain Stanford CRF-NER tagger
Buitinck, Marx Two-stage NER
Overview
Realize that NER is two problems in one: recognition andclassificationPipeline solution with two classifiersUse custom feature sets for eachDo not used precompiled list of names (“gazetteer”)Work at the sentence level (because of how training setsare set up)
Buitinck, Marx Two-stage NER
Overview
Realize that NER is two problems in one: recognition andclassificationPipeline solution with two classifiersUse custom feature sets for eachDo not used precompiled list of names (“gazetteer”)Work at the sentence level (because of how training setsare set up)
Buitinck, Marx Two-stage NER
Overview
Realize that NER is two problems in one: recognition andclassificationPipeline solution with two classifiersUse custom feature sets for eachDo not used precompiled list of names (“gazetteer”)Work at the sentence level (because of how training setsare set up)
Buitinck, Marx Two-stage NER
Overview
Realize that NER is two problems in one: recognition andclassificationPipeline solution with two classifiersUse custom feature sets for eachDo not used precompiled list of names (“gazetteer”)Work at the sentence level (because of how training setsare set up)
Buitinck, Marx Two-stage NER
Overview
Realize that NER is two problems in one: recognition andclassificationPipeline solution with two classifiersUse custom feature sets for eachDo not used precompiled list of names (“gazetteer”)Work at the sentence level (because of how training setsare set up)
Buitinck, Marx Two-stage NER
Recognition stage
Token-level task: is a token the Beginning of, Inside, orOutside any entity name?Features:
Word window wi−2, . . . ,wi+2POS tags for words in windowConjunction of words and POS tags in window, e.g.(wi−1,pi−1)Capitalization of tokens in window(Character) prefixes and suffixes of wi and wi−1REs for digits, Roman numerals and punctuation
Buitinck, Marx Two-stage NER
Recognition stage
Token-level task: is a token the Beginning of, Inside, orOutside any entity name?Features:
Word window wi−2, . . . ,wi+2POS tags for words in windowConjunction of words and POS tags in window, e.g.(wi−1,pi−1)Capitalization of tokens in window(Character) prefixes and suffixes of wi and wi−1REs for digits, Roman numerals and punctuation
Buitinck, Marx Two-stage NER
Recognition stage
Token-level task: is a token the Beginning of, Inside, orOutside any entity name?Features:
Word window wi−2, . . . ,wi+2POS tags for words in windowConjunction of words and POS tags in window, e.g.(wi−1,pi−1)Capitalization of tokens in window(Character) prefixes and suffixes of wi and wi−1REs for digits, Roman numerals and punctuation
Buitinck, Marx Two-stage NER
Recognition stage
Token-level task: is a token the Beginning of, Inside, orOutside any entity name?Features:
Word window wi−2, . . . ,wi+2POS tags for words in windowConjunction of words and POS tags in window, e.g.(wi−1,pi−1)Capitalization of tokens in window(Character) prefixes and suffixes of wi and wi−1REs for digits, Roman numerals and punctuation
Buitinck, Marx Two-stage NER
Recognition stage
Token-level task: is a token the Beginning of, Inside, orOutside any entity name?Features:
Word window wi−2, . . . ,wi+2POS tags for words in windowConjunction of words and POS tags in window, e.g.(wi−1,pi−1)Capitalization of tokens in window(Character) prefixes and suffixes of wi and wi−1REs for digits, Roman numerals and punctuation
Buitinck, Marx Two-stage NER
Recognition stage
Token-level task: is a token the Beginning of, Inside, orOutside any entity name?Features:
Word window wi−2, . . . ,wi+2POS tags for words in windowConjunction of words and POS tags in window, e.g.(wi−1,pi−1)Capitalization of tokens in window(Character) prefixes and suffixes of wi and wi−1REs for digits, Roman numerals and punctuation
Buitinck, Marx Two-stage NER
Recognition stage
Token-level task: is a token the Beginning of, Inside, orOutside any entity name?Features:
Word window wi−2, . . . ,wi+2POS tags for words in windowConjunction of words and POS tags in window, e.g.(wi−1,pi−1)Capitalization of tokens in window(Character) prefixes and suffixes of wi and wi−1REs for digits, Roman numerals and punctuation
Buitinck, Marx Two-stage NER
Recognition stage
Token-level task: is a token the Beginning of, Inside, orOutside any entity name?Features:
Word window wi−2, . . . ,wi+2POS tags for words in windowConjunction of words and POS tags in window, e.g.(wi−1,pi−1)Capitalization of tokens in window(Character) prefixes and suffixes of wi and wi−1REs for digits, Roman numerals and punctuation
Buitinck, Marx Two-stage NER
Classification stage
Don’t do this at token-level; we know the entity spans!Input is a list of tokens considered an entity by therecognition stageFeatures:
The tokens we got from recognitionThe four surrounding tokensTheir pre- and suffixes up to length fourCapitalization pattern, as a string on the alphabet (L|U|O)∗The occurrence of capitalized tokens, digits and dashes inthe entire sentence
Buitinck, Marx Two-stage NER
Classification stage
Don’t do this at token-level; we know the entity spans!Input is a list of tokens considered an entity by therecognition stageFeatures:
The tokens we got from recognitionThe four surrounding tokensTheir pre- and suffixes up to length fourCapitalization pattern, as a string on the alphabet (L|U|O)∗The occurrence of capitalized tokens, digits and dashes inthe entire sentence
Buitinck, Marx Two-stage NER
Classification stage
Don’t do this at token-level; we know the entity spans!Input is a list of tokens considered an entity by therecognition stageFeatures:
The tokens we got from recognitionThe four surrounding tokensTheir pre- and suffixes up to length fourCapitalization pattern, as a string on the alphabet (L|U|O)∗The occurrence of capitalized tokens, digits and dashes inthe entire sentence
Buitinck, Marx Two-stage NER
Classification stage
Don’t do this at token-level; we know the entity spans!Input is a list of tokens considered an entity by therecognition stageFeatures:
The tokens we got from recognitionThe four surrounding tokensTheir pre- and suffixes up to length fourCapitalization pattern, as a string on the alphabet (L|U|O)∗The occurrence of capitalized tokens, digits and dashes inthe entire sentence
Buitinck, Marx Two-stage NER
Classification stage
Don’t do this at token-level; we know the entity spans!Input is a list of tokens considered an entity by therecognition stageFeatures:
The tokens we got from recognitionThe four surrounding tokensTheir pre- and suffixes up to length fourCapitalization pattern, as a string on the alphabet (L|U|O)∗The occurrence of capitalized tokens, digits and dashes inthe entire sentence
Buitinck, Marx Two-stage NER
Classification stage
Don’t do this at token-level; we know the entity spans!Input is a list of tokens considered an entity by therecognition stageFeatures:
The tokens we got from recognitionThe four surrounding tokensTheir pre- and suffixes up to length fourCapitalization pattern, as a string on the alphabet (L|U|O)∗The occurrence of capitalized tokens, digits and dashes inthe entire sentence
Buitinck, Marx Two-stage NER
Classification stage
Don’t do this at token-level; we know the entity spans!Input is a list of tokens considered an entity by therecognition stageFeatures:
The tokens we got from recognitionThe four surrounding tokensTheir pre- and suffixes up to length fourCapitalization pattern, as a string on the alphabet (L|U|O)∗The occurrence of capitalized tokens, digits and dashes inthe entire sentence
Buitinck, Marx Two-stage NER
Classification stage
Don’t do this at token-level; we know the entity spans!Input is a list of tokens considered an entity by therecognition stageFeatures:
The tokens we got from recognitionThe four surrounding tokensTheir pre- and suffixes up to length fourCapitalization pattern, as a string on the alphabet (L|U|O)∗The occurrence of capitalized tokens, digits and dashes inthe entire sentence
Buitinck, Marx Two-stage NER
Learning algorithm
Use averaged perceptron for both stagesLearns an approximation of max-margin solution (linearSVM)40 iterationsUsed the LBJ machine learning toolkit
Buitinck, Marx Two-stage NER
Learning algorithm
Use averaged perceptron for both stagesLearns an approximation of max-margin solution (linearSVM)40 iterationsUsed the LBJ machine learning toolkit
Buitinck, Marx Two-stage NER
Learning algorithm
Use averaged perceptron for both stagesLearns an approximation of max-margin solution (linearSVM)40 iterationsUsed the LBJ machine learning toolkit
Buitinck, Marx Two-stage NER
Learning algorithm
Use averaged perceptron for both stagesLearns an approximation of max-margin solution (linearSVM)40 iterationsUsed the LBJ machine learning toolkit
Buitinck, Marx Two-stage NER
Evaluation
Aim for F1 score, as defined in the CoNLL 2002 sharedtask on NERTwo corpora: CoNLL 2002 and a subset of SoNaR(courtesy Desmet and Hoste)Compare against Stanford and Desmet and Hoste’salgorithm
Buitinck, Marx Two-stage NER
Evaluation
Aim for F1 score, as defined in the CoNLL 2002 sharedtask on NERTwo corpora: CoNLL 2002 and a subset of SoNaR(courtesy Desmet and Hoste)Compare against Stanford and Desmet and Hoste’salgorithm
Buitinck, Marx Two-stage NER
Evaluation
Aim for F1 score, as defined in the CoNLL 2002 sharedtask on NERTwo corpora: CoNLL 2002 and a subset of SoNaR(courtesy Desmet and Hoste)Compare against Stanford and Desmet and Hoste’salgorithm
Buitinck, Marx Two-stage NER
Results on CoNLL 2002
309.686 tokens containing 19901 names, four categories65% training, 22% validation and 12% test setsStanford achieves F1 = 74.72; "miscellaneous" category ishard (< 0.7)We achieve F1 = 75.14; "organization" category is hard
Buitinck, Marx Two-stage NER
Results on CoNLL 2002
309.686 tokens containing 19901 names, four categories65% training, 22% validation and 12% test setsStanford achieves F1 = 74.72; "miscellaneous" category ishard (< 0.7)We achieve F1 = 75.14; "organization" category is hard
Buitinck, Marx Two-stage NER
Results on CoNLL 2002
309.686 tokens containing 19901 names, four categories65% training, 22% validation and 12% test setsStanford achieves F1 = 74.72; "miscellaneous" category ishard (< 0.7)We achieve F1 = 75.14; "organization" category is hard
Buitinck, Marx Two-stage NER
Results on CoNLL 2002
309.686 tokens containing 19901 names, four categories65% training, 22% validation and 12% test setsStanford achieves F1 = 74.72; "miscellaneous" category ishard (< 0.7)We achieve F1 = 75.14; "organization" category is hard
Buitinck, Marx Two-stage NER
Results on SoNaR
New, large corpus with manual annotationsUsed a 200k tokens subset of a preliminary version,three-fold cross validationState of the art is Desmet and Hoste (2011) withF1 = 84.44Best individual classifier from that paper (CRF) gets 83.77Our system: 83.56Here, “product” and “miscellaneous” categories are hard
Buitinck, Marx Two-stage NER
Results on SoNaR
New, large corpus with manual annotationsUsed a 200k tokens subset of a preliminary version,three-fold cross validationState of the art is Desmet and Hoste (2011) withF1 = 84.44Best individual classifier from that paper (CRF) gets 83.77Our system: 83.56Here, “product” and “miscellaneous” categories are hard
Buitinck, Marx Two-stage NER
Results on SoNaR
New, large corpus with manual annotationsUsed a 200k tokens subset of a preliminary version,three-fold cross validationState of the art is Desmet and Hoste (2011) withF1 = 84.44Best individual classifier from that paper (CRF) gets 83.77Our system: 83.56Here, “product” and “miscellaneous” categories are hard
Buitinck, Marx Two-stage NER
Results on SoNaR
New, large corpus with manual annotationsUsed a 200k tokens subset of a preliminary version,three-fold cross validationState of the art is Desmet and Hoste (2011) withF1 = 84.44Best individual classifier from that paper (CRF) gets 83.77Our system: 83.56Here, “product” and “miscellaneous” categories are hard
Buitinck, Marx Two-stage NER
Results on SoNaR
New, large corpus with manual annotationsUsed a 200k tokens subset of a preliminary version,three-fold cross validationState of the art is Desmet and Hoste (2011) withF1 = 84.44Best individual classifier from that paper (CRF) gets 83.77Our system: 83.56Here, “product” and “miscellaneous” categories are hard
Buitinck, Marx Two-stage NER
Results on SoNaR
New, large corpus with manual annotationsUsed a 200k tokens subset of a preliminary version,three-fold cross validationState of the art is Desmet and Hoste (2011) withF1 = 84.44Best individual classifier from that paper (CRF) gets 83.77Our system: 83.56Here, “product” and “miscellaneous” categories are hard
Buitinck, Marx Two-stage NER
Conclusion
Near-state of the art performance from simple learnerswith good feature setsNo gazetteers, so should be fairly reusable(Side conclusion: SoNaR is more easily learnable thanCoNLL)
Buitinck, Marx Two-stage NER
Conclusion
Near-state of the art performance from simple learnerswith good feature setsNo gazetteers, so should be fairly reusable(Side conclusion: SoNaR is more easily learnable thanCoNLL)
Buitinck, Marx Two-stage NER
Conclusion
Near-state of the art performance from simple learnerswith good feature setsNo gazetteers, so should be fairly reusable(Side conclusion: SoNaR is more easily learnable thanCoNLL)
Buitinck, Marx Two-stage NER
Future work
Being integrated in UvA’s xTAS text analysis pipelineUsed to find entities in Dutch Hansard corpus(forthcoming) and link entities to WikipediaFull SoNaR is now available; new evaluation needed
Buitinck, Marx Two-stage NER
Future work
Being integrated in UvA’s xTAS text analysis pipelineUsed to find entities in Dutch Hansard corpus(forthcoming) and link entities to WikipediaFull SoNaR is now available; new evaluation needed
Buitinck, Marx Two-stage NER
Future work
Being integrated in UvA’s xTAS text analysis pipelineUsed to find entities in Dutch Hansard corpus(forthcoming) and link entities to WikipediaFull SoNaR is now available; new evaluation needed
Buitinck, Marx Two-stage NER