Date post: | 12-Jan-2016 |
Category: |
Documents |
Upload: | britney-barker |
View: | 213 times |
Download: | 1 times |
1
Named Entity Recognition based on three different
machine learning techniques
Zornitsa [email protected]
JRC WorkshopSeptember 27, 2005
Research Group on Research Group on Language Processing and Information SystemsLanguage Processing and Information Systems g g
PLSIPLSI
2
g
g
PLSI
PLSI
Rese
arc
h G
roup o
n L
anguage P
roce
ssin
g
and Info
rmati
on S
yst
em
s
Outline
Named Entity Recognition task definition applications
Machine learning approach Classifier combination Feature description and experimental evaluation
for NE detection for NE classification
NERUA at GeoCLEF Conclusions and future work
3
g
g
PLSI
PLSI
Rese
arc
h G
roup o
n L
anguage P
roce
ssin
g
and Info
rmati
on S
yst
em
s
Named Entity Recognition – task definition
Identification of proper names in text, using BIO scheme B starts an entity I continues the entity O words outside entity
Classification into a predefined set of categories Person names Organizations (companies, governmental organizations, etc) Locations (cities, countries, etc) Miscellaneous (movie titles, sport events, etc)
Adam_B-PER Smith_I-PER works_O for_O IBM_B-ORG ,_O London_B-LOC ._O
4
g
g
PLSI
PLSI
Rese
arc
h G
roup o
n L
anguage P
roce
ssin
g
and Info
rmati
on S
yst
em
s Information Extraction Question Answering Document classification Automatic indexing of books Increase accuracy of Internet search results
(location Clinton/South Carolina vs. PresidentClinton)
Named Entity Recognition – applications
5
g
g
PLSI
PLSI
Rese
arc
h G
roup o
n L
anguage P
roce
ssin
g
and Info
rmati
on S
yst
em
s
Outline
Named Entity Recognition task definition applications
Machine learning approach Classifier combination Feature description and experimental evaluation
for NE detection for NE classification
NERUA at GeoCLEF Conclusions and future work
6
g
g
PLSI
PLSI
Rese
arc
h G
roup o
n L
anguage P
roce
ssin
g
and Info
rmati
on S
yst
em
s
Machine learning approach
Given: NER task tagged corpus
Select classification methods Memory-based learning Maximum Entropy Hidden Markov Models
Construct set of characteristics detection phase classification phase
7
g
g
PLSI
PLSI
Rese
arc
h G
roup o
n L
anguage P
roce
ssin
g
and Info
rmati
on S
yst
em
s
Text DetectionHMM
TiMBL
Classification
HMM
TiMBL
MXE
NERText
Voting
Voting
NERUA:sistema de detección y clasificación de entidades utilizando aprendizaje automático, Ferrández et al.
8
g
g
PLSI
PLSI
Rese
arc
h G
roup o
n L
anguage P
roce
ssin
g
and Info
rmati
on S
yst
em
s
Classification method 1
Memory-based learning (k-nearest neighbours) toolkit
TiMBL package time performance
quick training phase slow during testing
features various types of features irrelevant features impede performance
9
g
g
PLSI
PLSI
Rese
arc
h G
roup o
n L
anguage P
roce
ssin
g
and Info
rmati
on S
yst
em
s
Classification method 2
Maximum Entropy toolkit
MaxEnt time performance
slow training phase slow testing phase
feature management string, missing values
10
g
g
PLSI
PLSI
Rese
arc
h G
roup o
n L
anguage P
roce
ssin
g
and Info
rmati
on S
yst
em
s
Classification method 3
Hidden Markov Models toolkit
ICOPOST time performance
quick training phase quick testing phase
feature management cannot handle as many features as the other two
methods need corpus or label transformation
11
g
g
PLSI
PLSI
Rese
arc
h G
roup o
n L
anguage P
roce
ssin
g
and Info
rmati
on S
yst
em
s
Outline
Named Entity Recognition task definition applications
Machine learning approach Classifier combination Feature description and experimental evaluation
for NE detection for NE classification
NERUA at GeoCLEF Conclusions and future work
12
g
g
PLSI
PLSI
Rese
arc
h G
roup o
n L
anguage P
roce
ssin
g
and Info
rmati
on S
yst
em
s
Classifier combination
Majority voting give each classifier one vote
CL 1 CL 2 CL 3
PER PER PER
ORG LOC ORG
PER LOC LOC
PER ORG MISC
Vote
PER
ORG
LOC
…
13
g
g
PLSI
PLSI
Rese
arc
h G
roup o
n L
anguage P
roce
ssin
g
and Info
rmati
on S
yst
em
s
Outline
Named Entity Recognition task definition applications
Machine learning approach Classifier combination Feature description and experimental evaluation
for NE detection for NE classification
NERUA at GeoCLEF Conclusions and future work
14
g
g
PLSI
PLSI
Rese
arc
h G
roup o
n L
anguage P
roce
ssin
g
and Info
rmati
on S
yst
em
s
Features for NE detection
Contextual anchor word (e.g. the word to be classified); words in a [-3,…,+3] window ;
Orthographic capitalization at position 0,[-3,..,+3]; whole anchor word in capitals (ex. IBM) position of anchor word in a sentence
Substring extraction 2 and 3 letter extraction from left and right side of the anchor word
Gazetteer list word at position 0,+1,+2,+3 seen in the list
Trigger word list word at position 0,[-3,..,+3] seen in the list
Using Language Resource Independent Detection for Spanish NER, Kozareva et al., RANLP’05
15
g
g
PLSI
PLSI
Rese
arc
h G
roup o
n L
anguage P
roce
ssin
g
and Info
rmati
on S
yst
em
s
Results for NE detection
Spanish B I BIO
TMB-ALL 94.81 86.45 92.56
TMB-CO1 94.62 86.14 92.34
TMB-COS2 94.72 86.48 92.51
HMM3 93.19 82.33 90.29
Voting1,2,3 95.07 87.17 92.96
Data Size Train TestSp tokens 264715 51533Sp entities 18794 3558Pt tokens 68597 22624Pt entities 3094 1013
Portuguese B I BIO
TMB-CO 82.91 68.53 78.41
TMB-COS 81.65 63.80 76.20
HMM 72.93 59.81 68.53
Voting 83.32 69.09 78.86
16
g
g
PLSI
PLSI
Rese
arc
h G
roup o
n L
anguage P
roce
ssin
g
and Info
rmati
on S
yst
em
s
Index
Named Entity Recognition task definition applications
Machine learning approach Classifier combination Feature description and experimental evaluation
for NE detection for NE classification
NERUA at GeoCLEF Conclusions and future work
17
g
g
PLSI
PLSI
Rese
arc
h G
roup o
n L
anguage P
roce
ssin
g
and Info
rmati
on S
yst
em
s
Features for NE classification
Contextual whole entity first word of the entity second word of the entity if present words around the entity in [-3,…,+3] window
Orthographic position of anchor word in a sentence capital, lowercase or other symbol
Gazetteer list part of entity in the list whole entity in the list whole entity is not in any of these lists
Trigger lists anchor word words in [-1,+1] window
18
g
g
PLSI
PLSI
Rese
arc
h G
roup o
n L
anguage P
roce
ssin
g
and Info
rmati
on S
yst
em
s
Results for NE classification
Classification LOC MISC ORG PER
MxE241 77.81 57.49 78.83 85.41
TMB24 75.49 53.19 77.44 83.89
MxE25 78.27 58.22 78.64 85.60
TMB252 75.15 52.94 77.79 85.36
HMM3 71.15 45.69 72.95 70.20
Voting1,2,3 78.46 57.00 78.93 86.52
F-score for Spanish classification
19
g
g
PLSI
PLSI
Rese
arc
h G
roup o
n L
anguage P
roce
ssin
g
and Info
rmati
on S
yst
em
s
Outline
Named Entity Recognition – task definition, applications
Machine learning approach Classifier combination Feature description and experimental
evaluation for NE detection for NE classification
NERUA at GeoCLEF Conclusions and future work
20
g
g
PLSI
PLSI
Rese
arc
h G
roup o
n L
anguage P
roce
ssin
g
and Info
rmati
on S
yst
em
s
NERUA at GeoCLEF
Language Run Result
English IRn+NERUA 34.95
IRn+Dramneri 29.77
Spanish-English IRn+NERUA 26.06
IRn+Dramneri 23.65
English used directly the feature sets constructed for Spanish
NERUA outperformed the rule-based system Dramneri although both consulted the same gazetteer and trigger word lists
NERUA took more processing time
University of Alicante at GeoCLEF 2005, Ferrández et al., CLEF’05
21
g
g
PLSI
PLSI
Rese
arc
h G
roup o
n L
anguage P
roce
ssin
g
and Info
rmati
on S
yst
em
s
Conclusions and future work
We found a language resource independent feature set for NE detection 92.96% of Spanish entities 78.86% of Portuguese entities
Classifier combination has improved NE classification
Good coverage over PER, LOC and ORG classes is maintained
Machine learning systems may outperform rule-based systems, however they need more processing time and hand-labeled resources which are not available for all languages
22
g
g
PLSI
PLSI
Rese
arc
h G
roup o
n L
anguage P
roce
ssin
g
and Info
rmati
on S
yst
em
s
Future work
Find discriminative features for MISC class Resolve NER leaning upon unlabeled data Divide the four categories into more detailed
ones Adapt the system for other languages Study ways of automatic gazetteer
construction
23
Thank you for the attention!¿Questions?
Named Entity Recognition based on three different machine learning techniques
Zornitsa [email protected]
JRC WorkshopSeptember 27, 2005
Research Group on Research Group on Language Processing and Information SystemsLanguage Processing and Information Systems g g
PLSIPLSI