+ All Categories
Home > Documents > 1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva...

1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva...

Date post: 12-Jan-2016
Category:
Upload: britney-barker
View: 213 times
Download: 1 times
Share this document with a friend
23
1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva [email protected] JRC Workshop September 27, 2005 Research Group on Research Group on Language Processing and Information Systems Language Processing and Information Systems g g PLSI PLSI
Transcript
Page 1: 1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva zkozareva@dlsi.ua.es JRC Workshop September 27, 2005.

1

Named Entity Recognition based on three different

machine learning techniques

Zornitsa [email protected]

JRC WorkshopSeptember 27, 2005

Research Group on Research Group on Language Processing and Information SystemsLanguage Processing and Information Systems g g

PLSIPLSI

Page 2: 1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva zkozareva@dlsi.ua.es JRC Workshop September 27, 2005.

2

g

g

PLSI

PLSI

Rese

arc

h G

roup o

n L

anguage P

roce

ssin

g

and Info

rmati

on S

yst

em

s

Outline

Named Entity Recognition task definition applications

Machine learning approach Classifier combination Feature description and experimental evaluation

for NE detection for NE classification

NERUA at GeoCLEF Conclusions and future work

Page 3: 1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva zkozareva@dlsi.ua.es JRC Workshop September 27, 2005.

3

g

g

PLSI

PLSI

Rese

arc

h G

roup o

n L

anguage P

roce

ssin

g

and Info

rmati

on S

yst

em

s

Named Entity Recognition – task definition

Identification of proper names in text, using BIO scheme B starts an entity I continues the entity O words outside entity

Classification into a predefined set of categories Person names Organizations (companies, governmental organizations, etc) Locations (cities, countries, etc) Miscellaneous (movie titles, sport events, etc)

Adam_B-PER Smith_I-PER works_O for_O IBM_B-ORG ,_O London_B-LOC ._O

Page 4: 1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva zkozareva@dlsi.ua.es JRC Workshop September 27, 2005.

4

g

g

PLSI

PLSI

Rese

arc

h G

roup o

n L

anguage P

roce

ssin

g

and Info

rmati

on S

yst

em

s Information Extraction Question Answering Document classification Automatic indexing of books Increase accuracy of Internet search results

(location Clinton/South Carolina vs. PresidentClinton)

Named Entity Recognition – applications

Page 5: 1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva zkozareva@dlsi.ua.es JRC Workshop September 27, 2005.

5

g

g

PLSI

PLSI

Rese

arc

h G

roup o

n L

anguage P

roce

ssin

g

and Info

rmati

on S

yst

em

s

Outline

Named Entity Recognition task definition applications

Machine learning approach Classifier combination Feature description and experimental evaluation

for NE detection for NE classification

NERUA at GeoCLEF Conclusions and future work

Page 6: 1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva zkozareva@dlsi.ua.es JRC Workshop September 27, 2005.

6

g

g

PLSI

PLSI

Rese

arc

h G

roup o

n L

anguage P

roce

ssin

g

and Info

rmati

on S

yst

em

s

Machine learning approach

Given: NER task tagged corpus

Select classification methods Memory-based learning Maximum Entropy Hidden Markov Models

Construct set of characteristics detection phase classification phase

Page 7: 1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva zkozareva@dlsi.ua.es JRC Workshop September 27, 2005.

7

g

g

PLSI

PLSI

Rese

arc

h G

roup o

n L

anguage P

roce

ssin

g

and Info

rmati

on S

yst

em

s

Text DetectionHMM

TiMBL

Classification

HMM

TiMBL

MXE

NERText

Voting

Voting

NERUA:sistema de detección y clasificación de entidades utilizando aprendizaje automático, Ferrández et al.

Page 8: 1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva zkozareva@dlsi.ua.es JRC Workshop September 27, 2005.

8

g

g

PLSI

PLSI

Rese

arc

h G

roup o

n L

anguage P

roce

ssin

g

and Info

rmati

on S

yst

em

s

Classification method 1

Memory-based learning (k-nearest neighbours) toolkit

TiMBL package time performance

quick training phase slow during testing

features various types of features irrelevant features impede performance

Page 9: 1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva zkozareva@dlsi.ua.es JRC Workshop September 27, 2005.

9

g

g

PLSI

PLSI

Rese

arc

h G

roup o

n L

anguage P

roce

ssin

g

and Info

rmati

on S

yst

em

s

Classification method 2

Maximum Entropy toolkit

MaxEnt time performance

slow training phase slow testing phase

feature management string, missing values

Page 10: 1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva zkozareva@dlsi.ua.es JRC Workshop September 27, 2005.

10

g

g

PLSI

PLSI

Rese

arc

h G

roup o

n L

anguage P

roce

ssin

g

and Info

rmati

on S

yst

em

s

Classification method 3

Hidden Markov Models toolkit

ICOPOST time performance

quick training phase quick testing phase

feature management cannot handle as many features as the other two

methods need corpus or label transformation

Page 11: 1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva zkozareva@dlsi.ua.es JRC Workshop September 27, 2005.

11

g

g

PLSI

PLSI

Rese

arc

h G

roup o

n L

anguage P

roce

ssin

g

and Info

rmati

on S

yst

em

s

Outline

Named Entity Recognition task definition applications

Machine learning approach Classifier combination Feature description and experimental evaluation

for NE detection for NE classification

NERUA at GeoCLEF Conclusions and future work

Page 12: 1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva zkozareva@dlsi.ua.es JRC Workshop September 27, 2005.

12

g

g

PLSI

PLSI

Rese

arc

h G

roup o

n L

anguage P

roce

ssin

g

and Info

rmati

on S

yst

em

s

Classifier combination

Majority voting give each classifier one vote

CL 1 CL 2 CL 3

PER PER PER

ORG LOC ORG

PER LOC LOC

PER ORG MISC

Vote

PER

ORG

LOC

Page 13: 1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva zkozareva@dlsi.ua.es JRC Workshop September 27, 2005.

13

g

g

PLSI

PLSI

Rese

arc

h G

roup o

n L

anguage P

roce

ssin

g

and Info

rmati

on S

yst

em

s

Outline

Named Entity Recognition task definition applications

Machine learning approach Classifier combination Feature description and experimental evaluation

for NE detection for NE classification

NERUA at GeoCLEF Conclusions and future work

Page 14: 1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva zkozareva@dlsi.ua.es JRC Workshop September 27, 2005.

14

g

g

PLSI

PLSI

Rese

arc

h G

roup o

n L

anguage P

roce

ssin

g

and Info

rmati

on S

yst

em

s

Features for NE detection

Contextual anchor word (e.g. the word to be classified); words in a [-3,…,+3] window ;

Orthographic capitalization at position 0,[-3,..,+3]; whole anchor word in capitals (ex. IBM) position of anchor word in a sentence

Substring extraction 2 and 3 letter extraction from left and right side of the anchor word

Gazetteer list word at position 0,+1,+2,+3 seen in the list

Trigger word list word at position 0,[-3,..,+3] seen in the list

Using Language Resource Independent Detection for Spanish NER, Kozareva et al., RANLP’05

Page 15: 1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva zkozareva@dlsi.ua.es JRC Workshop September 27, 2005.

15

g

g

PLSI

PLSI

Rese

arc

h G

roup o

n L

anguage P

roce

ssin

g

and Info

rmati

on S

yst

em

s

Results for NE detection

Spanish B I BIO

TMB-ALL 94.81 86.45 92.56

TMB-CO1 94.62 86.14 92.34

TMB-COS2 94.72 86.48 92.51

HMM3 93.19 82.33 90.29

Voting1,2,3 95.07 87.17 92.96

Data Size Train TestSp tokens 264715 51533Sp entities 18794 3558Pt tokens 68597 22624Pt entities 3094 1013

Portuguese B I BIO

TMB-CO 82.91 68.53 78.41

TMB-COS 81.65 63.80 76.20

HMM 72.93 59.81 68.53

Voting 83.32 69.09 78.86

Page 16: 1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva zkozareva@dlsi.ua.es JRC Workshop September 27, 2005.

16

g

g

PLSI

PLSI

Rese

arc

h G

roup o

n L

anguage P

roce

ssin

g

and Info

rmati

on S

yst

em

s

Index

Named Entity Recognition task definition applications

Machine learning approach Classifier combination Feature description and experimental evaluation

for NE detection for NE classification

NERUA at GeoCLEF Conclusions and future work

Page 17: 1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva zkozareva@dlsi.ua.es JRC Workshop September 27, 2005.

17

g

g

PLSI

PLSI

Rese

arc

h G

roup o

n L

anguage P

roce

ssin

g

and Info

rmati

on S

yst

em

s

Features for NE classification

Contextual whole entity first word of the entity second word of the entity if present words around the entity in [-3,…,+3] window

Orthographic position of anchor word in a sentence capital, lowercase or other symbol

Gazetteer list part of entity in the list whole entity in the list whole entity is not in any of these lists

Trigger lists anchor word words in [-1,+1] window

Page 18: 1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva zkozareva@dlsi.ua.es JRC Workshop September 27, 2005.

18

g

g

PLSI

PLSI

Rese

arc

h G

roup o

n L

anguage P

roce

ssin

g

and Info

rmati

on S

yst

em

s

Results for NE classification

Classification LOC MISC ORG PER

MxE241 77.81 57.49 78.83 85.41

TMB24 75.49 53.19 77.44 83.89

MxE25 78.27 58.22 78.64 85.60

TMB252 75.15 52.94 77.79 85.36

HMM3 71.15 45.69 72.95 70.20

Voting1,2,3 78.46 57.00 78.93 86.52

F-score for Spanish classification

Page 19: 1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva zkozareva@dlsi.ua.es JRC Workshop September 27, 2005.

19

g

g

PLSI

PLSI

Rese

arc

h G

roup o

n L

anguage P

roce

ssin

g

and Info

rmati

on S

yst

em

s

Outline

Named Entity Recognition – task definition, applications

Machine learning approach Classifier combination Feature description and experimental

evaluation for NE detection for NE classification

NERUA at GeoCLEF Conclusions and future work

Page 20: 1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva zkozareva@dlsi.ua.es JRC Workshop September 27, 2005.

20

g

g

PLSI

PLSI

Rese

arc

h G

roup o

n L

anguage P

roce

ssin

g

and Info

rmati

on S

yst

em

s

NERUA at GeoCLEF

Language Run Result

English IRn+NERUA 34.95

IRn+Dramneri 29.77

Spanish-English IRn+NERUA 26.06

IRn+Dramneri 23.65

English used directly the feature sets constructed for Spanish

NERUA outperformed the rule-based system Dramneri although both consulted the same gazetteer and trigger word lists

NERUA took more processing time

University of Alicante at GeoCLEF 2005, Ferrández et al., CLEF’05

Page 21: 1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva zkozareva@dlsi.ua.es JRC Workshop September 27, 2005.

21

g

g

PLSI

PLSI

Rese

arc

h G

roup o

n L

anguage P

roce

ssin

g

and Info

rmati

on S

yst

em

s

Conclusions and future work

We found a language resource independent feature set for NE detection 92.96% of Spanish entities 78.86% of Portuguese entities

Classifier combination has improved NE classification

Good coverage over PER, LOC and ORG classes is maintained

Machine learning systems may outperform rule-based systems, however they need more processing time and hand-labeled resources which are not available for all languages

Page 22: 1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva zkozareva@dlsi.ua.es JRC Workshop September 27, 2005.

22

g

g

PLSI

PLSI

Rese

arc

h G

roup o

n L

anguage P

roce

ssin

g

and Info

rmati

on S

yst

em

s

Future work

Find discriminative features for MISC class Resolve NER leaning upon unlabeled data Divide the four categories into more detailed

ones Adapt the system for other languages Study ways of automatic gazetteer

construction

Page 23: 1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva zkozareva@dlsi.ua.es JRC Workshop September 27, 2005.

23

Thank you for the attention!¿Questions?

Named Entity Recognition based on three different machine learning techniques

Zornitsa [email protected]

JRC WorkshopSeptember 27, 2005

Research Group on Research Group on Language Processing and Information SystemsLanguage Processing and Information Systems g g

PLSIPLSI


Recommended