+ All Categories
Home > Documents > Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.

Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.

Date post: 03-Jan-2016
Category:
Upload: ethan-sims
View: 214 times
Download: 0 times
Share this document with a friend
Popular Tags:
27
Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT
Transcript
Page 1: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.

Term Informativeness for Named Entity Detection

Jason D. M. RennieMIT

Tommi JaakkolaMIT

Page 2: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.

Information Extraction

President Bush signed the Central America Free Trade Agreement into law Tuesday…

Who What When

Page 3: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.

Named Entity Detection

President Bush signed the Central America Free Trade Agreement into law Tuesday, hailing the seven-nation pact as an open-door policy that will benefit U.S. exporters

and seed prosperity and democracy in Central America and the Dominican

Republic.

Page 4: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.

Informal Communication

• Other Sources of Information– E-mail– Web Bulletin Boards– Mailing Lists

• More specialized, up-to-date information

• But, harder to extract

Page 5: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.

IE for Informal Comm.

SUBJECT: Two New Ipswich Seafood Joints to Open Soon.

ALL HOUNDS ON DECK! #1 Across from the new HS, at the old White Cap Seafood is a renovated new joint and the sign says "Salt Box". I suspect they are opening soon; they look ready. Lets hope its great as there is too much 'just average' around here. #2: In the…

Page 6: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.

NED for Informal Comm.

Subject: finale harvard square

has anyone been to the recently openedfinale in harvard square?

Page 7: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.

Restaurant Bulletin Board

• Gathered from a Restaurant BBoard– 6 sets of ~100 posts– 132 threads– Applied Ratnaparki’s POS tagger– Hand-labeled each token In/Out of restaurant

name

Page 8: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.

Detecting Named Entities

Named Entity

Informative

Bursty

Named Entity

Informative

Page 9: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.

Document 1 Document 2 Document 3

Quantifying Informativeness

the clandestineBrazil

Page 10: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.

A Little History…

Z-measure [Brookes,1968]

Inverse Doc. Freq. [Jones,1973]

xI [Bookstein & Swanson, 1974]

Residual IDF [Church & Gale, 1995]

Gain [Papenini, 2001]

Page 11: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.

Main Idea

• Informative words are:– Rare (IDF)– Modal (Mixture Score)

• Rarity and Modality are independent qualities

• We quantify informativeness using a product of IDF and Mixture Score

Page 12: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.

Binomial Distribution

Page 13: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.

Term Frequency Distributions

7

0

4

0

8

0

5

5

6

0

“the”

“Brazil”

Page 14: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.

Mixture Models

0.1% 5%

10%

0 5

90%

Page 15: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.

Modality

• Modal words fit a mixture much better than a single binomial

• We separately fit the binomial and mixture models to each term frequency distribution

• We quantify modality by comparing the fitness of the two models

Page 16: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.

Learning Mixture Parameters

Use Gradient Descent to learn , 1, 2

Page 17: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.

Comparing Fitness

• Use log-odds to compare fitness of the two models

Page 18: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.

Top Mixture Score Words

Token Score Rest. Occur.

sichaun 99.62 31/52

fish 50.59 7/73

was 48.79 0/483

speed 44.69 16/19

tacos 43.77 4/19

Page 19: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.

Independence

Rareness(IDF)

Modality(Mixture Score)

?

Page 20: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.

Correlation Coefficient

Score Pair Corr. Coefficient

IDF/Mixture -.0139IDF/RIDF .4113

Mixture/RIDF .7380

Page 21: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.

Top Words Overlap Plot

• Two sorted lists– Sorted by IDF– Sorted by Mixture Score

• Look at % overlap among top N in both lists

• Plot % overlap as we vary N

• Independent scores would produce line along diagonal

Page 22: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.

Overlap Plot

# Top Words

Per

cent

Ove

rlap

IDF/Mixture

IDF/RIDF

Page 23: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.

Top IDF*Mixture Words

Token Score Rest. Occur.

sichaun 379.97 31/52

villa 197.08 10/11

tokyo 191.72 7/11

ribs 181.57 0/13

speed 156.23 16/19

Page 24: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.

Intro to NED Experiments

• Task: Identify Restaurant Names

• Use standard NED features (capitalization, punctuation, POS) as “Baseline”

• Add informativeness score as an additional feature

• Use F1 Breakeven as performance metric

Page 25: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.

NED Experiments

Feature Set F1 Breakeven

Baseline 55.0%

IDF 56.0%

Mixture 56.0%

IDF,Mixture 56.9%

Residual IDF 57.4%

IDF*RIDF 58.5%

IDF*Mixture 59.3%

Better

Page 26: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.

Summary

• Traditional syntax-based features are not enough for IE in e-mail & bulletin boards

• We used term occurrence statistics to construct an informativeness score (IDF*Mixture)

• We found IDF*Mixture to be useful for identifying topic-centric words and named entites

Page 27: Term Informativeness for Named Entity Detection Jason D. M. Rennie MIT Tommi Jaakkola MIT.

Discussion

• Phrases

• Foreign languages, Speech

• Co-reference resolution, context tracking

• Collaborative filtering


Recommended