July 30th, 2009 Lexical Knowledge from
Ngrams1
Self-adjustable bootstrappingfor Named Entity set expansion
Sushant Narsale (JHU)Satoshi Sekine (NYU)
July 30th, 2009 Lexical Knowledge from Ngrams 2
Nail: Set (NE list) Expansionusing bootstrapping
2Lexical Knowledge from Ngrams
Expand Named Entity Sets for 150 Named Entity Categories
July 30th, 2009
ngrams
Self-adjustable bootstrapping
July 30th, 2009 Lexical Knowledge from Ngrams 3
Our Task
• Input: Seeds for 150 Named Entity Categories•Output: More examples like seeds
• Motivation– “Creating lists of Named Entities on Web is critical
for query analysis, document categorization and ad matching” -Web Scale Distributional Similarity and Entity Set Expansion, Pantel et. al
3Lexical Knowledge from NgramsJuly 30th, 2009
July 30th, 2009 Lexical Knowledge from Ngrams 4
Examples of 3 categories from 150
Awards (1091) Academy (215) Title (8)
AAASS/Orbis Books Prize Abd-el-Tif prizeAbel PrizeAcademy AwardACM Turing AwardAdalbert Stifter PrizeAdriano Gonzalez Leon Biennial Novel PrizeAga Khan Prize for FictionAgatha AwardAgatha AwardsAIA Gold Medal
Aboriginal StudiesAccountingActuarial Science and StatisticsAdministration of JusticeAdministrative and Policy StudiesAfrican StudiesAfricana StudiesAmerican CulturesAmerican StudiesAnatomyAnesthesiologyAnthropology
Mr.MrMisterMrs.MrsMissMs.Ms
July 30th, 2009 Lexical Knowledge from Ngrams 5
150 category Named Entity
July 30th, 2009 Lexical Knowledge from Ngrams 6
Bootstrapping
• Get more of similar– Set of names (i.e. Presidents)
• Clinton, Bush Putin, Chirac– They must share something…
• They share the same context in texts• President * said yesterday of President * in• President * , the President * , who
• The contexts may be shared by other Presidents• Yeltsin, Zemin, Hussein, Obama
We need scoring function to score the candidatesWe need to set the number of contexts/examples to learn
July 30th, 2009 Lexical Knowledge from Ngrams 7
Problem
• Different NE categories need different parameter settings in bootstrapping
– “Academic” has a small number of strong contexts (Department of … at)
– “Company” has a large number of weak contexts (… was bankrupted, … hires)
– “Award” has strong suffix feature (… Award/Prize)– “Nationality” has a specific length (1), “Book” has a wide
length variation
July 30th, 2009 Lexical Knowledge from Ngrams 8
Self-Adjustable bootstrapping• We need to find the best parameter setting for each
category• Idea:
Bootstrapping + Machine Learning Approach
Use 80% of seeds for training (train-data), 20% of seeds to optimize the functions and thresholds (dev-data)
July 30th, 2009 Lexical Knowledge from Ngrams 9
Our Approach
• Parameters1. Context
• Formula’s to score Contexts and Targets• Number of contexts to be used
2. Suffix/Prefixe.g. Suffix=Awards, for award categories
3. Lengtha bias on lengths of retrieved Entity set
– Weighted Linear Interpolation of three functions
• Optimization Function : Total Reciprocal Rank
July 30th, 2009 Lexical Knowledge from Ngrams 10
Our Approach
• Parameters1. Context
• Formula’s to score Contexts and Targets• Number of contexts to be used
2. Suffix/Prefixe.g. Suffix=Awards, for award categories
3. Lengtha bias on lengths of retrieved Entity set
– Weighted Linear Interpolation of three functions
• Optimization Function : Total Reciprocal Rank
July 30th, 2009 Lexical Knowledge from Ngrams 11
1. Scoring formula’s
• Scoring Targets1. Fi / CF2. Ft / log(CF)3. Ft * log(Fi) /CF4. log(Fi)*Ft / CF5. log(Fi)*Ft / log(CF)
• Scoring Context1. Fi / CF2. Ft / log(CF)3. Fi * log(Ft) / CF4. log(Fi) * Ft / CF5. log(Fi) * Ft / log(CF)
Fi = Co-occurrence frequency of targets and the contextFt = Number of target types co-occurred with the contextCF = Corpus frequency of the context
We observed that different scoring formula’s work best for different categories
July 30th, 2009 Lexical Knowledge from Ngrams 12
Our Approach
• Parameters1. Context
• Formula’s to score Contexts and Targets• Number of contexts to be used
2. Suffix/Prefixe.g. Suffix=Awards, for award categories
3. Lengtha bias on lengths of retrieved Entity set
– Weighted Linear Interpolation of three functions
• Optimization Function : Total Reciprocal Rank
July 30th, 2009 Lexical Knowledge from Ngrams 13
2. Prefix/Suffix
Award Lake Bridge Bird
Academy AwardAmerican Book AwardsFilmfare AwardsBAFTA AwardsBatty Weber PrizeBooker PrizeCameos PrizeCarnegie PrizeWorld CupEdgar Award
Aberdeen LakeWhite Rock LakeTucker LakeSummersville LakeBelmont LakeLake MonroeLake NakuwaLake MuhlenbergLake LacanauLake Columbia
Albert BridgeGeorge Washington BridgeAuckland Harbor BridgeBenjamin Franklin BridgeYokohama Bay BridgeWalter Taylor Bridge
African CuckooAfep PigeonOwlAcorn WoodpeckerPenguinHawkEagleParrotCrow
S=Award (19%)S=Prize (16%)
P=Lake (47%)S=Lake (30%)
S=Bridge (70%)S=bridge (8%)
N/A
July 30th, 2009 Lexical Knowledge from Ngrams 13
July 30th, 2009 Lexical Knowledge from Ngrams 14
Our Approach
• Parameters1. Context
• Formula’s to score Contexts and Targets• Number of contexts to be used
2. Suffix/Prefixe.g. Suffix=Awards, for award categories
3. Lengtha bias on lengths of retrieved Entity set
– Weighted Linear Interpolation of three functions
• Optimization Function : Total Reciprocal Rank
July 30th, 2009 Lexical Knowledge from Ngrams 15
3. Length
• Set bias for length of retrieved entity set based on distribution of length over the seed words.
00.10.20.30.40.50.60.70.80.91
1 2 3 4 5
Nationality.txt
Nationality
July 30th, 2009 Lexical Knowledge from Ngrams 16
3. Length
• Set bias for length of retrieved entity set based on distribution of length over the seed words.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5
Bird.txt
Nationality.txt
Bird
July 30th, 2009 Lexical Knowledge from Ngrams 17
3. Length
• Set bias for length of retrieved entity set based on distribution of length over the seed words.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5
Bird.txt
Book.txt
Nationality.txt
Book
July 30th, 2009 Lexical Knowledge from Ngrams 18
3. Length
• Set bias for length of retrieved entity set based on distribution of length over the seed words.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5
Academic.txtAward.txtConference.txtBird.txtBook.txtMoney_Form.txtLake.txtAirport.txtNationality.txt
July 30th, 2009 Lexical Knowledge from Ngrams 19
Our Approach
• Parameters1. Context
• Formula’s to score Contexts and Targets• Number of contexts to be used
2. Suffix/Prefixe.g. Suffix=Awards, for award categories
3. Lengtha bias on lengths of retrieved Entity set
– Weighted Linear Interpolation of three functions
• Optimization Function : Total Reciprocal Rank
July 30th, 2009 Lexical Knowledge from Ngrams 20
Optimization Function
• TRR (Total Reciprocal Rank) – We want to get higher score for parameters which retrieve
our test examples at the top of the retrieved set.
1 2 8
Score = 1/1 +1/2+1/8 = 1.625
2 3 4 6
Score = 1/2 +1/3+1/4+1/6 = 1.358
TRR =
July 30th, 2009 Lexical Knowledge from Ngrams 21
Experiment
• Data– The dataset consists of seeds for all 150 NE’s – The number of seeds vary from 20-20,000 extracted from Wikipedia list pages and other list pages (Sekine et.al 2004)– Examples
• Program– N-gram search engine for Wikipedia.– 1.7 billion tokens and 1.2 billion 7-grams.
Academic 214 Artificial Intelligence, Asia-Pacific Studies, Biochemistry
Airport 1054 A.P Hill Army Airfield, Aberdeen Airport, Afron Municipal Airport
Bridge 1174 10th Avenue Bridge, 23the Street viaduct, Acosta Bridge,
July 30th, 2009 Lexical Knowledge from Ngrams 22
Optimization Result
• Different parameter settings give the best results for different categories
Context scoring func.
Target scoring func.
threshold for
context
p/suffix feat.
weight
length feat.
weight
Academic (43)
Airport (210)
Bridge (235)
#2 #2 200 100 10 0.90 1.09 0.26#5 #4 50 700 10 0.24 1.56 0.57#5 #4 200 700 50 0.24 1.22 0.77#2 #5 50 100 100 0.83 1.47 0.56
Last line is the best for all categories combined (baseline)
July 30th, 2009 Lexical Knowledge from Ngrams 23
Results
Our Method Baseline
Rec. Prec. F Rec. Prec. FAcademic 71 61 66 (+11) 61 50 55Airport 23 61 33 (+0) 23 60 33Bridge 16 47 24 (+10) 9 29 14
Recall: percentage of held-out seed examples in top 2,000
Precision: percentage of correct targets in 100 random sample of top 2,000
July 30th, 2009 Lexical Knowledge from Ngrams 24
Future Work
• More Features– Phrase Clustering– Genre information– Longer dependency
• Better optimization• Start with smaller number of seeds• Other targets (e.g. relation)• Make a tool (like Google Sets)
July 30th, 2009 Lexical Knowledge from Ngrams 25
Using Phrase ClustersMatching % Total Matches Category Cluster ID
92% 237 Airport 28791% 23 Incident 548
83% 24 Ocassions 47470% 72 Facility_Other 769
64% 11566 Titles 54510% 7252 Flora 950
10% 103009 City 441
9.7% 11272 Religion 464
9.4% 3107 Planet 332
9.5% 2883 Train 326
25
July 30th, 2009 Lexical Knowledge from Ngrams 26
Airport Cluster• 1301 “airport” in Cluster #287• Chicago 's O'Hare Airport• Ben Gurion International Airport• Little Rock National Airport• London 's Heathrow airport• Austin airport• Burbank airport• London 's Heathrow Airport• Memphis airport• La Guardia airport • Corpus Christi International Airport• Boston 's Logan Airport • Cincinnati/Northern Kentucky International Airport• Sea-Tac airport
July 30th, 2009 Lexical Knowledge from Ngrams 27
Conclusion
• A solution for “Different methods work different categories”
• Large dictionary of 150 category Named Entities