1
Features, Formalized
Stephen MayhewHyung Sul Kim
2
Outline• What are features?• How are they defined in NLP tasks in general?• How they are defined specifically for relation
extraction? (Kernel methods)
3
What are features?
4
Feature Extraction Pipeline1. Define Feature Generation Functions (FGF)2. Apply FGFs to Data to make a lexicon3. Translate examples into feature space4. Learning with vectors
5
Feature Generation Functions• When we say ‘features’, we actually are often talking
about FGFs.• Define a relation over instance space
• For example, let instance • The relation containsWord(w) is active ( = 1) three
times: containsWord(little), containsWord(brown), containsWord(cow)
6
Feature Generation FunctionsLet be an enumerable collection of relations on . A Feature Generation Function is a mapping:
that maps each to a set of all elements in that satisfy .
Common notation for FGF:
7
Feature Generation FunctionsExample:
“Gregor Samsa woke from troubled dreams.”
Let { isCap(…), hasLen4(…), endsWithS(…) }
Define an FGF over and apply it to the instance:isCap(Gregor), isCap(Samsa), hasLen4(woke),
hasLen4(from), endsWithS(dreams)
8
Feature Extraction Pipeline1. Define Feature Generation Functions (FGF)2. Apply FGFs to Data to make a lexicon3. Translate examples into feature space4. Learning with vectors
9
LexiconApply our FGF to all input data.Creates grounded features and indexes them…3534: hasWord(stark)3535: hasWord(stamp)3536: hasWord(stampede)3537: hasWord(starlight)…
10
Feature Extraction Pipeline1. Define Feature Generation Functions (FGF)2. Apply FGFs to Data to make a lexicon3. Translate examples into feature space4. Learning with vectors
11
Translate examples to feature spaceFrom Lexicon:…98: hasWord(In)…241: hasWord(the)…3534: hasWord(stark)3535: hasWord(stamp)3536: hasWord(stampede)3537: hasWord(starlight)…
“In the stark starlight”
<98, 241, 3534, 3537>
12
Feature Extraction Pipeline1. Define Feature Generation Functions (FGF) 2. Apply FGFs to Data to make a lexicon3. Translate examples into feature space4. Learning with vectors
Easy.
13
Feature Extraction Pipeline
1. FGFs are already defined2. Lexicon is already defined3. Translate examples into feature space4. Learning with vectors
No surprises here.
Testing
14
Structured Pipeline - Training1. Define Feature Generation Functions (FGF)
(Note: in this case: )2. Apply FGFs to data to make a lexicon3. Translate examples into feature space4. Learning with vectors
Exactly the same as before!
15
Structured Pipeline - TestingRemember, the FGF is
Now we don’t have to use, but the idea is very similar: for every possible we create features.
16
Automatic Feature GenerationTwo ways to look at this:1. Creating an FGF
This is a black art, not even intuitive for humans to do
2. Choosing the best subset of a closed setThis is possible, algorithms exist
17
Exploiting Syntactico-Semantic Structures for Relation ExtractionBefore doing the hard task of relation classification, apply some easy heuristics to recognize:• Premodifiers: [the [Seattle] Zoo]• Possessives: [[California’s] Governor]• Prepositions: [officials] in [California]• Formulaics: [Medford] , [Massachusetts]
These 4 structures cover 80% of the mention pairs (in ACE 2004)
Chan and Roth, ACL 2011
18
Kernels for Relation Extraction
Hyung Sul Kim
19
Kernel Tricks• Borrowed a few slides from ACL2012 Tutorial for
Kernels in NLP by Moschitti
20
21
22
23
24
All We Need is• K(x1, x2) = ϕ(x1) · ϕ(x2) • Computing K(x1, x2) can be possible without
mapping x to ϕ(x)
25
Linear Kernels with Features(Zhou et al., 2005)
• Pairwise binary-SVM training • Features• Words• Entity Types• Mention Level• Overlap• Base Phrase Chunking• Dependency Tree• Parse Tree• Semantic Resources
26
Feature Description ExampleWM1 bag-of-words in M1 {they}HM1 head word of M1 theyWM2 bag-of-words in M2 {their, children}HM2 head word of M2 childrenHM12 combination of HM1 and HM2 <they,children>WBNULL when no word in between 0WBFL the only word in between when only one word in between 0WBF first word in between when at least two words in between doWBL last word in between when at least two words in between putWBO other words in between except first and last words when at least three words
in betweennot
BM1F first word before M1 0BM1L second word before M1 0AM2F first word after M2 inAM2L second word after M2 a
Word Features
27
Entity Types, Mention Level, Overlap
Feature Description Example 1 Example 2ET12 combination of mention entity types (PER, ORG, FAC, LOC, GPE) <PER, PER> <GPE, LOC>ML12 combination of mention levels (NAME, NOMIAL, PRONOUN) <PRO, NOM> <NAM,NAM>#MB number of other mentions in between 0 0#WB number of words in between 3 0M1>M2 1 if M2 is included in M1 0 1M1<M2 1 if M1 is included in M2 0 0
28
Base Phrase Chunking
Feature Description ExampleCPHBNULL when no phrase in between 0CPHBFL the only phrase head when only one phrase in between 0CPHBF first phrase head in between when at least two phrases in between JAPANCPHBL last phrase head in between when at least two phrase heads in between KILLEDCPHBO other phrase heads in between except first and last phrase heads when at
least three phrases in between0
CPHBM1F first phrase head before M1 0CPHBM1L second phrase head before M1 0CPHAM2F first phrase head after M2 0CPHAM2L second phrase head after M2 0
30
Performance of Features (F1 Measure)
Words
+ Entity Typ
e
+ Mention Le
vel
+ Overla
p
+ Chunking
+ Dependency Tree
+ Parse Tree
+ Seman
tic Reso
urces
0
10
20
30
40
50
60
31
Performance ComparisonYear Authors Method F-Measure
2005 Zhou et al. Linear Kernels with Handcrafted Features 55.5
32
Syntactic Kernels(Zhao and Grishman, 2005)
• Syntactic Kernels (Composite of 5 Kernels)• Argument Kernel• Bigram Kernel• Link Sequence Kernel• Dependency Path Kernel• Local Dependency Kernel
33
Bigram Kernel
• All unigrams and bigrams in the text from M1 to M2
Unigram Bigram
they they do
do do not
not not put
put put their
their their children
children
34
Dependency Path KernelThat's because Israel was expected to retaliate against Hezbollah forces in areas controlled by Syrian troops.
35
Performance ComparisonYear Authors Method F-Measure
2005 Zhou et al. Linear Kernels with Handcrafted Features 55.5
2005 Zhao and Grishman Syntactic Kernels (Composite of 5 Kernels) 70.35
36
Composite Kernel(Zhang et al., 2006)
• Composite of Two Kernels• Entity Kernel (Linear Kernel with entity related features
given by ACE datasets)• Convolution Tree Kernel (Collins and Duffy, 2001)
• Two ways to composite two kernels• Linear Combination• Polynomial Expansion
37
Convolution Tree Kernel(Collins and Duffy, 2001)
An example treeEfficiently Compute K(x1, x2) by O(|x1|·|x2|)
38
Relation Instance Spaces
51.361.9
59.260.4
39
Performance ComparisonYear Authors Method F-Measure
2005 Zhou et al. Linear Kernels with Handcrafted Features 55.5
2005 Zhao and Grishman Syntactic Kernels (Composite of 5 Kernels) 70.35
2006 Zhang et al. Entity Kernel + Convolution Tree Kernel 72.1
40
Context-Sensitive Tree Kernel(Zhou et al., 2007)
• Motivational Example: John and Mary got marriedcalled predicate-linked category (10%)
PT: 63.6 Context-Sensitive Tree Kernel: 73.2
41
Performance ComparisonYear Authors Method F-Measure
2005 Zhou et al. Linear Kernels with Handcrafted Features 55.5
2005 Zhao and Grishman Syntactic Kernels (Composite of 5 Kernels) 70.35
2006 Zhang et al. Entity Kernel + Convolution Tree Kernel 72.1
2007 Zhou et al. (Zhou et al., 2005) + Context-sensitive Tree Kernel 75.8
42
Best Kernel(Nguyen et al., 2009)
• Use Multiple Kernels on• Constituent Trees• Dependency Trees• Sequential Structures
• Design 5 different Kernel Composites with 4 Tree Kernels and 6 Sequential Kernels
43
Convolution Tree Kernels on 4 Special Trees
68.9
56.3
60.258.5
PET
DW GR
GRW
PET + GR = 70.5DW + GR = 61.8
44
Word Sequence Kernels on 6 Special Sequences
SK1. Sequence of terminals (lexical words) in the PET e.g. T2-LOC washington, U.S. T1-PER officials
SK2. Sequence of part-of-speech (POS) tags in the PETe.g. T2-LOC NN , NNP T1-PER NNS
SK3. Sequence of grammatical relations in the PETe.g. T2-LOC pobj , nn T1-PER nsubj
SK4. Sequence of words in the DWe.g. Washington T2-LOC In working T1-PER officials GPE U.S.
SK5. Sequence of grammatical relations in the GRe.g. pobj T2-LOC prep ROOT T1-PER nsubj GPE nn
SK6. Sequence of POS tags in the DWe.g. NN T2-LOC IN VBP T1-PER NNS GPE NNP
61.0
60.8
61.659.7
59.8
59.7
SK1 + SK2 + SK3 + SK4 + SK5 + SK6 = 69.8
45
Word Sequence Kernels(Cancedda et al., 2003)
• Extended Sequence Kernels•Map to high-dimensional spaces using every
subsequence• Penalties to• common subsequences (using IDF)• longer subsequences• non-contiguous subsequences
46
Performance Comparison
Year Authors Method F-Measure
2005 Zhou et al. Linear Kernels with Handcrafted Features 55.5
2005 Zhao and Grishman Syntactic Kernels (Composite of 5 Kernels) 70.35
2006 Zhang et al. Entity Kernel + Convolution Tree Kernel 72.1
2007 Zhou et al. (Zhou et al., 2005) + Context-sensitive Tree Kernel 75.8
2009 Nguyen et al. Multiple Tree Kernels + Multiple Sequence Kernels 71.5
(Zhang et al., 2006) F-measure 68.9 in our settings
(Zhou et al., 2007) “Such heuristics expand the tree and remove unnecessary information allowing a higher improvement on RE. They are tuned on the target RE task so although the result is impressive, we cannot use it to compare with pure automatic learning approaches, such us our models. “
47
Topic Kernel(Wang et al., 2011)
• Use Wikipedia InfoBox to learn topics of relations (like topics of words) based on co-occurrences
Topics Top RelationsTopic 1 active_years_end_date, career_end, final_year, retiredTopic 2 commands, part_of, battles, not_able_commandersTopic 3 influenced, school_tradition, not_able_ideas, main_interestsTopic 4 destinations, end, through, post_townTopic 5 prizes, award, academy_awards, highlightsTopic 6 inflow, outflow, length, maxdepthTopic 7 after, successor, ending_terminusTopic 8 college, almamater, education…
48
Overview
49
Performance Comparison
Year Authors Method F-Measure
2005 Zhou et al. Linear Kernels with Handcrafted Features 55.5
2005 Zhao and Grishman Syntactic Kernels (Composite of 5 Kernels) 70.35
2006 Zhang et al. Entity Kernel + Convolution Tree Kernel 72.1
2007 Zhou et al. (Zhou et al., 2005) + Context-sensitive Tree Kernel 75.8
2009 Nguyen et al. Multiple Tree Kernels + Multiple Sequence Kernels 71.5
2011 Wang et al. Entity Features + Word Features + Dependency Path + Topic Kernels 73.24