Feature SelectionAdvanced Statistical Methods in NLP
Ling 572January 24, 2012
RoadmapFeature representations:
Features in attribute-value matricesMotivation: text classification
Managing featuresGeneral approachesFeature selection techniques
Feature scoring measures Alternative feature weighting
Chi-squared feature selection
Representing Input:Attribute-Value Matrix
f1
Currency
f2
Country
… fm
Date
Label
x1= Doc1 1 1 0.3 0 Spam
x2=Doc2 1 1 1.75 1 Spam
..
xn=Doc4 0 0 0 2 NotSpam
Choosing features:• Define features – i.e. with feature templates
Representing Input:Attribute-Value Matrix
f1
Currency
f2
Country
… fm
Date
Label
x1= Doc1 1 1 0.3 0 Spam
x2=Doc2 1 1 1.75 1 Spam
..
xn=Doc4 0 0 0 2 NotSpam
Choosing features:• Define features – i.e. with feature templates• Instantiate features
Representing Input:Attribute-Value Matrix
f1
Currency
f2
Country
… fm
Date
Label
x1= Doc1 1 1 0.3 0 Spam
x2=Doc2 1 1 1.75 1 Spam
..
xn=Doc4 0 0 0 2 NotSpam
Choosing features:• Define features – i.e. with feature templates• Instantiate features• Perform dimensionality reduction
Representing Input:Attribute-Value Matrix
f1
Currency
f2
Country
… fm
Date
Label
x1= Doc1 1 1 0.3 0 Spam
x2=Doc2 1 1 1.75 1 Spam
..
xn=Doc4 0 0 0 2 NotSpam
Choosing features:• Define features – i.e. with feature templates• Instantiate features• Perform dimensionality reductionWeighting features: increase/decrease feature import
Representing Input:Attribute-Value Matrix
f1
Currency
f2
Country
… fm
Date
Label
x1= Doc1 1 1 0.3 0 Spam
x2=Doc2 1 1 1.75 1 Spam
..
xn=Doc4 0 0 0 2 NotSpam
Choosing features:• Define features – i.e. with feature templates• Instantiate features• Perform dimensionality reductionWeighting features: increase/decrease feature import• Global feature weighting: weight whole column• Local feature weighting: weight cell, conditions
Feature Selection ExampleTask: Text classification
Feature template definition:
Feature Selection ExampleTask: Text classification
Feature template definition:Word – just one template
Feature instantiation:
Feature Selection ExampleTask: Text classification
Feature template definition:Word – just one template
Feature instantiation:Words from training (and test?) data
Feature selection:
Feature Selection ExampleTask: Text classification
Feature template definition:Word – just one template
Feature instantiation:Words from training (and test?) data
Feature selection:Stopword removal: remove top K (~100) highest freq
Words like: the, a, have, is, to, for,…
Feature weighting:
Feature Selection Example Task: Text classification
Feature template definition: Word – just one template
Feature instantiation: Words from training (and test?) data
Feature selection: Stopword removal: remove top K (~100) highest freq
Words like: the, a, have, is, to, for,…
Feature weighting: Apply tf*idf feature weighting
tf = term frequency; idf = inverse document frequency
The Curse of Dimensionality
Think of the instances as vectors of features# of features = # of dimensions
The Curse of Dimensionality
Think of the instances as vectors of features# of features = # of dimensions
Number of features potentially enormous# words in corpus continues to increase w/corpus size
The Curse of Dimensionality
Think of the instances as vectors of features# of features = # of dimensions
Number of features potentially enormous# words in corpus continues to increase w/corpus size
High dimensionality problematic:
The Curse of Dimensionality
Think of the instances as vectors of features# of features = # of dimensions
Number of features potentially enormous# words in corpus continues to increase w/corpus size
High dimensionality problematic:Leads to data sparseness
The Curse of Dimensionality
Think of the instances as vectors of features# of features = # of dimensions
Number of features potentially enormous# words in corpus continues to increase w/corpus size
High dimensionality problematic:Leads to data sparseness
Hard to create valid model Hard to predict and generalize – think kNN
The Curse of Dimensionality
Think of the instances as vectors of features# of features = # of dimensions
Number of features potentially enormous# words in corpus continues to increase w/corpus size
High dimensionality problematic:Leads to data sparseness
Hard to create valid model Hard to predict and generalize – think kNN
Leads to high computational cost
The Curse of Dimensionality
Think of the instances as vectors of features # of features = # of dimensions
Number of features potentially enormous # words in corpus continues to increase w/corpus size
High dimensionality problematic: Leads to data sparseness
Hard to create valid model Hard to predict and generalize – think kNN
Leads to high computational cost Leads to difficulty with estimation/learning
More dimensions more samples needed to learn model
Breaking the CurseDimensionality reduction:
Produce a representation with fewer dimensions But with comparable performance
Breaking the CurseDimensionality reduction:
Produce a representation with fewer dimensions But with comparable performance
More formally, given an original feature set r, Create a new set r’ |r’| < |r|, with comparable perf.
Breaking the CurseDimensionality reduction:
Produce a representation with fewer dimensions But with comparable performance
More formally, given an original feature set r, Create a new set r’ |r’| < |r|, with comparable perf.
Functionally,Many ML algorithms do not scale well
Breaking the CurseDimensionality reduction:
Produce a representation with fewer dimensions But with comparable performance
More formally, given an original feature set r, Create a new set r’ |r’| < |r|, with comparable perf.
Functionally,Many ML algorithms do not scale well
Expensive: Training cost, training costPoor prediction: overfitting, sparseness
Dimensionality ReductionGiven an initial feature set r,
Create a feature set r’ s.t. |r| < |r’|
Approaches:
Dimensionality ReductionGiven an initial feature set r,
Create a feature set r’ s.t. |r| < |r’|
Approaches:r’: same for all classes (aka global), vsr’: different for each class (aka local)
Dimensionality ReductionGiven an initial feature set r,
Create a feature set r’ s.t. |r| < |r’|
Approaches:r’: same for all classes (aka global), vsr’: different for each class (aka local)
Feature selection/filtering, vsFeature mapping (aka extraction)
Feature SelectionFeature selection:
r’ is a subset of rHow can we pick features?
Feature SelectionFeature selection:
r’ is a subset of rHow can we pick features?
Extrinsic ‘wrapper’ approaches:
Feature SelectionFeature selection:
r’ is a subset of rHow can we pick features?
Extrinsic ‘wrapper’ approaches: For each subset of features:
Build, evaluate classifier for some task Pick subset of features with best performance
Feature SelectionFeature selection:
r’ is a subset of rHow can we pick features?
Extrinsic ‘wrapper’ approaches: For each subset of features:
Build, evaluate classifier for some task Pick subset of features with best performance
Intrinsic ‘filtering’ methods: Use some intrinsic (statistical?) measure Pick features with highest scores
Feature SelectionWrapper approach:
Pros:
Feature SelectionWrapper approach:
Pros: Easy to understand, implementClear relationship b/t selected features and task perf.
Cons:
Feature SelectionWrapper approach:
Pros: Easy to understand, implementClear relationship b/t selected features and task perf.
Cons:Computationally intractable: 2|r’|*(training + testing)Specific to task, classifier; ad-hov
Filtering approach:Pros
Feature SelectionWrapper approach:
Pros: Easy to understand, implementClear relationship b/t selected features and task perf.
Cons:Computationally intractable: 2|r’|*(training + testing)Specific to task, classifier; ad-hov
Filtering approach:Pros: theoretical basis, less task, classifier specificCons:
Feature SelectionWrapper approach:
Pros: Easy to understand, implementClear relationship b/t selected features and task perf.
Cons:Computationally intractable: 2|r’|*(training + testing)Specific to task, classifier; ad-hov
Filtering approach:Pros: theoretical basis, less task, classifier specificCons: Doesn’t always boost task performance
Feature MappingFeature mapping (extraction) approaches
Features r’ representation combinations/transformations of features in r
Feature MappingFeature mapping (extraction) approaches
Features r’ representation combinations/transformations of features in rExample: many words near-synonyms, but treated as
unrelated
Feature MappingFeature mapping (extraction) approaches
Features r’ representation combinations/transformations of features in rExample: many words near-synonyms, but treated as
unrelated Map to new concept representing all
big, large, huge, gigantic, enormous concept of ‘bigness’
Examples:
Feature MappingFeature mapping (extraction) approaches
Features r’ representation combinations/transformations of features in rExample: many words near-synonyms, but treated as
unrelated Map to new concept representing all
big, large, huge, gigantic, enormous concept of ‘bigness’
Examples:Term classes: e.g. class-based n-grams
Derived from term clusters
Feature MappingFeature mapping (extraction) approaches
Features r’ representation combinations/transformations of features in rExample: many words near-synonyms, but treated as unrelated
Map to new concept representing all big, large, huge, gigantic, enormous concept of ‘bigness’
Examples:Term classes: e.g. class-based n-grams
Derived from term clusters
Dimensions in Latent Semantic Analysis (LSA/LSI) Result of Singular Value Decomposition (SVD) on matrix
Produces ‘closest’ rank r’ approximation of original
Feature MappingPros:
Feature MappingPros:
Data-drivenTheoretical basis – guarantees on matrix similarityNot bound by initial feature space
Cons:
Feature MappingPros:
Data-drivenTheoretical basis – guarantees on matrix similarityNot bound by initial feature space
Cons:Some ad-hoc factors:
e.g. # of dimensionsResulting feature space can be hard to interpret
Feature FilteringFiltering approaches:
Applying some scoring methods to features to rank their informativeness or importance w.r.t. some class
Feature FilteringFiltering approaches:
Applying some scoring methods to features to rank their informativeness or importance w.r.t. some class
Fairly fast and classifier-independent
Feature FilteringFiltering approaches:
Applying some scoring methods to features to rank their informativeness or importance w.r.t. some class
Fairly fast and classifier-independent
Many different measures:Mutual informationInformation gainChi-squared etc…
Feature Scoring Measures
Basic Notation, Distributions
Assume binary representation of terms, classes
tk : term in T; ci: class in C
Basic Notation, Distributions
Assume binary representation of terms, classes
tk : term in T; ci: class in C
P(tk): proportion of documents in which tk appears
P(ci): proportion of documents of class ci
Binary so have
Basic Notation, Distributions
Assume binary representation of terms, classes
tk : term in T; ci: class in C
P(tk): proportion of documents in which tk appears
P(ci): proportion of documents of class ci
Binary so have
Setting Up!ci ci
!tk a b
tk c d
Setting Up!ci ci
!tk a b
tk c d
Setting Up!ci ci
!tk a b
tk c d
Setting Up!ci ci
!tk a b
tk c d
Setting Up!ci ci
!tk a b
tk c d
Setting Up!ci ci
!tk a b
tk c d
Setting Up!ci ci
!tk a b
tk c d
Feature Selection Functions
Question: What makes a good features?
Feature Selection Functions
Question: What makes a good features?
Perspective: Best features:
Features that are most DIFFERENTLY distributed across classes
Feature Selection Functions
Question: What makes a good features?
Perspective: Best features:
Features that are most DIFFERENTLY distributed across classes
I.e. features are best that most effectively differentiate between classes
Term Selection Functions: DF
Document frequency (DF):Number of documents in which tk appears
Term Selection Functions: DF
Document frequency (DF):Number of documents in which tk appears
Applying DF: Remove terms with DF below some threshold
Term Selection Functions: DF
Document frequency (DF):Number of documents in which tk appears
Applying DF: Remove terms with DF below some threshold
Intuition:Very rare terms: won’t help with categorization
Or not useful globally
Pros:
Term Selection Functions: DF
Document frequency (DF):Number of documents in which tk appears
Applying DF: Remove terms with DF below some threshold
Intuition:Very rare terms: won’t help with categorization
Or not useful globally
Pros: Easy to implement, scalable
Cons:
Term Selection Functions: DF
Document frequency (DF):Number of documents in which tk appears
Applying DF: Remove terms with DF below some threshold
Intuition:Very rare terms: won’t help with categorization
Or not useful globally
Pros: Easy to implement, scalable
Cons: Ad-hoc, low DF terms ‘topical’
Term Selection Functions: MI
Pointwise Mutual Information (MI)
Term Selection Functions: MI
Pointwise Mutual Information (MI)
MI=0 if t,c independent
Term Selection Functions: MI
Pointwise Mutual Information (MI)
MI=0 if t,c independent
Issue: Can be heavily influenced by marginalProblem comparing terms of differing frequencies
Term Selection Functions: IG
Information Gain: Intuition: Transmitting Y, how many bits can we
save if we know X? IG(Y,Xi) = H(Y)-H(Y|X)
Information Gain: Derivation
From F. Xia, ‘11
More Feature SelectionGSS coefficient:
From F. Xia, ‘11
More Feature SelectionGSS coefficient:
NGL coefficient: N : # of docs
From F. Xia, ‘11
More Feature SelectionGSS coefficient:
NGL coefficient: N : # of docs
Chi-square:
From F. Xia, ‘11
More Feature SelectionGSS coefficient:
NGL coefficient: N : # of docs
Chi-square:
From F. Xia, ‘11
More Term SelectionRelevancy score:
From F. Xia, ‘11
More Term SelectionRelevancy score:
Odds Ratio:
From F. Xia, ‘11
Global SelectionPrevious measures compute class-specific
selection
What if you want to filter across ALL classes?Compute an aggregate measure across classes
Sum:
Average:
Max:
From F. Xia, ‘11
What’s the best?Answer:
It depends on ….ClassifiersType of data…
According to (Yang and Pedersen, 1997):{OR,NGL,GSS} > {X2
max,Igsum}> {#avg}>>{MI}On text classification tasks
Using kNN
From F. Xia, ‘11
Feature WeightingFor text classification, typical weights include:
Binary: weights in {0,1}
Term frequency (tf): # occurrences of tk in document di
Inverse document frequency (idf):dfk: # of docs in which tk appears; N: # docs
idf = log (N/(1+dfk))
tfidf = tf*idf
Chi SquareTests for presence/absence of relation random
variablesBivariate analysis tests 2 random variables
Can test strength of relationship
(Strictly speaking) doesn’t test direction
Chi Square ExampleCan gender predict shoe choice?
A: male/female FeaturesB: shoe choice Classes: {sandal, sneaker,…}
Due to F. Xia
sandal sneaker leather shoe
boot other
Male 6 17 13 9 5
Female 13 5 7 16 9
Comparing DistributionsObserved distribution (O):
Expected distribution (E):
sandal sneaker leather shoe
boot other
Male 6 17 13 9 5
Female 13 5 7 16 9
sandal sneaker
leather shoe
boot other Total
Male 50
Female 50
Total 19 22 20 25 14 100
Due to F. Xia
Comparing DistributionsObserved distribution (O):
Expected distribution (E):
sandal sneaker leather shoe
boot other
Male 6 17 13 9 5
Female 13 5 7 16 9
sandal sneaker
leather shoe
boot other Total
Male 9.5 11 10 12.5 7 50
Female 9.5 11 10 12.5 7 50
Total 19 22 20 25 14 100
Due to F. Xia
Computing Chi SquareExpected value for cell=
row_total*column_total/table_total
X2=(6-9.5)2/9.5+(17-11)2/11+.. = 14.026
Calculating X2
Tabulate contigency table of observed values: O
Compute row, column totals
Compute table of expected values, given row/colAssuming no association
Compute X2
For 2x2 TableO:
E:
!ci ci
!tk a b
tk c d
!ci ci Total
!tk (a+b)(a+c)/N
(a+b)(b+d)/N a+b
tk (c+d)(a+c)/N (c+d)(b+d)/N c+d
total a+c b+d N
X2 TestTest whether random variables are independent
Null hypothesis: 2 R.V.s are independent
Compute X2 statistic: Compute degrees of freedom
df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4
Test probability of X2 statistic value X2 table
If probability is low – below some significance level Can reject null hypothesis
Requirements for X2 TestEvents assumed independent, same distribution
Outcomes must be mutually exclusive
Raw frequencies, not percentages
Sufficient values per cell: > 5