1
INTRODUCTION TO
Machine Learning
ETHEM ALPAYDIN© The MIT Press, 2004
Edited for CS 536 Fall 2005 – Rutgers UniversityAhmed Elgammal
[email protected]://www.cmpe.boun.edu.tr/~ethem/i2ml
Lecture Slides for
CHAPTER 6:
Dimensionality Reduction
2
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)3
Why Reduce Dimensionality?
1. Reduces time complexity: Less computation2. Reduces space complexity: Less parameters3. Saves the cost of observing the feature4. Simpler models are more robust on small datasets5. More interpretable; simpler explanation6. Data visualization (structure, groups, outliers, etc)
if plotted in 2 or 3 dimensions
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)4
Feature Selection vs Extraction
Feature selection: Choosing k<d important features, ignoring the remaining d – k
Subset selection algorithmsFeature extraction: Project the
original xi , i =1,...,d dimensions to new k<d dimensions, zj , j =1,...,k
Principal components analysis (PCA), linear discriminant analysis (LDA), factor analysis (FA)
3
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)5
Subset Selection
There are 2d subsets of d featuresForward search: Add the best feature at each step
Set of features F initially Ø.At each iteration, find the best new feature
j = argmini E ( F ∪ xi )Add xj to F if E ( F ∪ xj ) < E ( F )
Hill-climbing O(d2) algorithmBackward search: Start with all features and remove
one at a time, if possible.Floating search (Add k, remove l)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)6
Principal Components Analysis (PCA)
Find a low-dimensional space such that when x is projected there, information loss is minimized.The projection of x on the direction of w is: z = wTxFind w such that Var(z) is maximized
Var(z) = Var(wTx) = E[(wTx – wTµ)2] = E[(wTx – wTµ)(wTx – wTµ)]= E[wT(x – µ)(x – µ)Tw]= wT E[(x – µ)(x –µ)T]w = wT∑w
where Var(x)= E[(x – µ)(x –µ)T] = ∑
4
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)7
Maximize Var(z) subject to ||w||=1
∑w1 = αw1 that is, w1 is an eigenvector of ∑Choose the one with the largest eigenvalue for
Var(z) to be maxSecond principal component: Max Var(z2), s.t., ||w2||=1 and orthogonal to w1
∑ w2 = α w2 that is, w2 is another eigenvector of ∑and so on.
( )1max 11111
−α− wwwww
TTΣ
( ) ( )01max 1222222
−β−−α− wwwwwww
TTTΣ
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)8
What PCA doesz = WT(x – m)
where the columns of W are the eigenvectors of ∑, and m is sample meanCenters the data at the origin and rotates the axes
5
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)9
How to choose k ?
Proportion of Variance (PoV) explained
when λi are sorted in descending order Typically, stop at PoV>0.9Scree graph plots of PoV vs k, stop at “elbow”
dk
k
λ++λ++λ+λλ++λ+λLL
L
21
21
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)10
6
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)11
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)12
Principle Component Analysis PCA
Given a set of points We are looking for a linear projection: a linear combination of orthogonal basis vectors
diN Rxxxx ∈},,,,{ 21 L
cAx ⋅≈
dRdmRm <<,
≈ c1 + c2 + c3 +…+
…x ≈ c
x cmA
What is the projection that minimizes the reconstruction error ? ∑ −=
iii AcxE
7
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)13
Principle Component Analysis PCA
Given a set of points
Center the points: compute
Compute covariance matrixCompute the eigen vectors for QEigenvectors are the orthogonal basis we are looking for
diN Rxxxx ∈},,,,{ 21 L
∑=i
iN x1µd
iN RxxxxP ∈−−−= ],,,,[ 21 µµµ LTPPQ =
kkk eQe λ=
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)14
Singular Value DecompositionSVD: If A is a real m by n matrix then there exist orthogonal matrices
U (m⋅m) and V (n⋅n) such that UtAV= Σ =diag(σ1, σ2,…, σp) p=min{m,n}
UtAV= Σ A= U Σ Vt
Singular values: Non negative square roots of the eigenvalues of AtA. Denoted σi, i=1,…,nAtA is symmetric ⇒ eigenvalues and singular values are real.Singular values arranged in decreasing order.
Amxn
Umxm
Σmxn
Vt
nxn=
λvvAAVVAA
VVVVVUUVVUVUAA
t
t
ttttttttt
=
∑=
∑=∑∑=∑∑=∑∑= −
)()(
)()(2
12
8
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)15
SVD for PCASVD can be used to efficiently compute the image basis
U are the eigen vectors (basis)Most important thing to notice: Distance in the eigen-space is an approximation to the correlation in the original space
jiji ccxx −≈−
λvvPPUUPP
UUUUUVVUVUVUPP
t
t
ttttttttt
=
∑=
∑=∑∑=∑∑=∑∑= −
)()(
))((2
12
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)16
PCA
Most important thing to notice: Distance in the eigen-space is an approximation to the correlation in the original space
jiji ccxx −≈−
xAcAcxT≈
≈dR dmRm <<,
9
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)17
EigenfaceUse PCA and subspace projection to perform face recognitionHow to describe a face as a linear combination of face basis Matthew Turk and Alex Pentland “Eigenfaces for Recognition” 1991
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)18
Face Recognition - Eigenface
MIT Media Lab -Face Recognition demo page http://vismod.media.mit.edu/vismod/demos/facerec/
10
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)19
Factor Analysis
Find a small number of factors z, which when combined generate x :
xi – µi = vi1z1 + vi2z2 + ... + vikzk + εi
where zj, j =1,...,k are the latent factors with E[ zj ]=0, Var(zj)=1, Cov(zi ,, zj)=0, i ≠ j ,
εi are the noise sourcesE[ εi ]= ψi, Cov(εi , εj) =0, i ≠ j, Cov(εi , zj) =0 ,
and vij are the factor loadings
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)20
PCA vs FA
PCA From x to z z = WT(x – µ)FA From z to x x – µ = Vz + ε
x z
z x
11
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)21
Factor Analysis
In FA, factors zj are stretched, rotated and translated to generate x
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)22
Multidimensional Scaling
Given pairwise distances between N points, dij, i,j =1,...,N
place on a low-dim map s.t. distances are preserved.z = g (x | θ ) Find θ that min Sammon stress
( )( )
( ) ( )( )∑
∑
−
−−θ−θ=
−
−−−=θ
s,r sr
srsr
s,r sr
srsr
E
2
2
2
2
||
|
xx
xxxgxg
xx
xxzzX
12
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)23
Map of Europe by MDS
Map from CIA – The World Factbook: http://www.cia.gov/
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)24
Linear Discriminant Analysis
Find a low-dimensional space such that when xis projected, classes are well-separated.Find w that maximizes
( ) ( )
( )∑∑∑ −==
+−
=
tttT
tt
tttT
rmsr
rm
ssmmJ
21
211
22
21
221
xwxw
w
13
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)25
Between-class scatter:
Within-class scatter:
( ) ( )( )( )
( )( )TBBT
TT
TTmm
2121
2121
221
221
where mmmmwwwmmmmw
mwmw
−−==
−−=
−=−
SS
( )( )( )
( )( )21
21
21
111
111
21
21
where
where
SSSS
S
S
+==+
−−=
=−−=
−=
∑∑∑
WWT
tt
Ttt
Ttt
TttT
tt
tT
ss
r
r
rms
ww
mxmx
wwwmxmxw
xw
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)26
Fisher’s Linear Discriminant
Find w that max
LDA soln:
Parametric soln:
( )211 mmw −⋅= −
Wc S
( )( )
wwmmw
wwwww
WT
T
WT
BT
JSS
S2
21 −==
( )( ) ( )Σ
−Σ= −
,~Cp ii µµµ
N| n whe21
1
xw
14
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)27
K>2 Classes
Within-class scatter:
Between-class scatter:
Find W that max
( )( )Tit
it
ttii
K
iiW r mxmx −−== ∑∑
=
SSS 1
( )( ) ∑∑==
=−−=K
ii
K
i
TiiiB K
N11
1 mmmmmmS
( )WSWWSW
WW
T
BT
J =The largest eigenvectors of SW
-1SBMaximum rank of K-1
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)28
15
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)29
Separating Style and Content
Objective: Decomposing two factors using linear methods
Content: which characterStyle : which font
“Bilinear models”J. Tenenbaum and W. Freeman “Separating Style and Content with Bilinear Models” Neural computation 2000
Figures from J. Tenenbaum and W. Freeman “Separating Style and Content with Bilinear Models” Neural computation 2000
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)30
Bilinear Models
Symmetric bilinear model
∑=ji
cj
siij
sc bawy,
Figures from J. Tenenbaum and W. Freeman “Separating Style and Content with Bilinear Models” Neural computation 2000
16
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)31
Bilinear models cssc bAy =
Head pose as style factor person as content
Person as style factor pose as content
Figures from J. Tenenbaum and W. Freeman “Separating Style and Content with Bilinear Models” Neural computation 2000
Asymmetric bilinear model: use style dependent basis vectors
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)32
Figures from J. Tenenbaum and W. Freeman “Separating Style and Content with Bilinear Models” Neural computation 2000