1
Dimensionality ReductionPCA
Machine Learning – CSE446David Wadden (slides provided by Carlos Guestrin)University of Washington
Feb 22, 2017©Carlos Guestrin 2005-2017
2
Dimensionality reduction
n Input data may have thousands or millions of dimensions!¨ e.g., text data has
n Dimensionality reduction: represent data with fewer dimensions¨ easier learning – fewer parameters¨ visualization – hard to visualize more than 3D or 4D¨ discover “intrinsic dimensionality” of data
n high dimensional data that is truly lower dimensional
©Carlos Guestrin 2005-2013
3
Lower dimensional projections
n Rather than picking a subset of the features, we can create new features that are combinations of existing features
n Let’s see this in the unsupervised setting ¨ just X, but no Y
©Carlos Guestrin 2005-2013
4
Linear projection and reconstruction
x[1]
x[2]
project into1-dimension z
reconstruction:only know z,
what was (x[1],x[2])
©Carlos Guestrin 2005-2013
5
Principal component analysis –basic idean Project d-dimensional data into k-dimensional
space while preserving information:¨ e.g., project space of 10000 words into 3-dimensions¨ e.g., project 3-d into 2-d
n Choose projection with minimum reconstruction error
©Carlos Guestrin 2005-2013
6
Linear projections, a review
n Project a point into a (lower dimensional) space:¨ point: xi = (xi[1],…,xi[D])¨ select a basis – set of basis vectors – (u1,…,uK)
n we consider orthonormal basis: ¨ ui•ui=1, and ui•uj=0 for i¹j
¨ select a center – x, defines offset of space ¨ best coordinates in lower dimensional space defined
by dot-products: (zi[1],…,zi[K]),
©Carlos Guestrin 2005-2013
zi[j] = (xi � x̄) · uj
7
PCA finds projection that minimizes reconstruction errorn Given N data points: xi = (xi[1],…, xi[D]), i=1…Nn Will represent each point as a projection:
¨ where: and
n PCA:¨ Given K<D, find (u1,…,uK)
minimizing reconstruction error:
x1
x2
©Carlos Guestrin 2005-2013
x̂i = x̄+KX
j=1
zi[j]uj x̄ =1
N
NX
i=1
xi zi[j] = (xi � x̄) · uj
errorK =
NX
i=1
(xi � ˆ
xi)2
8
Understanding the reconstruction error
¨Given K<D, find (u1,…,uK) minimizing reconstruction error:
n Note that xi can be represented exactly by d-dimensional projection:
n Rewriting error:
©Carlos Guestrin 2005-2013
xi = x̄+DX
j=1
zi[j]uj
x̂i = x̄+KX
j=1
zi[j]uj
zi[j] = (xi � x̄) · uj
errorK =
NX
i=1
(xi � ˆ
xi)2
errorK =
NX
i=1
(xi � ˆ
xi)2=
NX
i=1
2
4¯
x+
DX
j=1
zi[j]uj �
0
@x̄+
KX
j=1
zi[j]uj
1
A
3
52
=
NX
i=1
2
4DX
j=K+1
zi[j]uj
3
52
=
NX
i=1
2
4DX
j=K+1
zi[j]uj · ujzi[j] + 2
DX
j=K+1
DX
`>j
zi[j]uj · u`zi[`]
3
5
=
NX
i=1
DX
j=K+1
(zi[j])2
9
Reconstruction error and covariance matrix
©Carlos Guestrin 2005-2013
⌃ =1
N
NX
i=1
(xi � x̄)(xi � x̄)T
�m` =1
N
NX
i=1
(xi[m]� x̄[m])(xi[`]� x̄[`])errorK =
NX
i=1
DX
j=K+1
[uj · (xi � ¯
x)]
2
=
NX
i=1
DX
j=K+1
u
Tj (xi � ¯
x)(xi � ¯
x)
Tuj
=
DX
j=K+1
u
Tj
"nX
i=1
(xi � ¯
x)(xi � ¯
x)
T
#uj
= NDX
j=K+1
u
Tj ⌃uj
10
Minimizing reconstruction error and eigen vectors
n Minimizing reconstruction error equivalent to picking (ordered) orthonormal basis (u1,…,uD) minimizing:
n Eigen vector:
n Minimizing reconstruction error equivalent to picking (uK+1,…,uD) to be eigen vectors with smallest eigen values
©Carlos Guestrin 2005-2013
⌃u = �u
errorK = NDX
j=k+1
uTj ⌃uj
11
Basic PCA algoritm
n Start from m by n data matrix Xn Recenter: subtract mean from each row of X
¨ Xc ¬ X – Xn Compute covariance matrix:
¨ S ¬ 1/N XcT Xc
n Find eigen vectors and values of Sn Principal components: k eigen vectors with
highest eigen values
©Carlos Guestrin 2005-2013
12
PCA example
©Carlos Guestrin 2005-2013
x̂i = x̄+KX
j=1
zi[j]uj
13
PCA example – reconstruction
only used first principal component
©Carlos Guestrin 2005-2013
x̂i = x̄+KX
j=1
zi[j]uj
14
Eigenfaces [Turk, Pentland ’91]
n Input images: n Principal components:
©Carlos Guestrin 2005-2013
15
Eigenfaces reconstruction
n Each image corresponds to adding 8 principal components:
©Carlos Guestrin 2005-2013
16
Scaling up
n Covariance matrix can be really big!¨ S is D by D¨ Say, only 10000 features¨ finding eigenvectors is very slow…
n Use singular value decomposition (SVD)¨ finds to K eigenvectors¨ great implementations available, e.g., scipy.linalg.svd
©Carlos Guestrin 2005-2013
17
SVDn Write X = W S VT
¨ X ¬ data matrix, one row per datapoint¨ W ¬ weight matrix, one row per datapoint – coordinate of xi in eigenspace¨ S ¬ singular value matrix, diagonal matrix
n in our setting each entry is eigenvalue lj
¨ VT ¬ singular vector matrixn in our setting each row is eigenvector vj
©Carlos Guestrin 2005-2013
18
PCA using SVD algoritm
n Start from m by n data matrix Xn Recenter: subtract mean from each row of X
¨ Xc ¬ X – Xn Call SVD algorithm on Xc – ask for k singular vectorsn Principal components: k singular vectors with highest
singular values (rows of VT)¨ Coefficients become:
©Carlos Guestrin 2005-2013
19
What you need to know
n Dimensionality reduction¨ why and when it’s important
n Simple feature selectionn Principal component analysis
¨ minimizing reconstruction error¨ relationship to covariance matrix and eigenvectors¨ using SVD
©Carlos Guestrin 2005-2013