Dimensionality Reduc1on (con1nued) Lecture 25
David Sontag New York University
Slides adapted from Carlos Guestrin and Luke Zettlemoyer
Basic PCA algorithm
• Start from m by n data matrix X • Recenter: subtract mean from each row of X
– Xc ← X – X
• Compute covariance matrix: – Σ ← 1/m Xc
T Xc
• Find eigen vectors and values of Σ • Principal components: k eigen vectors with highest eigen values
Linear projecHons, a review
• Project a point into a (lower dimensional) space: – point: x = (x1,…,xn) – select a basis – set of unit (length 1) basis vectors (u1,…,uk) • we consider orthonormal basis:
– ui•ui=1, and ui•uj=0 for i≠j – select a center – x, defines offset of space – best coordinates in lower dimensional space defined by dot-‐products: (z1,…,zk), zi = (x-‐x)•ui
PCA finds projecHon that minimizes reconstrucHon error
• Given m data points: xi = (x1i,…,xni), i=1…m
• Will represent each point as a projecHon:
• PCA: – Given k<n, find (u1,…,uk) minimizing reconstrucHon error: x1
x2 u1
Understanding the reconstrucHon error
• Note that xi can be represented exactly by n-‐dimensional projecHon:
• RewriHng error:
Given k<n, find (u1,…,uk)
minimizing reconstruction error:
argmaxθ
m�
j=1
�
z
Q(t+1)(z | xj) log�p(z, xj | θ(t))
�
=m�
j=1
�
z
Q(z | xj) log
�P (z | xj , θ(t))P (xj | θ(t))
Q(z | xj)
�
=m�
j=1
�
z
Q(z | xj) log�P (xj | θ(t))
�−
m�
j=1
�
z
Q(z | xj) log
�Q(z | xj)
P (z | xj , θ(t))
�
z1 = w(1)0 +�
i
w(1)i xi
zk = w(k)0 +�
i
w(k)i xi
errork =m�
i=1
xi −
x̄+k�
j=1
zijuj
2
5
argmaxθ
m�
j=1
�
z
Q(t+1)(z | xj) log�p(z, xj | θ(t))
�
=m�
j=1
�
z
Q(z | xj) log
�P (z | xj , θ(t))P (xj | θ(t))
Q(z | xj)
�
=m�
j=1
�
z
Q(z | xj) log�P (xj | θ(t))
�−
m�
j=1
�
z
Q(z | xj) log
�Q(z | xj)
P (z | xj , θ(t))
�
z1 = w(1)0 +�
i
w(1)i xi
zk = w(k)0 +�
i
w(k)i xi
errork =m�
i=1
xi −
x̄+k�
j=1
zijuj
2
=m�
i=1
x̄+n�
j=1
zijuj
−
x̄+k�
j=1
zijuj
2
5
argmaxθ
m�
j=1
�
z
Q(t+1)(z | xj) log�p(z, xj | θ(t))
�
=m�
j=1
�
z
Q(z | xj) log
�P (z | xj , θ(t))P (xj | θ(t))
Q(z | xj)
�
=m�
j=1
�
z
Q(z | xj) log�P (xj | θ(t))
�−
m�
j=1
�
z
Q(z | xj) log
�Q(z | xj)
P (z | xj , θ(t))
�
z1 = w(1)0 +�
i
w(1)i xi
zk = w(k)0 +�
i
w(k)i xi
errork =m�
i=1
xi −
x̄+k�
j=1
zijuj
2
=m�
i=1
x̄+n�
j=1
zijuj
−
x̄+k�
j=1
zijuj
2
=m�
i=1
n�
j=k+1
zijuj
2
5
argmaxθ
m�
j=1
�
z
Q(t+1)(z | xj) log�p(z, xj | θ(t))
�
=m�
j=1
�
z
Q(z | xj) log
�P (z | xj , θ(t))P (xj | θ(t))
Q(z | xj)
�
=m�
j=1
�
z
Q(z | xj) log�P (xj | θ(t))
�−
m�
j=1
�
z
Q(z | xj) log
�Q(z | xj)
P (z | xj , θ(t))
�
z1 = w(1)0 +�
i
w(1)i xi
zk = w(k)0 +�
i
w(k)i xi
errork =m�
i=1
xi −
x̄+k�
j=1
zijuj
2
=m�
i=1
x̄+n�
j=1
zijuj
−
x̄+k�
j=1
zijuj
2
=m�
i=1
n�
j=k+1
zijuj
2
=m�
i=1
n�
j=k+1
zijuj · ujz
ij +
m�
i=1
n�
j=k+1
n�
l=k+1,l �=k
zijuj · ulz
il
5
argmaxθ
m�
j=1
�
z
Q(t+1)(z | xj) log�p(z, xj | θ(t))
�
=m�
j=1
�
z
Q(z | xj) log
�P (z | xj , θ(t))P (xj | θ(t))
Q(z | xj)
�
=m�
j=1
�
z
Q(z | xj) log�P (xj | θ(t))
�−
m�
j=1
�
z
Q(z | xj) log
�Q(z | xj)
P (z | xj , θ(t))
�
z1 = w(1)0 +�
i
w(1)i xi
zk = w(k)0 +�
i
w(k)i xi
errork =m�
i=1
xi −
x̄+k�
j=1
zijuj
2
=m�
i=1
x̄+n�
j=1
zijuj
−
x̄+k�
j=1
zijuj
2
=m�
i=1
n�
j=k+1
zijuj
2
=m�
i=1
n�
j=k+1
zijuj · ujz
ij +
m�
i=1
n�
j=k+1
n�
l=k+1,l �=k
zijuj · ulz
il
=m�
i=1
n�
j=k+1
(zij)2
5
Error is the sum of squared weights for dimensions that have been cut!
ReconstrucHon error and covariance matrix
argmaxθ
m�
j=1
�
z
Q(t+1)(z | xj) log�p(z, xj | θ(t))
�
=m�
j=1
�
z
Q(z | xj) log
�P (z | xj , θ(t))P (xj | θ(t))
Q(z | xj)
�
=m�
j=1
�
z
Q(z | xj) log�P (xj | θ(t))
�−
m�
j=1
�
z
Q(z | xj) log
�Q(z | xj)
P (z | xj , θ(t))
�
z1 = w(1)0 +�
i
w(1)i xi
zk = w(k)0 +�
i
w(k)i xi
errork =m�
i=1
xi −
x̄+k�
j=1
zijuj
2
=m�
i=1
x̄+n�
j=1
zijuj
−
x̄+k�
j=1
zijuj
2
=m�
i=1
n�
j=k+1
zijuj
2
=m�
i=1
n�
j=k+1
zijuj · ujz
ij +
m�
i=1
n�
j=k+1
n�
l=k+1,l �=k
zijuj · ulz
il
=m�
i=1
n�
j=k+1
(zij)2
=m�
i=1
n�
j=k+1
uTj (x
i − x̄)(xi − x̄)T uj
5
argmaxθ
m�
j=1
�
z
Q(t+1)(z | xj) log�p(z, xj | θ(t))
�
=m�
j=1
�
z
Q(z | xj) log
�P (z | xj , θ(t))P (xj | θ(t))
Q(z | xj)
�
=m�
j=1
�
z
Q(z | xj) log�P (xj | θ(t))
�−
m�
j=1
�
z
Q(z | xj) log
�Q(z | xj)
P (z | xj , θ(t))
�
z1 = w(1)0 +�
i
w(1)i xi
zk = w(k)0 +�
i
w(k)i xi
errork =m�
i=1
xi −
x̄+k�
j=1
zijuj
2
=m�
i=1
x̄+n�
j=1
zijuj
−
x̄+k�
j=1
zijuj
2
=m�
i=1
n�
j=k+1
zijuj
2
=m�
i=1
n�
j=k+1
zijuj · ujz
ij +
m�
i=1
n�
j=k+1
n�
l=k+1,l �=k
zijuj · ulz
il
=m�
i=1
n�
j=k+1
(zij)2
=m�
i=1
n�
j=k+1
uTj (x
i − x̄)(xi − x̄)T uj
=n�
j=k+1
uTj
�m�
i=1
(xi − x̄)(xi − x̄)T�
uj
5
argmaxθ
m�
j=1
�
z
Q(t+1)(z | xj) log�p(z, xj | θ(t))
�
=m�
j=1
�
z
Q(z | xj) log
�P (z | xj , θ(t))P (xj | θ(t))
Q(z | xj)
�
=m�
j=1
�
z
Q(z | xj) log�P (xj | θ(t))
�−
m�
j=1
�
z
Q(z | xj) log
�Q(z | xj)
P (z | xj , θ(t))
�
z1 = w(1)0 +�
i
w(1)i xi
zk = w(k)0 +�
i
w(k)i xi
errork =m�
i=1
xi −
x̄+k�
j=1
zijuj
2
=m�
i=1
x̄+n�
j=1
zijuj
−
x̄+k�
j=1
zijuj
2
=m�
i=1
n�
j=k+1
zijuj
2
=m�
i=1
n�
j=k+1
zijuj · ujz
ij +
m�
i=1
n�
j=k+1
n�
l=k+1,l �=k
zijuj · ulz
il
=m�
i=1
n�
j=k+1
(zij)2
=m�
i=1
n�
j=k+1
uTj (x
i − x̄)(xi − x̄)T uj
=n�
j=k+1
uTj
�m�
i=1
(xi − x̄)(xi − x̄)T�
uj
errork =n�
j=k+1
uTj Σuj
5
Thus, to minimize the reconstruction error we want to minimize
n�
j=k+1
uTj Σuj
Recall that to maximize the variance we want to maximize
k�
j=1
uTj Σuj
These are equivalent! k�
j=1
uTj Σuj +
n�
j=k+1
uTj Σuj =
n�
j=1
uTj Σuj = trace(Σ)
Dimensionality reducHon with PCA
23
!"#$%&"'%()"*+,-$./0*"'%,/&"%1,234In high-dimensional problem, data usually lies near a linear subspace, as noise introduces small variability
Only keep data projections onto principal components with large eigenvalues
Can ignore the components of lesser significance.
You might lose some information, but if the eigenvaluesmuch
0
5
10
15
20
25
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
Varia
nce
(%)
[Slide from Aarti Singh]
PCA example
Data: Projection: Reconstruction:
What’s the difference between the first eigenvector and linear regression?
Suppose we have data { (x,y) }
Predict y from x Predict x from y PCA
[Pictures from “Cerebral Mastication” blog]
Eigenfaces [Turk, Pentland ’91] • Input images: Principal components:
Eigenfaces reconstrucHon
• Each image corresponds to adding together the principal components:
Scaling up
• Covariance matrix can be really big! – Σ is n by n – 10000 features can be common! – finding eigenvectors is very slow…
• Use singular value decomposiHon (SVD) – finds to k eigenvectors – great implementaHons available, e.g., Matlab svd
SVD • Write X = W S VT
– X ← data matrix, one row per datapoint
– W ← weight matrix, one row per datapoint – coordinate of xi in eigenspace
– S ← singular value matrix, diagonal matrix • in our sedng each entry is eigenvalue λj
– VT ← singular vector matrix • in our sedng each row is eigenvector vj
PCA using SVD algorithm • Start from m by n data matrix X • Recenter: subtract mean from each row of X
– Xc ← X – X
• Call SVD algorithm on Xc – ask for k singular vectors
• Principal components: k singular vectors with highest singular values (rows of VT) – Coefficients: project each point onto the new vectors
Non-‐linear methods
12
!"#$%#&'$"#()$&*+#)",#-.%!"#$%&'(%"&)*"+*"$),-.,/*+,'(0-,)*1-".%/,*#"-,*,++%2%,&(*-,1-,),&('(%"&3*'&/*2'1(0-,*0&/,-45%&6*-,4'(%"&)*(7'(*6".,-&*(7,*/'('
8969** 86"3*1,-)"&'4%(5*'&/*%&(,44%6,&2,*'-,*7%//,&*'((-%$0(,)*(7'(*27'-'2(,-%:,**70#'&*$,7'.%"-*%&)(,'/*"+*)0-.,5*;0,)(%"&)<"1%2)*=)1"-()3*)2%,&2,3*&,>)3*,(29?*%&)(,'/*"+*/"20#,&()
@+(,&*#'5*&"(*7'.,*175)%2'4*#,'&%&6
A%&,'-/)-%,-0"1&2.30.%$%#&4%"156-6&7/248B'2("-*C&'45)%)D&/,1,&/,&(*!"#1"&,&(*C&'45)%)*=D!C?
E"&4%&,'-!"01",-"% *-9$%3"06DF@GCHA"2'4*A%&,'-*8#$,//%&6*=AA8?
Slide from Aarti Singh
Latent Dirichlet allocation
“swiss roll”
ProbabilisHc topic models
(Blei, Ng, Jordan JMLR ‘03)
gene 0.04dna 0.02genetic 0.01.,,
life 0.02evolve 0.01organism 0.01.,,
brain 0.04neuron 0.02nerve 0.01...
data 0.02number 0.02computer 0.01.,,
Topics Documents Topic proportions andassignments
Figure 1: The intuitions behind latent Dirichlet allocation. We assume that somenumber of “topics,” which are distributions over words, exist for the whole collection (far left).Each document is assumed to be generated as follows. First choose a distribution over thetopics (the histogram at right); then, for each word, choose a topic assignment (the coloredcoins) and choose the word from the corresponding topic. The topics and topic assignmentsin this figure are illustrative—they are not fit from real data. See Figure 2 for topics fit fromdata.
model assumes the documents arose. (The interpretation of LDA as a probabilistic model isfleshed out below in Section 2.1.)
We formally define a topic to be a distribution over a fixed vocabulary. For example thegenetics topic has words about genetics with high probability and the evolutionary biologytopic has words about evolutionary biology with high probability. We assume that thesetopics are specified before any data has been generated.1 Now for each document in thecollection, we generate the words in a two-stage process.
1. Randomly choose a distribution over topics.
2. For each word in the document
(a) Randomly choose a topic from the distribution over topics in step #1.
(b) Randomly choose a word from the corresponding distribution over the vocabulary.
This statistical model reflects the intuition that documents exhibit multiple topics. Eachdocument exhibits the topics with different proportion (step #1); each word in each document
1Technically, the model assumes that the topics are generated first, before the documents.
3
θd
z1d
zNd
β1
βT
ProbabilisHc topic models
(Blei, Ng, Jordan JMLR ‘03)
pna .0100 cough .0095
pneumonia .0090 cxr .0085
levaquin .0060 …
sore throat .05 swallow .0092
voice .0080 fevers .0075
ear .0016
…
celluliHs .0105 swelling .0100 redness .0055
lle .0050 fevers .0045 …
Topic word distributions
Triage note
Inference
Graphical model for Latent Dirichlet Allocation (LDA)
β1 β2
βT
Low Dimensional representation: distribution of topics for the note
Pneumonia 0.50 Common cold 0.49 Diabetes 0.01
θd
Example of learned representaHon
Paraphrased note:
“Pa%ent has URI [upper respiratory infec3on] symptoms like cough, runny nose, ear pain. Denies fevers. history of seasonal allergies”
Inferred Topic Distribu1on
Allergy
Cold
Other
Allergy Cold/URI
What you need to know
• Dimensionality reducHon – why and when it’s important
• Simple feature selecHon
• RegularizaHon as a type of feature selecHon • Principal component analysis
– minimizing reconstrucHon error – relaHonship to covariance matrix and eigenvectors
– using SVD