Linea r Dimensionalit y Reductionjordan/MLShort...PCA is the most basic and standa rd of the linea r...

Linear Dimensionality ReductionRad Lab Machine Learning Workshop

UC Berkeley

August 24, 2007

Percy Liang

Lots of high-dimensional data...

face images

Zambian President LevyMwanawasa has won asecond term in office inan election his challengerMichael Sata accused himof rigging, official resultsshowed on Monday.

According to media reports,a pair of hackers said onSaturday that the FirefoxWeb browser, commonlyperceived as the saferand more customizablealternative to marketleader Internet Explorer,is critically flawed. Apresentation on the flawwas shown during theToorCon hacker conferencein San Diego.

documents

gene expression data MEG readings

Goal: find a useful representation of data

2

In many real applications, we are confronted with various types of high-dimensional data. The goal ofdimensionality reduction is to convert this data into a lower dimensional representation more amenable tovisualization or further processing by machine learning algorithms.

Basic idea of linear dimensionality reduction

Represent each face as a high-dimensional vector x ∈ R361

x ∈ R361

z = UTx

z ∈ R10

How do we choose U?

3

All five of the methods we will present fall under this framework. A high-dimensional data point (for example,a face image) is mapped via a linear projection into a lower-dimensional point. An important question to bearin mind is whether a linear projection even makes sense. There are several ways to optimize U based on thenature of the data.

Motivation and context

Why do dimensionality reduction?

• Scientific: understand structure of data (visualization)

• Computational: compress data for time/space efficiency

• Statistical: fewer dimensions allows better generalization• Direct: model normal data for detecting outliers

Related topics in this course:

• Feature selection• Clustering• Nonlinear dimensionality reduction (visualization)

4

There are several reasons one might want to do dimensionality reduction. We have already seen two ways ofeffectively reducing the dimensionality of data (feature selection and clustering) and will see nonlinear techniqueslater today.

Outline

• Methods

– Principal component analysis (PCA)

– Canonical correlation analysis (CCA)

– Linear discriminant analysis (LDA)

– Non-negative matrix factorization (NMF)

– Independent component analysis (ICA)

• Case studies

• Extensions, related methods, summary

5

Roadmap

• Methods






• Case studies


Methods / Principal component analysis (PCA) 6

PCA: first principal component

Input data:

Xd×n = ( x1 . . . xn )

Objective: maximize varianceof projected data

= max||u||=1

E(uTx)2

= max||u||=1

1n

n∑i=1

(uTxi)2

= max||u||=1

1n||uTX||2

= max||u||=1

uT

(1nXXT

)u

= largest eigenvalue of Cdef=

1nXXT

(Cd×d is covariance matrix)Methods / Principal component analysis (PCA) 7

PCA is the most basic and standard of the linear dimensionality reduction techniques. Let us start out byreducing the dimensionality to 1, which is all the machinery we need for reducing dimensionality to arbitrary r.To simplify the equations, assume the data is centered at 0. Notation: X = matrix, x = vector, x = scalar.Another interpretation of PCA is that we want to minimize the reconstruction error

∑ni=1 ||xi − uuTxi||2.

Derivation for one principal component PCAThe first principal component can be expressed as anoptimization problem (this is the variational formulation). Wecan remove the norm constraint by explicitly normalizing inthe objective:

max||u||=1

||uTX||2 = maxu

uTXXTuuTu

Let (λ1,u1), . . . (λ1,un) be the eigenvalues and eigenvectorsof XXT . Each vector u has an eigendecompositionu =

∑i aiui, so we have the following equivalent

optimization problem:

maxa

(∑

i aiui)TXXT (∑

i aiui)(∑

i aiui)T (∑

i aiui)

Using the fact that uTi uj = 1 if i = j (and 0 otherwise) and

XXTui = λiui, we simply to the following:

maxa

∑i a

2iλi∑

i a2i

.

If we think of the ai’s as specifying a distribution over theeigenvectors, the above quantity is clearly maximized when allthe mass is placed on the largest eigenvector, which isa = (1, 0, 0, . . . ), corresponding to u = u1.

Another way to see the eigenvalue solution is in terms ofLagrange multiplers. We start out with the same constrainedoptimization problem. (Note that replacing ||u|| = 1 with||u||2 ≤ 1 does not affect the solution. Why?)

maximize ||uTX||2 subject to ||u||2 ≤ 1.

We construct the Lagrangian:

L(u, λ) = ||uTX||2 + λ(||u||2 − 1).

This is a convex optimization problem, so taking the gradientwith respect to u and setting it to zero gives a sufficientcondition for optimality:

∇L(u, λ) = 2(XXT )u− 2λu = 0.

Rewriting the above expression reveals that it is just aneigenvalue problem:

(XXT )u = λu.


Equivalence to minimizing reconstruction errorWe will show that maximizing the variance along the principalcomponent is equivalent to minimizing the reconstruction error,i.e., the sum of the squares of the perpendicular distance from thecomponent:

Reconstruction error =n∑

i=1

||xi − uuTxi||2

=n∑

i=1

(||xi||2 − (uTxi)2

)= constant −

n∑i=1

(uTxi)2

= constant − ||uTX||2

Note that u ⊥ (xi − uuTxi) because

uT (xi − uuTxi) = 0,

so the second line follows from Pythagoras’s theorem. Thesederivations show that minimizing reconstruction error is the sameas maximizing variance.


First r principal components

Compute r principal components (eigenvectors of C):

Ud×r = ( u1 . . . ur )New representation:

z = UTx (zj = uTj x for j = 1, . . . , r)

Reconstruction:

x u Uz =∑r

j=1 zjuj

Multiple data points:

Xd×n u Ud×r Zr×n

( x1 . . . xn ) u ( u1 . . . ur ) ( z1 . . . zn )Methods / Principal component analysis (PCA) 10

We compute the first r eigenvectors of the covariance matrix C, which takes O(rd2) time (after O(nd2) to firstcompute C). To get a new representation, we project x with UT to get new coordinates z. To reconstruct x,we take a linear combination of the principal components U = (u1, . . . ,ur) with coefficients z = (z1, . . . , zr).Note that if r = d, then the reconstruction is exact. In that case, U is a orthogonal matrix (UUT = I), soUT = U−1.

Eigen-faces [Turk, 1991]

• Each xi is a face image, which is a vector in Rd

• d = number of pixels

• xji = intensity of the j-th pixel in image i

Xd×n u Ud×r Zr×n

( . . . ) u ( ) ( z1 . . . zn )zi more “meaningful” representation of faces than xi.


An application of PCA is image analysis. Here, the principal components (eigenvectors) are images that resemblefaces. Note that the representation is sensitive to rotation and translation. One can store z as a compressedversion of the original image x. The z can be used as features in classification (although LDA might be moresuitable).

Latent Semantic Analysis [Deerwater, 1990]

• Each xi is a bag of words, which is a vector in Rd

• d = number of words in the vocabulary

• xji = frequency of word j in document i

Xd×n u Ud×r Zr×n

(stocks: 2 · · · 0

chairman: 4 · · · 1the: 8 · · · 7· · · ... · · · ...

wins: 0 · · · 2game: 1 · · · 3

) u (0.4 · · · -0.0010.8 · · · 0.03

0.01 · · · 0.04... · · · ...

0.002 · · · 2.30.003 · · · 1.9

) ( z1 . . . zn )How to measure similarity between two documents?

zT1 z2 is probably better than xT

1 x2


Latent Semantic Analysis (LSA), also known as Latent Semantic Indexing, is an application of PCA tocategorical data. LSA is often used in information retrieval. Eigen-documents tries to capture “semantics”:an eigen-document contains related words. But how do we interpret negative frequencies? Other methodssuch as probabilistic LSA, Latent Dirichlet Allocation, or non-negative matrix factorization may lead to moreinterpretable results.

How many principal components?• Similar to question of “How many clusters?”

• Magnitude of eigenvalues indicate fraction of variance captured.

• Eigenvalues on a face image dataset:

2 3 4 5 6 7 8 9 10 11

i

287.1

553.6

820.1

1086.7

1353.2

λi

• Eigenvalues typically drop off sharply, so don’t need that many.

• Of course variance isn’t everything...


The total variance is the sum of all the eigenvalues, which is just the trace of the covariance matrix (sum ofdiagonal entries). For typical data sets, the eigenvalues decay rapidly.

PCA references• Schutze: Distributional POS tagging

http://ucrel.lancs.ac.uk/acl/E/E95/E95-1020.pdf

• Ando, Zhang: A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Datahttp://www-cs-students.stanford.edu/%7etzhang/papers/jmlr05%5fsemisup.pdf


Roadmap

• Methods






• Case studies


Methods / Canonical correlation analysis (CCA) 15

Motivation for CCA [Hotelling, 1936]

Often, each data point consists of two views:

• Image retrieval: for each image, have the following:

– x: Pixels (or other visual features)

– y: Text around the image

• Genomics: for each gene, have the following:

– x: Gene expression in DNA microarray

– y: Chemical reactions catalyzed in metabolic pathways

Goal: reduce the dimensionality of the two views jointly


Each data point is a vector, which can be broken up into two parts (possibly of different dimensionality), whichwe call x and y. What happens if PCA is run on the concatenated data points

(xy

)?

From variance to correlationInput data: (x1,y1), . . . , (xn,yn)

PCA:find u to maximize variance E(uTx)2

find v to maximize variance E(vTy)2

CCA: find (u,v) to maximize correlation corr(uTx)(vTy)

Correspondence between x and y data points denoted by shade of gray

Result:CCA solution: green arrowsPCA solution: black arrows


Here, CCA chooses directions so that dark points have large projected coordinates and light points have smallprojected coordinates, both for the x view and the y view, causing them to be correlated. PCA ignores therelationship between x and y entirely.

Properties of CCA

Objective: maximize correlation between projected views

= maxu,v

corr(uTx,vTy) = maxu,v

cov(uTx,vTy)√var(uTx)

√var(vTy)

Reduces to a generalized eigenvalue problem.

Remember correlation 6= covariance.

• If x = Ay, then any (u,v) with u = Av is optimal(correlation 1)

• If x and y are independent, then any (u,v) is optimal(correlation 0)

• Canonical correlation invariant to affine transformationof x,y


Correlation is defined as covariance divided by standard deviations. If we remove the entire denominator,then we end up with partial least squares (PLS). If we remove just

√var(vTy), then we have multiple linear

regression (MLR). To get some intuition about CCA, consider degenerate cases (x and y are totally dependentor independent).

CCA objective function

Objective: maximize correlation between projected views

= maxu,v

corr(uTx,vTy) = maxu,v

cov(uTx,vTy)√var(uTx)

√var(vTy)

= maxcvar(uTx)=cvar(vTy)=1cov(uTx,vTy)

= max||uTX||=||vTY||=1

n∑i=1

(uTxi)(vTyi)

= max||uTX||=||vTY||=1

uTXYTv

= largest generalized eigenvalue λ given by(0 XYT

YXT 0

) (uv

)= λ

(XXT 0

0 YYT

) (uv

),

which reduces to an ordinary eigenvalue problem.


Derivation for CCA (one component)

The first principal component is given by solving the following objective function:

maximize uTXYTv subject to uTXXTu ≤ 1 and vTYYTv ≤ 1.

This is a constrained optimization problem, so we construct the Lagrangian:

L(u,v, λu, λv) = uTXYTv + λu(uTXXTu− 1) + λv(vTYYTv − 1).

This is a convex optimization problem, so taking the gradient with respect to u and v and setting them to zerogives a sufficient condition for optimality:

∇L(u,v, λu, λv) =(

(YXT )u− 2λv(YYT )v(XYT )v − 2λu(XXT )u

)=

(00

).

Left-multiplying the top row by vT and the bottom row by uT and subtracting results in

λuuT (XXT )u = λvvT (YYT )v.

Since uT (XXT )u = vT (YYT )v = 1 at the optimum, λu = λvdef= λ. Now we can write the optimality

conditions as a generalized eigenvalue problem:(0 XYT

YXT 0

) (uv

)= λ

(XXT 0

0 YYT

) (uv

).


CCA references• Hardoon, et al.: Canonical correlation analysis; An overview with application to learning methods

http://eprints.ecs.soton.ac.uk/10658/01/TR%5fCSD%5f03%5f02.pdfApplies kernel CCA on two views (one of images and one of text) for image retrieval.

• Yamanishi, et al.: Heterogeneous data comparison and gene selection with kernel canonical correlation analysishttp://cg.ensmp.fr/%7evert/publi/04kmcbbook/heterogeneous.pdf

• Matlab code for CCAhttp://www.imt.liu.se/%7emagnus/cca/


Roadmap

• Methods






• Case studies


Methods / Linear discriminant analysis (LDA) 22

Motivation for LDA [Fisher, 1936]

What is the best linear projection?

PCA solution


Interclass variance is the sum of squared distances of points in different classes; intraclass variance, of pairs ofpoints in the same class.

Motivation for LDA [Fisher, 1936]

What is the best linear projection with these labels?

PCA solution LDA solution

Goal: reduce the dimensionality given labels

Idea: want projection to maximize overall interclass variancerelative to intraclass variance


Interclass variance is the sum of squared distances of points in different classes; intraclass variance, of pairs ofpoints in the same class.

LDA objective function

Objective: maximize interclass varianceintraclass variance = total variance

intraclass variance − 1

Total variance: 1n

∑i(u

T (xi − µ))2

Mean of all points: µ = 1n

∑i xi

Intraclass variance: 1n

∑i(u

T (xi − µyi))2

Mean of points in class y: µy = 1|{i:yi=y}|

∑i:yi=y xi

Reduces to a generalized eigenvalue problem.


The total variance is the sum of interclass variance and intraclass variance.

LDA derivation

Global mean: µ =∑

i xi Xg = (x1−µ, . . . ,xn−µ)Class mean: µy =

∑i:yi=y xi Xc = (x1−µy1

, . . . ,xn−µyn)

Objective: maximize total varianceintraclass variance = interclass variance

intraclass variance + 1

= maxu

∑ni=1(u

T (xi − µ))2∑ni=1(uT (xi − µyi

))2

= max||uTXc||=1

n∑i=1

(uT (xi − µ))2

= max||uTXc||=1

uTXgXTg u

= largest generalized eigenvalue λ given by

(XgXTg )u = λ(XcXT

c )u.


Summary so far

Framework: z u UTx

Criteria for choosing U thus far:

• PCA: maximize variance

• CCA: maximize correlation

• LDA: maximize interclass varianceintraclass variance

All these methods reduce to solving generalizedeigenvalue problems.

Next (NMF, ICA): more complex criteria for U.


Roadmap

• Methods






• Case studies


Methods / Non-negative matrix factorization (NMF) 27

Motivation for NMF [Paatero, ’94; Lee, ’99]

Back to basic PCA setting (single view, no labels)

Xd×n u Ud×r Zr×n

( x1 . . . xn ) u ( u1 . . . ur ) ( z1 . . . zn )• Data is not just any arbitrary real vector:

– Text modeling: each document is a vector of term frequencies

– Gene expression: each gene is a vector of expression profiles

– Collaborative filtering: each user is a vector of movie ratings

• Each basis vector ui is an “eigen-document/eigen-gene/eigen-user”

• Would like U and Z to have only non-negative entriesso that we can interpret each point as combination of prototypes

Goal: reduce the dimensionality given non-negativity constraints


Non-negative constraints can improve visualization and interpretability. For example, in gene expression, a basisvector should represent a (soft) subset of experiments rather than an arbitrary linear combination.

PCA versus NMF

x u∑r

j=1 zjuj • Sum of basis vectors mustbe (positively) additive(zj ≥ 0)

• The basis vectors ui’s tendto be sparse

• NMF recovers aparts-based representationof x whereas PCA recoversa holistic representations

• Caveat for images: sparsitydepends on properalignment (remember,representation is still a bagof pixels)


We can see a qualitative difference between the basis vectors recovered by PCA and those recovered by NMF.

NMF machinery

• Objectives to minimize (all entries in X,U,Z non-negative)

– Frobenius norm (same as PCA but with non-negativity constraints):

||X−UZ||2F =∑n

i=1

∑rj=1(Xji − (UZ)ji)2

– KL divergence:

KL(X||UZ) =∑n

i=1

∑rj=1 Xji log Xji

(UZ)ji−Xji + (UZ)ji

• Algorithm

– Hard non-convex optimization problem:could get stuck in local minima, need to worry about initialization

– Simple/fast multiplicative update rule [Lee & Seung ’99, ’01]

• Relationship to other methods

– Vector quantization: zj is 1 in exactly one component j

– Probabilistic latent semantic analysis: equivalent to 2nd objective

– Latent Dirichlet Allocation: Bayesian version of pLSI


Unlike PCA, CCA, LDA, there are several reasonable objective functions that we could use. Optimization ishard and requires an iterative algorithm that converges to local optima.

NMF references• Lee, Seung: Learning the parts of objects with nonnegative matrix factorization.

http://www.seas.upenn.edu/%7eddlee/Papers/nmf.pdf

• Hoyer: Non-negative Matrix Factorization with Sparseness Constraintshttp://jmlr.csail.mit.edu/papers/volume5/hoyer04a/hoyer04a.pdf

• Lin: Projected Gradient Methods for Non-negative Matrix Factorizationhttp://www.csie.ntu.edu.tw/%7ecjlin/papers/pgradnmf.pdf

• Srebro: Learning with Matrix Factorizationshttp://people.csail.mit.edu/nati/Publications/thesis.pdf

• Buntime, Jakulin: Discrete Principal Component Analysishttp://cosco.hiit.fi/search/MPCA/buntineDPCA.pdf

• Matlab code for NMFhttp://www.broad.mit.edu/mpr/publications/projects/NMF/nmf.m


Roadmap

• Methods






• Case studies


Methods / Independent component analysis (ICA) 32

Motivation for ICA [Herault & Jutten, ’86]

x = Uz

Cocktail party problem:d people, d microphones, n time stepsAssume: people are speaking independently (z)

acoustics mix linearly through an invertible U

X =

Goal: find transformation U−1 that makes components ofz = U−1x as independent as possible


PCA versus ICA

Original signal PCA solution ICA solution

ICA finds independent components; doesn’t work if data is Gaussian:

?


ICA algorithm

x = Uz• Preprocessing: whiten data X with PCA

so that components are uncorrelated

• Find U−1 to maximize independence of z = U−1x• How to measure independence?

mutual information, negentropy,non-Gaussianity (e.g., kurtosis)

• Hard non-convex optimization

• Methods for solving: fastICA, kernelICA, ProDenICA


ICA references• Hyvarinen, Karhunen, Oja: Independent Component Analysis (introduction chapter)

http://www.cis.hut.fi/projects/ica/book/intro.pdf

• Hyvarinen, Oja: Independent Component Analysis: Algorithms and Applicationshttp://www.cs.helsinki.fi/u/ahyvarin/papers/NN00new.pdf

• Bach, Jordan: Kernel Independent Component Analysishttp://cmm.ensmp.fr/%7ebach/kernelICA-jmlr.pdf

• Some ICA softwarehttp://www.tsi.enst.fr/icacentral/algos.html

• ICA code for Matlabhttp://www.cis.hut.fi/projects/ica/fastica/


Roadmap

• Methods






• Case studies


Case studies 37

Network anomaly detection [Lakhina, ’05]

Input data: traffic flow oneach link in the networkduring each time interval

Model assumption: traffic is sum of flows along a few “paths”Apply PCA: each principal component intuitively represents a pathAnomaly: when traffic data deviates from first few principal components

Case studies 38

Multi-task learning [Ando & Zhang, ’05]

Setup:• Have a set of related tasks (classify documents for various users)

• Each task has a classifier (weights of a linear classifier)

• Want to share structure between classifiers

One step of their procedure:given a set of classifiers x1, . . . ,xn,run PCA to identify shared structure:

X = ( x1 . . . xn ) u UZ

Each data point is a linear classifierEach principal component is a eigen-classifier

Case studies 39

Unsupervised POS tagging [Schutze, ’95]

Part-of-speech (POS) tagging task:Input: I like reducing the dimensionality of data .

Output: NOUN VERB VERB(-ING) DET NOUN PREP NOUN .

Each data point is (the context distribution of) a word.

Key idea: words appearing in similar contextstend to have the same POS tags;so cluster using the contexts of each word type

Issue: contexts are too sparse

Solution: run PCA first,then cluster using new representation

Case studies 39

Brain imaging

s =

Data: EEG/MEG/fMRI readingsGoal: separate signals into sources

One solution: ICA

Another solution: CCA [Borga, ’02]

The two views are the signals sat adjacent time steps:

(x1,y1) = (s(1), s(2))(x2,y2) = (s(2), s(3))(x3,y3) = (s(3), s(4))

. . .

More robust and faster than ICA

Case studies 40

Roadmap

• Methods






• Case studies


Extensions, related methods, summary 41

Extensions

• Non-linear: can use kernel trick

• Sparsity: either in U or z• Robustness: be insensitive to outliers

• Online: update basis vectors U with additional data

• Probabilistic (factor analysis):

– Handle missing data

– Estimate uncertainty

– Natural way to incorporate in a larger model


Other linear methods

• Partial least squares (PLS)

– Find directions of maximum covariance

– Generalized eigenvalue problem

• Multiple linear regression (MLR):

– Find directions of minimum squared error

– Generalized eigenvalue problem

• Random projections:

– Randomly project data onto O(log n) dimensions

– Pairwise distances preserved with high probability


Curtain call

PCA: find subspace that captures most variance in data;eigenvalue problem

CCA: find pair of subspaces that captures most correlation;generalized eigenvalue problem

LDA: find subspace that maximizes intraclass varianceinterclass variance;

generalized eigenvalue problem

NMF: find subspace that minimizes reconstruction errorfor non-negative data; non-trivial optimization problem

ICA: find subspace where sources are independent;non-trivial optimization problem


General references• Shawe-Taylor and Cristianini: Kernel Methods for Pattern Analysis

http://www.kernel-methods.net

• de Bie, et al.: Eigenproblems in Pattern Recognitionhttp://www.ofai.at/%7eroman.rosipal/Papers/eig%5fbook04.pdf

• Borga, et al.: A unified approach to PCA, PLS, MLR and CCAhttp://www.cvl.isy.liu.se/ScOut/TechRep/Papers/LiTHISYR1992.pdf

• Tutorial on component analysis for visionhttp://eccv2006.tugraz.at/tutorials.html


Date post:	23-Jul-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Linea r Dimensionalit y Reductionjordan/MLShort...PCA is the most basic and standa rd of the linea r...

Documents