Linear Dimensionality ReductionRad Lab Machine Learning Workshop
UC Berkeley
August 24, 2007
Percy Liang
Lots of high-dimensional data...
face images
Zambian President LevyMwanawasa has won asecond term in office inan election his challengerMichael Sata accused himof rigging, official resultsshowed on Monday.
According to media reports,a pair of hackers said onSaturday that the FirefoxWeb browser, commonlyperceived as the saferand more customizablealternative to marketleader Internet Explorer,is critically flawed. Apresentation on the flawwas shown during theToorCon hacker conferencein San Diego.
documents
gene expression data MEG readings
Goal: find a useful representation of data
2
In many real applications, we are confronted with various types of high-dimensional data. The goal ofdimensionality reduction is to convert this data into a lower dimensional representation more amenable tovisualization or further processing by machine learning algorithms.
Basic idea of linear dimensionality reduction
Represent each face as a high-dimensional vector x ∈ R361
x ∈ R361
z = UTx
z ∈ R10
How do we choose U?
3
All five of the methods we will present fall under this framework. A high-dimensional data point (for example,a face image) is mapped via a linear projection into a lower-dimensional point. An important question to bearin mind is whether a linear projection even makes sense. There are several ways to optimize U based on thenature of the data.
Motivation and context
Why do dimensionality reduction?
• Scientific: understand structure of data (visualization)
• Computational: compress data for time/space efficiency
• Statistical: fewer dimensions allows better generalization• Direct: model normal data for detecting outliers
Related topics in this course:
• Feature selection• Clustering• Nonlinear dimensionality reduction (visualization)
4
There are several reasons one might want to do dimensionality reduction. We have already seen two ways ofeffectively reducing the dimensionality of data (feature selection and clustering) and will see nonlinear techniqueslater today.
Outline
• Methods
– Principal component analysis (PCA)
– Canonical correlation analysis (CCA)
– Linear discriminant analysis (LDA)
– Non-negative matrix factorization (NMF)
– Independent component analysis (ICA)
• Case studies
• Extensions, related methods, summary
5
Roadmap
• Methods
– Principal component analysis (PCA)
– Canonical correlation analysis (CCA)
– Linear discriminant analysis (LDA)
– Non-negative matrix factorization (NMF)
– Independent component analysis (ICA)
• Case studies
• Extensions, related methods, summary
Methods / Principal component analysis (PCA) 6
PCA: first principal component
Input data:
Xd×n = ( x1 . . . xn )
Objective: maximize varianceof projected data
= max||u||=1
E(uTx)2
= max||u||=1
1n
n∑i=1
(uTxi)2
= max||u||=1
1n||uTX||2
= max||u||=1
uT
(1nXXT
)u
= largest eigenvalue of Cdef=
1nXXT
(Cd×d is covariance matrix)Methods / Principal component analysis (PCA) 7
PCA is the most basic and standard of the linear dimensionality reduction techniques. Let us start out byreducing the dimensionality to 1, which is all the machinery we need for reducing dimensionality to arbitrary r.To simplify the equations, assume the data is centered at 0. Notation: X = matrix, x = vector, x = scalar.Another interpretation of PCA is that we want to minimize the reconstruction error
∑ni=1 ||xi − uuTxi||2.
Derivation for one principal component PCAThe first principal component can be expressed as anoptimization problem (this is the variational formulation). Wecan remove the norm constraint by explicitly normalizing inthe objective:
max||u||=1
||uTX||2 = maxu
uTXXTuuTu
Let (λ1,u1), . . . (λ1,un) be the eigenvalues and eigenvectorsof XXT . Each vector u has an eigendecompositionu =
∑i aiui, so we have the following equivalent
optimization problem:
maxa
(∑
i aiui)TXXT (∑
i aiui)(∑
i aiui)T (∑
i aiui)
Using the fact that uTi uj = 1 if i = j (and 0 otherwise) and
XXTui = λiui, we simply to the following:
maxa
∑i a
2iλi∑
i a2i
.
If we think of the ai’s as specifying a distribution over theeigenvectors, the above quantity is clearly maximized when allthe mass is placed on the largest eigenvector, which isa = (1, 0, 0, . . . ), corresponding to u = u1.
Another way to see the eigenvalue solution is in terms ofLagrange multiplers. We start out with the same constrainedoptimization problem. (Note that replacing ||u|| = 1 with||u||2 ≤ 1 does not affect the solution. Why?)
maximize ||uTX||2 subject to ||u||2 ≤ 1.
We construct the Lagrangian:
L(u, λ) = ||uTX||2 + λ(||u||2 − 1).
This is a convex optimization problem, so taking the gradientwith respect to u and setting it to zero gives a sufficientcondition for optimality:
∇L(u, λ) = 2(XXT )u− 2λu = 0.
Rewriting the above expression reveals that it is just aneigenvalue problem:
(XXT )u = λu.
Methods / Principal component analysis (PCA) 8
Equivalence to minimizing reconstruction errorWe will show that maximizing the variance along the principalcomponent is equivalent to minimizing the reconstruction error,i.e., the sum of the squares of the perpendicular distance from thecomponent:
Reconstruction error =n∑
i=1
||xi − uuTxi||2
=n∑
i=1
(||xi||2 − (uTxi)2
)= constant −
n∑i=1
(uTxi)2
= constant − ||uTX||2
Note that u ⊥ (xi − uuTxi) because
uT (xi − uuTxi) = 0,
so the second line follows from Pythagoras’s theorem. Thesederivations show that minimizing reconstruction error is the sameas maximizing variance.
Methods / Principal component analysis (PCA) 9
First r principal components
Compute r principal components (eigenvectors of C):
Ud×r = ( u1 . . . ur )New representation:
z = UTx (zj = uTj x for j = 1, . . . , r)
Reconstruction:
x u Uz =∑r
j=1 zjuj
Multiple data points:
Xd×n u Ud×r Zr×n
( x1 . . . xn ) u ( u1 . . . ur ) ( z1 . . . zn )Methods / Principal component analysis (PCA) 10
We compute the first r eigenvectors of the covariance matrix C, which takes O(rd2) time (after O(nd2) to firstcompute C). To get a new representation, we project x with UT to get new coordinates z. To reconstruct x,we take a linear combination of the principal components U = (u1, . . . ,ur) with coefficients z = (z1, . . . , zr).Note that if r = d, then the reconstruction is exact. In that case, U is a orthogonal matrix (UUT = I), soUT = U−1.
Eigen-faces [Turk, 1991]
• Each xi is a face image, which is a vector in Rd
• d = number of pixels
• xji = intensity of the j-th pixel in image i
Xd×n u Ud×r Zr×n
( . . . ) u ( ) ( z1 . . . zn )zi more “meaningful” representation of faces than xi.
Methods / Principal component analysis (PCA) 11
An application of PCA is image analysis. Here, the principal components (eigenvectors) are images that resemblefaces. Note that the representation is sensitive to rotation and translation. One can store z as a compressedversion of the original image x. The z can be used as features in classification (although LDA might be moresuitable).
Latent Semantic Analysis [Deerwater, 1990]
• Each xi is a bag of words, which is a vector in Rd
• d = number of words in the vocabulary
• xji = frequency of word j in document i
Xd×n u Ud×r Zr×n
(stocks: 2 · · · 0
chairman: 4 · · · 1the: 8 · · · 7· · · ... · · · ...
wins: 0 · · · 2game: 1 · · · 3
) u (0.4 · · · -0.0010.8 · · · 0.03
0.01 · · · 0.04... · · · ...
0.002 · · · 2.30.003 · · · 1.9
) ( z1 . . . zn )How to measure similarity between two documents?
zT1 z2 is probably better than xT
1 x2
Methods / Principal component analysis (PCA) 12
Latent Semantic Analysis (LSA), also known as Latent Semantic Indexing, is an application of PCA tocategorical data. LSA is often used in information retrieval. Eigen-documents tries to capture “semantics”:an eigen-document contains related words. But how do we interpret negative frequencies? Other methodssuch as probabilistic LSA, Latent Dirichlet Allocation, or non-negative matrix factorization may lead to moreinterpretable results.
How many principal components?• Similar to question of “How many clusters?”
• Magnitude of eigenvalues indicate fraction of variance captured.
• Eigenvalues on a face image dataset:
2 3 4 5 6 7 8 9 10 11
i
287.1
553.6
820.1
1086.7
1353.2
λi
• Eigenvalues typically drop off sharply, so don’t need that many.
• Of course variance isn’t everything...
Methods / Principal component analysis (PCA) 13
The total variance is the sum of all the eigenvalues, which is just the trace of the covariance matrix (sum ofdiagonal entries). For typical data sets, the eigenvalues decay rapidly.
PCA references• Schutze: Distributional POS tagging
http://ucrel.lancs.ac.uk/acl/E/E95/E95-1020.pdf
• Ando, Zhang: A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Datahttp://www-cs-students.stanford.edu/%7etzhang/papers/jmlr05%5fsemisup.pdf
Methods / Principal component analysis (PCA) 14
Roadmap
• Methods
– Principal component analysis (PCA)
– Canonical correlation analysis (CCA)
– Linear discriminant analysis (LDA)
– Non-negative matrix factorization (NMF)
– Independent component analysis (ICA)
• Case studies
• Extensions, related methods, summary
Methods / Canonical correlation analysis (CCA) 15
Motivation for CCA [Hotelling, 1936]
Often, each data point consists of two views:
• Image retrieval: for each image, have the following:
– x: Pixels (or other visual features)
– y: Text around the image
• Genomics: for each gene, have the following:
– x: Gene expression in DNA microarray
– y: Chemical reactions catalyzed in metabolic pathways
Goal: reduce the dimensionality of the two views jointly
Methods / Canonical correlation analysis (CCA) 16
Each data point is a vector, which can be broken up into two parts (possibly of different dimensionality), whichwe call x and y. What happens if PCA is run on the concatenated data points
(xy
)?
From variance to correlationInput data: (x1,y1), . . . , (xn,yn)
PCA:find u to maximize variance E(uTx)2
find v to maximize variance E(vTy)2
CCA: find (u,v) to maximize correlation corr(uTx)(vTy)
Correspondence between x and y data points denoted by shade of gray
Result:CCA solution: green arrowsPCA solution: black arrows
Methods / Canonical correlation analysis (CCA) 17
Here, CCA chooses directions so that dark points have large projected coordinates and light points have smallprojected coordinates, both for the x view and the y view, causing them to be correlated. PCA ignores therelationship between x and y entirely.
Properties of CCA
Objective: maximize correlation between projected views
= maxu,v
corr(uTx,vTy) = maxu,v
cov(uTx,vTy)√var(uTx)
√var(vTy)
Reduces to a generalized eigenvalue problem.
Remember correlation 6= covariance.
• If x = Ay, then any (u,v) with u = Av is optimal(correlation 1)
• If x and y are independent, then any (u,v) is optimal(correlation 0)
• Canonical correlation invariant to affine transformationof x,y
Methods / Canonical correlation analysis (CCA) 18
Correlation is defined as covariance divided by standard deviations. If we remove the entire denominator,then we end up with partial least squares (PLS). If we remove just
√var(vTy), then we have multiple linear
regression (MLR). To get some intuition about CCA, consider degenerate cases (x and y are totally dependentor independent).
CCA objective function
Objective: maximize correlation between projected views
= maxu,v
corr(uTx,vTy) = maxu,v
cov(uTx,vTy)√var(uTx)
√var(vTy)
= maxcvar(uTx)=cvar(vTy)=1cov(uTx,vTy)
= max||uTX||=||vTY||=1
n∑i=1
(uTxi)(vTyi)
= max||uTX||=||vTY||=1
uTXYTv
= largest generalized eigenvalue λ given by(0 XYT
YXT 0
) (uv
)= λ
(XXT 0
0 YYT
) (uv
),
which reduces to an ordinary eigenvalue problem.
Methods / Canonical correlation analysis (CCA) 19
Derivation for CCA (one component)
The first principal component is given by solving the following objective function:
maximize uTXYTv subject to uTXXTu ≤ 1 and vTYYTv ≤ 1.
This is a constrained optimization problem, so we construct the Lagrangian:
L(u,v, λu, λv) = uTXYTv + λu(uTXXTu− 1) + λv(vTYYTv − 1).
This is a convex optimization problem, so taking the gradient with respect to u and v and setting them to zerogives a sufficient condition for optimality:
∇L(u,v, λu, λv) =(
(YXT )u− 2λv(YYT )v(XYT )v − 2λu(XXT )u
)=
(00
).
Left-multiplying the top row by vT and the bottom row by uT and subtracting results in
λuuT (XXT )u = λvvT (YYT )v.
Since uT (XXT )u = vT (YYT )v = 1 at the optimum, λu = λvdef= λ. Now we can write the optimality
conditions as a generalized eigenvalue problem:(0 XYT
YXT 0
) (uv
)= λ
(XXT 0
0 YYT
) (uv
).
Methods / Canonical correlation analysis (CCA) 20
CCA references• Hardoon, et al.: Canonical correlation analysis; An overview with application to learning methods
http://eprints.ecs.soton.ac.uk/10658/01/TR%5fCSD%5f03%5f02.pdfApplies kernel CCA on two views (one of images and one of text) for image retrieval.
• Yamanishi, et al.: Heterogeneous data comparison and gene selection with kernel canonical correlation analysishttp://cg.ensmp.fr/%7evert/publi/04kmcbbook/heterogeneous.pdf
• Matlab code for CCAhttp://www.imt.liu.se/%7emagnus/cca/
Methods / Canonical correlation analysis (CCA) 21
Roadmap
• Methods
– Principal component analysis (PCA)
– Canonical correlation analysis (CCA)
– Linear discriminant analysis (LDA)
– Non-negative matrix factorization (NMF)
– Independent component analysis (ICA)
• Case studies
• Extensions, related methods, summary
Methods / Linear discriminant analysis (LDA) 22
Motivation for LDA [Fisher, 1936]
What is the best linear projection?
PCA solution
Methods / Linear discriminant analysis (LDA) 23
Interclass variance is the sum of squared distances of points in different classes; intraclass variance, of pairs ofpoints in the same class.
Motivation for LDA [Fisher, 1936]
What is the best linear projection with these labels?
PCA solution LDA solution
Goal: reduce the dimensionality given labels
Idea: want projection to maximize overall interclass variancerelative to intraclass variance
Methods / Linear discriminant analysis (LDA) 23
Interclass variance is the sum of squared distances of points in different classes; intraclass variance, of pairs ofpoints in the same class.
LDA objective function
Objective: maximize interclass varianceintraclass variance = total variance
intraclass variance − 1
Total variance: 1n
∑i(u
T (xi − µ))2
Mean of all points: µ = 1n
∑i xi
Intraclass variance: 1n
∑i(u
T (xi − µyi))2
Mean of points in class y: µy = 1|{i:yi=y}|
∑i:yi=y xi
Reduces to a generalized eigenvalue problem.
Methods / Linear discriminant analysis (LDA) 24
The total variance is the sum of interclass variance and intraclass variance.
LDA derivation
Global mean: µ =∑
i xi Xg = (x1−µ, . . . ,xn−µ)Class mean: µy =
∑i:yi=y xi Xc = (x1−µy1
, . . . ,xn−µyn)
Objective: maximize total varianceintraclass variance = interclass variance
intraclass variance + 1
= maxu
∑ni=1(u
T (xi − µ))2∑ni=1(uT (xi − µyi
))2
= max||uTXc||=1
n∑i=1
(uT (xi − µ))2
= max||uTXc||=1
uTXgXTg u
= largest generalized eigenvalue λ given by
(XgXTg )u = λ(XcXT
c )u.
Methods / Linear discriminant analysis (LDA) 25
Summary so far
Framework: z u UTx
Criteria for choosing U thus far:
• PCA: maximize variance
• CCA: maximize correlation
• LDA: maximize interclass varianceintraclass variance
All these methods reduce to solving generalizedeigenvalue problems.
Next (NMF, ICA): more complex criteria for U.
Methods / Linear discriminant analysis (LDA) 26
Roadmap
• Methods
– Principal component analysis (PCA)
– Canonical correlation analysis (CCA)
– Linear discriminant analysis (LDA)
– Non-negative matrix factorization (NMF)
– Independent component analysis (ICA)
• Case studies
• Extensions, related methods, summary
Methods / Non-negative matrix factorization (NMF) 27
Motivation for NMF [Paatero, ’94; Lee, ’99]
Back to basic PCA setting (single view, no labels)
Xd×n u Ud×r Zr×n
( x1 . . . xn ) u ( u1 . . . ur ) ( z1 . . . zn )• Data is not just any arbitrary real vector:
– Text modeling: each document is a vector of term frequencies
– Gene expression: each gene is a vector of expression profiles
– Collaborative filtering: each user is a vector of movie ratings
• Each basis vector ui is an “eigen-document/eigen-gene/eigen-user”
• Would like U and Z to have only non-negative entriesso that we can interpret each point as combination of prototypes
Goal: reduce the dimensionality given non-negativity constraints
Methods / Non-negative matrix factorization (NMF) 28
Non-negative constraints can improve visualization and interpretability. For example, in gene expression, a basisvector should represent a (soft) subset of experiments rather than an arbitrary linear combination.
PCA versus NMF
x u∑r
j=1 zjuj • Sum of basis vectors mustbe (positively) additive(zj ≥ 0)
• The basis vectors ui’s tendto be sparse
• NMF recovers aparts-based representationof x whereas PCA recoversa holistic representations
• Caveat for images: sparsitydepends on properalignment (remember,representation is still a bagof pixels)
Methods / Non-negative matrix factorization (NMF) 29
We can see a qualitative difference between the basis vectors recovered by PCA and those recovered by NMF.
NMF machinery
• Objectives to minimize (all entries in X,U,Z non-negative)
– Frobenius norm (same as PCA but with non-negativity constraints):
||X−UZ||2F =∑n
i=1
∑rj=1(Xji − (UZ)ji)2
– KL divergence:
KL(X||UZ) =∑n
i=1
∑rj=1 Xji log Xji
(UZ)ji−Xji + (UZ)ji
• Algorithm
– Hard non-convex optimization problem:could get stuck in local minima, need to worry about initialization
– Simple/fast multiplicative update rule [Lee & Seung ’99, ’01]
• Relationship to other methods
– Vector quantization: zj is 1 in exactly one component j
– Probabilistic latent semantic analysis: equivalent to 2nd objective
– Latent Dirichlet Allocation: Bayesian version of pLSI
Methods / Non-negative matrix factorization (NMF) 30
Unlike PCA, CCA, LDA, there are several reasonable objective functions that we could use. Optimization ishard and requires an iterative algorithm that converges to local optima.
NMF references• Lee, Seung: Learning the parts of objects with nonnegative matrix factorization.
http://www.seas.upenn.edu/%7eddlee/Papers/nmf.pdf
• Hoyer: Non-negative Matrix Factorization with Sparseness Constraintshttp://jmlr.csail.mit.edu/papers/volume5/hoyer04a/hoyer04a.pdf
• Lin: Projected Gradient Methods for Non-negative Matrix Factorizationhttp://www.csie.ntu.edu.tw/%7ecjlin/papers/pgradnmf.pdf
• Srebro: Learning with Matrix Factorizationshttp://people.csail.mit.edu/nati/Publications/thesis.pdf
• Buntime, Jakulin: Discrete Principal Component Analysishttp://cosco.hiit.fi/search/MPCA/buntineDPCA.pdf
• Matlab code for NMFhttp://www.broad.mit.edu/mpr/publications/projects/NMF/nmf.m
Methods / Non-negative matrix factorization (NMF) 31
Roadmap
• Methods
– Principal component analysis (PCA)
– Canonical correlation analysis (CCA)
– Linear discriminant analysis (LDA)
– Non-negative matrix factorization (NMF)
– Independent component analysis (ICA)
• Case studies
• Extensions, related methods, summary
Methods / Independent component analysis (ICA) 32
Motivation for ICA [Herault & Jutten, ’86]
x = Uz
Cocktail party problem:d people, d microphones, n time stepsAssume: people are speaking independently (z)
acoustics mix linearly through an invertible U
X =
Goal: find transformation U−1 that makes components ofz = U−1x as independent as possible
Methods / Independent component analysis (ICA) 33
PCA versus ICA
Original signal PCA solution ICA solution
ICA finds independent components; doesn’t work if data is Gaussian:
?
Methods / Independent component analysis (ICA) 34
ICA algorithm
x = Uz• Preprocessing: whiten data X with PCA
so that components are uncorrelated
• Find U−1 to maximize independence of z = U−1x• How to measure independence?
mutual information, negentropy,non-Gaussianity (e.g., kurtosis)
• Hard non-convex optimization
• Methods for solving: fastICA, kernelICA, ProDenICA
Methods / Independent component analysis (ICA) 35
ICA references• Hyvarinen, Karhunen, Oja: Independent Component Analysis (introduction chapter)
http://www.cis.hut.fi/projects/ica/book/intro.pdf
• Hyvarinen, Oja: Independent Component Analysis: Algorithms and Applicationshttp://www.cs.helsinki.fi/u/ahyvarin/papers/NN00new.pdf
• Bach, Jordan: Kernel Independent Component Analysishttp://cmm.ensmp.fr/%7ebach/kernelICA-jmlr.pdf
• Some ICA softwarehttp://www.tsi.enst.fr/icacentral/algos.html
• ICA code for Matlabhttp://www.cis.hut.fi/projects/ica/fastica/
Methods / Independent component analysis (ICA) 36
Roadmap
• Methods
– Principal component analysis (PCA)
– Canonical correlation analysis (CCA)
– Linear discriminant analysis (LDA)
– Non-negative matrix factorization (NMF)
– Independent component analysis (ICA)
• Case studies
• Extensions, related methods, summary
Case studies 37
Network anomaly detection [Lakhina, ’05]
Input data: traffic flow oneach link in the networkduring each time interval
Model assumption: traffic is sum of flows along a few “paths”Apply PCA: each principal component intuitively represents a pathAnomaly: when traffic data deviates from first few principal components
Case studies 38
Multi-task learning [Ando & Zhang, ’05]
Setup:• Have a set of related tasks (classify documents for various users)
• Each task has a classifier (weights of a linear classifier)
• Want to share structure between classifiers
One step of their procedure:given a set of classifiers x1, . . . ,xn,run PCA to identify shared structure:
X = ( x1 . . . xn ) u UZ
Each data point is a linear classifierEach principal component is a eigen-classifier
Case studies 39
Unsupervised POS tagging [Schutze, ’95]
Part-of-speech (POS) tagging task:Input: I like reducing the dimensionality of data .
Output: NOUN VERB VERB(-ING) DET NOUN PREP NOUN .
Each data point is (the context distribution of) a word.
Key idea: words appearing in similar contextstend to have the same POS tags;so cluster using the contexts of each word type
Issue: contexts are too sparse
Solution: run PCA first,then cluster using new representation
Case studies 39
Brain imaging
s =
Data: EEG/MEG/fMRI readingsGoal: separate signals into sources
One solution: ICA
Another solution: CCA [Borga, ’02]
The two views are the signals sat adjacent time steps:
(x1,y1) = (s(1), s(2))(x2,y2) = (s(2), s(3))(x3,y3) = (s(3), s(4))
. . .
More robust and faster than ICA
Case studies 40
Roadmap
• Methods
– Principal component analysis (PCA)
– Canonical correlation analysis (CCA)
– Linear discriminant analysis (LDA)
– Non-negative matrix factorization (NMF)
– Independent component analysis (ICA)
• Case studies
• Extensions, related methods, summary
Extensions, related methods, summary 41
Extensions
• Non-linear: can use kernel trick
• Sparsity: either in U or z• Robustness: be insensitive to outliers
• Online: update basis vectors U with additional data
• Probabilistic (factor analysis):
– Handle missing data
– Estimate uncertainty
– Natural way to incorporate in a larger model
Extensions, related methods, summary 42
Other linear methods
• Partial least squares (PLS)
– Find directions of maximum covariance
– Generalized eigenvalue problem
• Multiple linear regression (MLR):
– Find directions of minimum squared error
– Generalized eigenvalue problem
• Random projections:
– Randomly project data onto O(log n) dimensions
– Pairwise distances preserved with high probability
Extensions, related methods, summary 43
Curtain call
PCA: find subspace that captures most variance in data;eigenvalue problem
CCA: find pair of subspaces that captures most correlation;generalized eigenvalue problem
LDA: find subspace that maximizes intraclass varianceinterclass variance;
generalized eigenvalue problem
NMF: find subspace that minimizes reconstruction errorfor non-negative data; non-trivial optimization problem
ICA: find subspace where sources are independent;non-trivial optimization problem
Extensions, related methods, summary 44
General references• Shawe-Taylor and Cristianini: Kernel Methods for Pattern Analysis
http://www.kernel-methods.net
• de Bie, et al.: Eigenproblems in Pattern Recognitionhttp://www.ofai.at/%7eroman.rosipal/Papers/eig%5fbook04.pdf
• Borga, et al.: A unified approach to PCA, PLS, MLR and CCAhttp://www.cvl.isy.liu.se/ScOut/TechRep/Papers/LiTHISYR1992.pdf
• Tutorial on component analysis for visionhttp://eccv2006.tugraz.at/tutorials.html
Extensions, related methods, summary 45