Factorization, Principal Component Analysis and Singular ... · Factorization, Principal Component...

transcript

Factorization, Principal ComponentAnalysis and Singular Value

Decomposition

Volker TrespSummer 2017

Recall: Multiple Linear Regression

• Many inputs and one output

y ≈ Xw

Multivariate Linear Regression: Linear Regression with SeveralOutputs

• Many inputs and many outputs

Y ≈ XW

Unknown Inputs

• Now we assume that the inputs X are also unknown

• We change the notation and write A = X and B = WT and get

Y = ABT

• Example: each row of Y corresponds to a user, each column of Y corresponds to a

movie and yi,j is the rating of user i for movie j

• Thus the i-th row of A describes the latent attributes or latent factors of the user

associated with the i-th row and the j-th row of B describes the latent attributes or

latent factors of the movie associated with the j-th column

Cost Function

• A least-squares cost function becomes

∑(i,j)∈R

yi,j − r∑k=1

ai,kbj,k

N∑i=1

r∑k=1

a2i,k + λ

M∑j=1

r∑k=1

Here R is the set of existing ratings, r (rank) is the number of latent factors, and λ

is a regularization parameter

• Note, that the cost function ignores movies which have not been rated yet and treats

them as missing

• A and B are found via stochastic gradient descent

• After convergence, we can predict for any user and any movie

yi,j =r∑

ai,kbj,k

Symmetry of the Factorization

• Matrix factorization was the most important component in the winning entries in the

Netflix competition: rows are users and columns are movies and yi,j is the rating user

i gave for movie j

• Note that the i-th row of A contains the latent factor of user i and the j-th row of

B contains the latent factors of movie j (symmetry of the decomposition!)

Factorization of the Design Matrix

• So far we started with Y ≈ XW , i.e., we factorized the output matrix

• In other applications it makes sense to factorize the design matrix X as

X ≈ ABT

• This is a form of dimensional reduction, if r < M

• As we will see later, a classifier with A as design matrix can give better results than

a classifier with X as design matrix. Example: X has many columns and is extremely

sparse, A might have a small number of columns and is non-sparse

As an Autoencoder

• The optimal ai can be written as a function of all attributes of i, i.e., xi,:

• In matrix form

A = XV

• Then

X ≈ XV BT

xi ≈ BV Txi

• Thus if X is complete, we can learn the factorization via an autoencoder

• The factorization approach as described is not unique and it has only been recentlyused in machine learning

• More traditional is the factorization via a principal component analysis (PCA)

• With A→ Z and B → V we get

X ≈ ZrV Tr

• The i-th row of A contains the r principal principal components of i (new name forthe latent factors). With r = min(M,N) the factorization is without error. Withr < min(M,N) this is an approximation

• The columns of V are orthonormal

• The decomposition is unique and is optimal for any r with respect to the cost function

∑i,j

xi,j − r∑k=1

zi,kvj,k

• We will now derive the solution

Dimensionality Reduction

• We want to compress the M -dimensional x to an r−dimensional z using a linear

transformation

• We want that x can be reconstructed from z as well as possible in the mean squared

error sense for all data points xi∑i

(xi − Vrzi)T (xi − Vrzi)

where Vr is an M × r matrix.

First Component

• Let’s first look at r = 1 and we want to find the vector v

• Without loss of generality, we assume that ‖v‖ = 1

• The reconstruction error for a particular xi is given by

(xi − xi)T (xi − xi) = (xi − vzi)

T (xi − vzi).

The optimal zi is then (see figure)

zi = vTxi

Thus we get

xi = vvTxi

Computing the First Principal Vector

• So what is v? We are looking for a v that minimized the reconstruction error over all

data points. We use the Lagrange parameter λ to guarantee length 1

L =N∑i=1

(vvTxi − xi)T (vvTxi − xi) + λ(vTv − 1)

=N∑i=1

xTi vvTvvTxi + xTi xi − xTi vv

Txi − xTi vvTxi + λ(vTv − 1)

=N∑i=1

xTi xi − xTi vvTxi + λ(vTv − 1)

Computing the First Principal Vector

• The first term does not depend on v. We take the derivative with respect to v andobtain for the second term

∂vxTi vv

∂v(vTxi)

T (vTxi) = 2

∂vvTxi

)(vTxi)

= 2xi(vTxi) = 2xi(x

Ti v) = 2(xix

and for the last term

∂vvTv = 2λv

• We set the derivative to zero and get

N∑i=1

xixTi v = λv

or in matrix form

Σv = λv

where Σ = XTX

• Recall that the Lagrangian is maximized with respect to λ

• Thus the first principal vector v is the first eigenvector of Σ (with the largest eigen-

value)

• zi = vTxi is called the first principal component of xi

Computing all Principal Vectors

• The second principal vector is given by the second eigenvector of Σ and so on

• For a rank r approximation we get

zi = V Tr xi

• Here, the columns of Vr are all orthonormal and correspond to the r eigenvectors of

Σ with the largest eigenvalues

• The optimal reconstruction is

xi = Vrzi

PCA Applications

Classification and Regression

• First perform an PCA of X and then use as input to the classifier zi instead of xi,

zi = V Tr xi

Similarity and Novelty

• A distance measure (Euclidean distance) based on the principal components is often

more meaningful than a distance measure calculated in the original space

• Novelty detection / outlier detection: We calculate the reconstruction of a new vector

x and calculate

‖x− VrV Tr x‖ = ‖V T−rx‖

If this distance is large, then the new input is unusual, i.e. might be an outlier

• Here V−r contains the M − r eigenvectors vr+1, ...,vM of Σ

PCA Example: Handwritten Digits

Data Set

• 130 handwritten digits “ 3 ” (in total: 658): significant difference in style

• The images have 16 × 16 grey valued pixels. Each input vector x consists of 256

grey values of the pixels: applying a linear classifier to the original pixels gives bad

results

Visualisation

• We see the first two principal vectors v1, v2

• v1 prolongs the lower portion of the “3”

• v2 modulates thickness

Visualisation: Reconstruction

• For different values of the principal components z1 and z2 the reconstructed image

is shown

x = m + z1v1 + z2v2

• m is a mean vector that was subtracted before the PCA was performed and is now

added again. m represents 256 mean pixel values averaged over all samples

Eigenfaces

Data Set

• PCA for face recognition

• http://vismod.media.mit.edu/vismod/demos/facerec/basic.html

• 7562 images from 3000 persons

• xi contains the pixel values of the i-th image. Obviously it does not make sense to

build a classifier directly on the 256× 256 = 65536 pixel values

• Eigenfaces were calculated based on 128 images (eigenfaces might sound cooler than

principal vectors!) (training set)

• For recognition on test images, the first r = 20 principal components are used

• Almost each person had at least 2 images; many persons had images with varying

facial expression, different hair style, different beards, ...

Similarity Search based on Principal Components

• The upper left image is the test image. Based on the Euclidean distance in PCA-space

the other 15 images were classified as nearest neighbors. All 15 images came form the

correct person, although the data base contained more than 7562 images!

• Thus, distance is evaluated following

‖z− zi‖

Recognition Rate

• 200 pictures were selected randomly from the test set. In 96% of all cases the nearest

neighbor was the correct person

Modular Eigenspaces

• The method can also be applied to facial features as eigeneyes, eigennoses, eigen-

mouths.

• Analysis of human exe movements also showed that humans concentrate on these

local features as well

Automatically Finding the Facial Features

• The modular methods require an automatic way of finding the facial features (eyes,

node, mouth)

• One defines rectangular windows that are indexed by the central pixel in the window

• One computes the anomaly of the image window for all locations, where the detector

was trained on a feature class (e.g., left eyes) by using a rank 10 PCA. When the

anomaly is minimum the feature (eye) is detected.

ANleft eye(zposk) = ‖zposk − zposk‖2

• In the following images, brightness is anomaly

Input Image

Distances

Detection

Training Templates

Typical Detections

Detection Rate

• The next plot shows the performance of the left-eye-detector based on

ANleft eye(zposk) (labeled as DFFS) with rank one and with rank 10. Also shown

are the results for simple template matching (distance to the mean left eye image

(SSD)).

• Definition of Detection: The global optimum is below a threshold value α and is

within 5 pixels of the correct location. Detection rate = recall = P (pred = 1|y =

• Definition of False Alarm: The global optimum is below a threshold value α and

is outside of 5 pixels of the correct location. Specifity = P (pred = 0|y = 0)

• In the curves, α is varied. DFFS(10) reaches a correct detection of 94% at a false

alarm rate of 6%. this means that in 94% of all cases, where a left eye has been

detected in the image, it was detected at the right location and in 6% of all cases,

where a left eye has been detected in the image, it was detected at the wrong location

Robustness

• A potential advantage of the eigenfeature layer is the ability to overcome the short-

comings of the standard eigenface method. A pure eigenface recognition system can

be fooled by gross variations in the input image (hats, beards, etc.).

• The first row of the figure above shows additional testing views of 3 individuals in the

above dataset of 45. These test images are indicative of the type of variations which

can lead to false matches: a hand near the face, a painted face, and a beard.

• The second row in the figure above shows the nearest matches found based on a

standard eigenface classification. Neither of the 3 matches correspond to the correct

individual.

• On the other hand, the third row shows the nearest matches based on the eyes and

nose features, and results in correct identification in each case. This simple example

illustrates the advantage of a modular representation in disambiguating false eigenface

matches.

PCA with Centered Data

• Sometimes the mean is subtracted first

xi,j = xi,j −mj

N∑i=1

• X now contains the centered data

• Centering is recommended when data are approximately Gaussian distributed

PCA with Centered Data (cont’d)

• Let

ˆX = VrVTr X

xi = m +r∑

vlzi,l

with m = (m1, . . . ,mM)T

zi,l = vTl xi

PCA and Singular ValueDecomposition

Singular Value Decomposition (SVD)

• Any N ×M matrix X can be factored as

X = UDV T

where U and V are both orthonormal matrices. U is an N ×N Matrix and V is

an M ×M Matrix.

• D is an N × M diagonal matrix with diagonal entries (singular values) di ≥0, i = 1, . . . , r, with r = min(M,N)

• The uj (columns of U) are the left singular vectors

• The vj are the right singular vectors

• The dj are the singular values

Covariance Matrix and Kernel Matrix

• We get for the empirical covariance matrix

NXTX =

NVDTUTUDV T =

NVDTDV T =

NVDV V

• And for the empirical kernel matrix

MXXT =

MUDV TV DTUT =

MUDDTUT =

• With

ΣV =1

NVDV KU =

one sees that the columns of V are the eigenvectors of Σ and the columns of U are

the eigenvectors of K: The eigenvalues are the diagonal entries of DV , respectively

• Apparent by now: The columns of V are both the principal vectors and the eigenvec-

More Expressions

• The SVD is

X = UDV T

from which we get

X = UUTX

X = XV V T

Reduced Rank

• In the SVD, the di are ordered: d1 ≥ d2 ≥ d3... ≥ dr. In many cases one can

neglect di, i > r and one obtains a rank-r Approximation. Let Dr be a diagonal

matrix with the corresponding entries. Then we get the approximation

X = UrDrVTr

X = UrUTr X

X = XVrVTr

where Ur contains the first r columns of U . Correspondingly, Vr

Best Approximation

• The approximation above is the best rank-r approximation with respect to the squared

error (Frobenius Norm). The approximation error is

N∑i=1

M∑j=1

(xj,i − xj,i)2 =r∑

Factors for Rows and Columns

• Recall that in the Netflix example on matrix factorization with X ≈ ABT , the rows

of A contained the factors for the users and the rows of B contained the factors for

the movies

• In the PCA, factors for entities associated with the columns are the rows of

Tr = XTUr

• With this definition,

X = ZrD−1r TTr

ZrD−1r TTr = XVrD

−1r UTr X = UrDrV

Tr VrD

−1r UTr UrDrV

Tr = UrDrV

LSA: Similarities BetweenDocuments

Feature Vectors for Documents

• Given a collection of N documents and M keywords

• X is the term-frequency (tf) matrix; xi,j indicates how often word j occurred in

document i.

• Some classifiers use this representation as inputs

• On the other hand, two documents might discuss similar topics (are “semantically

similar”) without using the same key words

• By doing a PCA we can find document representations as rows of Zr = XV and

term representations as rows of T = XTU

• This is known as Latent Semantic Analysis (LSA)

Simple Example

• In total 9 sentences (documents):

– 5 documents on human-computer interaction) (c1 - c5)

– 4 texts on mathematical graph theory (m1 - m4)

• The 12 key words are in italic letters

tf-Matrix and Word Correlations

• The tf-Matrix X

• Based on the original data, the Pearson correlation between human and user is nega-

tive, although one would assume a large semantic correlation

Singular Value Decomposition

• Decomposition X = UDV T

Approximation with r = 2 and Word Correlations

• Reconstruction X with r = 2

• Shown is XT

• Based on X the correlation between human and user is almost one! The similarity

between human and minors is strongly negative (as it should be)

• In document m4: “Graph minors: a survey” the word survey which is in the original

document gets a smaller value than the term trees, which was not in the document

originally

Document Correlations in the Original and the ReconstructedData

• Top: document correlation in the original data X: The average correlation between

documents in the c-class is almost zero

• Bottom: in X there is a strong correlation between documents in the same class and

strong negative correlation across document classes

Applications of LSA

• LSA-similarity often corresponds to the human perception of document or word simi-

larity

• There are commercial applications in the evaluation of term papers

• There are indications that search engine providers like Google and Yahoo, use LSA for

the ranking of pages and to filter our Spam (spam is unusual, novel)

Illustration

• The next slides illustrate LSA, where the horizontal axis stands for the word index and

the vertical axis stands for the word count

• If we consider word counts as functions of the index (functions as infinite-dimensions

vectors) then the LSA (and the PCA) does function smoothing

• The columns of V would then define the basis functions (note that in the LSI the

columns would be orthonormal, in contrast to the situation displayed in the plot)

• The columns of V define patterns

• If, as shown, the columns of V have limited support, they define different subspaces

Extensions

• Factorization approaches are part of many machine learning solutions

• An autoencoder, as used in deep learning, is closely related

• Factorization can be generalized to more dimensions:

• For example a 3-way array (tensor) X with dimensions subject, predicate, object. Thetensor has a one where the triple is known to exist and zero otherwise. Then we canapproximate (PARAFAC [PARAllel FACtors])

xi,j,l =r∑

ai,kbj,kcl,k

Here A contains the latent factors of the subject, B of the object, and C of thepredicate

• If the entries of X are nonnegative (for example represent counts) it sometimes impro-ves interpretability of the latent factors by enforcing that the factor matrices are non-negative as well (nonnegative matrix factorization (NMF), probabilistic LSA (pLSA),latent Dirichlet allocation (LDA))

Extensions: Improving Multivariate Linear Regression

• Consider that we have several input dimensions and several output dimensions

• It makes sense to maintain that the latent factors should be calculated from just the

input representation but that this mapping itself should be derived by also including

the training outputs

• The next two slides show some possible architectures

• In classical analysis this is done via partial least squares (pLS) or via a canonical

correlation analysis (CCA)

Relationships to Clustering and Classification

• In multiclass classification an object is assigned to one out of several classes. In

the ground truth the object belongs to exactly one class. The classifier might only

look at a subset of the inputs

• Clustering is identical to classification, only that the class labels are unknown in the

training data

• In multi-label classification an object can be assigned to several classes. This means

that also in the ground truth the object can belongs to more than one class. Each

class might only look at an individual subset of the inputs. Let Ik be the set of inputs

affiliated with class k

• Factor analysis and topic models are related to multi-label classification where the

class labels (latent factors) are unknown in the training set (this interpretation works

best with non-negative approaches like NMF, pLSI, LDA)

• This also leads to interpretable similarity. Consider a term-document matrix

– Two documents i and i′ are semantically similar if their topic profiles are similar,

i.e. zi ≈ zi′

– Two terms j and j′ are semantically similar if they appear in the same sets Ik,

i.e., if their input set profiles are similar (since Tr = XTUr = VrDr ) (again,

this interpretation works best with non-negative approaches like NMF, pLSI, LDA)

• Related in data mining: subspace clustering and frequent item set mining (the input

set profiles Ik are the frequent item sets)