Week 4, Lecture 7 - Dimensionality Reduction: PCA and NMF · Week 4, Lecture 7 - Dimensionality...

Post on 23-Aug-2020

1 views 0 download

transcript

1 / 36

Week 4, Lecture 7 - Dimensionality Reduction:PCA and NMF

Aaron Meyer

2 / 36

Outline

I Administrative IssuesI Decomposition methods

I Factor analysisI Principal components analysisI Non-negative matrix factorization

3 / 36

Dealing with many variables

I So far we’ve largely concentrated on cases in which we haverelatively large numbers of measurements for a few variablesI This is frequently refered to as n > p

I Two other extremes are imporantI Many observations and many variablesI Many variables but few observations (p > n)

4 / 36

Dealing with many variables

Usually when we’re dealing with many variables, we don’t havea great understanding of how they relate to each otherI E.g. if gene X is high, we can’t be sure that will mean gene Y

will be tooI If we had these relationships, we could reduce the data

I E.g. if we had variables to tell us it’s 3 pm in Los Angeles, wedon’t need one to say it’s daytime

5 / 36

Dimensionality Reduction

Generate a low-dimensional encoding of a high-dimensionalspacePurposes:I Data compression / visualizationI Robustness to noise and uncertaintyI Potentially easier to interpret

Bonus: Many of the other methods from the class can be appliedafter dimensionality reduction with little or no adjustment!

6 / 36

Matrix Factorization

Many (most?) dimensionality reduction methods involve matrix fac-torization

Basic Idea: Find two (or more) matrices whose product best approx-imate the original matrix

Low rank approximation to original N × M matrix:

X ≈ WHT

where W is N × R, HT is M × R, and R � N .

7 / 36

Matrix Factorization

Generalization of many methods (e.g., SVD, QR, CUR, TruncatedSVD, etc.)

8 / 36

Aside - What should R be?

X ≈ WHT

where W is M × R, HT is M × R, and R � N .

9 / 36

Matrix FactorizationMatrix factorization is also compression

Figure: http://www.aaronschlegel.com/image-compression-principal-component-analysis/

10 / 36

Factor AnalysisMatrix factorization is also compression

Figure: http://www.aaronschlegel.com/image-compression-principal-component-analysis/

11 / 36

Factor AnalysisMatrix factorization is also compression

Figure: http://www.aaronschlegel.com/image-compression-principal-component-analysis/

12 / 36

Examples from bioengineering

Process controlI Large bioreactor runs may be recorded in a database, along

with a variety of measurements from those runsI We may be interested in how those different runs varied, and

how each factor relates to one anotherI Plotting a compressed version of that data can indicate when

an anomolous change is present

13 / 36

Examples from bioengineering

Mutational processesI Anytime multiple contributory factors give rise to a

phenomena, matrix factorization can separate them outI Will talk about this in greater detail

Cell heterogeneityI Enormous interest in understanding how cells are similar or

differentI Answer to this can be in millions of different waysI But cells often follow programs

14 / 36

Principal Components Analysis

Application of matrix factorizationI Each principal component (PC) is linear combination of

uncorrelated attributes / features’I Ordered in terms of varianceI kth PC is orthogonal to all previous PCsI Reduce dimensionality while maintaining maximal variance

15 / 36

Principal Components Analysis

BOARD

16 / 36

Methods to calculate PCA

I Iterative computationI More robust with high numbers of variablesI Slower to calculate

I NIPALS (Non-linear iterative partial least squares)I Able to efficiently calculate a few PCs at onceI Breaks down for high numbers of variables (large p)

17 / 36

Practical Notes

PCAI Implemented within sklearn.decomposition.PCA

I PCA.fit_transform(X) fits the model to X, then provides thedata in principal component space

I PCA.components_ provides the “loadings matrix”, ordirections of maximum variance

I PCA.explained_variance_ provides the amount of varianceexplained by each component

18 / 36

PCA

import matplotlib.pyplot as pltfrom sklearn import datasetsfrom sklearn.decomposition import PCA

iris = datasets.load_iris()

X = iris.datay = iris.targettarget_names = iris.target_names

pca = PCA(n_components=2)X_r = pca.fit(X).transform(X)

# Print PC1 loadingsprint(pca.components_[:, 0])# ...

19 / 36

PCA

# ...pca = PCA(n_components=2)X_r = pca.fit(X).transform(X)

# Print PC1 loadingsprint(pca.components_[:, 0])

# Print PC1 scoresprint(X_r[:, 0])

# Percentage of variance explained for each componentprint(pca.explained_variance_ratio_)# [ 0.92461621 0.05301557]

20 / 36

PCA

21 / 36

Non-negative matrix factorization

Like PCA, except the coefficients in the linear combinationmust be non-negativeI Forcing positive coefficients implies an additive combination

of basis parts to reconstruct wholeI Generally leads to zeros for factors that don’t contribute

22 / 36

Non-negative matrix factorization

The answer you get will always depend on the error metric,starting point, and search methodBOARD

23 / 36

What is significant about this?

I The update rule is multiplicative instead of additiveI In the initial values for W and H are non-negative, then W

and H can never become negativeI This guarantees a non-negative factorizationI Will converge to a local maxima

I Therefore starting point matters

24 / 36

Non-negative matrix factorization

The answer you get will always depend on the error metric,starting point, and search methodI Another approach is to find the gradient across all the

variables in the matrixI Called coordinate descent, and is usually fasterI Not going to go through implementationI Will also converge to a local maxima

25 / 36

NMF application: Netflix

Suppose Alice rates Inception 4 stars. What led to this rating?

26 / 36

NMF application: Netflix480,000 users x 17,700 movies Test data: most recent ratings

What do you think is done with missing values?

27 / 36

NMF application: Netflix

I More generally, additional baseline predictors include:I A factor that allows Alice’s rating to (linearly) depend on the

(square root of the) number of days since her first rating. (Forexample, have you ever noticed that you become a harshercritic over time?)

I A factor that allows Alice’s rating to depend on the number ofdays since the movie’s first rating by anyone. (If you’re one ofthe first people to watch it, maybe it’s because you’re a hugefan and really excited to see it on DVD, so you’ll tend to rateit higher.)

I A factor that allows Alice’s rating to depend on the number ofpeople who have rated Inception. (Maybe Alice is a hipsterwho hates being part of the crowd.)

I A factor that allows Alice’s rating to depend on the movie’soverall rating.

I (Plus a bunch of others.)

28 / 36

NMF application: Netflix

And, in fact, modeling these biases turned out to be fairly important:in their paper describing their final solution to the Netflix Prize, Belland Koren write that:

Of the numerous new algorithmic contributions, I would like tohighlight one – those humble baseline predictors (or biases), whichcapture main effects in the data. While the literature mostly con-centrates on the more sophisticated algorithmic aspects, we havelearned that an accurate treatment of main effects is probably atleast as signficant as coming up with modeling breakthroughs.

29 / 36

NMF application: Netflix

30 / 36

NMF application: Mutational Processes in Cancer

Figure: Helleday et al, Nat Rev Gen, 2014

31 / 36

NMF application: Mutational Processes in Cancer

Figure: Helleday et al, Nat Rev Gen, 2014

32 / 36

NMF application: Mutational Processes in Cancer

Figure: Alexandrov et al, Cell Rep, 2013

33 / 36

NMF application: Mutational Processes in Cancer

Figure: Alexandrov et al, Cell Rep, 2013

34 / 36

Practical Notes - NMF

I Implemented within sklearn.decomposition.NMF.I n_components: number of componentsI init: how to initialize the searchI solver: ‘cd’ for coordinate descent, or ‘mu’ for multiplicative

updateI l1_ratio: Can regularize fit

I Provides:I NMF.components_: components x features matrixI Returns transformed data through NMF.fit_transform()

35 / 36

Summary

PCAI Preserves the covariation within a datasetI Therefore mostly preserves axes of maximal variationI Number of components can vary—in practice more than 2 or

3 rarely helpful

NMFI Explains the dataset through factoring into two non-negative

matricesI Much more stable and well-specified reconstruction when

assumptions are appropriateI Excellent for separating out additive factors

36 / 36

Closing

As always, selection of the appropriate method depends uponthe question being asked.