+ All Categories
Home > Documents > LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS

LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS

Date post: 24-Feb-2016
Category:
Upload: overton
View: 33 times
Download: 0 times
Share this document with a friend
Description:
LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS. •Objectives: Data Considerations Computational Complexity Overfitting Principal Components Analysis Fisher Linear Discriminant Analysis Multiple Discriminant Analysis Examples - PowerPoint PPT Presentation
18
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting Principal Components Analysis Fisher Linear Discriminant Analysis Multiple Discriminant Analysis Examples Resources: J.S.: Dimensionality C.A.: Dimensionality S.S.: PCA and Factor Analysis Java PR Applet W.P.: Fisher DTREG: LDA S.S.: DFA
Transcript
Page 1: LECTURE  08:  DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS

ECE 8443 – Pattern RecognitionECE 8527 – Introduction to Machine Learning and Pattern Recognition

LECTURE 08: DIMENSIONALITY,PRINCIPAL COMPONENTS ANALYSIS

• Objectives:Data ConsiderationsComputational ComplexityOverfittingPrincipal Components AnalysisFisher Linear Discriminant AnalysisMultiple Discriminant AnalysisExamples

• Resources:J.S.: DimensionalityC.A.: DimensionalityS.S.: PCA and Factor AnalysisJava PR AppletW.P.: FisherDTREG: LDAS.S.: DFA

Page 2: LECTURE  08:  DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS

ECE 8527: Lecture 08, Slide 2

Probability of Error• Feature vectors typically have dimensions greater than 50.

• Classification accuracy depends upon the dimensionality and the amount of training data.

• Consider the case of two classes multivariate normal with the same covariance:

• If the features are independent then: ,

and

where:

and:

)()( 211

212 tr

dueerrorP

u

r

2

2/

2

21)(

0)(lim

errorPr

),...,,( 222

21 ddiag

2

1

212

di

i i

iir

Page 3: LECTURE  08:  DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS

ECE 8527: Lecture 08, Slide 3

Dimensionality and Training Data Size• The most useful features are the ones for which the difference between the

means is large relative to the standard deviation .

• Too many features can lead to a decrease in performance.

• Fusing of different types of information, referred to as feature fusion, is a good application for Principal Components Analysis (PCA).

• Increasing the feature vector dimension can significantly increase the memory (e.g., the number of elements in the covariance matrix grows as the square of the dimension of the feature vector) and computational complexity.

• Good rule of thumb: 10 independent data samples for every parameter to be estimated.

• For practical systems, such as speech recognition, even this simple rule can result in a need for vast amounts of data.

Page 4: LECTURE  08:  DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS

ECE 8527: Lecture 08, Slide 4

• “Big Oh” notation used to describe complexity:

if f(x) = 2+3x+4x2, f(x) has computational complexity O(x2)

• Recall: )(lnˆln21)2ln(

2)ˆ(ˆ)ˆ(

21)( 1 Pdg xxx t

)dn(O )nd(O 2 )(O 1 )nd(O 2 )n(O

• Watch those constants of proportionality (e.g., O(nd2)).

• If the number of data samples is inadequate, we can experience overfitting

(which implies poor generalization).

• Hence, later in the course, we will study ways to control generalization and to

smooth estimates of key parameters such as the mean and covariance (see

textbook).

Computational Complexity

Page 5: LECTURE  08:  DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS

ECE 8527: Lecture 08, Slide 5

• It is common that the number of available samples is inadequate to train a

complex classifier. Alternatives:

Reduce the number of parameters

(e.g., assume diagonal covariances)

Assume all classes have the same covariance (“pooled covariance”)

Better estimate of covariance (e.g., use Bayesian parameter estimate)

Pseudo-Bayesian estimate:

Regularized discriminant analysis (shrinkage):

Overfitting

ˆ)1(0

I

nnnn

i

iii

1or

11

Page 6: LECTURE  08:  DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS

ECE 8527: Lecture 08, Slide 6

Component Analysis• Previously introduced as a “whitening transformation”.

• Component analysis is a technique that combines features to reduce the dimension of the feature space.

• Linear combinations are simple to compute and tractable.

• Project a high dimensional space onto a lower dimensional space.

• Three classical approaches for finding the optimal transformation:

Principal Components Analysis (PCA): projection that best represents the data in a least-square sense.

Multiple Discriminant Analysis (MDA): projection that best separates the data in a least-squares sense.

Independent Component Analysis (IDA): projection that minimizes the mutual information of the components.

Page 7: LECTURE  08:  DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS

ECE 8527: Lecture 08, Slide 7

Principal Component Analysis

• Consider representing a set of n d-dimensional samples x1,…,xn by a single vector, x0.

• Define a squared-error criterion:

• It is easy to show that the solution to this problem is given by:

• The sample mean is a zero-dimensional representation of the data set.

• Consider a one-dimensional solution in which we project the data into a line running through the sample mean:

where e is a unit vector in the direction of this line, and a is a scalar representing the distance of any point from the mean.

• We can write the squared-error criterion as:

2

1000

n

kkJ xxx

n

kkn 1

01 xmx

emx a

2

1211 )(,,...,,

n

kkkn aaaaJ xeme

Page 8: LECTURE  08:  DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS

ECE 8527: Lecture 08, Slide 8

Minimizing Squared Error

• Note that: (the norm of the unit vector is 1)

• Differentiate with respect to ak and obtain:

• The geometric interpretation is the we obtain a least-squares solution by projecting the vector, x, onto a line in the direction of e that passes through the sample mean.

• But what is the best direction for e?

2

11

2

1

2

2

1

2

1211

)(2

)(

)(,,...,,

n

kk

n

kk

tk

n

kk

n

kkk

n

kkkn

aa

a

aaaaJ

mxmx ee

mx e

xeme

1e

)( mxe kt

ka

Page 9: LECTURE  08:  DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS

ECE 8527: Lecture 08, Slide 9

Scatter Matrix• Define a scatter matrix, S:

This should look familiar, it is (n-1) times the sample covariance matrix.

• If we substitute our solution for ak into our expression for the squared error:

n

k

tkk

1mxmxS

2

1

2

11

2

11

2

1

2

1

2

11

2

2

111

2

2

11

2

1

21

)()(

)()(

)(

2

)(2

n

kk

tn

kk

tk

n

kk

t

n

kk

tk

n

kk

t

n

kk

n

kk

tn

kk

n

kk

n

kk

n

kkk

n

kk

n

kk

n

kk

tk

n

kk

a

aaa

aaJ

mxSeemxemxmxe

mxemxmx e

mxmx emx

mx

mxmx eee

Page 10: LECTURE  08:  DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS

ECE 8527: Lecture 08, Slide 10

Minimization Using Lagrange Multipliers• The vector, e, that minimizes J1 also maximizes .

• Use Lagrange multipliers to maximize subject to the constraint .

• Let be the undetermined multiplier, and differentiate:

with respect to e, to obtain:

• Set to zero and solve:

• It follows to maximize we want to select an eigenvector corresponding to the largest eigenvalue of the scatter matrix.

• In other words, the best one-dimensional projection of the data (in the least mean-squared error sense) is the projection of the data onto a line through the sample mean in the direction of the eigenvector of the scatter matrix having the largest eigenvalue (hence the name Principal Component).

• For the Gaussian case, the eigenvectors are the principal axes of the hyperellipsoidally shaped support region!

• Let’s work some examples (class-independent and class-dependent PCA).

Seet 1e

1 eeSee ttu

eSee

22 u

Seet

eSe

Seet

Page 11: LECTURE  08:  DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS

ECE 8527: Lecture 08, Slide 11

Discriminant Analysis• Discriminant analysis seeks directions that are efficient for discrimination.

• Consider the problem of projecting data from d dimensions onto a line with the hope that we can optimize the orientation of the line to minimize error.

• Consider a set of n d-dimensional samples x1,…,xn in the subset D1 labeled 1 and n2 in the subset D2 labeled 2.

• Define a linear combination of x:

and a corresponding set of n samples y1, …, yn divided into Y1 and Y2.

2

1000

n

kkJ xxx

xw ty

• Our challenge is to find w that maximizes separation.

• This can be done by considering the ratio of the between-class scatter to the within-class scatter.

Page 12: LECTURE  08:  DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS

ECE 8527: Lecture 08, Slide 12

Separation of the Means and Scatter• Define a sample mean for class i:

• The sample mean for the projected points are:

The sample mean for the projected points is just the projection of the mean (which is expected since this is a linear transformation).

• It follows that the distance between the projected means is:

• Define a scatter for the projected samples:

• An estimate of the variance of the pooled data is:

and is called the within-class scatter.

iDi

i n xxm 1

it

D

t

iYyii

ii ny

nm mwxw

x

11~

2121~~ mwmw ttmm

22 ~~ iYy

ii mys

)~~)(1( 22

21 ss

n

Page 13: LECTURE  08:  DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS

ECE 8527: Lecture 08, Slide 13

Fisher Linear Discriminant and Scatter

• The Fisher linear discriminant maximizes the criteria:

• Define a scatter for class I, Si :

• The total scatter, Sw, is:

• We can write the scatter for the projected samples as:

• Therefore, the sum of the scatters can be written as:

22

21

221~~

~~)(

ss

mmwJ

iD

tiii

xm-xm-xS

21 SSS W

wSwwmxmxw

mwxw

x

x

itt

iD

it

Di

tti

i

i

s

22~

wSw Wtss 2

221

~~

Page 14: LECTURE  08:  DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS

ECE 8527: Lecture 08, Slide 14

Separation of the Projected Means

• The separation of the projected means obeys:

• Where the between class scatter, SB, is given by:

• Sw is the within-class scatter and is proportional to the covariance of the pooled data.

• SB , the between-class scatter, is symmetric and positive definite, but because it is the outer product of two vectors, its rank is at most one.

• This implies that for any w, SBw is in the direction of m1-m2.

• The criterion function, J(w), can be written as:

wSw

wmmmmw

mwmw

Bt

tt

ttmm

2121

22121

~~

tB 2121 mmmmS

wSwwSww

WtB

tJ

Page 15: LECTURE  08:  DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS

ECE 8527: Lecture 08, Slide 15

Linear Discriminant Analysis

• This ratio is well-known as the generalized Rayleigh quotient and has the

well-known property that the vector, w, that maximizes J(), must satisfy:

• The solution is:

• This is Fisher’s linear discriminant, also known as the canonical variate.

• This solution maps the d-dimensional problem to a one-dimensional problem (in this case).

• From Chapter 2, when the conditional densities, p(x|i), are multivariate normal with equal covariances, the optimal decision boundary is given by:

where , and w0 is related to the prior probabilities.

• The computational complexity is dominated by the calculation of the within-class scatter and its inverse, an O(d2n) calculation. But this is done offline!

• Let’s work some examples (class-independent PCA and LDA).

wSwS WB

)( 211 mmSw -

W

00 wtxw 211 μ-μw

Page 16: LECTURE  08:  DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS

ECE 8527: Lecture 08, Slide 16

Multiple Discriminant Analysis• For the c-class problem in a d-dimensional space, the natural generalization

involves c-1 discriminant functions.

• The within-class scatter is defined as:

• Define a total mean vector, m:

and a total scatter matrix, ST, by:

• The total scatter is related to the within-class scatter (derivation omitted):

• We have c-1 discriminant functions of the form:

c

i D

tii

c

iiW

i

S11 x

m-xm-xS

ic

iinnmm

1

1

x

m-xm-xS tT

BWT SSS

c

i

tiiiB n

1))(( mmmmS

121 ,...,c-,iti

t xwxWy

Page 17: LECTURE  08:  DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS

ECE 8527: Lecture 08, Slide 17

Multiple Discriminant Analysis (Cont.)• The criterion function is:

• The solution to maximizing J(W) is once again found via an eigenvalue decomposition:

• Because SB is the sum of c rank one or less matrices, and because only c-1 of these are independent, SB is of rank c-1 or less.

• An excellent presentation on applications of LDA can be found at PCA Fails!

WSW

WSW(W)

WtB

t

J

00 iWiBWiB wSSSS

Page 18: LECTURE  08:  DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS

ECE 8527: Lecture 08, Slide 18

Summary• The “curse of dimensionality.”• Dimensionality and training data size.• Overfitting can be avoided by using weighted combinations of the pooled

covariances and individual covariances.• Types of component analysis.• Principal component analysis: represents the data by minimizing the squared

error (representing data in directions of greatest variance).• Example of class-independent and class-dependent analysis.• Insight into the important dimensions of your problem.• Introduced Component Analysis.• Defined a criterion that maximizes discrimination.• Derived the solution to the two-class problem.• Generalized this solution to c classes.• Compared LDA and PCA on some interesting data sets.


Recommended