Regularized Discriminant Analysis and Reduced-Rank...

Regularized Discriminant Analysis and Reduced-Rank LDA

Regularized Discriminant Analysis andReduced-Rank LDA

Jia Li

Department of StatisticsThe Pennsylvania State University

Email: [email protected]://www.stat.psu.edu/∼jiali

Jia Li http://www.stat.psu.edu/∼jiali


Regularized Discriminant Analysis

I A compromise between LDA and QDA.

I Shrink the separate covariances of QDA toward a commoncovariance as in LDA.

I Regularized covariance matrices:

Σk(α) = αΣk + (1− α)Σ .

I The quadratic discriminant function δk(x) is defined using theshrunken covariance matrices Σk(α).

I The parameter α controls the complexity of the model.





Computations for LDA

I Discriminant function:

δk(x) = −1

2log |Σk | −

1

2(x − µk)T Σ−1

k (x − µk) + log πk .

I Eigen-decomposition of Σk : Σk = UkDkUTk . Dk is diagonal

with elements dkl , l = 1, 2, ..., p. Uk is p × p orthonormal.

I

(x − µk)T Σ−1k (x − µk)

= [UTk (x − µk)]TD−1

k [UTk (x − µk)]

= [D− 1

2k UT

k (x − µk)]T [D− 1

2k UT

k (x − µk)]

I log |Σk | =∑

l log dkl .



I LDA, Σ = UDUT :I Sphere the data D− 1

2 UTX → X ∗ and D− 12 UTµk → µ∗k .

I For the transformed data and class centroids, classify x∗ to theclosest class centroid in the transformed space, modulo theeffect of the class prior probabilities πk .



The geometric illustration of LDA. Left: Original data in the twoclasses. The ellipsis represent the two estimated covariancematrices. Right: The class mean removed data and the estimatedcommon covariance matrix.



The geometric illustration of LDA. Left: The sphered meanremoved data. Right: The sphered data in the two classes, thesphered means, and the decision boundary.



Reduced-Rank LDA

Binary classification

I Decision boundary is given by thefollowing linear equation:

logπ1

π2− 1

2(µ1 + µ2)

TΣ−1(µ1 − µ2)

+xTΣ−1(µ1 − µ2) = 0 .

I Only the projection of X on thedirection Σ−1(µ1 − µ2) matters.

I If the data are sphered, only theprojection of X ∗ on µ∗1 − µ∗2 is needed.



I Suppose data are sphered.

I The subspace spanned by the K centroids is of rank K − 1,denoted by HK−1.

I Data can be viewed in HK−1 without losing any information.

I When K > 3, we might want to find a subspace HL ⊆ HK−1

optimal for LDA in some sense.



Optimization Criterion

I Fisher’s optimization criterion: the projected centroids werespread out as much as possible comparing with variance.

I Find the linear combination Z = aTX such that thebetween-class variance is maximized relative to thewithin-class variance, where a = (a1, a2, ..., ap)

T .

I Assume the within-class covariance matrix of X is W, i.e., thecommon covariance matrix of the classes.



I The between-class covariance matrix is B. Suppose µk is acolumn vector denoting the mean vector of class k.

µ =K∑

k=1

πkµk

B =K∑

k=1

πk(µk − µ)(µk − µ)T

Note πk is the percentage of class k samples in the entiredata set.





I For the linear combination Z , the between-class variance isaTBa and the within-class variance is aTWa.

I Fisher’s optimization becomes

maxa

aTBa

aTWa.

I Eigen-decomposition of W = VW DW VTW .

I W = (W12 )TW

12 , where W

12 = D

12W VT

W .

I Define b = W12 a, then a = W− 1

2 b. The optimization becomes

maxb

bT (W− 12 )TBW− 1

2 b

bTb

I Define B∗ = (W− 12 )TBW− 1

2 .



I Eigen-decomposition of B∗ = V∗DBV∗T .V∗ = (v∗1 , v∗2 , ..., v∗p ).

I The maximization is achieved by b = v∗1 , the first eigen vectorof B∗.

I Similarly, one can find the next direction b2 = v∗2 that isorthogonal to b1 = v∗1 and maximizes bT

2 B∗b2/bT2 b2.

I Since a = W− 12 b, convert to the original problem,

al = W− 12 v∗l .

I The al (also denoted as vl in the textbook) are referred to asdiscriminant coordinates or canonical variates.



I Summarization on obtaining discriminant coordinates:I Find the centroids for all the classes.I Find between-class covariance matrix B using the centroid

vectors.I Find within-class covariance matrix W, i.e., Σ.I By eigen-decomposition

W = (W12 )TW

12 = (D

12

W VTW )TD

12

W VTW .

I Compute

B∗ = (W− 12 )TBW− 1

2 = D− 1

2

W VTW BVW D

− 12

W .

I Eigen-decomposition of B∗:

B∗ = V∗DBV∗T .

I The discriminant coordinates are: al = W− 12 v∗l .



Simulation

I Three classes with equal prior probabilities 1/3.

I Input is two dimensional.

I The class conditional density of X is a normal distribution.

I The common covariance matrix Σ =

(1.0 0.00.0 1.0

).

I The three mean vectors are:

µ1 =

(00

)µ2 =

(−32

)µ3 =

(−1−3

)I Total of 450 samples are drawn with 150 in each class for

training.

I Another set of 450 samples are drawn with 150 in each classfor testing.



The scatter plot of the test data. Red: class 1. Blue: class 2.Magenta: class 3.



LDA ResultI Priors: π1 = π2 = π3 = 150

450 = 0.3333.I The three mean vectors are:

µ1 =

(−0.0757−0.0034

)µ2 =

(−2.83101.9847

)µ3 =

(−0.9992−2.9005

)I Estimated covariance matrix: Σ =

(0.9967 0.00200.0020 1.0263

).

I Decision boundaries:I Between class 1 (red) and 2 (blue):

5.9480 + 2.7684X1 − 1.9427X2 = 0 .

I Between class 1 (red) and 3 (magenta):

4.5912 + 0.9209X1 + 2.8211X2 = 0 .

I Between class 2 (blue) and 3 (magenta):

−1.3568− 1.8475X1 + 4.7639X2 = 0 .



Classification error rate on the test data set: 7.78%.



Discriminant Coordinates

I Between-class covariance matrix:

B =

(1.3111 −1.3057−1.3057 4.0235

).

I Within-class covariance matrix:

W =

(0.9967 0.00200.0020 1.0263

).

I W12 =

(−0.0686 −1.01080.9960 −0.0676

).

I B∗ = (W− 12 )TBW− 1

2 =

(3.7361 1.46031.4603 1.5050

).



I Eigen-decomposition of B∗:

B∗ = V∗DBV∗T

V∗ =

(0.8964 0.44320.4432 −0.8964

)DB =

(4.4582 00 0.7830

).



I The two discriminant coordinates are:

v1 = W− 12 v∗1 =

(−0.0668 0.9994−0.9848 −0.0678

) (0.89640.4432

)=

(0.3831−0.9128

)

v2 = W− 12 v∗2 =

(−0.9255−0.3757

)I Project data onto v1 and classify using only this 1-D data.

I The projected data are xTi v1, i = 1, ...,N.



Solid line: first DC. Dash line: second DC.



Projection on the First DCProjection of the training data on the first discriminant coordinate.



I Perform LDA on the projected data.

I The classification rule is:

G (x) =

1 −1.4611 ≤ xT v1 ≤ 1.11952 xT v1 ≤ −1.46113 xT v1 ≥ 1.1195

I Error rate on the test data: 12.67%.



Principal Component Direction

I Find the input matrix of X , or do singular valuedecomposition of mean removed X , to find the principalcomponent directions.

I Denote the covariance matrix by T:

T =

(2.3062 −1.3066−1.3066 5.0542

).

I Eigen-decomposition of T = VTDTVTT :

VT =

(0.3710 −0.9286−0.9286 −0.3710

)DT =

(5.5762 00 1.7842

)



Solid line: first PCD. Dash line: second PCD.



Results Based on the First PCProjection of data on the first PC. The boundaries between classesare shown.



I Perform LDA on the projected data.

I The classification rule is:

G (x) =

1 −1.4592 ≤ xT v1 ≤ 1.14892 xT v1 ≤ −1.45923 xT v1 ≥ 1.1489

I Error rate on the test data: 13.11%.



Comparison

I It is generally true that T = B + W.

I For the given example W ≈ I; and the true within-classcovariance matrix is I.

I Ideally, for this example, both the discriminant coordinatesand the principal component directions are simply theeigenvectors of B.

I In general, discriminant coordinates and principal componentdirections are different.

I To compute PC directions, class information is not needed;and hence PCs have more flexible applications.

I For classification, DCs tend to be better.



A New Simulation

I Change the commoncovariance matrix Σ to:(

4.0898 −0.8121−0.8121 0.5900

).

I The scatter plot of thetest data set.



LDA ResultThe classification boundaries obtained by LDA. The error rate forthe test data is 6%.



DCs and PC Directions

The solid line indicates the first DC or PC; the dash line indicatesthe second DC or PC.

Discriminant coordinates Principal component directions



Projection on 1-DThe LDA results obtained using the projected data onto the firstdiscriminant coordinate and the first principal component direction.

Projection on the first DC (test seterror rate: 7.78%)

Projection on the first PCD (testset error rate: 32.44%)


Date post:	11-Jan-2019
Category:	Documents
Upload:	nguyencong
View:	239 times
Download:	0 times

Regularized Discriminant Analysis and Reduced-Rank...

Documents