+ All Categories
Home > Documents > ANÁLISE MULTIVARIADA aula 01Preprocessing Techniques •Eigenvector Rotation The coordinate axes...

ANÁLISE MULTIVARIADA aula 01Preprocessing Techniques •Eigenvector Rotation The coordinate axes...

Date post: 17-Mar-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
54
Multivariate Analysis preprocessing techniques and PCA Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br [email protected]
Transcript

Multivariate Analysis preprocessing techniques and PCA

Prof. Dr. Anselmo E de Oliveira

anselmo.quimica.ufg.br

[email protected]

Preprocessing Techniques

• Redundant/Constant Variables

– The data set should be checked for constant and redundant variables before attempting pattern recognition since variance is required in the data.

– Usually redundant variables are detected by examining the correlation matrix for high coefficientes of correlation.

Preprocessing Techniques

– correlation matrix

𝐂 = 1

𝑁 − 1𝐗𝐓𝐗

𝐗 is autoscaled

• Ex: 𝐂 =1 0,05695 −0,394790,05695 1 0,77886−0,39479 0,77886 1

Preprocessing Techniques

• Translation

– The purpose of translation is to change the position of the data set with respect to the axes

– Mean-centering

Preprocessing Techniques

– Mean-centering

• kth measurement of the i sample, 𝑥𝑖𝑘

𝑥𝑖𝑘′ = 𝑥𝑖𝑘 − 𝑥 𝑘

𝑥 𝑘 =1

𝑁𝑃 𝑥𝑖𝑘

𝑁𝑃

𝑖=1

NP = total number of samples

Preprocessing Techniques

– Mean-centering

• 𝐗 → 𝐗′

• 𝐗′ variables are now referred to as features

• Data matrix:

𝐗 =

0.61 1.030.54 0.960.21 0.510.78 1.38

sample [Fe2+] /ppm [Cl-] /ppm

1 0.61 1.03

2 0.54 0.96

3 0.21 0.51

4 0.78 1.38

Preprocessing Techniques

– Mean-centering

𝐗 =

0.61 1.030.54 0.960.21 0.510.78 1.38

𝐗′ =

0.07 0.060.00 −0.01−0.33 −0.460.24 0.41

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

0 0.2 0.4 0.6 0.8 1

[Cl-

] /p

pm

[Fe+2] /ppm

𝑥 𝑖 0.54 0.97

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0.0

0.1

0.2

0.3

0.4

0.5

-0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3

[Cl-

] /p

pm

[Fe+2] /ppm

Preprocessing Techniques

• Normalization makes the lengths of all data vectors in the data set the same, that is, the sum of the squares of the elements of a data vector is the same for all the samples in the entires data set.

𝑥𝑖𝑘2 = 𝑐𝑖

𝑁𝑉

𝑘=1

– Familar example would be normalizing vectors to unit length

Preprocessing Techniques

• Scaling

– The notion of scaling is intuitive

– In the absence of any a priori information the data should be scaled so as to put all the variables on an equal footing in terms of their variance

– Range scaling

𝑥′𝑖𝑘 =𝑥𝑖𝑘−𝑥𝑘 𝑚𝑖𝑛

𝑥𝑘 𝑚𝑎𝑥 −𝑥𝑘 𝑚𝑖𝑛 0.0 ≤ 𝑥′𝑖𝑘 ≤ 1.0

Preprocessing Techniques

– Range scaling

𝐗 =

0.96 79.71.43 32.22.03 10.81.71 18.91.13 35.51.29 7.0

Molecule Bond length /Å ∆𝒇𝑯 /cal g-1

H2O 0.96 79.7

SO2 1.43 32.2

SiCl4 2.03 10.8

AsF3 1.71 18.9

N2O 1.13 35.5

BF3 1.29 7.0

Preprocessing Techniques

– Range scaling

0

10

20

30

40

50

60

70

80

0 10 20 30 40 50 60 70 80

bo

nd

len

gth

DfH /cal g-1

Preprocessing Techniques

– Range scaling

𝐗 =

0.96 79.71.43 32.22.03 10.81.71 18.91.13 35.51.29 7.0

0.00 1.000.44 0.351.00 0.050.70 0.160.16 0.390.31 0.00

Preprocessing Techniques

– Range scaling

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

bo

nd

len

gth

DfH /cal g-1

Preprocessing Techniques

– Range scaling: outlier

𝐗 =

0.96 79.76.43 32.22.03 10.81.71 18.91.13 35.51.29 7.0

0.00 1.001.00 0.350.20 0.050.14 0.160.03 0.390.06 0.00

Preprocessing Techniques

– Range scaling: outlier

0

10

20

30

40

50

60

70

80

0 10 20 30 40 50 60 70 80

bo

nd

len

gth

DfH /cal g-1

Preprocessing Techniques

– Range scaling: outlier

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

bo

nd

len

gth

DfH /cal g-1

Preprocessing Techniques

• Autoscaling removes any inadvertent weighting that arises due to arbitrary units, but is not as sensitive to outliers as range scaling.

Preprocessing Techniques

– Autoscaling to unit variance refers to mean-centering following by dividing by the standard deviation, 𝑠𝑘, on a variable by variable basis:

𝑥′𝑖𝑘 =𝑥𝑖𝑘 − 𝑥 𝑘𝑠𝑘

and

𝑠𝑘 =1

𝑁𝑃 − 1 𝑥𝑖𝑘 − 𝑥 𝑘

2

𝑁𝑃

𝑖=1

12

Preprocessing Techniques

– The other version of autoscaling in common use is

𝑥′𝑖𝑘 =𝑥𝑖𝑘 − 𝑥 𝑘

𝑥𝑖𝑘 − 𝑥 𝑘2𝑁𝑃

𝑖=1

12

Preprocessing Techniques

• No outlier

• Outlier present

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

bo

nd

len

gth

DfH /cal g-1

-0.1

0

0.1

0.2

0.3

-0.01 -0.005 0 0.005 0.01 0.015

bo

nd

len

gth

DfH /cal g-1

Preprocessing Techniques

• Feature Weighting

– Provides a measure of the discriminating ability of a variable in terms of category separation.

– Improves classification results by “stretching”, for example, scaling each axis according to its overall feature weight, 𝑤𝑘.

Preprocessing Techniques

• Feature Weighting

– A very discriminating feature will yield widely separated distributions; conversely, the distributions will be very nearly superimposed for a poor one.

Frequency histogram of data in two categories for variable 𝑥𝑘 fr

equ

ency

𝑥𝑘

I

II

𝑽𝒊𝒏𝒕𝒆𝒓

𝑽𝒊𝒏𝒕𝒓𝒂

Preprocessing Techniques

• Feature Weighting

– The variance weighting for these categories, 𝑤𝑘 𝐼, 𝐼𝐼 , is calculated as the ratio of the intercategory variances to the sum of the intracategory variances:

𝑤𝑘 𝐼, 𝐼𝐼 = 2

1𝑁𝐼 𝑥𝐼

2 + 1 𝑁𝐼𝐼 𝑥𝐼𝐼

2 − 2 𝑁𝐼𝑁𝐼𝐼 𝑥𝐼𝑥𝐼𝐼

1𝑁𝐼 𝑥𝐼 − 𝑥 𝐼

2 + 1 𝑁𝐼𝐼 𝑥𝐼𝐼 − 𝑥 𝐼𝐼

2

Such value can be computed for all pairs 𝑁𝐽 = 1 2 𝑁𝑉 𝑁𝑉 − 1 of features.

Preprocessing Techniques

– It is desirable to compute a single, overall rating of each feature´s discriminating ability, 𝑤𝑘:

𝑤𝑘 = 𝑤𝑘 𝑗

𝑁𝐽

𝑗=1

1𝑁𝐽

In this way, 𝑤𝑘 is always greater than or equal to 1.0 units of variance even if one or more features have no discriminating ability.

Preprocessing Techniques

– Alternately, the Fisher Weight replaces the numerator with the difference of the category means:

𝑤𝑘 𝐼, 𝐼𝐼 =𝑥 𝐼 − 𝑥 𝐼𝐼

1𝑁𝐼 𝑥𝐼 − 𝑥 𝐼 2 +

1𝑁𝐼𝐼 𝑥𝐼𝐼 − 𝑥 𝐼𝐼 2

– Since the Fisher weight may actually go to zero for a nondiscriminating feature, the overall weight is calculated as

𝑤𝑘 =1

𝑁𝐽 𝑤𝑘 𝑗

𝑁𝐽

𝑗=1

Preprocessing Techniques

– In an application, either the Fisher or Variance weights would be calculated and the axis scaled by these values to produce a new set of features 𝑥′𝑖𝑘:

𝑥′𝑖𝑘 = 𝑤𝑘𝑥𝑖𝑘

Preprocessing Techniques

Preprocessing Techniques

“The weighted features are summarized in Table 2. The variance and Fisher weights indicate that the early eluting compounds, represented as feature PK1, are very important in discriminating between the two classes. This becomes obvious on visual examination of the chromatographic profiles. Other components, however, are also important in a Fisher and variance weight sense, and this is not readily apparent by visual examination.”

Preprocessing Techniques

Preprocessing Techniques

Preprocessing Techniques

• Rotation

In general, a set of coordinate axes may be rotated through an angle to change the relative orientation of a set of points to the axes.

𝐘 = 𝐕𝐗

X original data in column matrix form

V transformation matrix

Y new column matrix coordinates

Preprocessing Techniques

• Rotation

– 2D

𝐕 =𝑐𝑜𝑠𝜃 𝑠𝑖𝑛𝜃−𝑠𝑖𝑛𝜃 𝑐𝑜𝑠𝜃

Preprocessing Techniques

• Eigenvector Rotation

As a preprocessing step, it is extremely useful to rotate all the axes involved in an n-dimentional data set so that the first new axis corresponds to the direction of the greatest variance in the data, and each successive axis represents the maximum residual variance.

Preprocessing Techniques

• Eigenvector Rotation

For autoscaled data, the correlation matrix, C, is given by

𝐂 =1

𝑁 − 1𝐗𝐓𝐗

where N is the number of samples/rows in X.

Preprocessing Techniques

• Eigenvector Rotation

𝐘 = 𝐗𝐕 𝐘𝐓 = 𝐕𝐓𝐗𝐓 𝐘𝐓𝐘 = 𝐕𝐓𝐗𝐓𝐗𝐕 𝐘𝐓𝐘 = 𝐕𝐓𝐂𝐕 = 𝚲

where is a diagonal matriz.

Preprocessing Techniques

• Eigenvector Rotation In other words, this is an eigenvector problem, in which we would like to find the vector in R that, when applied to the system are converted to scalar multiples of themselves:

𝐂𝐕 = 𝜆𝐕 𝐂𝐕 − 𝜆𝐕 = 𝟎 𝐂 − 𝜆𝐈 𝐕 = 𝟎

Here is a scalar variable, whose solutions are the diagonal elements of .

Preprocessing Techniques

• Eigenvector Rotation

This problem has a nontrivial solution when 𝐂 − 𝜆𝐈 = 𝟎

This equation is solved for the roots of which are the eigenvalues, that is, the variances associated with each new axis. Once the solutions for are know, they can be substituted back to solve for the vectors of V.

Preprocessing Techniques

• Eigenvector Rotation The coordinate axes are now the eigenvectors, which are linear combinations of the original variables. The coordinates Y are referred to as scores and the elements of the eigenvectors – eigenvectors coefficients – are called the loadings. One might think of the eigenvector coefficient as indicating how much a variable is “loaded into” an eigenvector, for example the magnitude of the contribution of a variable in comprising the eigenvector.

Preprocessing Techniques

• Eigenvector Rotation

– Data reduction: the discarding of factors that do not contain significant information about the data.

– %Var: the percent variance remaining after dimensionality reduction.

%𝑉𝑎𝑟 = 𝜆𝑖𝑁𝐶𝑖=1

𝜆𝑖𝑁𝑉𝑖=1

× 100

where NC is the number of components retained.

Preprocessing Techniques

• Eigenvector Rotation – Signal/noise enhancemente: the chemical

information will be contained in the first one or more eigenvectors having the largest variances, and the nois contained in the eigenvectors of smalest eigenvalue may be discarded.

– Eigenvalues plots are extremely useful to display n-dimentional data, because they provide a way to preserve the maximum amount of infomation in a two-dimentional projection.

Principal Component Analysis (PCA)

Preprocessing Techniques

• Eigenvector Rotation

– Principal Component Analysis (PCA) refers to the diagonalization of the covariance matrix, COV.

𝑥𝑖𝑘′ = 𝑥𝑖𝑘 − 𝑥 𝑘

𝐂𝐎𝐕 =1

𝑁 − 1𝐗′𝐓𝐗′

• Matlab: cov(xcm)

Preprocessing Techniques

– PCA

• It is important to note how to get back and forth computationally between scores and data. The scores can be computed as

𝑦𝑖𝑘 = 𝑥𝑖𝑚 − 𝑥 𝑘 𝑣𝑚𝑘

𝑁

𝑚=1

Matlab > [V,lambda] = eig(cov(xcm))

> scores=xcm*V;

Preprocessing Techniques

– PCA

• Given the score 𝑦𝑖, the original datum 𝑥𝑖𝑘 can be obtained as

𝑥𝑖𝑘 = 𝑥 𝑘 + 𝑦𝑖𝑚𝑣𝑘𝑚

𝑁

𝑚=1

Matlab > xcm = scores*V’;

> x = xcm+repmat(mean(x),[39,1]);

Preprocessing Techniques

– PCA

• Scores and loadings: the coordinates Y are referred to as scores and the elements of the eigenvectors – eigenvectors coefficients – are called the loadings.

• Matlab: [V,lambda]=eig(cov(x))

[V, lambda]=eig(cov(xcm))

[V, lambda]=eig(cov(xa))

Preprocessing Techniques

– PCA

• Is a technique for reducing the amount of data when there is correlation present.

• The idea behind PCA is to find principal components PC1 , PC2, ...., PCn which are linear combinations of the original variables describing each speciment, X1, X2,..., Xn, i.e. 𝑃𝐶1 = 𝑉11𝑋1 + 𝑉12𝑋2 +⋯+ 𝑉1𝑛𝑋𝑛 𝑃𝐶2 = 𝑉21𝑋1 + 𝑉22𝑋2 +⋯+ 𝑉2𝑛𝑋𝑛

etc.

Preprocessing Techniques

– PCA • ex: Q1EstatDesc.xls contains 3 variables, books, attend and grade

and has 40 cases. – Books: X1

– Attend: X2

– Grade: X3

• Mean-centering

• [V,lambda]=eig(cov(xcm))

• The principal components are chosen so that the first principal component (PC1), accounts for most of the variance in the data set, PC2 accounts for next largest variation and so on.

> lambda

> somavar=sum(diag(lambda))

> varpc=100*lambda/somavar

Preprocessing Techniques

– PCA

• V =

0.4902 0.1619 -0.8564

-0.8684 0.0062 -0.4959

-0.0750 0.9868 0.1436

• PC1 = -0.8564X1 - 0.4959X2 + 0.1436X3

• PC2 = 0.1619X1 + 0.0062X2 + 0.9868X3

• PC3 = 0.4902X1 - 0.8684X2 - 0.0750X3

Preprocessing Techniques

– PCA

• PC1xPC2 » The amount of total variance expressed by the first two

principal components: 93.7379 + 3.9393 = 97.6772%

> scores=xcm*V

» PC1 = -0.8564X1 - 0.4959X2 + 0.1436X3

» PC2 = 0.1619X1 + 0.0062X2 + 0.9868X3

Preprocessing Techniques

– PCA

• PC1xPC2

> plot(scores(:,3),scores(:,2),'*'); grid on;

> xlabel('PC1'); ylabel('PC2')

Preprocessing Techniques

– PCA

• PC1xPC2

-20 -15 -10 -5 0 5 10 15-4

-3

-2

-1

0

1

2

3

PC1

PC

2

16

Preprocessing Techniques

– PCA • PC1xPC2 score plot

» PC1 = -0.8564X1 - 0.4959X2 + 0.1436X3

» PC2 = 0.1619X1 + 0.0062X2 + 0.9868X3

» Sample 16

• Score value on PC1

= -0.8546*16.4072-0.4959*9.2933+0.1436*(-1.8962)

= -18.9024

• Score value on PC2

= 0.1619*16.4072+0.0062*9.2933+0.9868*(-1.8962)

= 0.8428

» Right, left, up, low.

» Upper right, upper left, lower right, lower left.

Preprocessing Techniques

– PCA

• PC1 loadings » plot(V(:,3),'*-'), grid on

» >> xlabel('variable'); ylabel('PC1 loadings')

1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

variable

PC

1 loadin

gs

Preprocessing Techniques

– PCA scores

autoscaled data data mean-centered

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

PC1

PC

2

-20 -15 -10 -5 0 5 10 15-4

-3

-2

-1

0

1

2

3

PC1

PC

2

Preprocessing Techniques

– PCA loadings

autoscaled data data mean-centered

1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

variable

PC

1 loadin

gs


Recommended