ANÁLISE MULTIVARIADA aula 01Preprocessing Techniques •Eigenvector Rotation The coordinate axes...

Multivariate Analysis preprocessing techniques and PCA

Prof. Dr. Anselmo E de Oliveira

anselmo.quimica.ufg.br

[email protected]

http://www.quimica.ufg.br/docentes/anselmo

mailto:[email protected]

Preprocessing Techniques

• Redundant/Constant Variables

– The data set should be checked for constant and redundant variables before attempting pattern recognition since variance is required in the data.

– Usually redundant variables are detected by examining the correlation matrix for high coefficientes of correlation.


– correlation matrix

𝐂 = 1

𝑁 − 1𝐗𝐓𝐗

𝐗 is autoscaled

• Ex: 𝐂 =1 0,05695 −0,394790,05695 1 0,77886−0,39479 0,77886 1


• Translation

– The purpose of translation is to change the position of the data set with respect to the axes

– Mean-centering


– Mean-centering

• kth measurement of the i sample, 𝑥𝑖𝑘

𝑥𝑖𝑘′ = 𝑥𝑖𝑘 − 𝑥 𝑘

𝑥 𝑘 =1

𝑁𝑃 𝑥𝑖𝑘

𝑁𝑃

𝑖=1

NP = total number of samples


– Mean-centering

• 𝐗 → 𝐗′

• 𝐗′ variables are now referred to as features

• Data matrix:

𝐗 =

0.61 1.030.54 0.960.21 0.510.78 1.38

sample [Fe2+] /ppm [Cl-] /ppm

1 0.61 1.03

2 0.54 0.96

3 0.21 0.51

4 0.78 1.38


– Mean-centering

𝐗 =

0.61 1.030.54 0.960.21 0.510.78 1.38

𝐗′ =

0.07 0.060.00 −0.01−0.33 −0.460.24 0.41

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

0 0.2 0.4 0.6 0.8 1

[Cl-

] /p

pm

[Fe+2] /ppm

𝑥 𝑖 0.54 0.97

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0.0

0.1

0.2

0.3

0.4

0.5

-0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3

[Cl-

] /p

pm

[Fe+2] /ppm


• Normalization makes the lengths of all data vectors in the data set the same, that is, the sum of the squares of the elements of a data vector is the same for all the samples in the entires data set.

𝑥𝑖𝑘2 = 𝑐𝑖

𝑁𝑉

𝑘=1

– Familar example would be normalizing vectors to unit length


• Scaling

– The notion of scaling is intuitive

– In the absence of any a priori information the data should be scaled so as to put all the variables on an equal footing in terms of their variance

– Range scaling

𝑥′𝑖𝑘 =𝑥𝑖𝑘−𝑥𝑘 𝑚𝑖𝑛

𝑥𝑘 𝑚𝑎𝑥 −𝑥𝑘 𝑚𝑖𝑛 0.0 ≤ 𝑥′𝑖𝑘 ≤ 1.0


– Range scaling

𝐗 =

0.96 79.71.43 32.22.03 10.81.71 18.91.13 35.51.29 7.0

Molecule Bond length /Å ∆𝒇𝑯 /cal g-1

H2O 0.96 79.7

SO2 1.43 32.2

SiCl4 2.03 10.8

AsF3 1.71 18.9

N2O 1.13 35.5

BF3 1.29 7.0


– Range scaling

0

10

20

30

40

50

60

70

80

0 10 20 30 40 50 60 70 80

bo

nd

len

gth

/Å

DfH /cal g-1


– Range scaling

𝐗 =

0.96 79.71.43 32.22.03 10.81.71 18.91.13 35.51.29 7.0

0.00 1.000.44 0.351.00 0.050.70 0.160.16 0.390.31 0.00


– Range scaling

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

bo

nd

len

gth

/Å

DfH /cal g-1


– Range scaling: outlier

𝐗 =

0.96 79.76.43 32.22.03 10.81.71 18.91.13 35.51.29 7.0

0.00 1.001.00 0.350.20 0.050.14 0.160.03 0.390.06 0.00



0

10

20

30

40

50

60

70

80

0 10 20 30 40 50 60 70 80

bo

nd

len

gth

/Å

DfH /cal g-1



0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

bo

nd

len

gth

/Å

DfH /cal g-1


• Autoscaling removes any inadvertent weighting that arises due to arbitrary units, but is not as sensitive to outliers as range scaling.


– Autoscaling to unit variance refers to mean-centering following by dividing by the standard deviation, 𝑠𝑘, on a variable by variable basis:

𝑥′𝑖𝑘 =𝑥𝑖𝑘 − 𝑥 𝑘𝑠𝑘

and

𝑠𝑘 =1

𝑁𝑃 − 1 𝑥𝑖𝑘 − 𝑥 𝑘

2

𝑁𝑃

𝑖=1

12


– The other version of autoscaling in common use is

𝑥′𝑖𝑘 =𝑥𝑖𝑘 − 𝑥 𝑘

𝑥𝑖𝑘 − 𝑥 𝑘2𝑁𝑃

𝑖=1

12


• No outlier

• Outlier present

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

bo

nd

len

gth

/Å

DfH /cal g-1

-0.1

0

0.1

0.2

0.3

-0.01 -0.005 0 0.005 0.01 0.015

bo

nd

len

gth

/Å

DfH /cal g-1


• Feature Weighting

– Provides a measure of the discriminating ability of a variable in terms of category separation.

– Improves classification results by “stretching”, for example, scaling each axis according to its overall feature weight, 𝑤𝑘.



– A very discriminating feature will yield widely separated distributions; conversely, the distributions will be very nearly superimposed for a poor one.

Frequency histogram of data in two categories for variable 𝑥𝑘 fr

equ

ency

𝑥𝑘

I

II

𝑽𝒊𝒏𝒕𝒆𝒓

𝑽𝒊𝒏𝒕𝒓𝒂



– The variance weighting for these categories, 𝑤𝑘 𝐼, 𝐼𝐼 , is calculated as the ratio of the intercategory variances to the sum of the intracategory variances:

𝑤𝑘 𝐼, 𝐼𝐼 = 2

1𝑁𝐼 𝑥𝐼

2 + 1 𝑁𝐼𝐼 𝑥𝐼𝐼

2 − 2 𝑁𝐼𝑁𝐼𝐼 𝑥𝐼𝑥𝐼𝐼

1𝑁𝐼 𝑥𝐼 − 𝑥 𝐼

2 + 1 𝑁𝐼𝐼 𝑥𝐼𝐼 − 𝑥 𝐼𝐼

2

Such value can be computed for all pairs 𝑁𝐽 = 1 2 𝑁𝑉 𝑁𝑉 − 1 of features.


– It is desirable to compute a single, overall rating of each feature´s discriminating ability, 𝑤𝑘:

𝑤𝑘 = 𝑤𝑘 𝑗

𝑁𝐽

𝑗=1

1𝑁𝐽

In this way, 𝑤𝑘 is always greater than or equal to 1.0 units of variance even if one or more features have no discriminating ability.


– Alternately, the Fisher Weight replaces the numerator with the difference of the category means:

𝑤𝑘 𝐼, 𝐼𝐼 =𝑥 𝐼 − 𝑥 𝐼𝐼

1𝑁𝐼 𝑥𝐼 − 𝑥 𝐼 2 +

1𝑁𝐼𝐼 𝑥𝐼𝐼 − 𝑥 𝐼𝐼 2

– Since the Fisher weight may actually go to zero for a nondiscriminating feature, the overall weight is calculated as

𝑤𝑘 =1

𝑁𝐽 𝑤𝑘 𝑗

𝑁𝐽

𝑗=1


– In an application, either the Fisher or Variance weights would be calculated and the axis scaled by these values to produce a new set of features 𝑥′𝑖𝑘:

𝑥′𝑖𝑘 = 𝑤𝑘𝑥𝑖𝑘



“The weighted features are summarized in Table 2. The variance and Fisher weights indicate that the early eluting compounds, represented as feature PK1, are very important in discriminating between the two classes. This becomes obvious on visual examination of the chromatographic profiles. Other components, however, are also important in a Fisher and variance weight sense, and this is not readily apparent by visual examination.”




• Rotation

In general, a set of coordinate axes may be rotated through an angle to change the relative orientation of a set of points to the axes.

𝐘 = 𝐕𝐗

X original data in column matrix form

V transformation matrix

Y new column matrix coordinates


• Rotation

– 2D

𝐕 =𝑐𝑜𝑠𝜃 𝑠𝑖𝑛𝜃−𝑠𝑖𝑛𝜃 𝑐𝑜𝑠𝜃


• Eigenvector Rotation

As a preprocessing step, it is extremely useful to rotate all the axes involved in an n-dimentional data set so that the first new axis corresponds to the direction of the greatest variance in the data, and each successive axis represents the maximum residual variance.



For autoscaled data, the correlation matrix, C, is given by

𝐂 =1

𝑁 − 1𝐗𝐓𝐗

where N is the number of samples/rows in X.



𝐘 = 𝐗𝐕 𝐘𝐓 = 𝐕𝐓𝐗𝐓 𝐘𝐓𝐘 = 𝐕𝐓𝐗𝐓𝐗𝐕 𝐘𝐓𝐘 = 𝐕𝐓𝐂𝐕 = 𝚲

where is a diagonal matriz.


• Eigenvector Rotation In other words, this is an eigenvector problem, in which we would like to find the vector in R that, when applied to the system are converted to scalar multiples of themselves:

𝐂𝐕 = 𝜆𝐕 𝐂𝐕 − 𝜆𝐕 = 𝟎 𝐂 − 𝜆𝐈 𝐕 = 𝟎

Here is a scalar variable, whose solutions are the diagonal elements of .



This problem has a nontrivial solution when 𝐂 − 𝜆𝐈 = 𝟎

This equation is solved for the roots of which are the eigenvalues, that is, the variances associated with each new axis. Once the solutions for are know, they can be substituted back to solve for the vectors of V.


• Eigenvector Rotation The coordinate axes are now the eigenvectors, which are linear combinations of the original variables. The coordinates Y are referred to as scores and the elements of the eigenvectors – eigenvectors coefficients – are called the loadings. One might think of the eigenvector coefficient as indicating how much a variable is “loaded into” an eigenvector, for example the magnitude of the contribution of a variable in comprising the eigenvector.



– Data reduction: the discarding of factors that do not contain significant information about the data.

– %Var: the percent variance remaining after dimensionality reduction.

%𝑉𝑎𝑟 = 𝜆𝑖𝑁𝐶𝑖=1

𝜆𝑖𝑁𝑉𝑖=1

× 100

where NC is the number of components retained.


• Eigenvector Rotation – Signal/noise enhancemente: the chemical

information will be contained in the first one or more eigenvectors having the largest variances, and the nois contained in the eigenvectors of smalest eigenvalue may be discarded.

– Eigenvalues plots are extremely useful to display n-dimentional data, because they provide a way to preserve the maximum amount of infomation in a two-dimentional projection.

Principal Component Analysis (PCA)



– Principal Component Analysis (PCA) refers to the diagonalization of the covariance matrix, COV.

𝑥𝑖𝑘′ = 𝑥𝑖𝑘 − 𝑥 𝑘

𝐂𝐎𝐕 =1

𝑁 − 1𝐗′𝐓𝐗′

• Matlab: cov(xcm)


– PCA

• It is important to note how to get back and forth computationally between scores and data. The scores can be computed as

𝑦𝑖𝑘 = 𝑥𝑖𝑚 − 𝑥 𝑘 𝑣𝑚𝑘

𝑁

𝑚=1

Matlab > [V,lambda] = eig(cov(xcm))

> scores=xcm*V;


– PCA

• Given the score 𝑦𝑖, the original datum 𝑥𝑖𝑘 can be obtained as

𝑥𝑖𝑘 = 𝑥 𝑘 + 𝑦𝑖𝑚𝑣𝑘𝑚

𝑁

𝑚=1

Matlab > xcm = scores*V’;

> x = xcm+repmat(mean(x),[39,1]);


– PCA

• Scores and loadings: the coordinates Y are referred to as scores and the elements of the eigenvectors – eigenvectors coefficients – are called the loadings.

• Matlab: [V,lambda]=eig(cov(x))

[V, lambda]=eig(cov(xcm))

[V, lambda]=eig(cov(xa))


– PCA

• Is a technique for reducing the amount of data when there is correlation present.

• The idea behind PCA is to find principal components PC1 , PC2, ...., PCn which are linear combinations of the original variables describing each speciment, X1, X2,..., Xn, i.e. 𝑃𝐶1 = 𝑉11𝑋1 + 𝑉12𝑋2 +⋯+ 𝑉1𝑛𝑋𝑛 𝑃𝐶2 = 𝑉21𝑋1 + 𝑉22𝑋2 +⋯+ 𝑉2𝑛𝑋𝑛

etc.


– PCA • ex: Q1EstatDesc.xls contains 3 variables, books, attend and grade

and has 40 cases. – Books: X1

– Attend: X2

– Grade: X3

• Mean-centering

• [V,lambda]=eig(cov(xcm))

• The principal components are chosen so that the first principal component (PC1), accounts for most of the variance in the data set, PC2 accounts for next largest variation and so on.

> lambda

> somavar=sum(diag(lambda))

> varpc=100*lambda/somavar

http://portais.ufg.br/uploads/56/original_Q1EstatDesc.xlsx?1331817223


– PCA

• V =

0.4902 0.1619 -0.8564

-0.8684 0.0062 -0.4959

-0.0750 0.9868 0.1436

• PC1 = -0.8564X1 - 0.4959X2 + 0.1436X3

• PC2 = 0.1619X1 + 0.0062X2 + 0.9868X3

• PC3 = 0.4902X1 - 0.8684X2 - 0.0750X3


– PCA

• PC1xPC2 » The amount of total variance expressed by the first two

principal components: 93.7379 + 3.9393 = 97.6772%

> scores=xcm*V

» PC1 = -0.8564X1 - 0.4959X2 + 0.1436X3

» PC2 = 0.1619X1 + 0.0062X2 + 0.9868X3


– PCA

• PC1xPC2

> plot(scores(:,3),scores(:,2),'*'); grid on;

> xlabel('PC1'); ylabel('PC2')


– PCA

• PC1xPC2

-20 -15 -10 -5 0 5 10 15-4

-3

-2

-1

0

1

2

3

PC1

PC

2

16


– PCA • PC1xPC2 score plot

» PC1 = -0.8564X1 - 0.4959X2 + 0.1436X3

» PC2 = 0.1619X1 + 0.0062X2 + 0.9868X3

» Sample 16

• Score value on PC1

= -0.8546*16.4072-0.4959*9.2933+0.1436*(-1.8962)

= -18.9024

• Score value on PC2

= 0.1619*16.4072+0.0062*9.2933+0.9868*(-1.8962)

= 0.8428

» Right, left, up, low.

» Upper right, upper left, lower right, lower left.


– PCA

• PC1 loadings » plot(V(:,3),'*-'), grid on

» >> xlabel('variable'); ylabel('PC1 loadings')

1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

variable

PC

1 loadin

gs


– PCA scores

autoscaled data data mean-centered

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

PC1

PC

2

-20 -15 -10 -5 0 5 10 15-4

-3

-2

-1

0

1

2

3

PC1

PC

2


– PCA loadings

autoscaled data data mean-centered

1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

variable

PC

1 loadin

gs

Date post:	17-Mar-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

ANÁLISE MULTIVARIADA aula 01Preprocessing Techniques •Eigenvector Rotation The coordinate axes...

Documents