CLASSIFICATION

CLASSIFICATION

Periodic Table of Elements

1789 Lavosier

1869 Mendelev

Measures of similarity

i) distance

ii) angular (correlation)

xkT

dkl = || xTk-xT

l||

xlT

Variable spaceVar 1

Var 2

Two objects plotted in the two-dimensional variable space. The difference between the object vectors is defined as the Euclidean distance between the objects, dkl

angular

Measuring similarity

Distance

i) Euclidean ii) Minkowski (“Manhatten”, “taxis”)iii) Mahalanobis (correlated variables)

X1

X2

p1

p2

Euclid

ean

Euclidean:

m

jljkjkl xxD

1

2)(

m

jljkjlk xxD

1

, ||Manhattan:

Distance

Classification using distance:

Nearest neighbor(s) define the membership of an object.

KNN (K nearest neighbors)

K = 1

K = 3

Classification

X1

X2

X1 and X2 is uncorrelated, cov(X1, X2) = 0 for both subsets (classes)

=> can use KNN to measure similarity

Classification

X1

X2 PC1PC2

Class 3

Class 4

Class 1

Class 2

Univariate classification can NOT provide a good separation between class 1 and class 2. Bivariate classification (KNN) provides separation.For class 3 and class 4, PC analysis provides excellent separation on PC2.

Classification

X1

X2

X1 and X2 is correlated, cov(X1, X2) 0 for both “classes” (high X1 => high X2)KNN fails, but PC analysis provides the correct classification

Classification

Cluster methods like KNN (K nearest neighbors) use all the data in the calculation of distances.

Drawback: No separation of noise from information

Cure: Use scores from major PCs

VARIABLE CORRELATIONAND

SIMILARITYBETWEEN OBJECTS

CORRELATION&SIMILARITY

Variable space

Var 1

Var 2


Variable space

Var 1

Var 2PCclass 2

PCclass 1

SUPERVISED COMPARISON (SIMCA)

CORRELATION-SIMILARITY

Variable space

Var 1

Var 2 PC1PC2

UNSUPERVISED COMPARISON (PCA)


eTk

xcT

xTk

Var 2

Var 1Variable Space


Unsupervised:

PCA - score plot

Fuzzy clustering

Supervised:

SIMCA

CORRELATION-SIMILARITY

0 10 20 30 KM

Characterisation and Correlation of crude oils….Kvalheim et al. (1985) Anal. Chem.


Sample 1

Sample 2

Sample N


SCORE PLOT

88

11

33

44

1111

96

10

69

1313 2

2

714

5

t1

t2

PC1

PC2

Soft Independent Modelling of

Class Analogies (SIMCA)

SIMCA

Data(Variance) =

Model(Covar. pattern)

Residuals(Unique variance,

noise)+

Angularcorrelation

Distance

SIMCA

Objects 1 2 3 . . . .

N N+1

N+N’

1 2 3 4 ………… …………...M

Data matrixVariables

Xki

Class 1

Class 2

Unassignedobjects

Class Q

Training set(Reference set)

Test set

Class - group of similar objectsObject - sample, individualVariable - feature, characteristics, attribute

SIMCA

Chromatogram 1 2 3 . . . .

N N+1

N+N’

1 2 3 4 ………… …………...M

Data matrixPeak area

Xki

Oil field 1

Oil field 2

Newsamples

Oil field Q

Training set(Reference set

Test set

PC MODELS

2*

3*

1*

2

3

1

x

2’

3’

1’

2

3

1

x

p1

xki = xi + eki

x’k = x’ + e’k

xki = xi + tkp’i + eki

x’k = x’ + tkp’ + e’k

PC MODELS

2’

3’

1’

2

3

1

X

p1

p2

xki = xi + tkp’i + eki

x’k = x’ + tk1p’1 + tk2p’2 + e’k

PRINCIPAL COMPONENT CLASS MODEL

XC = XC + TCP`C + EC

information(structure)

noise

cki

cai

A

a

cka

ci

cki eptXX

1

k = 1,2,…,N (object,sample)i = 1,2,…,N (variable)a = 1,2,….,A (principal componentc = 1,2,----,C (class)

PC MODELS

1

2

3

4

5

6

7

8

1

2

3

4

5

6

7

8

1

2

3

4

5

6

7

8

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

Deletion pattern for objects in the leave-out-one group-of elements-at-a-time cross validation procedure developed by Wold

CROSS VALIDATING PC MODELS

1

1 2 3 4 5 6 7 8 9 10

2

3

4

5

6

7

8

a

A

aa ptxxaE

1

i) Calculate scores and loadings for PC a+1; ta+1 and pà+1, excluding elements in one groupii) Predict values for elements eki, a+1 = tk,a+1 pà+1,i

iii) Sum over the elements iv) Repeat i)-iii) for all the other groups of elementsv) Compare with Adjust for degrees of freedom

2

,1, )ˆ( akiaki ee

2

,1, )ˆ( akiaki ee ik

akie,

2,

1-component PC model

Smax p= 0.05

Smax p= 0.01

PC 1

Residual Standard Deviation (RSD)

Smax

PC 1

S0

Mean RSD of class:

RSD of object:

)1)(/(1 1

220

M

i

N

kqqki ANAMes

)/(1

22q

M

ikik AMes

tupper

tlower

PC 1

smax

tmax

tmin

1/2st

1/2st

CLASSIFICATION OF A NEW OBJECT

i) Fit object to the class model

ii) Compare residual distance of object to the class model with the average residual distance of objects used to obtain the class (F-test)

CLASSIFICATION OF A NEW OBJECT i) Fit object to the class model Defines

For a = 1,2,...,A

Calculate the residuals to the object:

ii) Compare residual distance of object to the class model with the average residual distance of objects used to obtain the class (F-test) <= Fcritical => k class q

> Fcritical => k class q

qkk xxe '

0,

||||/ˆ1 aaaka ppet

||||/âakakaka pptee

)/()/(1

22

qAMeAMees

M

iiqaak

20

2

s

sk

tupper

tlower

PC 1

smax

tmax

tmin

1/2st

1/2st

Objects outside the model

PC 1

Detection of atypical objects

RSDmax

sl

tmax+ 1/2st

tlower

tmax

tmin

tmin - 1/2st

tl

sk

Object k: Sk > RSDmax => k is outside the class

Object l: tl is outside the “normal area”, {tmin-1/2st, tmax+1/2st} => Calculate the distance to the extreme point, that is, sl > RSDmax => l is outside the class

Detection of outliers

1. Score plots

2. DIXON-TESTS on each LATENT VARIABLE,

3. Normal plots of scores for each LATENT VARIABLE

4. Test of residuals, F-test (class model)

criticalmax

maxmax Qtt

ttQ

|| min

1

MODELLEING POWER

DISCRIMINATION POWER

MODELLEING POWER

The variables contribution to the class model q (intra-class variation)

MPiq= 1 - Sq

i,A/ Sqi,0

MPi= 1.0 => the variable i is completely explained by the class modelMPi= 0.0 => the variable i does NOT contribute to the class model

DISCRIMINATION POWER

The variables ability to separate two class models (inter-class variation)

DPr,qi = 1.0 => no discrimination power

DPr,qi > 3-4 => “Good” discrimination power

)()(

)()(2,

2,

2,

2,,

qsrs

rsqsD

qiri

qiriqri

sl(q)

l

k

Class q

Class r

sk(q)

sk(r)

sl(r)

SEPARATION BETWEEN CLASSES

Worst ratio: min|)(

)(|

rs

qs

l

l ,lr

Class distance: 43

)()(

)()(22

0

222,

qsrs

rsqsd

o

qrqr

=> “good separation”

POLISHED CLASSES

1) Remove “outliers”

2) Remove variables with both low

MP < 0.3-0.4 and low DP < 2-3

How does SIMCA separate from other multivariate methods?

i) Models systematic intra-class variation (angular correlation)

ii) Assuming normally distributed population, the residuals can be

used to decide class belonging (F-test)!

iii) “Closed” models

iv) Considers correlation, important for large data sets

v) SIMCA separates noise from systematic (predictive) variation

in each class

Latent Data Analysis (LDA)

Separating surface

• New classes ?• Outliers• Asymmetric case?•Looking for dissimilarities

MISSING DATA

x1

x2

f2(x1,x2)

f1(x1,x2)?

?

?

?

WHEN DOES SIMCA WORK?

1. Similarity between objects in the same class, homogenous data.2. Some relevant variables for the problem in question (MP, DP)3. At least 5 objects, 3 variables.

ALGORITHM FOR SIMCA MODELLING

Read Raw-data

Pretreatment of data

Select Subset/Class Evaluation of subsets

Cross validated PC-model

Variable Weighting

Outliers?

More Classes?

Remodel?Yes

Yes

Yes

“Polished” subsets

Standardise

Eliminate variables with low modelling and discriminated power

Square Root, Normaliseand more

Fit new objects

Date post:	06-Jan-2016
Category:	Documents
Upload:	tallys
View:	23 times
Download:	0 times

CLASSIFICATION

Documents