+ All Categories
Home > Documents > CLASSIFICATION

CLASSIFICATION

Date post: 06-Jan-2016
Category:
Upload: tallys
View: 23 times
Download: 0 times
Share this document with a friend
Description:
CLASSIFICATION. Periodic Table of Elements. 1789 Lavosier 1869 Mendelev. Measures of similarity i) distance ii) angular (correlation). Var 2. x k T d kl = || x T k - x T l || x l T. angular. Var 1. Variable space. - PowerPoint PPT Presentation
48
CLASSIFICATION
Transcript
Page 1: CLASSIFICATION

CLASSIFICATION

Page 2: CLASSIFICATION

Periodic Table of Elements

Page 3: CLASSIFICATION

1789 Lavosier

1869 Mendelev

Page 4: CLASSIFICATION

Measures of similarity

i) distance

ii) angular (correlation)

Page 5: CLASSIFICATION

xkT

dkl = || xTk-xT

l||

xlT

Variable spaceVar 1

Var 2

Two objects plotted in the two-dimensional variable space. The difference between the object vectors is defined as the Euclidean distance between the objects, dkl

angular

Page 6: CLASSIFICATION

Measuring similarity

Distance

i) Euclidean ii) Minkowski (“Manhatten”, “taxis”)iii) Mahalanobis (correlated variables)

Page 7: CLASSIFICATION

X1

X2

p1

p2

Euclid

ean

Euclidean:

m

jljkjkl xxD

1

2)(

m

jljkjlk xxD

1

, ||Manhattan:

Distance

Page 8: CLASSIFICATION

Classification using distance:

Nearest neighbor(s) define the membership of an object.

KNN (K nearest neighbors)

K = 1

K = 3

Page 9: CLASSIFICATION

Classification

X1

X2

X1 and X2 is uncorrelated, cov(X1, X2) = 0 for both subsets (classes)

=> can use KNN to measure similarity

Page 10: CLASSIFICATION

Classification

X1

X2 PC1PC2

Class 3

Class 4

Class 1

Class 2

Univariate classification can NOT provide a good separation between class 1 and class 2. Bivariate classification (KNN) provides separation.For class 3 and class 4, PC analysis provides excellent separation on PC2.

Page 11: CLASSIFICATION

Classification

X1

X2

X1 and X2 is correlated, cov(X1, X2) 0 for both “classes” (high X1 => high X2)KNN fails, but PC analysis provides the correct classification

Page 12: CLASSIFICATION

Classification

Cluster methods like KNN (K nearest neighbors) use all the data in the calculation of distances.

Drawback: No separation of noise from information

Cure: Use scores from major PCs

Page 13: CLASSIFICATION

VARIABLE CORRELATIONAND

SIMILARITYBETWEEN OBJECTS

Page 14: CLASSIFICATION

CORRELATION&SIMILARITY

Variable space

Var 1

Var 2

Page 15: CLASSIFICATION

CORRELATION&SIMILARITY

Variable space

Var 1

Var 2PCclass 2

PCclass 1

SUPERVISED COMPARISON (SIMCA)

Page 16: CLASSIFICATION

CORRELATION-SIMILARITY

Variable space

Var 1

Var 2 PC1PC2

UNSUPERVISED COMPARISON (PCA)

Page 17: CLASSIFICATION

CORRELATION&SIMILARITY

eTk

xcT

xTk

Var 2

Var 1Variable Space

Page 18: CLASSIFICATION

CORRELATION&SIMILARITY

Unsupervised:

PCA - score plot

Fuzzy clustering

Supervised:

SIMCA

Page 19: CLASSIFICATION

CORRELATION-SIMILARITY

0 10 20 30 KM

Characterisation and Correlation of crude oils….Kvalheim et al. (1985) Anal. Chem.

Page 20: CLASSIFICATION

CORRELATION&SIMILARITY

Sample 1

Sample 2

Sample N

Page 21: CLASSIFICATION

CORRELATION&SIMILARITY

SCORE PLOT

88

11

33

44

1111

96

10

69

1313 2

2

714

5

t1

t2

PC1

PC2

Page 22: CLASSIFICATION

Soft Independent Modelling of

Class Analogies (SIMCA)

Page 23: CLASSIFICATION

SIMCA

Data(Variance) =

Model(Covar. pattern)

Residuals(Unique variance,

noise)+

Angularcorrelation

Distance

Page 24: CLASSIFICATION

SIMCA

Objects 1 2 3 . . . .

N N+1

N+N’

1 2 3 4 ………… …………...M

Data matrixVariables

Xki

Class 1

Class 2

Unassignedobjects

Class Q

Training set(Reference set)

Test set

Class - group of similar objectsObject - sample, individualVariable - feature, characteristics, attribute

Page 25: CLASSIFICATION

SIMCA

Chromatogram 1 2 3 . . . .

N N+1

N+N’

1 2 3 4 ………… …………...M

Data matrixPeak area

Xki

Oil field 1

Oil field 2

Newsamples

Oil field Q

Training set(Reference set

Test set

Page 26: CLASSIFICATION

PC MODELS

2*

3*

1*

2

3

1

x

2’

3’

1’

2

3

1

x

p1

xki = xi + eki

x’k = x’ + e’k

xki = xi + tkp’i + eki

x’k = x’ + tkp’ + e’k

Page 27: CLASSIFICATION

PC MODELS

2’

3’

1’

2

3

1

X

p1

p2

xki = xi + tkp’i + eki

x’k = x’ + tk1p’1 + tk2p’2 + e’k

Page 28: CLASSIFICATION

PRINCIPAL COMPONENT CLASS MODEL

XC = XC + TCP`C + EC

information(structure)

noise

cki

cai

A

a

cka

ci

cki eptXX

1

k = 1,2,…,N (object,sample)i = 1,2,…,N (variable)a = 1,2,….,A (principal componentc = 1,2,----,C (class)

Page 29: CLASSIFICATION

PC MODELS

1

2

3

4

5

6

7

8

1

2

3

4

5

6

7

8

1

2

3

4

5

6

7

8

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

Deletion pattern for objects in the leave-out-one group-of elements-at-a-time cross validation procedure developed by Wold

Page 30: CLASSIFICATION

CROSS VALIDATING PC MODELS

1

1 2 3 4 5 6 7 8 9 10

2

3

4

5

6

7

8

a

A

aa ptxxaE

1

i) Calculate scores and loadings for PC a+1; ta+1 and p`a+1, excluding elements in one groupii) Predict values for elements eki, a+1 = tk,a+1 p`a+1,i

iii) Sum over the elements iv) Repeat i)-iii) for all the other groups of elementsv) Compare with Adjust for degrees of freedom

2

,1, )ˆ( akiaki ee

2

,1, )ˆ( akiaki ee ik

akie,

2,

Page 31: CLASSIFICATION

1-component PC model

Smax p= 0.05

Smax p= 0.01

PC 1

Page 32: CLASSIFICATION

Residual Standard Deviation (RSD)

Smax

PC 1

S0

Mean RSD of class:

RSD of object:

)1)(/(1 1

220

M

i

N

kqqki ANAMes

)/(1

22q

M

ikik AMes

Page 33: CLASSIFICATION

tupper

tlower

PC 1

smax

tmax

tmin

1/2st

1/2st

Page 34: CLASSIFICATION

CLASSIFICATION OF A NEW OBJECT

i) Fit object to the class model

ii) Compare residual distance of object to the class model with the average residual distance of objects used to obtain the class (F-test)

Page 35: CLASSIFICATION

CLASSIFICATION OF A NEW OBJECT i) Fit object to the class model Defines

For a = 1,2,...,A

Calculate the residuals to the object:

ii) Compare residual distance of object to the class model with the average residual distance of objects used to obtain the class (F-test) <= Fcritical => k class q

> Fcritical => k class q

qkk xxe '

0,

||||/ˆ1 aaaka ppet

||||/ˆaakakaka pptee

)/()/(1

22

qAMeAMees

M

iiqaak

20

2

s

sk

Page 36: CLASSIFICATION

tupper

tlower

PC 1

smax

tmax

tmin

1/2st

1/2st

Objects outside the model

Page 37: CLASSIFICATION

PC 1

Detection of atypical objects

RSDmax

sl

tmax+ 1/2st

tlower

tmax

tmin

tmin - 1/2st

tl

sk

Object k: Sk > RSDmax => k is outside the class

Object l: tl is outside the “normal area”, {tmin-1/2st, tmax+1/2st} => Calculate the distance to the extreme point, that is, sl > RSDmax => l is outside the class

Page 38: CLASSIFICATION

Detection of outliers

1. Score plots

2. DIXON-TESTS on each LATENT VARIABLE,

3. Normal plots of scores for each LATENT VARIABLE

4. Test of residuals, F-test (class model)

criticalmax

maxmax Qtt

ttQ

|| min

1

Page 39: CLASSIFICATION

MODELLEING POWER

DISCRIMINATION POWER

Page 40: CLASSIFICATION

MODELLEING POWER

The variables contribution to the class model q (intra-class variation)

MPiq= 1 - Sq

i,A/ Sqi,0

MPi= 1.0 => the variable i is completely explained by the class modelMPi= 0.0 => the variable i does NOT contribute to the class model

Page 41: CLASSIFICATION

DISCRIMINATION POWER

The variables ability to separate two class models (inter-class variation)

DPr,qi = 1.0 => no discrimination power

DPr,qi > 3-4 => “Good” discrimination power

)()(

)()(2,

2,

2,

2,,

qsrs

rsqsD

qiri

qiriqri

Page 42: CLASSIFICATION

sl(q)

l

k

Class q

Class r

sk(q)

sk(r)

sl(r)

SEPARATION BETWEEN CLASSES

Worst ratio: min|)(

)(|

rs

qs

l

l ,lr

Class distance: 43

)()(

)()(22

0

222,

qsrs

rsqsd

o

qrqr

=> “good separation”

Page 43: CLASSIFICATION

POLISHED CLASSES

1) Remove “outliers”

2) Remove variables with both low

MP < 0.3-0.4 and low DP < 2-3

Page 44: CLASSIFICATION

How does SIMCA separate from other multivariate methods?

i) Models systematic intra-class variation (angular correlation)

ii) Assuming normally distributed population, the residuals can be

used to decide class belonging (F-test)!

iii) “Closed” models

iv) Considers correlation, important for large data sets

v) SIMCA separates noise from systematic (predictive) variation

in each class

Page 45: CLASSIFICATION

Latent Data Analysis (LDA)

Separating surface

• New classes ?• Outliers• Asymmetric case?•Looking for dissimilarities

Page 46: CLASSIFICATION

MISSING DATA

x1

x2

f2(x1,x2)

f1(x1,x2)?

?

?

?

Page 47: CLASSIFICATION

WHEN DOES SIMCA WORK?

1. Similarity between objects in the same class, homogenous data.2. Some relevant variables for the problem in question (MP, DP)3. At least 5 objects, 3 variables.

Page 48: CLASSIFICATION

ALGORITHM FOR SIMCA MODELLING

Read Raw-data

Pretreatment of data

Select Subset/Class Evaluation of subsets

Cross validated PC-model

Variable Weighting

Outliers?

More Classes?

Remodel?Yes

Yes

Yes

“Polished” subsets

Standardise

Eliminate variables with low modelling and discriminated power

Square Root, Normaliseand more

Fit new objects


Recommended