CLASSIFICATION
Periodic Table of Elements
1789 Lavosier
1869 Mendelev
Measures of similarity
i) distance
ii) angular (correlation)
xkT
dkl = || xTk-xT
l||
xlT
Variable spaceVar 1
Var 2
Two objects plotted in the two-dimensional variable space. The difference between the object vectors is defined as the Euclidean distance between the objects, dkl
angular
Measuring similarity
Distance
i) Euclidean ii) Minkowski (“Manhatten”, “taxis”)iii) Mahalanobis (correlated variables)
X1
X2
p1
p2
Euclid
ean
Euclidean:
m
jljkjkl xxD
1
2)(
m
jljkjlk xxD
1
, ||Manhattan:
Distance
Classification using distance:
Nearest neighbor(s) define the membership of an object.
KNN (K nearest neighbors)
K = 1
K = 3
Classification
X1
X2
X1 and X2 is uncorrelated, cov(X1, X2) = 0 for both subsets (classes)
=> can use KNN to measure similarity
Classification
X1
X2 PC1PC2
Class 3
Class 4
Class 1
Class 2
Univariate classification can NOT provide a good separation between class 1 and class 2. Bivariate classification (KNN) provides separation.For class 3 and class 4, PC analysis provides excellent separation on PC2.
Classification
X1
X2
X1 and X2 is correlated, cov(X1, X2) 0 for both “classes” (high X1 => high X2)KNN fails, but PC analysis provides the correct classification
Classification
Cluster methods like KNN (K nearest neighbors) use all the data in the calculation of distances.
Drawback: No separation of noise from information
Cure: Use scores from major PCs
VARIABLE CORRELATIONAND
SIMILARITYBETWEEN OBJECTS
CORRELATION&SIMILARITY
Variable space
Var 1
Var 2
CORRELATION&SIMILARITY
Variable space
Var 1
Var 2PCclass 2
PCclass 1
SUPERVISED COMPARISON (SIMCA)
CORRELATION-SIMILARITY
Variable space
Var 1
Var 2 PC1PC2
UNSUPERVISED COMPARISON (PCA)
CORRELATION&SIMILARITY
eTk
xcT
xTk
Var 2
Var 1Variable Space
CORRELATION&SIMILARITY
Unsupervised:
PCA - score plot
Fuzzy clustering
Supervised:
SIMCA
CORRELATION-SIMILARITY
0 10 20 30 KM
Characterisation and Correlation of crude oils….Kvalheim et al. (1985) Anal. Chem.
CORRELATION&SIMILARITY
Sample 1
Sample 2
Sample N
CORRELATION&SIMILARITY
SCORE PLOT
88
11
33
44
1111
96
10
69
1313 2
2
714
5
t1
t2
PC1
PC2
Soft Independent Modelling of
Class Analogies (SIMCA)
SIMCA
Data(Variance) =
Model(Covar. pattern)
Residuals(Unique variance,
noise)+
Angularcorrelation
Distance
SIMCA
Objects 1 2 3 . . . .
N N+1
N+N’
1 2 3 4 ………… …………...M
Data matrixVariables
Xki
Class 1
Class 2
Unassignedobjects
Class Q
Training set(Reference set)
Test set
Class - group of similar objectsObject - sample, individualVariable - feature, characteristics, attribute
SIMCA
Chromatogram 1 2 3 . . . .
N N+1
N+N’
1 2 3 4 ………… …………...M
Data matrixPeak area
Xki
Oil field 1
Oil field 2
Newsamples
Oil field Q
Training set(Reference set
Test set
PC MODELS
2*
3*
1*
2
3
1
x
2’
3’
1’
2
3
1
x
p1
xki = xi + eki
x’k = x’ + e’k
xki = xi + tkp’i + eki
x’k = x’ + tkp’ + e’k
PC MODELS
2’
3’
1’
2
3
1
X
p1
p2
xki = xi + tkp’i + eki
x’k = x’ + tk1p’1 + tk2p’2 + e’k
PRINCIPAL COMPONENT CLASS MODEL
XC = XC + TCP`C + EC
information(structure)
noise
cki
cai
A
a
cka
ci
cki eptXX
1
k = 1,2,…,N (object,sample)i = 1,2,…,N (variable)a = 1,2,….,A (principal componentc = 1,2,----,C (class)
PC MODELS
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Deletion pattern for objects in the leave-out-one group-of elements-at-a-time cross validation procedure developed by Wold
CROSS VALIDATING PC MODELS
1
1 2 3 4 5 6 7 8 9 10
2
3
4
5
6
7
8
a
A
aa ptxxaE
1
i) Calculate scores and loadings for PC a+1; ta+1 and p`a+1, excluding elements in one groupii) Predict values for elements eki, a+1 = tk,a+1 p`a+1,i
iii) Sum over the elements iv) Repeat i)-iii) for all the other groups of elementsv) Compare with Adjust for degrees of freedom
2
,1, )ˆ( akiaki ee
2
,1, )ˆ( akiaki ee ik
akie,
2,
1-component PC model
Smax p= 0.05
Smax p= 0.01
PC 1
Residual Standard Deviation (RSD)
Smax
PC 1
S0
Mean RSD of class:
RSD of object:
)1)(/(1 1
220
M
i
N
kqqki ANAMes
)/(1
22q
M
ikik AMes
tupper
tlower
PC 1
smax
tmax
tmin
1/2st
1/2st
CLASSIFICATION OF A NEW OBJECT
i) Fit object to the class model
ii) Compare residual distance of object to the class model with the average residual distance of objects used to obtain the class (F-test)
CLASSIFICATION OF A NEW OBJECT i) Fit object to the class model Defines
For a = 1,2,...,A
Calculate the residuals to the object:
ii) Compare residual distance of object to the class model with the average residual distance of objects used to obtain the class (F-test) <= Fcritical => k class q
> Fcritical => k class q
qkk xxe '
0,
||||/ˆ1 aaaka ppet
||||/ˆaakakaka pptee
)/()/(1
22
qAMeAMees
M
iiqaak
20
2
s
sk
tupper
tlower
PC 1
smax
tmax
tmin
1/2st
1/2st
Objects outside the model
PC 1
Detection of atypical objects
RSDmax
sl
tmax+ 1/2st
tlower
tmax
tmin
tmin - 1/2st
tl
sk
Object k: Sk > RSDmax => k is outside the class
Object l: tl is outside the “normal area”, {tmin-1/2st, tmax+1/2st} => Calculate the distance to the extreme point, that is, sl > RSDmax => l is outside the class
Detection of outliers
1. Score plots
2. DIXON-TESTS on each LATENT VARIABLE,
3. Normal plots of scores for each LATENT VARIABLE
4. Test of residuals, F-test (class model)
criticalmax
maxmax Qtt
ttQ
|| min
1
MODELLEING POWER
DISCRIMINATION POWER
MODELLEING POWER
The variables contribution to the class model q (intra-class variation)
MPiq= 1 - Sq
i,A/ Sqi,0
MPi= 1.0 => the variable i is completely explained by the class modelMPi= 0.0 => the variable i does NOT contribute to the class model
DISCRIMINATION POWER
The variables ability to separate two class models (inter-class variation)
DPr,qi = 1.0 => no discrimination power
DPr,qi > 3-4 => “Good” discrimination power
)()(
)()(2,
2,
2,
2,,
qsrs
rsqsD
qiri
qiriqri
sl(q)
l
k
Class q
Class r
sk(q)
sk(r)
sl(r)
SEPARATION BETWEEN CLASSES
Worst ratio: min|)(
)(|
rs
qs
l
l ,lr
Class distance: 43
)()(
)()(22
0
222,
qsrs
rsqsd
o
qrqr
=> “good separation”
POLISHED CLASSES
1) Remove “outliers”
2) Remove variables with both low
MP < 0.3-0.4 and low DP < 2-3
How does SIMCA separate from other multivariate methods?
i) Models systematic intra-class variation (angular correlation)
ii) Assuming normally distributed population, the residuals can be
used to decide class belonging (F-test)!
iii) “Closed” models
iv) Considers correlation, important for large data sets
v) SIMCA separates noise from systematic (predictive) variation
in each class
Latent Data Analysis (LDA)
Separating surface
• New classes ?• Outliers• Asymmetric case?•Looking for dissimilarities
MISSING DATA
x1
x2
f2(x1,x2)
f1(x1,x2)?
?
?
?
WHEN DOES SIMCA WORK?
1. Similarity between objects in the same class, homogenous data.2. Some relevant variables for the problem in question (MP, DP)3. At least 5 objects, 3 variables.
ALGORITHM FOR SIMCA MODELLING
Read Raw-data
Pretreatment of data
Select Subset/Class Evaluation of subsets
Cross validated PC-model
Variable Weighting
Outliers?
More Classes?
Remodel?Yes
Yes
Yes
“Polished” subsets
Standardise
Eliminate variables with low modelling and discriminated power
Square Root, Normaliseand more
Fit new objects