+ All Categories
Home > Documents > Clustering with Mixed Type Variables and Determination of ... · COMPSTAT 2010 3 Motivation Task:...

Clustering with Mixed Type Variables and Determination of ... · COMPSTAT 2010 3 Motivation Task:...

Date post: 23-Jan-2021
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
23
COMPSTAT 2010 Clustering with Mixed Type Variables and Determination of Cluster Numbers Hana Řezanková, Dušan Húsek Tomáš Löster University of Economics, Prague ICS, Academy of Sciences of the Czech Republic
Transcript
Page 1: Clustering with Mixed Type Variables and Determination of ... · COMPSTAT 2010 3 Motivation Task: We are looking for groups of similar objects (e.g. respondents), i.e. we will concentrate

COMPSTAT 2010 1

Clustering with Mixed Type Variablesand Determination of Cluster Numbers

Hana Řezanková, Dušan HúsekTomáš Löster

University of Economics, PragueICS, Academy of Sciences of the Czech Republic

Page 2: Clustering with Mixed Type Variables and Determination of ... · COMPSTAT 2010 3 Motivation Task: We are looking for groups of similar objects (e.g. respondents), i.e. we will concentrate

COMPSTAT 2010 2

Outline

Motivation Methods for clustering with mixed type variables Implementation in software packages Proposal of new criteria for cluster evaluation Application Conclusion

Page 3: Clustering with Mixed Type Variables and Determination of ... · COMPSTAT 2010 3 Motivation Task: We are looking for groups of similar objects (e.g. respondents), i.e. we will concentrate

COMPSTAT 2010 3

Motivation

Task: We are looking for groups of similar Task: We are looking for groups of similar objects (e.g. respondents)objects (e.g. respondents),, i.e. we will i.e. we will concentrate on concentrate on thethe problem of object clusteringproblem of object clustering

The objects are characterized by both quantitative and qualitative (nominal) variables(e.g. respondent opinions, numbers of actions)

The number of clusters is unknown in advance The number of clusters is unknown in advance ––i.e. we should cope with appropriate number of i.e. we should cope with appropriate number of clusters determination (assignment)clusters determination (assignment)

Page 4: Clustering with Mixed Type Variables and Determination of ... · COMPSTAT 2010 3 Motivation Task: We are looking for groups of similar objects (e.g. respondents), i.e. we will concentrate

COMPSTAT 2010 4

Methods for clustering with mixed type variables

Using a specialized dissimilarity measureUsing a specialized dissimilarity measure(Gower(Gower’’s coefficient, cluster variability based)s coefficient, cluster variability based)and application of agglomerative hierarchicaland application of agglomerative hierarchicalcluster analysiscluster analysis (AHCA)(AHCA)

Clustering objects separately with quantitative and qualitative variables and combining the results by cluster-based similarity partitioning algorithm (CSPA)

Latent class models

Page 5: Clustering with Mixed Type Variables and Determination of ... · COMPSTAT 2010 3 Motivation Task: We are looking for groups of similar objects (e.g. respondents), i.e. we will concentrate

COMPSTAT 2010 5

Implementation in software packages

Specialized dissimilarity measuresSpecialized dissimilarity measures-- are not implementedare not implemented for for AHCAAHCA

Clustering objects with qualitative variables- is implemented only rarely (disagreement coef.)

Cluster-based similarity partitioning algorithm- is not implementednot implemented but it could be realized

LC Cluster models (Latent GOLD) LogLog--likelihood distance measurelikelihood distance measure between clusters

- implemented in two-step cluster analysis (SPSS)

Page 6: Clustering with Mixed Type Variables and Determination of ... · COMPSTAT 2010 3 Motivation Task: We are looking for groups of similar objects (e.g. respondents), i.e. we will concentrate

COMPSTAT 2010 6

Implementation in software packages

LogLog--likelihood distance measurelikelihood distance measure between clustersbetween clusters- implemented in two-step cluster analysis (SPSS)

)(, hhhhhhD

)1( )2(

1 1

22 )ln(21m

l

m

lglgllgg Hssn

g

gluK

u g

glugl n

nnn

Hl

ln1

…… entropyentropy

Page 7: Clustering with Mixed Type Variables and Determination of ... · COMPSTAT 2010 3 Motivation Task: We are looking for groups of similar objects (e.g. respondents), i.e. we will concentrate

COMPSTAT 2010 7

Implementation in software packages

LogLog--likelihood distance measurelikelihood distance measure between objectsbetween objects- implemented in two-step cluster analysis (SPSS)

)(, hhhhhhD

)1( )2(

1 1

22 )ln(21m

l

m

lglgllgg Hssn

jijiD xxxx ,),(

Page 8: Clustering with Mixed Type Variables and Determination of ... · COMPSTAT 2010 3 Motivation Task: We are looking for groups of similar objects (e.g. respondents), i.e. we will concentrate

COMPSTAT 2010 8

Evaluation criteria implemented in software packages

BIC (BIC (Bayesian Information Criterion)Bayesian Information Criterion)AICAIC (Akaike Information Criterion) - implemented in two-step cluster analysis (SPSS)

k

gkg nwI

1BIC )ln(2

)2(

1

)1( )1(2m

llk Kmkw

k

gkg wI

1AIC 22

… minimum

only for initial estimationof number of clusters

Page 9: Clustering with Mixed Type Variables and Determination of ... · COMPSTAT 2010 3 Motivation Task: We are looking for groups of similar objects (e.g. respondents), i.e. we will concentrate

COMPSTAT 2010 9

Proposed evaluation criteria

Within-cluster variability for k clusters:

Variability of the whole data set:

k

g

m

l

m

lglgllg

k

gg Hssnk

1 1 1

22

1

)1( )2(

)ln(21)(

)1( )2(

1 1

2 )2ln(21)1(

m

l

m

lll Hsn

Page 10: Clustering with Mixed Type Variables and Determination of ... · COMPSTAT 2010 3 Motivation Task: We are looking for groups of similar objects (e.g. respondents), i.e. we will concentrate

COMPSTAT 2010 10

Proposed evaluation criteria

k

g

m

l

m

lglgllg

k

gg Hssnk

1 1 1

22

1

)1( )2(

)ln(21)(

Within-cluster variability for k clusters:

)()1()( kkkdiff difference

it should be maximalfor the suitable number of clusters

Page 11: Clustering with Mixed Type Variables and Determination of ... · COMPSTAT 2010 3 Motivation Task: We are looking for groups of similar objects (e.g. respondents), i.e. we will concentrate

COMPSTAT 2010 11

Evaluation criteria modified for qualitative variables

1. Uncertainty index (R-square (RSQ) index)

2. Semipartial uncertainty index (optimal number of clusters - minimum)

)()1()( UUSPU kIkIkI

)1()()1()(

T

WT

T

BU

kV

VVVVkI

Page 12: Clustering with Mixed Type Variables and Determination of ... · COMPSTAT 2010 3 Motivation Task: We are looking for groups of similar objects (e.g. respondents), i.e. we will concentrate

COMPSTAT 2010 12

Evaluation criteria modified for qualitative variables

3. Pseudo (Calinski and Habarasz) F index– PSF (SAS), CHF ( SYSTAT)

4. Pseudo T-squared statistic – PST2 (SAS)PTS (SYSTAT)

)()1())()1(()(1)(

W

B

CHFU kkkkn

knVkV

kI

2

)()( ,

PTSU

hh

hh

hhhh

nn

kI

Page 13: Clustering with Mixed Type Variables and Determination of ... · COMPSTAT 2010 3 Motivation Task: We are looking for groups of similar objects (e.g. respondents), i.e. we will concentrate

COMPSTAT 2010 13

Evaluation criteria modified for qualitative variables

SYSTAT

Page 14: Clustering with Mixed Type Variables and Determination of ... · COMPSTAT 2010 3 Motivation Task: We are looking for groups of similar objects (e.g. respondents), i.e. we will concentrate

COMPSTAT 2010 14

Evaluation criteria modified for qualitative variables

5. Modified Davies and Bouldin (DB) index

kD

ss

kI

k

h hh

hDhD

hhh

1

,,

,

DB

max)(

kkI

k

h hhhh

hh

hhh

1 ,

,

DBU

)(max

)(

Page 15: Clustering with Mixed Type Variables and Determination of ... · COMPSTAT 2010 3 Motivation Task: We are looking for groups of similar objects (e.g. respondents), i.e. we will concentrate

COMPSTAT 2010 15

Evaluation criteria modified for qualitative variables

6. Dunn’s index

gkg

hhkhkh diam

DkI1

11D maxminmin)(

),(min, jiCChh DD

hjhi

xxxx

),(max, jiCg Ddiam

gji

xxxx

Page 16: Clustering with Mixed Type Variables and Determination of ... · COMPSTAT 2010 3 Motivation Task: We are looking for groups of similar objects (e.g. respondents), i.e. we will concentrate

COMPSTAT 2010 16

Modified evaluation criteria

k

ggGkG

1

)(

CClusterluster variability variability based on the variance and GiniGini’’ss coefficient of mutabilitycoefficient of mutability

)1( )2(

1 1

22 )ln(21m

l

m

lglgllgg GssnG

lK

u g

glugl n

nG

1

2

1 GiniGini’’ss coefficient ofcoefficient of mutabilitymutability

k

gkg nwGI

1BGC )ln(2

Page 17: Clustering with Mixed Type Variables and Determination of ... · COMPSTAT 2010 3 Motivation Task: We are looking for groups of similar objects (e.g. respondents), i.e. we will concentrate

COMPSTAT 2010 17

Evaluation criteria modified for qualitative variables

1. Tau index (RSQ index)

2. Semipartial tau index (optimal number of clusters - minimum)

)()1()(SP kIkIkI

)1()()1()(

T

WT

T

B

GkGG

VVV

VVkI

Page 18: Clustering with Mixed Type Variables and Determination of ... · COMPSTAT 2010 3 Motivation Task: We are looking for groups of similar objects (e.g. respondents), i.e. we will concentrate

COMPSTAT 2010 18

Application to a real data file

Data from a questionnaire surveyData from a questionnaire survey((for the participants of the chemistry seminarfor the participants of the chemistry seminar))

7 qualitative and 1 quantitative (count) variables Two-step cluster analysis for clustering of

respondents (experiments for the numbers of clusters from 2 to 4)

LC Cluster model (experiments for the numbers of clusters from 2 to 6) – the quantitative variable was recoded to 5 categories

Page 19: Clustering with Mixed Type Variables and Determination of ... · COMPSTAT 2010 3 Motivation Task: We are looking for groups of similar objects (e.g. respondents), i.e. we will concentrate

COMPSTAT 2010 19

Application to a real data file

Number of clusters Measure 1 2 3 4 Within-cluster variability

273.92 241.17 206.39 186.51

Variability difference

- 32.75 34.78 19.88

IU 0 0.12 0.25 0.32 ISPU 0.12 0.13 0.07 - ICHFU 0 6.52 7.69 7.19 IBIC 590.85 568.41 541.88 545.15

Criteria based on the entropy (TSCA in SPSS)

Page 20: Clustering with Mixed Type Variables and Determination of ... · COMPSTAT 2010 3 Motivation Task: We are looking for groups of similar objects (e.g. respondents), i.e. we will concentrate

COMPSTAT 2010 20

Application to a real data fileCriteria based on the Gini’s coefficient (TSCA in SPSS)

Number of clusters Measure 1 2 3 4 Within-cluster variability

185.41 162.57 137.83 127.86

Variability difference

- 22.84 24.74 9.97

I 0 0.12 0.26 0.31 ISP 0.12 0.13 0.05 - ICHF 0 6.74 8.11 6.90 IBGC 413.85 411.20 404.75 427.84

Page 21: Clustering with Mixed Type Variables and Determination of ... · COMPSTAT 2010 3 Motivation Task: We are looking for groups of similar objects (e.g. respondents), i.e. we will concentrate

COMPSTAT 2010 21

Application to a real data fileComparison of BIC

Number of clusters Method 1 2 3 4 Two-step CA 590.85 568.41 541.88 545.15 LC Cluster Model 1397.01 1059.24 1019.18 1036.90

Page 22: Clustering with Mixed Type Variables and Determination of ... · COMPSTAT 2010 3 Motivation Task: We are looking for groups of similar objects (e.g. respondents), i.e. we will concentrate

COMPSTAT 2010 22

Conclusion

If the distance between objects, distance between clusters, within-cluster variability and the total variability are defined for the case when objects are characterized by mixed-type variables, then the evaluation criteria for quantitative variables can be modified.

One possibility is an application of log-likelihood distance measure based on the entropy

Another possibility is to use the analogous measure with using of Gini’s coefficient

Page 23: Clustering with Mixed Type Variables and Determination of ... · COMPSTAT 2010 3 Motivation Task: We are looking for groups of similar objects (e.g. respondents), i.e. we will concentrate

COMPSTAT 2010 23

Thank you for your attention


Recommended