Business Intelligence & Data Mining-8

8/10/2019 Business Intelligence & Data Mining-8

1/28

Clustering


2/28

What is Cluster Analysis?

Cluster: a collection of data objects Similar to one another within the same cluster

Dissimilar to the objects in other clusters

Cluster analysis

Grouping a set of data objects into clusters Clustering is unsupervised classification: no predefined

classes

Typical applications

As a stand-alone tool to get insight into data distribution As apreprocessing step for other algorithms (detecting

outliers / noise, selecting interesting subspaces clustering

tendency)


3/28

General Applications of Clustering

Pattern Recognition Spatial Data Analysis

create thematic maps in GIS by clustering feature spaces

detect spatial clusters and explain them in spatial data mining

Image Processing Economic Science (especially market research)

WWW

Document classification

Cluster Weblog data to discover groups of similar accesspatterns


4/28

Examples of Clustering Applications

Marketing: Help marketers discover distinct groups in their

customer bases, and then use this knowledge to develop

targeted marketing programs

Insurance: Identifying groups of motor insurance policy

holders with a high average claim cost

City-planning: Identifying groups of houses according to

their house type, value, and geographical location

Earth-quake studies: Observed earth quake epicenters

should be clustered along continent faults


5/28

What Is Good Clustering?

A good clustering method will produce clusters with high intra-class similarity

low inter-class similarity

The quality of a clustering method is also measured

by its ability to discover some or all of the hidden

patterns.

The quality of a clustering result depends on both the

similarity / dissimilarity measure used by the methodand its logic.


6/28

Requirements of Clustering Algorithms

Scalability

Ability to deal with different types of attributes

Discovery of clusters with arbitrary shape

Ability to deal with noise and outliers

Ability to cope with high dimensionality

Interpretability and usability


7/28

Dissimilarity Metric

Dissimilarity/Similarity metric: Dissimilarityis expressed in terms of a distance function,which is typically metric: d(i, j)

The definitions of distance functions are

usually very different for interval-scaled,boolean, categorical, ordinal and ratiovariables.

Weights can be associated with different

variables based on applications and datasemantics.


8/28

Data Structures

Data matrix

Dissimilarity matrix

(symmetric mode)

npx...nfx...n1x

...............

ipx...

i fx...

i1x

...............

1px...

1fx...

11x

0...)2,()1,(

:::)2,3()

...ndnd

0dd(3,1

0d(2,1)

0


9/28

Data Types of Attributes

Interval-scaled variables

Binary variables

Nominal and ordinal variables

Ratio variables


10/28

Similarity and Dissimilarity Between

Objects

Distances are normally used to measure the similarity

or dissimilarity between two data objects

Some popular ones include: Minkowski distance:

where i = (xi1,xi2, ,xip) andj = (xj1,xj2, ,xjp) are twop-

dimensional data objects, and q is a positive integer

If q = 1, dis Manhattan distance

q q

pp

qq

jx

ix

jx

ix

jx

ixjid )||...|||(|),(

2211+++=

||...||||),(2211 pp ji

xj

xi

xj

xi

xid +++=


11/28

Similarity and Dissimilarity Between

Objects (Cont.)

If q = 2,d is Euclidean distance:

Propertiesd(i,j) 0

d(i,i) = 0

d(i,j) = d(j,i)

d(i,j) d(i,k) + d(k,j)

)||...|||(|),( 2222

2

11 pp jx

ix

jx

ix

jx

ixjid +++=


12/28

Interval-valued variables

Normalize the data

Calculate the mean absolute deviation:

where

Calculate the standardized measurement (z-score)

Using mean absolute deviation used more often in

clustering because it is more robust than standard

deviation

.)...211 nffff xx(xnm +++=

|)|...|||(|121 fnffffff mxmxmxns +++=

f

fif

if s

mxz

=


13/28

Binary Variables

A contingency table for binary data

Mis-match coefficient: dcbacb

jid ++++

=),(

1 0

1

0

sum

a b a b

c d c d sum a c b d

+

+

+ +Object i

Object j


14/28

Dissimilarity between Binary Variables

Example

let the values M, Y and P be set to 1, and the value F, N be set to 0

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

Jack M Y N P N N N

Mary F Y N P N P N

Jim M Y P N N N N

2( , ) 0 . 2 9

7

2

( , ) 0 . 2 97

4( , ) 0 . 5 7

7

d j a c k m a r y

d j a c k j im

d j i m m a r y

= =

= =

= =


15/28

Nominal Variables

A generalization of the binary variable in that it can

take more than 2 states, e.g., red, yellow, blue, green

Method 1: Simple matching

m: number of matches, p: total number of variables

Method 2: use a large number of binary variables

creating a new binary variable for each of theMnominal

states

pm

jid

=),(


16/28

Ordinal Variables

An ordinal variable can be discrete or continuous

order is important, e.g., rank

Can be treated like interval-scaled

replacingxif

by their rank

map the range of each variable onto [0, 1] by replacingi-th

object in thef-th variable by

compute the dissimilarity using methods for interval-scaled

variables

1

1

=

f

if

if M

rz

},...,1{fif

Mr


17/28

Ratio-Scaled Variables

Ratio-scaled variable: a positive measurement on anonlinear scale, approximately at exponential scale,

such asAeBtorAe-Bt

Methods:

treat them like interval-scaled variables not a good choice!

(why?)

apply logarithmic transformation

yif

=log(xif)

treat them as continuous ordinal data


18/28

Heuristic Solutions to Clustering

Exhaustive enumeration is computationally complex: evenfor small problem sizes (e.g. n = 25, m = 5), the number

of possible partitions evaluates to: 2,436,684,974,110,751

Partitioning Algorithms Construct various partitions and

then evaluate them by some criterion

Hierarchical Algorithms : partition the data into a nestedsequence of partitions. There are two approaches:

Start with n clusters (where n is the number of objects),

and iteratively merge pairs of clusters -Agglomerative

algorithms

Start by considering all the objects to be in one cluster anditeratively split one cluster into two at each step-Divisive

algorithms


19/28

Partitioning Algorithms: Basic Concept

Partitioning method: Construct a partition of a databaseD of nobjects into a set of k clusters

Given a k, find a partition of k clusters that optimizes

the chosen partitioning criterion

Global optimal: exhaustively enumerate all partitions

Heuristic methods: k-means and k-medoids algorithms

k-means (MacQueen67): Each cluster is represented by the

center of the cluster

k-medoids or PAM (Partition around medoids) (Kaufman &Rousseeuw87): Each cluster is represented by one of the

objects in the cluster


20/28

TheK-Means Clustering Method

Given k, the k-means algorithm is implemented in

4 steps:

Partition objects into knonempty subsets

Compute seed points as the centroids of the clusters ofthe current partition. The centroid is the center (mean

point) of the cluster.

Assign each object to the cluster with the nearest seed

point.

Go back to Step 2, stop when no more new assignment.


21/28

TheK-Means Clustering Method

Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10


22/28

Comments on theK-Means Method

Strength

Relatively efficient: O(tkn), where n is the number objects, k

is the number clusters, and t is the number iterations.

Normally, k, t


23/28

Variations of theK-Means Method

A few variants of the k-means which differ in

Selection of the initial kmeans

Dissimilarity calculations

Strategies to calculate cluster means

Handling categorical data: k-modes (Huang98)

Replacing means of clusters with modes

Using new dissimilarity measures to deal with categorical

objects

Using a frequency-based method to update modes of clusters A mixture of categorical and numerical data: k-prototype

method


24/28

TheK-Medoids Clustering Method

Find representative objects, called medoids, in clusters

PAM(Partitioning Around Medoids, 1987)

starts from an initial set of medoids and iteratively replaces

one of the medoids by one of the non-medoids if it improves

the total distance of the resulting clustering

PAMworks effectively for small data sets, but does not scale

well for large data sets

CLARA (Kaufmann & Rousseeuw, 1990)

CLARANS(Ng & Han, 1994): Randomized sampling

Focusing + spatial data structure (Ester et al., 1995)


25/28

PAM (Partitioning Around Medoids)

(1987)

PAM (Kaufman and Rousseeuw, 1987)

Use real object to represent the cluster

Select k representative objects arbitrarily

For each pair of non-selected object h and selected object i,

calculate the total swapping cost TCih

For each pair of i and h,

If TCih < 0, i is replaced by h

Then assign each non-selected object to the most similarrepresentative object

repeat steps 2-3 until there is no change


26/28

Hierarchical Clustering

Use distance matrix as clustering criteria. Thismethod does not require the number of clusters k as an

input, but needs a termination condition

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

aa b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative

(AGNES)

divisive

(DIANA)


27/28

AGNES (Agglomerative Nesting)

Introduced in Kaufmann and Rousseeuw (1990)

Implemented in statistical analysis packages, e.g., Splus

Use the Single-Link method and the dissimilarity matrix.

Merge nodes that have the least dissimilarity

Go on in a non-descending fashion Eventually all nodes belong to the same cluster

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10


28/28

DIANA (Divisive Analysis)

Introduced in Kaufmann and Rousseeuw (1990)

Implemented in statistical analysis packages, e.g.,

Splus

Inverse order of AGNES Eventually each node forms a cluster on its own

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Date post:	02-Jun-2018
Category:	Documents
Upload:	binzidd007
View:	218 times
Download:	0 times

Business Intelligence & Data Mining-8

Documents