Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 216 times |
Download: | 0 times |
1
Partitioning Algorithms: Basic Concepts Partition n objects into k clusters
Optimize the chosen partitioning criterion Example: minimize the Squared Error Squared Error of a cluster
mi is the mean (centroid) of Ci
Squared Error of a clustering
k
i Cp
i
k
ii
i
mpdCErrorError1
2
1
),()(
iCp
ii mpdCError2
),()(
2
Example of Square Error of Cluster
0 1 2 3 4 5 6 7 8 9 10
10
98
76
5
43
21
Ci={P1, P2, P3}
P1 = (3, 7)P2 = (2, 3)P3 = (7, 5)mi = (4, 5)
|d(P1, mi)|2
=(3-4)2+(7-5)2=5|d(P2, mi)|2=8
|d(P3, mi)|2=9
Error (Ci)=5+8+9=22
P3
P2
P1
mi
3
Example of Square Error of Cluster
0 1 2 3 4 5 6 7 8 9 10
10
98
76
5
43
21
Cj={P4, P5, P6}
P4 = (4, 6)P5 = (5, 5)P6 = (3, 4)mj = (4, 5)
|d(P4, mj)|2
=(4-4)2+(6-5)2=1|d(P5, mj)|2=1
|d(P6, mj)|2=1
Error (Cj)=1+1+1=3
P5
P6
P4
mj
4
Partitioning Algorithms: Basic Concepts
Global optimal: examine all possible partitions kn possible partitions, too expensive!
Heuristic methods: k-means and k-medoids k-means (MacQueen’67): Each cluster
is represented by center of cluster k-medoids (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects (medoid) in cluster
5
K-means Initialization
Arbitrarily choose k objects as the initial cluster centers (centroids)
Iteration until no change For each object Oi
Calculate the distances between Oi and the k centroids
(Re)assign Oi to the cluster whose centroid is the closest to Oi
Update the cluster centroids based on current assignment
6
k-Means Clustering Method
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
cluster
meancurrent clusters
new clusters
objectsrelocat
ed
7
Example
For simplicity, 1 dimensional objects and k=2.
Objects: 1, 2, 5, 6,7 K-means:
Randomly select 5 and 6 as initial centroids; => Two clusters {1,2,5} and {6,7};
meanC1=8/3, meanC2=6.5 => {1,2}, {5,6,7}; meanC1=1.5, meanC2=6 => no change. Aggregate dissimilarity = 0.5^2 + 0.5^2 +
1^2 + 1^2 = 2.5
8
Variations of k-Means Method Aspects of variants of k-means
Selection of initial k centroids E.g., choose k farthest points
Dissimilarity calculations E.g., use Manhattan distance
Strategies to calculate cluster means E.g., update the means incrementally
9
Strengths of k-Means Method Strength
Relatively efficient for large datasets O(tkn) where n is # objects, k is #
clusters, and t is # iterations; normally, k, t <<n
Often terminates at a local optimum global optimum may be found using
techniques such as deterministic annealing and genetic algorithms
10
Weakness of k-Means Method Weakness
Applicable only when mean is defined, then what about categorical data?
k-modes algorithm Unable to handle noisy data and outliers
k-medoids algorithm Need to specify k, number of clusters, in
advance Hierarchical algorithms Density-based algorithms
11
k-modes Algorithm
Handling categorical data: k-modes (Huang’98)
Replacing means of clusters with modes
Given n records in cluster, mode is record made up of most frequent attribute values
age income student credit_rating<=30 high no fair<=30 high no excellent31…40 high no fair>40 medium no fair>40 low yes fair>40 low yes excellent31…40 low yes excellent<=30 medium no fair<=30 low yes fair>40 medium yes fair<=30 medium yes excellent31…40 medium no excellent31…40 high yes fair
In the example cluster, mode = (<=30, medium, yes, fair)
Using new dissimilarity measures to deal with categorical objects
12
A Problem of K-means Sensitive to outliers
Outlier: objects with extremely large (or small) values
May substantially distort the distribution of the data
++
Outlier
13
k-Medoids Clustering Method
k-medoids: Find k representative objects, called medoids
PAM (Partitioning Around Medoids, 1987) CLARA (Kaufmann & Rousseeuw, 1990) CLARANS (Ng & Han, 1994): Randomized
sampling
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
k-means
k-medoids
14
PAM (Partitioning Around Medoids) (1987) PAM (Kaufman and Rousseeuw, 1987) Arbitrarily choose k objects as the initial
medoids Until no change, do
(Re)assign each object to the cluster with the nearest medoid
Improve the quality of the k-medoids (Randomly select a nonmedoid object, Orandom,
compute the total cost of swapping a medoid with
Orandom)
Work for small data sets (100 objects in 5 clusters)
Not efficient for medium and large data sets
15
Swapping Cost For each pair of a medoid m and a non-
medoid object h, measure whether h is better than m as a medoid
Use the squared-error criterion
Compute Eh-Em
Negative: swapping brings benefit Choose the minimum swapping cost
k
i Cpi
i
mpdE1
2),(
16
Four Swapping Cases When a medoid m is to be swapped with a non-
medoid object h, check each of other non-medoid objects j
j is in cluster of m reassign j Case 1: j is closer to some k than to h; after
swapping m and h, j relocates to cluster represented by k
Case 2: j is closer to h than to k; after swapping m and h, j is in cluster represented by h
j is in cluster of some k, not m compare k with h Case 3: j is closer to some k than to h; after
swapping m and h, j remains in cluster represented by k
Case 4: j is closer to h than to k; after swapping m and h, j is in cluster represented by h
17
PAM Clustering: Total swapping cost TCmh=jCjmh
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
j
mh
k
Cjmh = d(j, k) d(j, k)=0
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
k
mh j
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
k
m
hj
Cjmh = d(j, h) d(j, m) May be positive or negative
Case 2
Case 3
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
h
m k
j
Cjmh = d(j, k) d(j, m) 0
Case 1
Case 4
Cjmh = d(j, h) d(j, k) < 0
18
Complexity of PAM Arbitrarily choose k
objects as the initial medoids
Until no change, do (Re)assign each object to
the cluster with the nearest medoid
Improve the quality of the k-medoids
For each pair of medoid m and non-medoid object h
Calculate the swapping cost TCmh =jCjmh
O(1)
O((n-k)2*k)
O((n-k)*k)
O((n-k)2*k)
(n-k)*k times
O(n-k)
19
Strength and Weakness of PAM
PAM is more robust than k-means in the presence of outliers because a medoid is less influenced by outliers or other extreme values than a mean
PAM works efficiently for small data sets but does not scale well for large data sets
O(k(n-k)2 ) for each iteration
where n is # of data objects, k is # of clusters
Can we find the medoids faster?
20
CLARA (Clustering Large Applications) (1990) CLARA (Kaufmann and Rousseeuw in
1990) Built in statistical analysis packages, such
as S+ It draws multiple samples of data set,
applies PAM on each sample, gives best clustering as output
Handle larger data sets than PAM (1,000 objects in 10 clusters)
Efficiency and effectiveness depends on the sampling
21
CLARA - Algorithm Set mincost to MAXIMUM; Repeat q times // draws q samples
Create S by drawing s objects randomly from D;
Generate the set of medoids M from S by applying the PAM algorithm;
Compute cost(M,D) If cost(M, D)<mincost
Mincost = cost(M, D); Bestset = M;
Endif; Endrepeat; Return Bestset;
22
Complexity of CLARA
Set mincost to MAXIMUM; Repeat q times
Create S by drawing s objects randomly from D;
Generate the set of medoids M from S by applying the PAM algorithm;
Compute cost(M,D) If cost(M, D)<mincost
Mincost = cost(M, D);Bestset = M;
Endif; Endrepeat; Return Bestset;
O(1)
O(1)
O((s-k)2*k)
O((n-k)*k)O(1)
O((s-k)2*k+(n-k)*k)
23
Strengths and Weaknesses of CLARA Strength:
Handle larger data sets than PAM (1,000 objects in 10 clusters)
Weakness: Efficiency depends on sample size A good clustering based on samples will
not necessarily represent a good clustering of whole data set if sample is biased
24
CLARANS (“Randomized” CLARA) (1994) CLARANS (A Clustering Algorithm based
on Randomized Search) (Ng and Han’94) CLARANS draws sample in solution space
dynamically A solution is a set of k medoids The solutions space contains
solutions in total
The solution space can be represented by a graph where every node is a potential solution, i.e., a set of k medoids
k
n
25
Graph Abstraction Every node is a potential solution (k-
medoid) Every node is associated with a squared
error Two nodes are adjacent if they differ by
one medoid Every node has k(nk) adjacent nodes
{O1,O2,…,Ok}
{Ok+1,O2,…,Ok} {Ok+n,O2,…,Ok}
…
n-k neighbors for one medoid
k(n k) neighbors for one node
…
26
Graph Abstraction: CLARANS Start with a randomly selected node,
check at most m neighbors randomly If a better adjacent node is found, moves
to node and continue; otherwise, current node is local optimum; re-starts with another randomly selected node to search for another local optimum
When h local optimum have been found, returns best result as overall result
27
CLARANS
N
NN
C
C
N
N N
<
Local minimum
…
Compare no more than maxneighbor times
numlocal
Best Node
Local minimum
…Local minimum
…Local minimum
…
28
CLARANS - Algorithm Set mincost to MAXIMUM; For i=1 to h do // find h local optimum
Randomly select a node as the current node C in the graph;
J = 1; // counter of neighbors Repeat
Randomly select a neighbor N of C;If Cost(N,D)<Cost(C,D)
Assign N as the current node C;J = 1;
Else J++;Endif;
Until J > m Update mincost with Cost(C,D) if
applicableEnd for; End For Return bestnode;
29
Graph Abstraction (k-means, k-modes, k-medoids) Each vertex is a set of k-representative
objects (means, modes, medoids) Each iteration produces a new set of k-
representative objects with lower overall dissimilarity
Iterations correspond to a hill descent process in a landscape (graph) of vertices
30
Comparison with PAM Search for minimum in graph (landscape) At each step, all adjacent vertices are
examined; the one with deepest descent is chosen as next k-medoids
Search continues until minimum is reached For large n and k values (n=1,000, k=10),
examining all k(nk) adjacent vertices is time consuming; inefficient for large data sets
CLARANS vs PAM For large and medium data sets, it is obvious
that CLARANS is much more efficient than PAM For small data sets, CLARANS outperforms
PAM significantly
31
When n=80, CLARANS is 5 times faster than PAM, while the cluster quality is the same.
32
Comparision with CLARA
CLARANS vs CLARA CLARANS is always able to find clusterings
of better quality than those found by CLARA; CLARANS may use much more time than CLARA
When the time used is the same, CLARANS is still better than CLARA
33
34
Hierarchies of Co-expressed Genes and Coherent Patterns
The interpretation of co-expressed genes
and coherent patterns mainly depends on the
domain knowledge
35
A Subtle Situation
To split or not to split? It’s a question.
group A
group A1
group A2