k - medoid clustering with genetic algorithm

transcript

WEI-MING CHEN2012 .12 .06

k-medoid clustering with genetic algorithm

Outline

k-medoids clusteringfamous worksGCA : clustering with the add of a genetic

algorithmClustering genetic algorithm : also judge the

number of clusters Conclusion

What is k-medoid clustering?

Proposed in 1987 (L. Kaufman and P.J. Rousseeuw)

There are N points in the spacek points are chosen as centers (medoids)Classify other points into k groupsWhich k points should be chosen to minimize

the summation of the points to its medoid

Difficulty

NP-hardGenetic algorithms can be applied

k-medoid clusteringfamous worksGCA : clustering with the add of a genetic

Partitioning Around Medoids (PAM)

Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data: An introduction to cluster analysis. New York: Wiley

Group N data into k setsIn every generation, select every pair of (Oi,

Oj), where Oi is a medoid and Oj is not, if replace Oi by Oj would reduce the distance, replace Oi by Oj

Computation time : O(k(N-k)2) [one generation]

Clustering LARge Applications (CLARA)

Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data: An introduction to cluster analysis. New York: Wiley

Reduce the calculation timeOnly select s data in original N datas = 40+2k seems a good choiceComputation time : O(ks2+k(n-k)) [one

generation]

Clustering Large Applications based upon RANdomized Search (CLARANS)

Ng, R., & Han, J. (1994). Efficient and effective clustering methods for spatial data mining. In Proceedings of the 20th international conference on very large databases, Santiago, Chile (pp. 144–155)

Do not try all pairs of (Oi, Oj)Try max(0.0125(k(N-k)), 250) different Oj to

each Oi

Computation time : O(N2) [one generation]

k-medoids clusteringfamous worksGCA : clustering with the add of a

genetic algorithmClustering genetic algorithm : also judge the

Lucasius, C. B., Dane, A. D., & Kateman, G. (1993). On k-medoid clustering of large data sets with the aid of a genetic algorithm: Background, feasibility and comparison. Analytica Chimica Acta, 282, 647–669.

Chromosome encoding

N data, clustering to k groupsProblem size = k (the number of groups)each location of the string is an integer

(1~N) (a medoid)

Initialization

Each string in the population uniquely encodes a candidate solution of the target problem

Random choose the candidates

Selection

Select M worst individuals in population and throw them out

Crossover

Select some individuals for reproducing M new population

Building-block like crossoverMutation

Crossover

For example, k =3, p1 = 2 3 7, p2 = 4 8 21. Mix p1 and p2

Q = 21 31 71 42 82 22

randomly scramble : Q = 42 22 21 82 71 31

2. Add new material : first k elements may be changed Q = 5 22 7 82 71 31

3. randomly scramble again Q = 22 71 7 31 5 82

4. The offspring are selected from left or from right C1 = 2 7 3 , C2 = 8 5 3

Experiment

Under the limit of NFE < 100000N = 1000, k = 15

Experiment

GCA versus Random search

Experiment

GCA versus CLARA (k = 15)

Experiment

GCA versus CLARA (k = 50)

Experiment

Paper’s conclusion

GCA can handle both large values of k and small values of k

GCA outperforms CLARA, especially when k is a large value

GCA lends itself excellently for parallelizationGCA can be combined with CLARA to obtain

a hybrid searching system with better performance.

algorithmClustering genetic algorithm : also judge

the number of clusters Conclusion

Motivation

In some cases, we do not actually know the number of clusters

If we only know the upper limit?

Hruschka, E.R. and F.F.E. Nelson. (2003). “A Genetic Algorithm for Cluster Analysis.” Intelligent Data Analysis 7, 15–25.

Fitness function

a(i) : the average distance of a individual to the individual in the same cluster

d(i) : the average distance of a individual to the individual in a different cluster

b(i) : the smallest of d(i, C)

Fitness function

Silhouette fitness = This value will be high when…

small a(i) values high b(i) values

Chromosome encoding

N data, clustering to at most k groupsProblem size = N+1 each location of the string is an integer (1~k)

(belongs to which cluster )Genotype1: 22345123453321454552

5To avoid following problems:

Genotype2: 2|2222|111113333344444 4 Genotype3: 4|4444|333335555511111 4 Child2: 2 4444 111113333344444 4 Child3: 4 2222 333335555511111 5

Consistent Algorithm : 11234512342215343441 5

Initialization

Population size = 20The first genotype represents two clusters,

the second genotype represents three clusters, the third genotype represents four clusters, . . . , and the last one represents 21 clusters

Selection

roulette wheel selection

normalize to

Crossover

Uniform crossover do not workUse Grouping Genetic Algorithm (GGA),

proposed by Falkenauer (1998)

First, two strings are selectedA − 1123245125432533424B − 1212332124423221321

Randomly select groups to preserve in A(For example, group 2 and 3)

Crossover

A − 1123245125432533424B − 1212332124423221321C − 0023200020032033020

Check the unchanged group in B and place in C

C − 0023200024432033020Another child : form by the groups in B

(without which is actually placed in C)D − 1212332120023221321

Crossover

A − 1123245125432533424B − 1212332124423221321C − 0023200024432033020

Another child : form by the groups in B (without which is actually placed in C)

D − 1212332120023221321Check the unchanged group in A and place in D

The other objects (whose alleles are zeros) are placed to the nearest cluster

Mutation

Two ways for mutation1. randomly chosen a group, places all the

objects to the remaining cluster that has the nearest centroid

2. divides a randomly selected group into two new ones

Just change the genotypes in the smallest possible way

Experiment

4 test problems (N = 75, 200, 699, 150)

Experiment

Ruspini data (N = 75)

Paper’s conclusion

Do not need to know the number of groupsFind out the answer of four different test

problems successfully Only on small population size

Conclusion

Genetic algorithms is an acceptable method for clustering problems

Need to design crossover carefullyMaybe EDAs can be appliedSome theses? Or final projects!

k - medoid clustering with genetic algorithm

Documents