Post on 14-Dec-2015
transcript
Clustering under Clustering under Constraints with Constraints with
Genetic AlgorithmsGenetic Algorithms
by by
Albert Ali Salah Albert Ali Salah
Stanislav Redman Stanislav Redman
Gabriella KovacsGabriella Kovacs
OutlineOutline• Definition of the problem• Background on genetic algorithms• Case study: Workgroup assignment• Results
Clustering under ConstraintsClustering under Constraints• N multi-dimensional data items • A bunch of soft constraints• (A bunch of hard constraints)• The problem: Clustering the data
points so that the hard constraints are satisfied, and the soft constraints are optimized.
Constrained ClusteringConstrained Clustering• Constrained clustering is an unsupervised
learning technique, where some data items are known to be in the same cluster, and some are known to be in different clusters.
• Clustering under constraints is an optimization problem (I saw Karp in the elevator, and he said it’s probably NP-complete)
Genetic AlgorithmsGenetic Algorithms
• A GA is essentially a heuristic random search tool
• Has no rigorous mathematical principle, no one knows why it works
• Used frequently in soft constraint optimization, rarely in clustering
Details You All KnowDetails You All Know• Solutions are ‘coded’ into simple, DNA-
like structures called chromosomes• A fitness function is supplied to evaluate
the quality of solutions• The algorithm works on a population of
individuals• There is a Genetic Algorithm package
written for the object-oriented Dolphin Smalltalk environment
Genetic Algorithm FlowchartGenetic Algorithm Flowchart
Initial Population
End CriteriaReached?
Selection Cross-over
Mutation
New Population
No
YesOutput Best
Individual
Case Study: Santa FeCase Study: Santa Fe• Aim: Cluster people such that:
– Groups are balanced in number of students
– Each group consists of people with similar interests
– Each group has some people with basic skills
– Each group possesses enough knowledge in its areas of interest
Problem 1: RepresentationProblem 1: Representation• A good GA representation is:
– unambiguous– short (k bits means 2k search space)– smooth with respect to fitness
landscape– robust to mutations– free of preferential bias– simple to decode
• 01101001010010101001010…
• 01101001010010101001010…
Representation Representation
Three bits code the group number
The position indicates the student number
1 2 3 4…
Problem 2: FitnessProblem 2: Fitness• A good fitness function is:
– between 0 (awful) and 1 (optimal)– a correct ordering of individuals with
respect to their closeness to the optimal solution
– informative, and indicative of relative fitness
– pragmatic about the boundary conditions– simple and fast to calculate
Composite FitnessComposite Fitness• Assume there are n different, possibly
independent fitness criteria. Let f1, f2,… ,fn be the individual fitness functions that order the solutions according to individual criteria. The total
fitness function is
where i are coefficients to be determined
N : number of students
M : number of groups
S : number of interests
pi : interest vector of student i
gj : mean interest vector of group j
ij : Kronecker delta
ff11 : Interest Term : Interest Term
SN
SN
fij
N
i
M
jjigp
9
)(91 1
2
1
Problem with Problem with ff11
• 9SN is a too big normalization factor, all decent individuals (with small distances from the mean) will have f1 very close to 1.
• General Solution:
replace with dist
distdist
max
max distaveragez _
SN
ij
N
i
M
j jigp
f
1 1
2)(
1 8.0
ff22 : Balance Term : Balance Term
N : number of students
M : number of groups
ni : number of students in group j
N
nNM
jj MN
f 2
1
22
2
)(
M : number of groups
B : number of basic skills
bik: kth skill of student i
ij : Kronecker delta
ff33 : Basic Skills Term : Basic Skills Term
MB
bMB
f
M
j
B
k iijik
9
))max(arg4(91 1
2
3
M : number of groups
S : number of interests
hik: kth knowledge term of student i
ij : Kronecker delta
jk: 1 if kth interest term is among the first
three interests of group j, 0 otherwise.
ff44 : Knowledge Term : Knowledge Term
M
hM
f
M
j
S
k ijkijik
27
))max(arg4(271 1
2
4
GA parametersGA parameters• Population size: 100• Generations: 30• Crossover probability: 0.4 (single
point)• Mutation probability: 0.001• Equal coefficients
Some entertaining Some entertaining facts about the datasetfacts about the dataset
Basic skillsBasic skillsAverage Experts Beginners
Mathematics 2.83 9 4
Programming 2.75 14 11
English 3.10 19 1
Statistics 2.87 8 1
InterestsInterests
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
Self-organization
Computer Science
Multi-Agent Systems
Evolution
Biology
Neural Nets & Simulation
Information Theory
Economics
Optimization
Cognitive Science
Physics
Social Networks
Psychology
Neuroscience
Philosophy
Anthropology
Quantum Consciousness
KnowledgeKnowledge
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Computer Science
Evolution
Physics
Optimization
Multi-Agent Systems
Neural Nets & Simulation
Biology
Self-organization
Information Theory
Economics
Philosophy
Cognitive Science
Psychology
Social Networks
Neuroscience
Anthropology
Quantum Consciousness
TOP 10 knowledge-seeking peopleTOP 10 knowledge-seeking peopleIrina
Anton
Mourad
Zoltan
Anukool
Angel
Lyudmila
Mianlai
Aaron
Arthur
TOP 10 knowledgeable peopleTOP 10 knowledgeable peopleAnton
Louise
Arndt
Angel
Suzanne
Mark
Nilanjana
Wojciech
Albert
Aaron
Some serious resultsSome serious results
Clustering of interest vectors withClustering of interest vectors with
• Nearest neighbor• Furthest neighbor• Average linkage• Ward linkage
Nearest neighborNearest neighborsrsjri njnixxdistsrd :1,:1)),,(min(),(
FITNESS TERMS: 0,37352071 0,847012823 0,722222222 0,916006652
GROUP 1: Natalia, Nilanjana, Angel, Arndt, Alexander, Wojciech, Frederic, Jason, Gerard, Ferenc, Sergey, Milica, Zoltan, Bartlomiej, Aaron, Pau, Sergey, Jasper, Matthew, Mark, Eva, Volodymyr, Victor, Oleksiy, Anukool, Hilary, Lyudmila, Alex, Vaclav, Anton, Mourad, Nicholas, Arthur, Carolyn, Stanislav, Denis, Suzanne, Albert, Lisa, Vadim, Pavel, Sergiy, Valentin, Mianlai, Gordan
Interests: Self-organization (2,98) Evolution (2,8) Computer Science (2,78)
GROUP 2: LouiseInterests: Anthropology (4) Biology (4) Cognitive Science (4)
GROUP 3: Tatyana Interests: Cognitive Science (4) Computer Science (4) Information Theory (4)
GROUP 4: Gabriella Interests: Computer Science (4) Information Theory (4) Optimization (4)
GROUP 5: Ana-MariaInterests: Social Networks (4) Cognitive Science (3) Multi-Agent Systems (3)
GROUP 6: Angelica Interests: Cognitive Science (4) Computer Science (4) Multi-Agent Systems (4)
GROUP 7: ChristopheInterests: Cognitive Science (4) Neural Nets & Simulation (4) Psychology (4)
GROUP 8: Irina Interests: Cognitive Science (4) Computer Science (4) Information Theory (4)
Furthest neighborFurthest neighborsrsjri njnixxdistsrd :1,:1)),,(max(),(
FITNESS TERMS: 0,926035503 0,887127441 0,958333333 0,964728892
GROUP 1: Hilary, Angel, Mark, Mourad, Jason Interests: Psychology (3,8) Evolution (3,6) Anthropology (3,2)
GROUP 2: Bartlomiej, Louise, Alexander, Matthew, Valentin, Angelica, VictorInterests: Evolution (3,43) Multi-Agent Systems (3,29) Social Networks (3,29)
GROUP 3: Suzanne, Aaron, Alex, Arndt, WojciechInterests: Evolution (3,57) Biology (3,2929) Self-organization (3,14285714)
GROUP 4: Lisa, GerardInterests: Social Networks (4) Cognitive Science (3) Multi-Agent Systems (3)
GROUP 5: Sergiy, Albert, ChristopheInterests: Information Theory (2,625) Physics (2,625) Self-organization (2,625)
GROUP 6: Natalia, Nilanjana, Lyudmila, Vaclav, Anton, Frederic, Arthur, Ferenc, Stanislav, Milica, Denis, Sergey, Jasper, Pavel, Mianlai, Volodymyr, Gabriella, Oleksiy,
AnukoolInterests: Cognitive Science (4) Computer Science (4) Multi-Agent Systems (4)
GROUP 7: Pau, Vadim, Ana-Maria, Eva, Nicholas, Sergey, GordanInterests: Cognitive Science (3,33) Neural Nets & Simulation (3,33) Biology (3)
GROUP 8: Irina, Zoltan, Tatyana, CarolynInterests: Quantum Consciousness (3,75) Cognitive Science (3,5) Computer Science (3,5)
Average linkageAverage linkage
r sn
i
n
jsjri
sr
xxdistnn
srd1 1
),(1
),(
FITNESS TERMS: 0,821745562 0,879219281 0,902777778 0,951247491
GROUP 1: Natalia, Nilanjana, Angel, Wojciech, Frederic, Jason, Ferenc, Milica, Aaron, Sergey, Jasper, Mark, Volodymyr, Gabriella, Oleksiy, Hilary,
Lyudmila, Vaclav, Anton, Mourad, Arthur, Stanislav, Denis, Suzanne, Pavel, Mianlai
Interests: Self-organization (3,15) Multi-Agent Systems (3,04) Computer Science (3)
GROUP 2: AnukoolInterests: Computer Science (4) Neuroscience (4) Optimization (4)
GROUP 3: Bartlomiej, Lisa, Alexander, Matthew, Valentin, Gerard, Victor Interests: Evolution (3,57) Biology (3,29) Self-organization (3,14)
GROUP 4: Ana-MariaInterests: Social Networks (4) Cognitive Science (3) Multi-Agent Systems (3)
GROUP 5: Pau, Alex, Arndt, Vadim, Eva, Nicholas, Sergey, Gordan Interests: Information Theory (2,625) Physics (2,625) Self-organization (2,625)
GROUP 6: Angelica, LouiseInterests: Cognitive Science (4) Computer Science (4) Multi-Agent Systems (4)
GROUP 7: Sergiy, Albert, ChristopheInterests: Cognitive Science (3,33) Neural Nets & Simulation (3,333) Biology (3)
GROUP 8: Irina, Zoltan, Tatyana, Carolyn Interests: Quantum Consciousness (3,75) Cognitive Science (3,5) Computer Science (3,5)
Ward linkageWard linkage)/()(),( 2
, srsrsr nnxxdistnnsrd
FITNESS TERMS: 0,968195266 0,891630074 0,972222222 0,965034915
GROUP 1: Lisa, Alex, Arndt, Frederic, GerardInterests: Self-organization (3,6) Biology (3,4) Evolution (3,4)
GROUP 2: Pau, Vadim, Ana-Maria, Eva, Nicholas, Sergey, Gabriella, GordanInterests: Physics (2,625) Self-organization (2,625) Computer Science (2,5)
GROUP 3: Bartlomiej, Matthew, Valentin, AlexanderInterests: Economics (3,25) Evolution (3,25) Biology (3)
GROUP 4: Louise, Mianlai, Volodymyr, Victor, Angelica Interests: Computer Science (4,) Multi-Agent Systems (4,) Self-organization (3,8)
GROUP 5: Sergiy, Albert, ChristopheInterests: Cognitive Science (3,33) Neural Nets & Simulation (3,33) Biology (3)
GROUP 6: Stanislav, Natalia, Denis, Sergey, Vaclav, Anton, Pavel, Ferenc, Milica, OleksiyInterests: Computer Science (3,4) Neural Nets & Simulation (3,4) Economics (3,3)
GROUP 7: Irina, Zoltan, Tatyana, Carolyn Interests: Quantum Consciousness (3,75) Cognitive Science (3,5) Computer Science (3,5)
GROUP 8: Hilary, Lyudmila, Nilanjana, Angel, Wojciech, Mourad, Jason, Arthur, Suzanne, Aaron, Jasper, Mark, Anukool
Interests: Biology (3,38) Evolution (3,38) Self-organization (3,23)
FITNESS TERMS:0,988905325 0,845403674 0,989583333 0,981469795
GROUP 1 Self-organization (4) Neural Nets & Simulation (3,6) Physics (3,4) Arndt, Tatyana, Mianlai, Sergey, Zoltan
GROUP 2 Computer Science (2,56) Neural Nets & Simulation (2,56) Evolution (2,44) Denis, Pau, Alex, Ana-Maria, Lisa, Vadim, Sergiy, Eva, Milica
GROUP 3 Computer Science (3,1) Multi-Agent Systems (3,1) Self-organization (2,9) Stanislav, Natalia, Nilanjana, Gordan, Mourad, Gerard, Ferenc, Victor, Valentin, Oleksiy
GROUP 4 Self-organization (3,43) Evolution (3,14) Psychology (3) Suzanne, Lyudmila, Angel, Wojciech, Mark, Anton, Nicholas
GROUP 5 Cognitive Science (3) Biology (2,83) Evolution (2,67) Christophe, Aaron, Hilary, Albert, Alexander, Frederic
GROUP 6 Economics (3,33) Self-organization (3) Computer Science (2,67) Bartlomiej, Sergey, Jasper, Vaclav, Pavel, Gabriella
GROUP 7 Biology (3,75) Evolution (3,5) Self-organization (3,5) Matthew, Angelica, Louise, Arthur
GROUP 8 Computer Science (3,2) Information Theory (3,2) Philosophy (3,2) Anukool, Irina, Jason, Volodymyr, Carolyn
Comparison of resultsComparison of results
Nearest Neighbour Furthest Neighbour Average Linkage Ward Linkage GABalance 0,37 0,93 0,82 0,97 0,99Interests 0,85 0,89 0,88 0,89 0,85Basic Skills 0,72 0,96 0,90 0,97 0,99Knowledge 0,92 0,96 0,95 0,97 0,98
GOOD BYE, CSSS 2002GOOD BYE, CSSS 2002