Unsupervised Learning
Javier Bejar cbea
LSI - FIB
Term 2012/2013
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 1 / 65
Outline
1 Introduction
2 Algorithms for unsupervised learning
3 Hierarchical algorithms
4 Concept Formation
5 Partitional algorithmsModel/Prototype Based ClusteringDensity/Grid Based ClusteringGraph Based ClusteringUnsupervised Neural Networks
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 2 / 65
Introduction
1 Introduction
2 Algorithms for unsupervised learning
3 Hierarchical algorithms
4 Concept Formation
5 Partitional algorithmsModel/Prototype Based ClusteringDensity/Grid Based ClusteringGraph Based ClusteringUnsupervised Neural Networks
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 3 / 65
Introduction
Unsupervised Learning
Usually learning can be done in a supervised or unsupervised way
There are a strong bias in the machine learning community towardssupervised learning
But a lot of concepts are learned unsupervisedly
The discovery of new concepts is always unsupervised
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 4 / 65
Introduction
Unsupervised Learning
Goals:
Summarization: To obtain a representation that describes an unlabeleddatasetUnderstanding: To discover concepts inside data
These task are difficult because the discovery process is biased bycontext
Different answers can be valid depending on the discovery goal or thedomainThere are few criterion to validate the results
Knowledge representation: Unstructured (partitions/clusters) orrelational (hierarchies)
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 5 / 65
Algorithms for unsupervised learning
1 Introduction
2 Algorithms for unsupervised learning
3 Hierarchical algorithms
4 Concept Formation
5 Partitional algorithmsModel/Prototype Based ClusteringDensity/Grid Based ClusteringGraph Based ClusteringUnsupervised Neural Networks
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 6 / 65
Algorithms for unsupervised learning
Unsupervised Learning
Learning by the discovery of predefined structures
For example: probability distributions/models using parametric or nonparametric estimation
It is assumed that the data is embedded in a N-dimensional spacethat has a similarity/dissimilarity function defined
Bias:
Examples are more related to the nearest examples than to the farthestLook for compact groups that are maximally separated from each other
Areas related: Statistics, machine learning, graph theory, fuzzy theory,physics
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 7 / 65
Algorithms for unsupervised learning
Algorithms for unsupervised learning
Two main strategies:
Hierarchical algorithms
Examples are usually organized as a binary treeUsually no explicit division in groups
Partitional algorithms
Only a partition of the dataset is obtained
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 8 / 65
Hierarchical algorithms
1 Introduction
2 Algorithms for unsupervised learning
3 Hierarchical algorithms
4 Concept Formation
5 Partitional algorithmsModel/Prototype Based ClusteringDensity/Grid Based ClusteringGraph Based ClusteringUnsupervised Neural Networks
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 9 / 65
Hierarchical algorithms
Hierarchical algorithms
Based on graph theory
The examples form a full connected graphSimilarity define the length of the edgesThe clustering is decided using connectivity criteria
Based on matrix algebra
A distance matrix is calculated from the examplesThe clustering is computed using the distance matrixThe distance matrix is updated after each step (different updatingcriteria)
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 10 / 65
Hierarchical algorithms
Hierarchical algorithms
Graphs
Single Linkage, Complete Linkage, MSTDivisive, Agglomerative
Matrices
Johnson algorithmDifferent update criteria (S-L, C-L, Centroid, minimum variance)
Computational cost
O(n inst3 × num dimensions)
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 11 / 65
Hierarchical algorithms
Agglomerative Graph Algorithm
Algorithm: Agglomerative graph algorithm)
Compute Distance/similarity matrixrepeat
Find the pair of examples with higher similarityAdd an edge to the graph corresponding to this pairif Agglomeration criteria holds then
Merge the clusters the pair belongs toend
until Only one Cluster
Single linkage = The new edge is between to disconnected graphs
Complete linkage = The new edge creates a clique with all the nodesof both subgraphs
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 12 / 65
Hierarchical algorithms
Hierarchical algorithms - Graphs
2 3 4 5
1 6 8 2 72 1 5 33 10 94 4 5 2 3 1 4 2 3 1 54
Single Link Complete Link
2
1
3
4
5
2
1
3
4
5
2
1
3
4
5
2
1
3
4
5
2
1
3
4
5
2
1
3
4
5
2
1
3
4
5
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 13 / 65
Hierarchical algorithms
Agglomerative Johnson algorithm
Algorithm: Agglomerative Johnson algorithm
Compute Distance/similarity matrixrepeat
Find the pair of groups/examples with the higher similarityMerge the pair of groups/examplesDelete the rows and columns corresponding to the pair ofgroups/examplesAdd a new row and column with the new distances to the new group
until Matrix has one element
Single linkage = New distance is the distance between the nearestexamples
Complete linkage = New distance is the distance between the farthestexamples
Average linkage = New distance is the distance between centroids
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 14 / 65
Hierarchical algorithms
Hierarchical algorithms - Matrices
2 3 4 5
1 6 8 2 72 1 5 33 10 94 4
2,3 4 5
1 7 2 72,3 7.5 64 4
1,4 5
2,3 7.25 61,4 5.5
1,4,5
2,3 6.725
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 15 / 65
Hierarchical algorithms
Hierarchical algorithms - Example
Data Single Link Complete Link
−1 0 1 2 3 4 5
−2
02
46
8
x1
x2
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
2.0 1.5 1.0 0.5 0.0
151723162018211922241113141012241798635
12 10 8 6 4 2 0
151723192224182116201014118126351314279
Median Centroid Ward
4 3 2 1 0
111381263510241791418211620192224151723
4 3 2 1 0
182116201922241517231114138126351042179
50 40 30 20 10 0
142796810123511131517231418211620192224
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 16 / 65
Hierarchical algorithms
Hierarchical algorithms - Shortcomings
A partition of the data is not given, it has to be decided a posteriori
Some undesirable and strange behaviours could appear (chaining,inversions) distorting the results
Dendrogram is not a practical representation for large amount of data
Its computational cost is high
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 17 / 65
Concept Formation
1 Introduction
2 Algorithms for unsupervised learning
3 Hierarchical algorithms
4 Concept Formation
5 Partitional algorithmsModel/Prototype Based ClusteringDensity/Grid Based ClusteringGraph Based ClusteringUnsupervised Neural Networks
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 18 / 65
Concept Formation
Other hierarchical algorithms - Concept Formation
Learning has an incremental nature (experience is acquired fromcontinuous observation, not at once)
Concepts are learned with their relationships (polithetic hierarchies ofconcepts)
Search in the space of hierarchies
An objective function measures the utility of the learned structure
The updating of the structure is performed by a set of conceptualoperators
The result depends on the order of the examples
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 19 / 65
Concept Formation
Concept Formation - COBWEB (Fisher, 1989)
Based on ideas from cognitive psychology
Learning is incrementalConcepts are organized in a hierarchyConcepts are organized around a prototype and describedprobabilisticallyHierarchical concept representation is modified via cognitive operators
Builds a hierarchy top/down
Four conceptual operators
Uses an heuristic measure to find the basic level (Category utility)
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 20 / 65
Concept Formation
COBWEB - Category utility
Category utility is defined for a set of categories
Bias the search towards categories with high intra-similarity and lowinter-similarity
Maximized by the categories in the basic level (preferred level forprediction)
These classes maximize the predictivity of their attributes
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 21 / 65
Concept Formation
COBWEB - Category utility
Intra class similarity: P(Ai = Vij |Ck)
Maximize → most of the examples in the class share this value forthis attribute
Inter class similarity: P(Ck |Ai = Vij)
Maximize → fewer examples from other classes share this value forthis attribute
Maximize the trade off between the two measures for a given set ofcategories:
K∑k=1
P(Ai = Vij)I∑
i=1
J∑j=1
P(Ai = Vij |Ck)P(Ck |Ai = Vij)
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 22 / 65
Concept Formation
COBWEB - Category utility
Using Bayes theorem:
K∑k=1
P(Ck)I∑
i=1
J∑j=1
P(Ai = Vij |Ck)2
∑Ii=1
∑Jj=1 P(Ai = Vij |Ck)2 represents the number of attributes that
can be correctly predicted for a class
We look for a partition that increases this number of attributescompared to a baseline (no partition)
I∑i=1
J∑j=1
P(Ai = Vij)2
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 23 / 65
Concept Formation
COBWEB - Category utility
Category utility for qualitative attributes for a set of k categories {C1,... Ck}
∑Kk=1 P(Ck)
∑Ii=1
∑Jj=1 P(Ai = Vij |Ck)2 −
∑Ii=1
∑Jj=1 P(Ai = Vij)
2
K
Category utility for quantitative attributes (Gaussian distributions)∑Kk=1 P(Ck)
∑Ii=1
1σik−
∑Ii=1
1σip
K
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 24 / 65
Concept Formation
Probabilistic hierarchy
Negro
Blanco
Triángulo
Cuadrado
Círculo
Negro
Blanco
Triángulo
Cuadrado
Círculo
Negro
Blanco
Triángulo
Cuadrado
Círculo
Negro
Blanco
Triángulo
Cuadrado
Círculo
Negro
Blanco
Triángulo
Cuadrado
Círculo
1.0
0.0
1.00.0
0.0
0.0
1.0
0.0
0.0
1.0
0.0
1.0
0.01.0
0.0
0.0
0.0
1.0
0.66
0.33
0.25
0.75
0.25
0.25
0.50
P(V|C)
Forma
Color
P(C0)=1.0 P(V|C)
Forma
Color
P(V|C)
Forma
Color
P(V|C)
Forma
Color
P(V|C)
Forma
Color
P(C0)=0.25
P(C0)=0.50
P(C0)=0.75
P(C0)=0.25
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 25 / 65
Concept Formation
Algorithm
Incremental insertion of each example in the hierarchy
Look for the path from the root that puts the example in a leaf
Decide at each level how to modify the hierarchy (which operatorapply) to maximize CU and descend recursively the tree
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 26 / 65
Concept Formation
Operators
Incorporate: Put the example inside an existing class
New class: Create a new class at this level
Merge: Two concepts are merge and the example is incorporatedinside the new class
Divide: A concept is substituted by its children
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 27 / 65
Concept Formation
Split - Merge
Oi
Oi
MERGE
SPLIT
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 28 / 65
Concept Formation
COBWEB
Procedure: Depth-first limited search COBWEB (x: Example, H:Hierarchy)
Update the father with the new exampleif we are in a leaf then
Create a new level with this exampleelse
Compute CU of incorporating the example to each classSave the two best CUCompute CU of merging the best two classesCompute CU of splitting the best classCompute CU of creating a new class with the exampleRecursive call with the best choice
end
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 29 / 65
Partitional algorithms
1 Introduction
2 Algorithms for unsupervised learning
3 Hierarchical algorithms
4 Concept Formation
5 Partitional algorithmsModel/Prototype Based ClusteringDensity/Grid Based ClusteringGraph Based ClusteringUnsupervised Neural Networks
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 30 / 65
Partitional algorithms
Partitional algorithms
The computational cost to find the optimal partition of N objects in Kgroups is NP-hard
Model/prototype based algorithms (K-means, Gaussian MixtureModels, Fuzzy K-means, Leader algorithm, ...)
Density based algorithms
Grid based algorithms
Graph theory based algorithms (spectral Clustering
Unsupervised Neural networks
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 31 / 65
Partitional algorithms Model/Prototype Based Clustering
1 Introduction
2 Algorithms for unsupervised learning
3 Hierarchical algorithms
4 Concept Formation
5 Partitional algorithmsModel/Prototype Based ClusteringDensity/Grid Based ClusteringGraph Based ClusteringUnsupervised Neural Networks
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 32 / 65
Partitional algorithms Model/Prototype Based Clustering
K-means
We assume that the shape of the clusters is hyperspherical
An iterative algorithm assigns each example to one of K groups (K isa parameter)
Hill Climbing search
Optimization criteria (square error, minimize the distance of eachexample to the centroid of the class)
Distorsion =K∑
k=1
∑i∈Ck
‖ xi − µk ‖2
The algorithm converges to a local minima
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 33 / 65
Partitional algorithms Model/Prototype Based Clustering
K-means
Algorithm: K-means (X: Examples, k:integer)
Generate k prototypes with the k first examplesAssign the n-k examples to its nearest prototypeSumD = Sum of square distances examples-prototypesrepeat
Recalculate prototypesReassign examples to its nearest prototypeSumI = SumDSumD = Sum of square distances examples-prototypes
until SumI - SumD < ε
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 34 / 65
Partitional algorithms Model/Prototype Based Clustering
K-means
1
11 1
1
22
2
2
22
22
1
1
11
2
1
12
1
11 1
1
2
2 2
2
22
2
2
22
22
1
11 1
1
2
22
2
2
22
22
2
1
1
1
1
11 1
1
22
2
2
22
22
1
1
11
2
1
1
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 35 / 65
Partitional algorithms Model/Prototype Based Clustering
K-means - practical problems
The algorithm is sensitive to the initialization (to run the algorithmfrom random initializations could be a god idea)
Find the value of k is not an easy problem (experimentation withdifferent values is needed)
You can obtain a solution even if the classes are not hyperspherical(some classes could be splitted)
No guarantee about the quality of the solution
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 36 / 65
Partitional algorithms Model/Prototype Based Clustering
Mixture Decomposition - EM algorithm
We assume that the data are drawn from a mixture of probabilitydistribution functions (usually Gaussian), we are looking for theparameters of the distributions that explain better the data
The model of the data is:
P(x |θ) =K∑i=1
wiP(x |θi ,wi )
Being K the number of clusters and∑K
i=1 wi = 1
The membership of an example is a probability distribution
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 37 / 65
Partitional algorithms Model/Prototype Based Clustering
Mixture Decomposition - EM algorithm
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 38 / 65
Partitional algorithms Model/Prototype Based Clustering
Mixture Decomposition - EM algorithm
The goal is to estimate the parameters of the distribution thatdescribes each class (e.g.: means and standard deviations)
The algorithm maximizes the likelihood of the distribution withrespect to the dataset
It performs iteratively two steps
Expectation: We calculate a function that assigns a degree ofmembership to all the instances to any of the K probability distributionsMaximization: We re-estimate the parameters of the distributions tomaximize the membership
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 39 / 65
Partitional algorithms Model/Prototype Based Clustering
EM Algorithm (K Gaussian)
For the Gaussian case:
P(x |−→µ ,Σ) =K∑i=1
P(wi )P(x |−→µi ,Σi ,wi )
Being −→µ the vectors of means and Σ the covariance matrices
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 40 / 65
Partitional algorithms Model/Prototype Based Clustering
EM Algorithm (K Gaussian)
The computations depend on the assumptions that we make about theattributes (independent or not, same σ, ...)
The attributes are independent: µi and σi have to be computed foreach class (O(k) parameters) (model: hyper spheres or ellipsoidsparallel to coordinate axis)
The attributes are not independent: µi , σi and σij have to becomputed for each class (O(k2) parameters) (model: hyper ellipsoidsnon parallel to coordinate axis)
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 41 / 65
Partitional algorithms Model/Prototype Based Clustering
EM Algorithm (K Gaussian)
For the case of A independent attributes:
P(x |−→µi ,Σi ,wi ) =A∏
j=1
P(x |µij , σij ,wi )
The model to fit is
P(x |−→µ ,−→σ ) =K∑i=1
P(wi )A∏
j=1
p(x |µij , σij ,wi )
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 42 / 65
Partitional algorithms Model/Prototype Based Clustering
EM Algorithm (K Gaussian)
The update of the parameters in the maximization step is:
µi =
∑Nk=1 P(wi |xk ,−→µ ,−→σ )xk∑Nk=1 P(wi |xk ,−→µ ,−→σ )
σi =
∑Nk=1 P(wi |xk ,−→µ ,−→σ )(xk − µi )2∑N
k=1 P(wi |xk ,−→µ ,−→σ )
P(wi ) =1
N
N∑k=1
P(wi |xk ,−→µ ,−→σ )
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 43 / 65
Partitional algorithms Model/Prototype Based Clustering
EM Algorithm (K Gaussian)
A set of K initial distributions is generated, N(µi , σi ), µi and σi arevectors corresponding to the mean and the variance of each attribute
We repeat until convergence:1 Expectation: Compute the membership of each instance to each
probability distribution. Usually we use the log likelihood function ofthe distribution
Each instance will have a weight depending of the probability assignedby the previous step wxj,i = log(P(xj |N(µi , σi ))) (MLE)
2 Maximization: Recompute the parameters using the weights from theprevious steps and obtain the new µi and σi for each distribution
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 44 / 65
Partitional algorithms Model/Prototype Based Clustering
EM algorithm - Comments
K-means is a particular case of this algorithm
The main advantage is that we obtain a membership as a probability(soft assignments)
Using different probability distribution we can find different kinds ofstructures.
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 45 / 65
Partitional algorithms Model/Prototype Based Clustering
Incremental algorithms: Neighbourhood relationship
The commonality among all the algorithms until this point is thatthey are not incremental
Incrementality allows to update a model with new data withoutstarting from scratch
These algorithms use the neighbourhood relationship defined from asimilarity/distance function
This neighbourhood determines what instances belong to the samegroup
Examples: Nearest Neighbour, Mutual Neighbour
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 46 / 65
Partitional algorithms Model/Prototype Based Clustering
Nearest Neighbour/Leader Algorithm
Algorithm: LeaderAlgorithm (X: Examples, D:double)
Generate a prototype with the first examplewhile there are examples do
e= current exampled= distance of e to the the nearest prototypeif d ≤ D then
Introduce the example in the classRecompute the prototype
elseCreate a new prototype with this example
end
end
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 47 / 65
Partitional algorithms Model/Prototype Based Clustering
Nearest Neighbour
1
2
111
1
11
1 1
111
1
11
1
2
2
2
2
22
2
211
1
31
11
1
11
1
2
2
2
2
22
2
11
1
22
22
2
2
3
33
3
3
33
33
1
1
111 1
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 48 / 65
Partitional algorithms Model/Prototype Based Clustering
Fuzzy Clustering
Fuzzy clustering relax the hard partition constraint of K-means
Each instance has a degree of membership to each partition
A new optimization function is introduced:
L =K∑
k=1
N∑i=1
δ(Ck , xi )b‖xi − µk‖2
where∑K
k=1 δ(Ck , xi ) = 1 and b is a blending factor
This is an advantage over other algorithms when clusters overlapped
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 49 / 65
Partitional algorithms Model/Prototype Based Clustering
Fuzzy Clustering - Fuzzy C-means
Fuzzy C-means is the most known fuzzy clustering algorithm, it is thefuzzy version of K-means
The algorithm performs the optimization of the objective function ina similar way
The updating of the cluster centers are computed as:
µj =
∑Ni=1 δ(Cj , xi )
bxi∑Ni=1 δ(Cj , xi )b
And the updating of the memberships:
δ(Cj , xi ) =(1/dij)
1/(1−b)∑Kk=1(1/dik)1/(1−b)
, dij = ‖xi − µj‖2
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 50 / 65
Partitional algorithms Model/Prototype Based Clustering
Fuzzy Clustering
Other membership and distance functions can be used
Different functions have specific purposes like to detect specificshapes in the data (lines, rectangles, ...)
This algorithm is broadly used in image recognition
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 51 / 65
Partitional algorithms Density/Grid Based Clustering
1 Introduction
2 Algorithms for unsupervised learning
3 Hierarchical algorithms
4 Concept Formation
5 Partitional algorithmsModel/Prototype Based ClusteringDensity/Grid Based ClusteringGraph Based ClusteringUnsupervised Neural Networks
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 52 / 65
Partitional algorithms Density/Grid Based Clustering
Density estimation
The number of groups is not decided beforehand
We are looking for regions with high density of examples
We are no limited to a predefined set of shapes (non parametricmodel)
Different approaches:
Space partitioning (multidimensional grid)Multidimensional histograms (we look for high density regions with lessdimensions)
Usually it is more suited to datasets with low dimensionality (e.g.geographical data)
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 53 / 65
Partitional algorithms Density/Grid Based Clustering
Density estimation - Space partitioning
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 54 / 65
Partitional algorithms Density/Grid Based Clustering
Density estimation - Multidimensional grids
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 55 / 65
Partitional algorithms Graph Based Clustering
1 Introduction
2 Algorithms for unsupervised learning
3 Hierarchical algorithms
4 Concept Formation
5 Partitional algorithmsModel/Prototype Based ClusteringDensity/Grid Based ClusteringGraph Based ClusteringUnsupervised Neural Networks
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 56 / 65
Partitional algorithms Graph Based Clustering
Based in graph theory
We create different kinds of graphs with the dataset (MST, Voronoi,Delanau, ...)
We give consistency criteria for the edges of the graph (delete)
The result is a set of unconnected components
Two advantages: we do not need to know the number of classes, wedo not look for a specific model (any shape is possible)
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 57 / 65
Partitional algorithms Graph Based Clustering
Based in graph theory
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 58 / 65
Partitional algorithms Graph Based Clustering
Spectral Clustering
Spectral graph theory defines properties that hold the eigenvalues andeigenvectors of the adjacency matrix or Laplacian matrix of a graph
Spectral clustering uses spectral properties of the distance matrix
The distance matrix represents a graph that connects the examples
Complete graphNeighbourhood graph (different definitions)
From the diagonalization of this matrix some clustering algorithmscan be defined
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 59 / 65
Partitional algorithms Graph Based Clustering
Spectral Clustering
We start with the similarity matrix (W ) of a dataset (complete or not)
This matrix represents the similarity graph of the instances
The degree of an edge is defined as:
di =n∑
j=1
wij
We define the degree matrix D as the matrix with valuesd1, d2, . . . , dn as diagonal
We can define different Laplace matrices:
Unnormalized: L = D −WNormalized: Lsym = D−1/2LD−1/2 or also Lrw = D−1L
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 60 / 65
Partitional algorithms Graph Based Clustering
Spectral Clustering
We can cluster a dataset following this steps:1 Compute the Laplace matrix from the similarity matrix2 Compute the first K eigenvalues of the Laplace matrix3 Use the eigenvectors as new datapoints4 Apply K-means as clustering algorithm
We are embedding the dataset in a space with less dimensions usingthe neighbourhood relations among the data
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 61 / 65
Partitional algorithms Unsupervised Neural Networks
1 Introduction
2 Algorithms for unsupervised learning
3 Hierarchical algorithms
4 Concept Formation
5 Partitional algorithmsModel/Prototype Based ClusteringDensity/Grid Based ClusteringGraph Based ClusteringUnsupervised Neural Networks
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 62 / 65
Partitional algorithms Unsupervised Neural Networks
Unsupervised Neural Networks
Self-organizing maps are an unsupervised neural network method
Can be seen as an on-line constrained version of K-means
The data is transformed to fit in a 1-d or 2-d mesh
The nodes of this mesh are the prototypes
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 63 / 65
Partitional algorithms Unsupervised Neural Networks
Self-Organizing Maps
To build the map we have to decide the size and shape of the mesh(rectangular/hexagonal)
Each node of the mesh is a multidimensional prototype of p features
Algorithm: Self-Organizing Maps algorithm
Initial prototypes are distributed regularly on the meshfor Predefined number of iterations do
foreach Example xi doFind the nearest prototype (mj)Determine the neighborhood of mj (M)foreach Prototype mk ∈M do
mk = mk + α(xi −mk)end
end
end
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 64 / 65
Partitional algorithms Unsupervised Neural Networks
Self-Organizing Maps
During the iterations the mesh is transformed to be closer to the data,but maintaining the bidimensional relationship between prototypes
The performance of the algorithm depends on the learning rate α,usually is decreased from 1 to 0 during the iterations
The neighborhood of a prototype is defined by the adjacency of thecells and the distance of the prototypes
The number of neighbors used in the update is decreased during theiterations from a predefined number to 1 (only the prototype nearestto the observation)
Different variations of the algorithm give more weight depending onthe distance of the prototypes
Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 65 / 65