Unsupervised Learning - Computer Science Departmentbejar/apren/docum/trans/08-clustering-eng.pdf ·...

Unsupervised Learning

Javier Bejar cbea

LSI - FIB

Term 2012/2013

Javier Bejar cbea (LSI - FIB) Unsupervised Learning Term 2012/2013 1 / 65

Outline

1 Introduction

2 Algorithms for unsupervised learning

3 Hierarchical algorithms

4 Concept Formation

5 Partitional algorithmsModel/Prototype Based ClusteringDensity/Grid Based ClusteringGraph Based ClusteringUnsupervised Neural Networks


Introduction

1 Introduction



4 Concept Formation



Introduction


Usually learning can be done in a supervised or unsupervised way

There are a strong bias in the machine learning community towardssupervised learning

But a lot of concepts are learned unsupervisedly

The discovery of new concepts is always unsupervised


Introduction


Goals:

Summarization: To obtain a representation that describes an unlabeleddatasetUnderstanding: To discover concepts inside data

These task are difficult because the discovery process is biased bycontext

Different answers can be valid depending on the discovery goal or thedomainThere are few criterion to validate the results

Knowledge representation: Unstructured (partitions/clusters) orrelational (hierarchies)


Algorithms for unsupervised learning

1 Introduction



4 Concept Formation





Learning by the discovery of predefined structures

For example: probability distributions/models using parametric or nonparametric estimation

It is assumed that the data is embedded in a N-dimensional spacethat has a similarity/dissimilarity function defined

Bias:

Examples are more related to the nearest examples than to the farthestLook for compact groups that are maximally separated from each other

Areas related: Statistics, machine learning, graph theory, fuzzy theory,physics




Two main strategies:

Hierarchical algorithms

Examples are usually organized as a binary treeUsually no explicit division in groups

Partitional algorithms

Only a partition of the dataset is obtained



1 Introduction



4 Concept Formation





Based on graph theory

The examples form a full connected graphSimilarity define the length of the edgesThe clustering is decided using connectivity criteria

Based on matrix algebra

A distance matrix is calculated from the examplesThe clustering is computed using the distance matrixThe distance matrix is updated after each step (different updatingcriteria)




Graphs

Single Linkage, Complete Linkage, MSTDivisive, Agglomerative

Matrices

Johnson algorithmDifferent update criteria (S-L, C-L, Centroid, minimum variance)

Computational cost

O(n inst3 × num dimensions)



Agglomerative Graph Algorithm

Algorithm: Agglomerative graph algorithm)

Compute Distance/similarity matrixrepeat

Find the pair of examples with higher similarityAdd an edge to the graph corresponding to this pairif Agglomeration criteria holds then

Merge the clusters the pair belongs toend

until Only one Cluster

Single linkage = The new edge is between to disconnected graphs

Complete linkage = The new edge creates a clique with all the nodesof both subgraphs



Hierarchical algorithms - Graphs

2 3 4 5

1 6 8 2 72 1 5 33 10 94 4 5 2 3 1 4 2 3 1 54

Single Link Complete Link

2

1

3

4

5

2

1

3

4

5

2

1

3

4

5

2

1

3

4

5

2

1

3

4

5

2

1

3

4

5

2

1

3

4

5



Agglomerative Johnson algorithm

Algorithm: Agglomerative Johnson algorithm

Compute Distance/similarity matrixrepeat

Find the pair of groups/examples with the higher similarityMerge the pair of groups/examplesDelete the rows and columns corresponding to the pair ofgroups/examplesAdd a new row and column with the new distances to the new group

until Matrix has one element

Single linkage = New distance is the distance between the nearestexamples

Complete linkage = New distance is the distance between the farthestexamples

Average linkage = New distance is the distance between centroids



Hierarchical algorithms - Matrices

2 3 4 5

1 6 8 2 72 1 5 33 10 94 4

2,3 4 5

1 7 2 72,3 7.5 64 4

1,4 5

2,3 7.25 61,4 5.5

1,4,5

2,3 6.725



Hierarchical algorithms - Example

Data Single Link Complete Link

−1 0 1 2 3 4 5

−2

02

46

8

x1

x2

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

2.0 1.5 1.0 0.5 0.0

151723162018211922241113141012241798635

12 10 8 6 4 2 0

151723192224182116201014118126351314279

Median Centroid Ward

4 3 2 1 0

111381263510241791418211620192224151723

4 3 2 1 0

182116201922241517231114138126351042179

50 40 30 20 10 0

142796810123511131517231418211620192224



Hierarchical algorithms - Shortcomings

A partition of the data is not given, it has to be decided a posteriori

Some undesirable and strange behaviours could appear (chaining,inversions) distorting the results

Dendrogram is not a practical representation for large amount of data

Its computational cost is high


Concept Formation

1 Introduction



4 Concept Formation



Concept Formation

Other hierarchical algorithms - Concept Formation

Learning has an incremental nature (experience is acquired fromcontinuous observation, not at once)

Concepts are learned with their relationships (polithetic hierarchies ofconcepts)

Search in the space of hierarchies

An objective function measures the utility of the learned structure

The updating of the structure is performed by a set of conceptualoperators

The result depends on the order of the examples


Concept Formation

Concept Formation - COBWEB (Fisher, 1989)

Based on ideas from cognitive psychology

Learning is incrementalConcepts are organized in a hierarchyConcepts are organized around a prototype and describedprobabilisticallyHierarchical concept representation is modified via cognitive operators

Builds a hierarchy top/down

Four conceptual operators

Uses an heuristic measure to find the basic level (Category utility)


Concept Formation

COBWEB - Category utility

Category utility is defined for a set of categories

Bias the search towards categories with high intra-similarity and lowinter-similarity

Maximized by the categories in the basic level (preferred level forprediction)

These classes maximize the predictivity of their attributes


Concept Formation


Intra class similarity: P(Ai = Vij |Ck)

Maximize → most of the examples in the class share this value forthis attribute

Inter class similarity: P(Ck |Ai = Vij)

Maximize → fewer examples from other classes share this value forthis attribute

Maximize the trade off between the two measures for a given set ofcategories:

K∑k=1

P(Ai = Vij)I∑

i=1

J∑j=1

P(Ai = Vij |Ck)P(Ck |Ai = Vij)


Concept Formation


Using Bayes theorem:

K∑k=1

P(Ck)I∑

i=1

J∑j=1

P(Ai = Vij |Ck)2

∑Ii=1

∑Jj=1 P(Ai = Vij |Ck)2 represents the number of attributes that

can be correctly predicted for a class

We look for a partition that increases this number of attributescompared to a baseline (no partition)

I∑i=1

J∑j=1

P(Ai = Vij)2


Concept Formation


Category utility for qualitative attributes for a set of k categories {C1,... Ck}

∑Kk=1 P(Ck)

∑Ii=1

∑Jj=1 P(Ai = Vij |Ck)2 −

∑Ii=1

∑Jj=1 P(Ai = Vij)

2

K

Category utility for quantitative attributes (Gaussian distributions)∑Kk=1 P(Ck)

∑Ii=1

1σik−

∑Ii=1

1σip

K


Concept Formation

Probabilistic hierarchy

Negro

Blanco

Triángulo

Cuadrado

Círculo

Negro

Blanco

Triángulo

Cuadrado

Círculo

Negro

Blanco

Triángulo

Cuadrado

Círculo

Negro

Blanco

Triángulo

Cuadrado

Círculo

Negro

Blanco

Triángulo

Cuadrado

Círculo

1.0

0.0

1.00.0

0.0

0.0

1.0

0.0

0.0

1.0

0.0

1.0

0.01.0

0.0

0.0

0.0

1.0

0.66

0.33

0.25

0.75

0.25

0.25

0.50

P(V|C)

Forma

Color

P(C0)=1.0 P(V|C)

Forma

Color

P(V|C)

Forma

Color

P(V|C)

Forma

Color

P(V|C)

Forma

Color

P(C0)=0.25

P(C0)=0.50

P(C0)=0.75

P(C0)=0.25


Concept Formation

Algorithm

Incremental insertion of each example in the hierarchy

Look for the path from the root that puts the example in a leaf

Decide at each level how to modify the hierarchy (which operatorapply) to maximize CU and descend recursively the tree


Concept Formation

Operators

Incorporate: Put the example inside an existing class

New class: Create a new class at this level

Merge: Two concepts are merge and the example is incorporatedinside the new class

Divide: A concept is substituted by its children


Concept Formation

Split - Merge

Oi

Oi

MERGE

SPLIT


Concept Formation

COBWEB

Procedure: Depth-first limited search COBWEB (x: Example, H:Hierarchy)

Update the father with the new exampleif we are in a leaf then

Create a new level with this exampleelse

Compute CU of incorporating the example to each classSave the two best CUCompute CU of merging the best two classesCompute CU of splitting the best classCompute CU of creating a new class with the exampleRecursive call with the best choice

end



1 Introduction



4 Concept Formation





The computational cost to find the optimal partition of N objects in Kgroups is NP-hard

Model/prototype based algorithms (K-means, Gaussian MixtureModels, Fuzzy K-means, Leader algorithm, ...)

Density based algorithms

Grid based algorithms

Graph theory based algorithms (spectral Clustering

Unsupervised Neural networks


Partitional algorithms Model/Prototype Based Clustering

1 Introduction



4 Concept Formation




K-means

We assume that the shape of the clusters is hyperspherical

An iterative algorithm assigns each example to one of K groups (K isa parameter)

Hill Climbing search

Optimization criteria (square error, minimize the distance of eachexample to the centroid of the class)

Distorsion =K∑

k=1

∑i∈Ck

‖ xi − µk ‖2

The algorithm converges to a local minima



K-means

Algorithm: K-means (X: Examples, k:integer)

Generate k prototypes with the k first examplesAssign the n-k examples to its nearest prototypeSumD = Sum of square distances examples-prototypesrepeat

Recalculate prototypesReassign examples to its nearest prototypeSumI = SumDSumD = Sum of square distances examples-prototypes

until SumI - SumD < ε



K-means

1

11 1

1

22

2

2

22

22

1

1

11

2

1

12

1

11 1

1

2

2 2

2

22

2

2

22

22

1

11 1

1

2

22

2

2

22

22

2

1

1

1

1

11 1

1

22

2

2

22

22

1

1

11

2

1

1



K-means - practical problems

The algorithm is sensitive to the initialization (to run the algorithmfrom random initializations could be a god idea)

Find the value of k is not an easy problem (experimentation withdifferent values is needed)

You can obtain a solution even if the classes are not hyperspherical(some classes could be splitted)

No guarantee about the quality of the solution



Mixture Decomposition - EM algorithm

We assume that the data are drawn from a mixture of probabilitydistribution functions (usually Gaussian), we are looking for theparameters of the distributions that explain better the data

The model of the data is:

P(x |θ) =K∑i=1

wiP(x |θi ,wi )

Being K the number of clusters and∑K

i=1 wi = 1

The membership of an example is a probability distribution







The goal is to estimate the parameters of the distribution thatdescribes each class (e.g.: means and standard deviations)

The algorithm maximizes the likelihood of the distribution withrespect to the dataset

It performs iteratively two steps

Expectation: We calculate a function that assigns a degree ofmembership to all the instances to any of the K probability distributionsMaximization: We re-estimate the parameters of the distributions tomaximize the membership



EM Algorithm (K Gaussian)

For the Gaussian case:

P(x |−→µ ,Σ) =K∑i=1

P(wi )P(x |−→µi ,Σi ,wi )

Being −→µ the vectors of means and Σ the covariance matrices




The computations depend on the assumptions that we make about theattributes (independent or not, same σ, ...)

The attributes are independent: µi and σi have to be computed foreach class (O(k) parameters) (model: hyper spheres or ellipsoidsparallel to coordinate axis)

The attributes are not independent: µi , σi and σij have to becomputed for each class (O(k2) parameters) (model: hyper ellipsoidsnon parallel to coordinate axis)




For the case of A independent attributes:

P(x |−→µi ,Σi ,wi ) =A∏

j=1

P(x |µij , σij ,wi )

The model to fit is

P(x |−→µ ,−→σ ) =K∑i=1

P(wi )A∏

j=1

p(x |µij , σij ,wi )




The update of the parameters in the maximization step is:

µi =

∑Nk=1 P(wi |xk ,−→µ ,−→σ )xk∑Nk=1 P(wi |xk ,−→µ ,−→σ )

σi =

∑Nk=1 P(wi |xk ,−→µ ,−→σ )(xk − µi )2∑N

k=1 P(wi |xk ,−→µ ,−→σ )

P(wi ) =1

N

N∑k=1

P(wi |xk ,−→µ ,−→σ )




A set of K initial distributions is generated, N(µi , σi ), µi and σi arevectors corresponding to the mean and the variance of each attribute

We repeat until convergence:1 Expectation: Compute the membership of each instance to each

probability distribution. Usually we use the log likelihood function ofthe distribution

Each instance will have a weight depending of the probability assignedby the previous step wxj,i = log(P(xj |N(µi , σi ))) (MLE)

2 Maximization: Recompute the parameters using the weights from theprevious steps and obtain the new µi and σi for each distribution



EM algorithm - Comments

K-means is a particular case of this algorithm

The main advantage is that we obtain a membership as a probability(soft assignments)

Using different probability distribution we can find different kinds ofstructures.



Incremental algorithms: Neighbourhood relationship

The commonality among all the algorithms until this point is thatthey are not incremental

Incrementality allows to update a model with new data withoutstarting from scratch

These algorithms use the neighbourhood relationship defined from asimilarity/distance function

This neighbourhood determines what instances belong to the samegroup

Examples: Nearest Neighbour, Mutual Neighbour



Nearest Neighbour/Leader Algorithm

Algorithm: LeaderAlgorithm (X: Examples, D:double)

Generate a prototype with the first examplewhile there are examples do

e= current exampled= distance of e to the the nearest prototypeif d ≤ D then

Introduce the example in the classRecompute the prototype

elseCreate a new prototype with this example

end

end



Nearest Neighbour

1

2

111

1

11

1 1

111

1

11

1

2

2

2

2

22

2

211

1

31

11

1

11

1

2

2

2

2

22

2

11

1

22

22

2

2

3

33

3

3

33

33

1

1

111 1



Fuzzy Clustering

Fuzzy clustering relax the hard partition constraint of K-means

Each instance has a degree of membership to each partition

A new optimization function is introduced:

L =K∑

k=1

N∑i=1

δ(Ck , xi )b‖xi − µk‖2

where∑K

k=1 δ(Ck , xi ) = 1 and b is a blending factor

This is an advantage over other algorithms when clusters overlapped



Fuzzy Clustering - Fuzzy C-means

Fuzzy C-means is the most known fuzzy clustering algorithm, it is thefuzzy version of K-means

The algorithm performs the optimization of the objective function ina similar way

The updating of the cluster centers are computed as:

µj =

∑Ni=1 δ(Cj , xi )

bxi∑Ni=1 δ(Cj , xi )b

And the updating of the memberships:

δ(Cj , xi ) =(1/dij)

1/(1−b)∑Kk=1(1/dik)1/(1−b)

, dij = ‖xi − µj‖2



Fuzzy Clustering

Other membership and distance functions can be used

Different functions have specific purposes like to detect specificshapes in the data (lines, rectangles, ...)

This algorithm is broadly used in image recognition


Partitional algorithms Density/Grid Based Clustering

1 Introduction



4 Concept Formation




Density estimation

The number of groups is not decided beforehand

We are looking for regions with high density of examples

We are no limited to a predefined set of shapes (non parametricmodel)

Different approaches:

Space partitioning (multidimensional grid)Multidimensional histograms (we look for high density regions with lessdimensions)

Usually it is more suited to datasets with low dimensionality (e.g.geographical data)



Density estimation - Space partitioning



Density estimation - Multidimensional grids


Partitional algorithms Graph Based Clustering

1 Introduction



4 Concept Formation




Based in graph theory

We create different kinds of graphs with the dataset (MST, Voronoi,Delanau, ...)

We give consistency criteria for the edges of the graph (delete)

The result is a set of unconnected components

Two advantages: we do not need to know the number of classes, wedo not look for a specific model (any shape is possible)



Based in graph theory



Spectral Clustering

Spectral graph theory defines properties that hold the eigenvalues andeigenvectors of the adjacency matrix or Laplacian matrix of a graph

Spectral clustering uses spectral properties of the distance matrix

The distance matrix represents a graph that connects the examples

Complete graphNeighbourhood graph (different definitions)

From the diagonalization of this matrix some clustering algorithmscan be defined



Spectral Clustering

We start with the similarity matrix (W ) of a dataset (complete or not)

This matrix represents the similarity graph of the instances

The degree of an edge is defined as:

di =n∑

j=1

wij

We define the degree matrix D as the matrix with valuesd1, d2, . . . , dn as diagonal

We can define different Laplace matrices:

Unnormalized: L = D −WNormalized: Lsym = D−1/2LD−1/2 or also Lrw = D−1L



Spectral Clustering

We can cluster a dataset following this steps:1 Compute the Laplace matrix from the similarity matrix2 Compute the first K eigenvalues of the Laplace matrix3 Use the eigenvectors as new datapoints4 Apply K-means as clustering algorithm

We are embedding the dataset in a space with less dimensions usingthe neighbourhood relations among the data


Partitional algorithms Unsupervised Neural Networks

1 Introduction



4 Concept Formation




Unsupervised Neural Networks

Self-organizing maps are an unsupervised neural network method

Can be seen as an on-line constrained version of K-means

The data is transformed to fit in a 1-d or 2-d mesh

The nodes of this mesh are the prototypes



Self-Organizing Maps

To build the map we have to decide the size and shape of the mesh(rectangular/hexagonal)

Each node of the mesh is a multidimensional prototype of p features

Algorithm: Self-Organizing Maps algorithm

Initial prototypes are distributed regularly on the meshfor Predefined number of iterations do

foreach Example xi doFind the nearest prototype (mj)Determine the neighborhood of mj (M)foreach Prototype mk ∈M do

mk = mk + α(xi −mk)end

end

end



Self-Organizing Maps

During the iterations the mesh is transformed to be closer to the data,but maintaining the bidimensional relationship between prototypes

The performance of the algorithm depends on the learning rate α,usually is decreased from 1 to 0 during the iterations

The neighborhood of a prototype is defined by the adjacency of thecells and the distance of the prototypes

The number of neighbors used in the update is decreased during theiterations from a predefined number to 1 (only the prototype nearestto the observation)

Different variations of the algorithm give more weight depending onthe distance of the prototypes


Date post:	12-Oct-2018
Category:	Documents
Upload:	vanmien
View:	223 times
Download:	0 times

Unsupervised Learning - Computer Science Departmentbejar/apren/docum/trans/08-clustering-eng.pdf ·...

Documents