+ All Categories
Home > Documents > Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods...

Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods...

Date post: 14-Oct-2020
Category:
Upload: others
View: 13 times
Download: 0 times
Share this document with a friend
121
Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli DIBRIS - Dip. Informatica, Bioingegneria, Robotica e Ingegneria dei Sistemi, University of Genova, ITALY & S.H.R.O. - Sbarro Institute for Cancer Research and Molecular Medicine Temple University, Philadelphia, PA, USA email: [email protected] ML-CI 2016 Francesco Masulli Introduction to Data Clustering 1
Transcript
Page 1: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Introduction to Data Clustering 1

Francesco Masulli

DIBRIS - Dip. Informatica, Bioingegneria, Robotica e Ingegneria dei Sistemi,University of Genova, ITALY

&S.H.R.O. - Sbarro Institute for Cancer Research and Molecular Medicine

Temple University, Philadelphia, PA, USAemail: [email protected]

ML-CI 2016

Francesco Masulli Introduction to Data Clustering 1

Page 2: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Outline

1 Introduction

2 Partitioning Methods

3 Parametric/Statistical clustering

4 Hard Clustering

Francesco Masulli Introduction to Data Clustering 1

Page 3: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Machine Learning

In 1959, Arthur Samuel defined Machine Learning as a"Field of study that gives computers the ability to learnwithout being explicitly programmed".Machine learning is about the construction and study ofsystems (learners) that can learn from data.For example, a learner could be trained on emailmessages to learn to distinguish between spam andnon-spam messages. After learning, it can then be used toclassify new email messages into spam and non-spamfolders (generalization).

Francesco Masulli Introduction to Data Clustering 1

Page 4: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Machine Learning

In machine learning andpattern recognition, apattern is a data point (orinstance or condition),represented by a vectorof characteristics orattributes or features.From a statisticalview-point a pattern is arandom vector or amulti-variate randomvariable.

Francesco Masulli Introduction to Data Clustering 1

Page 5: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Clustering Problem

Let us consider a set of labels and a set of unlabeledpatterns. The classification problem concerns theassignment of a label to each pattern in such a way thatsimilar patterns will share the same label.There are two principal approaches to solve theclassification problem: for the first approach a set oflabeled instances (training set) is supposed to be availablefor the design of the classifier, while for the secondapproach the available instances are unlabeled.In the first case we deal with the supervised classificationproblem, while in the second case we deal withunsupervised classification or clustering problems.

Francesco Masulli Introduction to Data Clustering 1

Page 6: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

The Concept of Clustering

Greek philosopher Plato (∼ 400 BC):grouping objects based on their similarproperties (categorization)Statesman dialogue http://www.gutenberg.org/files/1738/1738-h/1738-h.htm

Approach further explored and systematized byAristotle (∼ 350 BC): differences betweenclasses and objectsCategories treatise http://classics.mit.edu/Aristotle/categories.html

Francesco Masulli Introduction to Data Clustering 1

Page 7: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

The Concept of Clustering

Principles of grouping (Gestalt psychologists: M. Wertheimer,K.Koffa, W. Kohler ∼1930):

> Law of Proximity: perception tends to groupstimuli that are close together as part of the sameobject, and stimuli that are far apart as twoseparate objects.

> Law of Similarity: perception lends itself toseeing stimuli that physically resemble each otheras part of the same object, and stimuli that aredifferent as part of a different object.

> Law of Good Continuation: people tend toperceive each object as a single uninterruptedobject.

> Laws of Closure, of Good Form, of Common Fate, etc.Francesco Masulli Introduction to Data Clustering 1

Page 8: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

The Concept of ClusteringInformal Definition of Clustering

To find a structure in given data that will be aggregate insome categories (or clusters)Data belonging to a cluster are more similar to data of thatcluster than to data of other clustersThe aim of clustering methods is to group patterns on thebasis of a similarity (or dissimilarity) criteria where groups(or clusters) are sets of similar patterns.

Francesco Masulli Introduction to Data Clustering 1

Page 9: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

What is a Cluster?

The notion of what constitutes a cluster is not well defined

From (Steinbach et al, 2002)

Clustering: not well-posed problem (Hadamard, 1923)

Francesco Masulli Introduction to Data Clustering 1

Page 10: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

RegularizationIll-Posed Problems [Hadamard, 1923] [KEC01, HAYK09]

A problem is well posed [Hadamard, 1923] when a solution

ExistsIs uniqueDepends continuosly on the initial data (i.e., robustnessagainst noise)

Many (especially inverse) pratical problems turned out to beill-posed.E.g., Differentiation is an ill-posed problem because its solutiondoes not depend continuosly on the data.–> robot vision: Inverse problem, ill-posed

Francesco Masulli Introduction to Data Clustering 1

Page 11: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

RegularizationIll-Posed Problems (Hadamard, 1923)

To solve an ill-posed problem we try to regularize it byintroducing generic constraints that will restrict the spaceof solutions in an appropriate wayThe character of the constraint depends on a prioriknowledge of the solution.The constraints enable the calculation of admissiblesolutions called regularized solutions out other (perhaps oninfinite number of) possible solutions.

Francesco Masulli Introduction to Data Clustering 1

Page 12: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Principle: Occam’s razor (William de Occam1285-1349)

"we should prefer simpler models to more complex models""this preference should be traded off against the extent to whichmodel fit the data" (Bishop,1996)

Francesco Masulli Introduction to Data Clustering 1

Page 13: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Clustering AlgorithmsClustering task

Unsupervised data analysis using clustering algorithmsprovides a useful tool to explore data structures.Clustering methods have been addressed in manycontexts and disciplines such as data mining, documentretrieval, image segmentation and pattern classification(Jain, 2009; Xu, 2005).

Francesco Masulli Introduction to Data Clustering 1

Page 14: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Clustering Problem

We deal with unsupervised classification when:labeling is very expensive or infeasiblethe available labeling is ambiguouswe need to improve our understanding of the nature ofpatternswe want to reduce the number of data (information) totransmit

Francesco Masulli Introduction to Data Clustering 1

Page 15: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Operational Definition of Clustering

Features to use (usually are given, sometime we select)and their attributes (binary, discrete, continue) and scales.REMARK: A good choice of features can lead to agood quality of clustering performanceClustering paradigm:

HierarchicalPartitiveVicinity

(Dis-)Similarity measure (indexes):(generalized-)Euclidean distance, correlationHamming, Jacard

An optimization procedure (when requested)

Francesco Masulli Introduction to Data Clustering 1

Page 16: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Clustering Paradigms

Hierarchical clustering able to find structures which can befurther divided in substructures and so on recursively. Theresult is a hierarchical structure of groups known asdendrogram. (Jain, 1999; Sneath, 1973; Ward, 1963)

Francesco Masulli Introduction to Data Clustering 1

Page 17: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Clustering Paradigms

Partitive/central clustering trying to obtain a single partitionof data, that are often based on the optimization of anappropriate objective function.A good cluster is that the distances between the points andthe cluster centroid are small (cluster compactness)

Francesco Masulli Introduction to Data Clustering 1

Page 18: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Clustering paradigms

Vicinity (connectivity) clustering: a good cluster is thateach point share the same cluster label as its nearestneighbor⇒it can represent any cluster shape that is an arbitrarymanifold in the data space (Shared Nearest NeighborClustering (Jarvis et al, 73; Ertoz et al, 2013), Spectralclustering (Filippone et al, 2008).

Francesco Masulli Introduction to Data Clustering 1

Page 19: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Clustering AlgorithmsRepresentation and similarity measures

Crucial aspects in clustering are pattern representation and thesimilarity measure:

Each pattern is usually represented by a set of features ofthe system under study. It is very important to notice that agood choice of representation of patterns can lead toimprovements in clustering performance. Whether it ispossible to choose an appropriate set of features dependson the system under study.Once a representation is fixed it is possible to choose anappropriate similarity measure among patterns. The mostpopular dissimilarity measure for metric representations isthe distance or distortion, for instance the Euclideandistance (Duda&Hart, 1973).

Francesco Masulli Introduction to Data Clustering 1

Page 20: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

ClusteringSimilarity measures

Direction cosine for continuous-valued patternscosθ = <x ,y>

||x ||||y ||

if vectors x and y are unitary→ cosθ = Cif C = 1 (full agreement)→ y = axif C = 0 → x⊥y

Francesco Masulli Introduction to Data Clustering 1

Page 21: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

ClusteringSimilarity measures

Euclidean distance for continuous-valued patterns

E(x,y) ≡ ||x− y|| =√∑n

i=1(xi − yi)2

||x− y||2 = ||x||2 + ||y||2 − 2 x · y

Francesco Masulli Introduction to Data Clustering 1

Page 22: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

ClusteringSimilarity measures

Minkosky distance for continuous-valued patternsM(x,y) = (

∑ni=1(xi − yi)

λ)1/λ

λ = 1 → city-block distance or Manatthan distance or l1

distanceλ = 2 → Euclidean distance or l2 distance

Francesco Masulli Introduction to Data Clustering 1

Page 23: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

ClusteringSimilarity measures

Generalized Hamming distance, for ordered set withdiscrete-valued elements (binary, characters, etc.). Is thenumber of different elements, e.g.x = ( p, a, t, t, e, r, n )y = ( w, e, s, t, e, r, n )

↑ ↑ ↑H(x,y)=3

Francesco Masulli Introduction to Data Clustering 1

Page 24: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

ClusteringSimilarity measures

Jaccard index, for setsd(A,B) = |A

⋂B|

|A⋃

B|

Distance of categorical variables

T (x , y) = δ(x , y) =

{0 if x = y1 otherwise

The notation δ(., .) is the Dirac’s delta

Francesco Masulli Introduction to Data Clustering 1

Page 25: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

ClusteringSimilarity measures

Gaussian Kernel similarity function forcontinuous-valued patternsW (x , y) = exp

(M(x ,y)

2σ2

)whereM(x , y) is a distanceσ is the spread of the kernel

Francesco Masulli Introduction to Data Clustering 1

Page 26: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

REMARK: Expectation

In probability theory, the expectation (or expected value,mathematical expectation, EV, mean, or the first moment)of a random variable is the weighted average of all possiblevalues that this random variable can take on.The weights used in computing this average correspond tothe probabilities in case of a discrete random variable, ordensities in case of a continuous random variable.From a rigorous theoretical standpoint, the expected valueis the integral of the random variable with respect to itsprobability measure.

Francesco Masulli Introduction to Data Clustering 1

Page 27: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

REMARK: Expectation

Definition (Expectaction of a discrete random variable)

Suppose discrete random variable x can take value x1 with probability p1,value x2 with probability p2, and so on, up to value xk with probability pk .Then the expectation of this random variable x is defined as:

E(x) =k∑

i=1

xi pi

Definition (Expectaction of a univariate continuous random variable)

If the probability distribution of x admits a probability density function p(x),then the expected value can be computed as

E(x) =

∫ +∞

−∞x p(x)d x

Francesco Masulli Introduction to Data Clustering 1

Page 28: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

REMARK: Expectation

REMARK: for multivariate random variables we have:

E(x) =k∑

i=1

xi pi when x multivariate discrete random variable

E(x) =

∫ +∞

−∞x p(x)d x when x multivariate continuous random variable

Francesco Masulli Introduction to Data Clustering 1

Page 29: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

REMARK: Expectation

Theorem (Law of the Unconscious Statistician)

Let be the function g(x) of a random variable x. We known theprobability distribution of x but we do not known explicitly thedistribution of g(x). The expected value of g(x) is then

E [g(x)] =k∑

i=1

g(xi) pi when x multivariate discrete random variable

E [g(x)] =

∫ +∞

−∞g(x) p(x)d x when x multivariate continuous random variable

Francesco Masulli Introduction to Data Clustering 1

Page 30: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

REMARK: Univariate Normal Distribution

Univariate Normal Distribution:

p(x) = N(µ, σ2) ≡ 1σ√

2πe−

12 ( x−µ

σ )2

(1)

where:σ2 variance; σ standard deviation,expectation of x or mean: E [x ] =

∫∞−∞ x p(x)d x ≡ µ

E [(x − µ)2] =∫∞−∞(x − µ)2 p(x)d x ≡ σ2

Francesco Masulli Introduction to Data Clustering 1

Page 31: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

REMARK: Univariate Normal Distribution

Univariate Normal Distribution:99.7% of samples are in the interval |x − µ| ≤ 3σ95% of samples are in the interval |x − µ| ≤ 2σ68% of samples are in the interval |x − µ| ≤ σ

Francesco Masulli Introduction to Data Clustering 1

Page 32: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Mahalanobis distanceMultivariate normal distribution:

p(x) = N(µ,Σ2) ≡ 1

(2π)d2 | Σ | 12

exp[−1

2(x− µ)t Σ−1(x− µ)

], (2)

where:

d dimensionality of the the feature space, (x− µ)t transposte of (x− µ),

µ ≡ E [x] mean vector,

Σ ≡ E [(x− µ)(x− µ))t ] covariance matrixσij = E [(xi − µi )× (xj − µj )] element ij-th of matrix Σσii variance of xi ; σij covariance of xi and xj

σij = 0 ⇐⇒ xi and xj statistically independentΣ symmetrical and semidefinite positive (i.e. zt Σ z ≥ 0)

Σ−1 inverse of Σ; | Σ | determinant of Σ.

Mahalanobis distanceM(x, µ) =

√(x− µ)tΣ−1(x− µ)

Francesco Masulli Introduction to Data Clustering 1

Page 33: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Mahalanobis distanceMultivariate normal distribution

Francesco Masulli Introduction to Data Clustering 1

Page 34: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Clustering AlgorithmsExample 1

Francesco Masulli Introduction to Data Clustering 1

Page 35: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Clustering AlgorithmsExample 2

Representation of a text document with a word-vector.Bag-Of-Words format: Each document is represented bythe set of its word frequencies (ignoring position of wordsin the document) and categories that it belongs to.The purpose of the format is to enable efficient executionof algorithms such as, clustering, learning, classification,visualization, etc.

Francesco Masulli Introduction to Data Clustering 1

Page 36: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

PartitionsCodevectors

Let X = {x1, . . . ,xn} be a data set composed by npatterns for which every x i ∈ Rd .The codebook (or set of centroids) V is defined as the setV = {v1, . . . ,vc}, typically with c � n. Each elementv i ∈ Rd is called codevector (or centroid or prototype).The Voronoi region Ri of the codevector v i is the set ofvectors in Rd for which v i is the nearest codevector:

Ri =

{z ∈ Rd

∣∣∣∣ i = arg minj‖z − v j‖2

}. (3)

It is possible to prove that each Voronoi region is convex(Linde, 1980) and the boundaries of the regions are linearsegments.

Francesco Masulli Introduction to Data Clustering 1

Page 37: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

PartitionsVoronoi set

The Voronoi set (or tassel, region,poliedronon) πi of the codevector v i is thesubset of elements of X for which thecodevector v i is the nearest codevector:

πi =

{x ∈ X

∣∣∣∣ i = arg minj‖x − v j‖2

}(4)

that is, the set of vectors belonging to Ri .A partition on Rd induced by all Voronoiregions is called Voronoi tessellation orDirichlet tessellation.

Francesco Masulli Introduction to Data Clustering 1

Page 38: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Vector quantizationExample

While the clustering approach is descriptive, the vectorquantization is predictive

Sequence of images512x512 pixels of 8 biteachconsider sub-images 4x4we can represent asubimage as a vector

xk = (x1k , · · · , x16

k )

Francesco Masulli Introduction to Data Clustering 1

Page 39: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Vector quantization

In the feature space the sub-images will aggregate in clusters

The center y j of a cluster jrepresent all elements of thecluster and is called codevector

codebook

y1 = (y1

1 , · · · , y161 )

y2 = (y12 , · · · , y16

2 )· · ·yc = (y1

c , · · · , y16c )

Francesco Masulli Introduction to Data Clustering 1

Page 40: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Vector quantization

When we should trasmit asub-image xk , we will send in itsplace j , where

l = argminj ||xk −y j || (WTA rule)

In this way we will transmit aninformation of order ln2c insteadof ln2(512× 512× 8)

Francesco Masulli Introduction to Data Clustering 1

Page 41: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Vector quantization - Color ImagesColor models:

RGB additive color model: red, green, and blue light are added togetherin various ways to reproduce a broad array of colors.CMYK ( or process color, four color) subtractive color model used incolor printing. It refers to the four inks used in some color printing: cyan,magenta, yellow, and key (black).CcMmYK (or CMYKLcLm) six color subtractive color model used insome inkjet printers optimized for photo printing. CMYK model (cyan,magenta, yellow, and key) + light cyan (c) and light magenta (m).etc.

RGB (additive) CMYK (subtractive)

Francesco Masulli Introduction to Data Clustering 1

Page 42: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Vector quantization - Color Images

RGB color images can be coded using 3 intensity matrices(R,G,B) of 512x512 pixels of 8 bit each;each pixel will be a vector of 3 gray levels pij = (rij ,gij ,bij);a subimage is a vector xk = (r1

k ,g1k ,b

1k , · · · , r16

k ,g16k ,b16

k )

the rest of previous discussion remains unchanged

Francesco Masulli Introduction to Data Clustering 1

Page 43: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Lagrange MultipliersGeneral case

TheoremIf a scalar field f (x1, · · · , xn) has a relative extremum when it issubject to m contraints (m < n), sayg1(x1, · · · , xn) = 0, · · · ,gm(x1, · · · , xn) = 0,the constrainted optimization problem (CP) can be solvedtrough the unconstrainted optimization of

L ≡ f (x1, · · · , xn)−m∑

i=1

λi g(x1, · · · , xn).

DefinitionL is called the Lagrangian andλi are called the Lagrange multipliers.

Francesco Masulli Introduction to Data Clustering 1

Page 44: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Lagrange MultipliersGeneral case

∇f (x1, · · · , xn) =∑m

i=1 λi ∇g(x1, · · · , xn)

g1(x1, · · · , xn) = 0...gm(x1, · · · , xn) = 0

note:m + n equationsm + n unknown quantities

Francesco Masulli Introduction to Data Clustering 1

Page 45: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Lagrange MultipliersExample

Find the extreme values of

z = x y

subject to the condition x + y = 1.Solution:

f (x , y) = xy

g(x , y) = x + y − 1 = 0∇f = λ∇g

x + y = 1

Francesco Masulli Introduction to Data Clustering 1

Page 46: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Lagrange MultipliersExample

y = λ

x = λ

x + y = 1

x = y = λ =12

fmax (x , y) =14

Francesco Masulli Introduction to Data Clustering 1

Page 47: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Picard IterationRecktenwald, 2000

DEF: Fixed PointA function g(x) is said to have a fixed point p if g(p) = p.In other words, if the value you put into the function isexactly the same value that you get out.Solving the equation f (x) = g(x)− x = 0 is identical tofinding the fixed point of g(x) AND the zero of f (x). So, weare dealing with another possible method for finding theroot of a one-variable equation.DEF: Fixed Point IterationThe iteration process is pn = g(pn−1) for n = 1,2,3, . . . .This process is also called Picard iteration, functionaliteration, or repeated substitution.

Francesco Masulli Introduction to Data Clustering 1

Page 48: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Picard IterationRecktenwald, 2000

DEF: Fixed PointA function g(x) is said to have a fixed point p if g(p) = p.In other words, if the value you put into the function isexactly the same value that you get out.Solving the equation f (x) = g(x)− x = 0 is identical tofinding the fixed point of g(x) AND the zero of f (x). So, weare dealing with another possible method for finding theroot of a one-variable equation.DEF: Fixed Point IterationThe iteration process is pn = g(pn−1) for n = 1,2,3, . . . .This process is also called Picard iteration, functionaliteration, or repeated substitution.

Francesco Masulli Introduction to Data Clustering 1

Page 49: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Picard IterationRecktenwald, 1998

Finding the root of f (x) = log(x + 4)− x on [0,2], i.e., f (x) = 0,is equivalent to find the fixed point of g(x) = log(x + 4) on [0,2].

n0123456

pn g(pn)

0.0000 0.60200.6020 0.66290.6629 0.66860.6686 0.66910.6691 0.66920.6692 0.66920.6692 0.6692

pn g(pn)

1.0000 0.69900.6990 0.67200.6720 0.66950.6695 0.66930.6693 0.66920.6692 0.66920.6692 0.6692

pn g(pn)

2.0000 0.77820.7782 0.67930.6793 0.67020.6702 0.66930.6693 0.66930.6693 0.66920.6692 0.6692

Francesco Masulli Introduction to Data Clustering 1

Page 50: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Picard IterationRecktenwald, 1998

Finding the root of f (x) = log(x + 4)− x on [0,2], i.e., f (x) = 0,is equivalent to find the fixed point of g(x) = log(x + 4) on [0,2].

n0123456

pn g(pn)

0.0000 0.60200.6020 0.66290.6629 0.66860.6686 0.66910.6691 0.66920.6692 0.66920.6692 0.6692

pn g(pn)

1.0000 0.69900.6990 0.67200.6720 0.66950.6695 0.66930.6693 0.66920.6692 0.66920.6692 0.6692

pn g(pn)

2.0000 0.77820.7782 0.67930.6793 0.67020.6702 0.66930.6693 0.66930.6693 0.66920.6692 0.6692

Francesco Masulli Introduction to Data Clustering 1

Page 51: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Picard Iteration (Rech2000) <REC00>Recktenwald, 1998

Uniqueness: The Fixed Point TheoremIf g is continuous on [a,b] and g(x) ∈ [a,b] for all x ∈ [a,b]then g has a fixed point in [a,b].In addition, if 0 < |g(x)| < 1 for all x ∈ [a,b] then g has aunique fixed point in [a,b]

Convergence Criteria for Picard IterationThe iteration process pn = g(pn−1) for n = 1,2,3, . . . willconverge to a unique solution for any initial value p0 in [a,b]if g

′exists on (a,b) and 0 < |g′(x)| < 1 for all x ∈ [a,b].

Francesco Masulli Introduction to Data Clustering 1

Page 52: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Picard Iteration (Rech2000) <REC00>Recktenwald, 1998

Uniqueness: The Fixed Point TheoremIf g is continuous on [a,b] and g(x) ∈ [a,b] for all x ∈ [a,b]then g has a fixed point in [a,b].In addition, if 0 < |g(x)| < 1 for all x ∈ [a,b] then g has aunique fixed point in [a,b]

Convergence Criteria for Picard IterationThe iteration process pn = g(pn−1) for n = 1,2,3, . . . willconverge to a unique solution for any initial value p0 in [a,b]if g

′exists on (a,b) and 0 < |g′(x)| < 1 for all x ∈ [a,b].

Francesco Masulli Introduction to Data Clustering 1

Page 53: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Picard Iteration (Rech2000) <REC00>Recktenwald, 1998

Uniqueness: The Fixed Point TheoremIf g is continuous on [a,b] and g(x) ∈ [a,b] for all x ∈ [a,b]then g has a fixed point in [a,b].In addition, if 0 < |g(x)| < 1 for all x ∈ [a,b] then g has aunique fixed point in [a,b]

Convergence Criteria for Picard IterationThe iteration process pn = g(pn−1) for n = 1,2,3, . . . willconverge to a unique solution for any initial value p0 in [a,b]if g

′exists on (a,b) and 0 < |g′(x)| < 1 for all x ∈ [a,b].

Francesco Masulli Introduction to Data Clustering 1

Page 54: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Picard IterationRecktenwald, 1998

In general, for multivalued funtions, each iteration of thePicard iterations method is composed of two (or more)steps:

Step1: A subset of variables is kept fixed and the optimization isperformed with respect to the remaining variables

Step2: The role of the fixed and moving variables is swapped.

The optimization algorithm stops when variables changeless than a fixed threshold.A more general framework named "Alternating ClusterEstimation" is presented in T. A. Runkler, J. C. Bezdek,Alternating Cluster Estimation: A New Tool for Clustering and FunctionApproximation, IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 7,NO. 4, 377-393, 1999 [RUN99]

Francesco Masulli Introduction to Data Clustering 1

Page 55: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Parametric Clustering (Duda,73)

Let X = {xh | h = 1, ..., n} be the set of unlabeled instanced (training set),and V = {vi | i = 1, ..., c} be the set of centers of clusters (or classes) ωj .Following a parametric learning approach, we make the followingassumptions:

1 The instances come from a known number of c classes ωi ,i ∈ {1, ..., c}.

2 The a priori probabilities P (ωi ), i.e. the probability of drawing patternsof class ωi from X are known.

3 The form of class-conditional probabilities densities p (x | ωi ,Θi ) (i.e.the probability density of instance xh inside class ωi ) are known, ∀i .

Θi is the unknown vector of parameters of the class-conditional probabilitiesdensities.Note that the third assumption reduces the clustering problem to the problemof estimation of the vector Θi (parametric learning).

Francesco Masulli Introduction to Data Clustering 1

Page 56: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Parametric Clustering

In this setting, we assume that instances are obtained byselecting a class ωi and then selecting a pattern x according tothe probability law p (x | ωi ,Θi), i.e.:

p (x | Θ) =c∑

i=1

p (x | ωi ,Θi) P (ωi) , (5)

where Θ = (Θ1, ...,Θc).A density function of this form is called a mixture density(Duda73), p (xh | ωi ,Θi) are called the component densities,and P (ωi) are called the mixing parameters.

Francesco Masulli Introduction to Data Clustering 1

Page 57: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Parametric Clustering

A well known parametric statistics method for estimating the parametervector Θ is based on maximum likelihood (Duda73). It assumes that theparameter vector Θ is fixed but unknown. The likelihood of the training set Xis the conditional density

p (X | Θ) =n∏

h=1

p (xh | Θ) (6)

or also:

p (X | Θ) =n∏

h=1

c∑i=1

p (xh | ωi ,Θi ) P (ωi ) , (7)

Its log is:

log p (X | Θ) =n∑

h=1

logc∑

i=1

p (xh | ωi ,Θi ) P (ωi ) , (8)

Then the maximum likelihood estimate Θ is that value of Θ that maximizesthe likelihood of the observed training set X (or its log).

Francesco Masulli Introduction to Data Clustering 1

Page 58: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Parametric Clustering

If p (X | Θ) is a differentiable function of Θ, we can obtain thefollowing conditions for the maximum-likelihood estimate Θj :

n∑h=1

P(ωi | xh, Θ

)∇Θj

log(

p(

xh | ωi , Θi

))= 0 ∀ j . (9)

Constraints:

P (ωi) ≥ 0 (10)

c∑i=1

P (ωi) = 1 (11)

Francesco Masulli Introduction to Data Clustering 1

Page 59: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Parametric Clustering

Let we assume now that the component densities aremultivariate normal, i.e.:

p(

xh | ωi , Θi

)=

1

(2π)d2 | Σi |

12

exp[−1

2(xh − vi)

t Σ−1i (xh − vi)

],

(12)where:

d dimensionality of the the feature space,vi mean vector,Σi ≡ E [(xh − vi))(xh − vi))t ] covariance matrix(xh − vi)

t transpose of xh − vi , Σ−1i inverse of Σi ,

and | Σi | determinant of Σi .

Francesco Masulli Introduction to Data Clustering 1

Page 60: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Parametric Clustering

The local-maximum-likelihood estimate are:

vi =

∑nh=1 P

(ωi | xh, Θi

)xh∑n

h=1 P(ωi | xh, Θi

) (13)

Σi =

∑nh=1 P

(ωi | xh, Θi

)(xh − vi )(xh − vi )

t∑nh=1 P

(ωi | xh, Θi

) (14)

P(ωi | xh, Θi

)=

| Σj |−12 exp[− 1

2 (xh − vi )t Σ−1

j (xh − vi )] P (ωi )∑cj=1 | Σj |−

12 exp[− 1

2 (xh − vj ))t Σ−1j (xh − vj ))] P (ωj )

(15)

Francesco Masulli Introduction to Data Clustering 1

Page 61: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Parametric Clustering

The Eqs. in the previous slide can be interpreted as thebasis of a gradient ascent or hill-climbing procedure formaximizing the likelihood procedure (Picard iteration);each cycle is composed by two (or more) steps:

Step1: A subset of variables is kept fixed and the optimization isperformed with respect to the remaining variables

Step2: The role of the fixed and moving variables is swapped.The optimization algorithm stops when variables changeless than a fixed thresholA Picard iteration can start with Eq. 15 using initialestimates to evaluate P

(ωi | xh, Θi

)and then using Eqs.

13, and 14 to update the other estimates, and then repeatthis cycle until the variations are less than an assignedthreshold.

Francesco Masulli Introduction to Data Clustering 1

Page 62: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Parametric Clustering

The inversion of Σi is quite time consuming, and moreoverit maybe ill-conditioned.Like all hill-climbing procedures the results do dependupon the starting point, and therefore there is thepossibility of multiple solutions.

Francesco Masulli Introduction to Data Clustering 1

Page 63: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

The K-Means Algorithm

We can notice that in Eq. 15 the probability P(ωi | xh, Θi

)is

large when the squared Mahalanobis distance

M2(xh, vi) ≡ (xh − vi)t Σ−1

j (xh − vi) (16)

is small.

Francesco Masulli Introduction to Data Clustering 1

Page 64: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

The K-Means Algorithm

This observation is the rationale of the K-Means Hard C-Means(HCM) or C-Means or Basic Isodata algorithm (Duda73) [Isodatastands for Iterative Self-Organizing Data Analysis Techniques] that isbased on the following approximation:

P(ωi | xh, Θi

)=

{1 if Ej (xh) = min1≤j≤C EJ (xi)0 otherwise

(17)

where Ej (xi) is the local cost function or distortion and isusually assumed as the squared Euclidean distance

Ej (xh) ≡|| xh − yj ||2 (18)

Starting from the finite data set X this algorithm movesiteratively the k codevectors to the arithmetic mean of theirVoronoi sets {πi}i=1,...,k .

Francesco Masulli Introduction to Data Clustering 1

Page 65: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

K-Means (Lloyd, 1957)EQUIVALENT FOUNDATION OF K-Means Algorithm: Lloyd’s algorithm, a.k.a. asVoronoi iteration or relaxation

Theorem

A necessary condition for a codebook V to minimize the EmpiricalQuantization Error (Gersho, 1992) or Expectation of Distortion or K-Meansfunctional or K-Means objective function (denoted as E(X ) or as < E >):

E(X ) =c∑

i=1

∑x∈πi

‖x − v i‖2 =∑

ih

uih‖xh − v i‖2, with uih =

{1 if xh ∈ πi

0 otherwise

(19)is that each codevector v i fulfills the centroid condition.

In case of a finite data set X and with Euclidean distance, the centroidcondition reduces to v i = 1

|πi |∑

x∈πix

Francesco Masulli Introduction to Data Clustering 1

Page 66: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

K-Means (Lloyd, 1957)Algorithm

K-Means is made up by the following steps:1 choose the number k of clusters;2 initialize the codebook V with vectors randomly picked from

X or with random vectors in the minimum hyperboxcontaining the full data set, i.e., as s a combination of datapoints:

vj =n∑

h=1

γjhxh , (20)

whith coefficients γih ∈ [0,1] .3 compute the Voronoi set πi associated to the codevector v i ;4 move each codevector to the mean of its Voronoi set;5 return to step 3 if any codevector has changed otherwise

return the codebook.

Francesco Masulli Introduction to Data Clustering 1

Page 67: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

K-Means (Lloyd, 1957)

Francesco Masulli Introduction to Data Clustering 1

Page 68: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

K-Means (Lloyd, 1957)

Francesco Masulli Introduction to Data Clustering 1

Page 69: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

K-Means (Lloyd, 1957)

Francesco Masulli Introduction to Data Clustering 1

Page 70: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

K-Means (Lloyd, 1957)

Francesco Masulli Introduction to Data Clustering 1

Page 71: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

K-Means (Lloyd, 1957)

Francesco Masulli Introduction to Data Clustering 1

Page 72: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

K-Means (Lloyd, 1957)

Francesco Masulli Introduction to Data Clustering 1

Page 73: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

K-Means (Lloyd, 1957)

Francesco Masulli Introduction to Data Clustering 1

Page 74: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

K-Means (Lloyd, 1957)pros

At each iteration of the algorithm a codebook is found anda Voronoi tessellation of the input space is provided.It is guaranteed that after each iteration the quantizationerror does not increase.At the end of the algorithm a local minimum of thequantization error is obtained.K-Means can be viewed as an Expectation-Maximizationalgorithm, ensuring the convergence after a finite numberof steps (Bishop, 1996)Different distances lead to different invariance propertiesas in the case of Mahalanobis distance which producesinvariance on ellipsoids (Duda&Hart, 1973)

Francesco Masulli Introduction to Data Clustering 1

Page 75: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

K-Means (Lloyd, 1957)cons

Local minima of E(X ) make the method dependent oninitialization, and the average is sensitive to outliers(Duda&Hart, 1973).

Francesco Masulli Introduction to Data Clustering 1

Page 76: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

K-Means (Lloyd, 1957)cons

In order to overcome this problem:

Heuristics, e.g., split and merge algorithms (Isodata Algorithm).

Local search techniques based on a regularization framework: (addingconstraints on the solution, i.e. minimization of a modified riskfunctional) E.g., Isodata and Fuzzy clustering paradigms: FuzzyC-Means (FCM) (Bezdek, 1981), and Deterministic Annealing (DA)(Rose, 1990), Possibilistic Clustering (Krishnapuram & Keller, 1993,1996), and Graded Possibilistic Clustering (Masulli & Rovetta, 2003).

Global search techniques, e.g., minimization of E(x) using SimulatedAnnealing (Bogus et al., 1999), or Evolutionary Computing (Fogel,1993; Bezdek et al., 1994; Tseng&Yang,1997; Egan, 1998; Kuncheva etal., 1998; Hall et al., 1999; Masulli et al., 1999).

Francesco Masulli Introduction to Data Clustering 1

Page 77: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

K-Means (Lloyd, 1957)cons

The number of clusters to find must be provided, and thiscan be done only using some a priori information oradditional validity criterion.K-Means can deal only with clusters with sphericallysymmetrical point distribution, since Euclidean distances ofpatterns from centroids are computed leading to aspherical invariance.The approximation in Eq. 17 making uih = {0,1} is oftentoo strong, while, by contrast, in real cases some objectsshow not zero similarity degrees to different classes.

Francesco Masulli Introduction to Data Clustering 1

Page 78: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Isodata AlgorithmIterative Self-Organizing Data Analysis Techniques

Heuristic algorithm which allows the number of clusters tobe automatically adjusted during the iteration by mergingsimilar clusters and splitting clusters with large standarddeviations.The Isodata algorithm is more flexible than the K-meanmethod. But the user has to choose empirically many moreparameters.In the next slides we’ll give an example of IsodataAlgorithm fromhttp://fourier.eng.hmc.edu/e161/lectures/classification/node13.html

Francesco Masulli Introduction to Data Clustering 1

Page 79: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Isodata AlgorithmIterative Self-Organizing Data Analysis Techniques

Given a set of samples {xi , i = 1, 2, ...,N} (where eachx = [x (i)

1 , ..., x (i)n ]T is a column vectors representing a point in the

n-dimensional feature space)

K = number of clusters desired;

I = maximum number of iterations allowed;

P = maximum number of pairs of cluster which can be merged;

ΘN = threshold value for minimum number of samples in each cluster(cardinality) can have (used for discarding clusters);

ΘS = threshold value for standard deviation (used for split operation);

ΘC = threshold value for pairwise distances (used for merge operation)

Francesco Masulli Introduction to Data Clustering 1

Page 80: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Isodata AlgorithmAlgorithm

Step 1. Arbitrarily choose k (not necessarily equal to K ) initial clustercenters from the data set {xi , i = 1, 2, ...,N}.Step 2. Assign each of the N data-points to the closest cluster center:x ∈ ωj if D L(x, vj ) = max {DL(x, vi ), i = 1, ..., k}Step 3. Discard clusters with fewer than ΘN members, i.e., if for any j ,Nj < ΘN , then discard ωj and k ← k − 1.

Step 4. Update each cluster center: vi =1Nj

∑x∈ωj

x (j = 1, · · · , k)

Step 5. Compute the average distance Dj of data-points in clustercenter ωj from their corresponding cluster

Dj =1Nj

∑x∈ωj

DL(x, vi ), (i = 1, ..., k)

Francesco Masulli Introduction to Data Clustering 1

Page 81: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Isodata AlgorithmAlgorithm

Step 6. Compute the overall average distance D of data-points fromtheir respective cluster centers:

D =1N

k∑j=1

NjDj

Step 7. If k ≤ K/2 (too few clusters), go to Step 8; else if k > K/2 (toomany clusters), go to Step 11; else go to Step 14. (Steps 8 through 10are for split operation, Steps 11 through 13 are for merge operation.)

Francesco Masulli Introduction to Data Clustering 1

Page 82: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Isodata AlgorithmAlgorithm

Step 8. First step to split. Find the standard deviation vector

σj = [σ(j)1 , · · · , σ

(j)n ]T for each cluster: σ(j)

i =

√√√√ 1Nj

∑x∈ωj

DL(xi − v (j)i )2,

where v (j)i is the ith component of vi and σi is the standard deviation of

the data-points in ωj , along the i-th coordinate axis. Nj is the number ofdata-points in ωj ,

Step 9. Find the maximum component of each σi for each cluster anddenote it byσ(j)

max ; do it for each j = 1, · · · , k .

Francesco Masulli Introduction to Data Clustering 1

Page 83: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Isodata AlgorithmAlgorithm

Step 10. If for any σ(j)max , (j = 1, · · · , k), all of the following are true:

σ(j)max > Θs; Dj > D; Nj > 2ΘN .

then split vj into two new clusters centers v+j and v−j by adding ±δ to

the components of vj , corresponding to σ(j)max , where δ can be ασ(j)

max , forsome α > 0. Then delete vj , and let k ← k + 1. Goto step 2.else Go to Step 14.

Step 11. First step to merge. Compute the pairwise distances Dij

between every two cluster centers: Dij = DL(vi , vj )

and arrange these k(k − 1)/2 distances in ascending order.

Step 12. Find no more than P smallest Dij ’s wich are also smaller thanΘC and keep them in ascending order: Di1j1 ≤ Di2j2 ≤ · · · ≤ DiP jP

Francesco Masulli Introduction to Data Clustering 1

Page 84: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Isodata AlgorithmAlgorithm

Step 13. Perform pairwise merge: for l = 1, · · · ,P, do the following:If neither of vil nor vjl has been used in this iteration, then merge themto form a new center:v = 1

Nil +Njl.

Delete vil and vjl and let k ← k − 1.Go to Step 2.

Step 14. Terminate if maximum number of iterations I is reached.Otherwise go to Step 2.

Francesco Masulli Introduction to Data Clustering 1

Page 85: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

K-Medoids Algorithm

The k-medoids algorithm is a clustering algorithm relatedto the k-means algorithmIn contrast to the k-means algorithm, k-medoids choosesdatapoints as centers (medoids or exemplars).A necessary condition for the set of medoids M tominimize the functional:

F (X ) =k∑

i=1

∑x∈πj

| x −mj |, (21)

where mj is selected from the x ∈ πj with the aim tominimize F (x).It is more robust to noise and outliers as compared tok-means because it minimizes a sum of dissimilaritiesinstead of a sum of squared Euclidean distances.

Francesco Masulli Introduction to Data Clustering 1

Page 86: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

K-Medoids AlgorithmAlgorithm

The most common realisation of k-medoid clustering is thePartitioning Around Medoids (PAM) algorithm (Theodoridis &Koutroumbas, 2006) and is as follows:

(1) Initialize: randomly select k of the n data points as the medoids

(2) Associate each data point to the closest medoid. ("closest" here isdefined using any valid distance metric, most commonly Euclideandistance, Manhattan distance or Minkowski distance)

(3) For each medoid m

(3.1) For each non-medoid data point o

(3.1.1) Swap m and o and compute the total cost of theconfiguration

(4)Select the configuration with the lowest cost.

(5) repeat steps 2 to 5 until there is no change in the medoid.

Francesco Masulli Introduction to Data Clustering 1

Page 87: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Image Segmentation / Binarization

We associate to each pixel a squared subimage centeredon it, plus a possible set of features extracted by someimage processing operator (augmented feature vector)xij = (x(i−k)(j−k), · · · , x(i+k)(j+k), f1, · · · , fr )we use a small the number of codevectors and we assign afalse color to each cluster for image segmentationif we segment using only two codevectors we obtain abinarized image.

original image binarized image original image binarized image

Francesco Masulli Introduction to Data Clustering 1

Page 88: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Image Segmentation

Robustneess to noise:A robustness measure for an estimator is the breakdownpoint, defined as the fraction of outliers able to corrupt theestimation

original MRI image +7% Gaussian noise segmented image

Francesco Masulli Introduction to Data Clustering 1

Page 89: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Image Segmentation

Francesco Masulli Introduction to Data Clustering 1

Page 90: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Batch K-Means (Lloyd, 1957)

The term batch means that at each step the algorithmtakes into account the whole data set to update thecodevectors. When the cardinality n of the data set X isvery high (e.g., several hundreds of thousands) the batchprocedure is computationally expensive.An on-line update has been introduced leading to theon-line K-Means algorithm (Linde, 1980; MacQueen,1967). At each step, this method simply randomly picks aninput pattern and updates its nearest codevector, ensuringthat the scheduling of the updating coefficient is adequateto allow convergence and consistency.

Francesco Masulli Introduction to Data Clustering 1

Page 91: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

On-Line K-Means / Vector Quantization

1 Initialize codevectors small at the center of the hyperbox of data2 Winner-Takes-All (WTA)3 Adapt winner codevector:

∆v j = ε(t)(x − v j ) (22)

4 Stochastic Approximation (Robbins-Morro,1951)

ε = ε(t) (23)

PROS:

minimizes the K-MEANS functional→ approximates K-MEANSsolutions

adapts to a changing environment.

CONS:

degenerated codevectors at the end of the learning that represent noexamples (wasting of resources)

Francesco Masulli Introduction to Data Clustering 1

Page 92: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Online K-Means/ Vector Quantization

Francesco Masulli Introduction to Data Clustering 1

Page 93: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Online K-Means/ Vector Quantization

Francesco Masulli Introduction to Data Clustering 1

Page 94: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Online K-Means/ Vector Quantization

Francesco Masulli Introduction to Data Clustering 1

Page 95: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Online K-Means/ Vector Quantization

Francesco Masulli Introduction to Data Clustering 1

Page 96: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Online K-Means/ Vector Quantization

Francesco Masulli Introduction to Data Clustering 1

Page 97: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

On-Line K-Means / Vector Quantization

Solutions:Conscience mechanism (DeSieno, 1988) [DES88].Determine the winner as:

s(x) = arg minv j∈V

(fj‖x − v j‖) (24)

where fj is the frequence of past winning of unit j .Self Organizing Maps [Kohonen, 1981];Fuzzy Learning Vector Quantization (Bezdek,1995).

Francesco Masulli Introduction to Data Clustering 1

Page 98: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Self Organizing Maps (Kohonen, 1981)[KoH90]

Kohonen, T., Automatic formation of topological maps of patterns in aself-organizing system In Oja, E. and Simula, O., editors,Proceedings of 2SCIA, Scand. Conference on Image Analysis, pages214-220, Helsinki, Finland. Suomen HahmontunnistustutkimuksenSeura r.y, 1981.

A Self Organizing Map (SOM), also known as SelfOrganizing Feature Map (SOFM), represents data bymeans of codevectors organized on a grid with fixedtopology.Codevectors move to adapt to the input distribution, butadaptation is propagated along the grid also to neighboringcodevectors, according to a given propagation orneighborhood function. This effectively constrains theevolution of codevectors.

Francesco Masulli Introduction to Data Clustering 1

Page 99: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Self Organizing Maps (Kohonen, 1981)

Grid topologies may differ. We consider a two-dimensional,square-mesh topology.The distance on the grid is used to determine how stronglya codevector is adapted when the unit aij is the winner.The metric used on a rectangular grid is the Manhattandistance, for which the distance between two elementsr = (r1, r2) and s = (s1, s2) is:

drs = |r1 − s1|+ |r2 − s2| . (25)

Francesco Masulli Introduction to Data Clustering 1

Page 100: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Self Organizing Maps (Kohonen, 1981) - Algorithm1 Initialize the codebook V with small codevectors at the center of the

hyperbox of data2 Initialize the set C of connections to form the rectangular grid of

dimension n1 × n2

3 Initialize t = 04 Randomly pick an input x from X5 Determine the winner:

s(x) = arg minv j∈V‖x − v j‖ (26)

6 Adapt each codevector:

∆v j = ε(t)h(drs)(x − v j ) (27)

where h is a decreasing function of d , e.g.: h(drs) = exp(− d2

rs2s2(t)

)7 Increment t8 if t < tmax go to step 49 end.

Francesco Masulli Introduction to Data Clustering 1

Page 101: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Self Organizing Maps (Kohonen, 1981)

σ(t) and ε(t) are decreasing functions of t , e.g. (Ritter, 1991):

σ(t) = σi

(σfσi

)t/tmax, ε(t) = εi

(εfεi

)t/tmax,

σi, σf initial values for the functions σ(t) and ε(t)).εi, εf final values for the functions σ(t) and ε(t).

Francesco Masulli Introduction to Data Clustering 1

Page 102: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Self Organizing Maps (Kohonen, 1981)

Francesco Masulli Introduction to Data Clustering 1

Page 103: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Self Organizing Maps (Kohonen, 1981)

Francesco Masulli Introduction to Data Clustering 1

Page 104: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Self Organizing Maps (Kohonen, 1981)

Calibration:

Francesco Masulli Introduction to Data Clustering 1

Page 105: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Self Organizing Maps (Kohonen, 1981)

Francesco Masulli Introduction to Data Clustering 1

Page 106: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Self Organizing Maps (Kohonen, 1981)

Test with uniform distribution (noise):

Francesco Masulli Introduction to Data Clustering 1

Page 107: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Self Organizing Maps (Kohonen, 1981)

Francesco Masulli Introduction to Data Clustering 1

Page 108: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Self Organizing Maps (Kohonen, 1981)

Francesco Masulli Introduction to Data Clustering 1

Page 109: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Self Organizing Maps (Kohonen, 1981)

Francesco Masulli Introduction to Data Clustering 1

Page 110: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Self Organizing Maps (Kohonen, 1981)

Francesco Masulli Introduction to Data Clustering 1

Page 111: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Self Organizing Maps (Kohonen, 1981)Using SOM for clustering

The method was originally devised as a tool for embeddingmultidimensional data into typically two dimensionalspaces, for data visualization.Since then, it has also been frequently used as a clusteringmethod, which was originally not considered appropriatebecause of the constraints imposed by the topology.

Francesco Masulli Introduction to Data Clustering 1

Page 112: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Neural Gas (Martinetz, 1993) [MAR93]

This technique resembles the SOM in the sense that notonly the winner codevector is adapted.It is different in that codevectors are not constrained to beon a grid, and the adaptation of the codevectors near thewinner is controlled by a criterion based on distance ranks.Each time a pattern x is presented, all the codevectors v jare ranked according to their distance to x (the closestobtains the lowest rank).

Francesco Masulli Introduction to Data Clustering 1

Page 113: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Neural Gas (Martinetz, 1993)Algorithm

ρj rank of the distance between x and the codevector v j

update rule:∆v j = ε(t)hλ(ρj)(x − v j) (28)

withε(t) ∈ [0,1] gradually lowered as t increaseshλ(ρj ) a function decreasing with ρj with a characteristicdecay λ; usually hλ(ρj ) = exp (−ρj/λ).

Francesco Masulli Introduction to Data Clustering 1

Page 114: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Neural Gas (Martinetz, 1993)Algorithm

1 Initialize the codebook V randomly picking from X2 Initialize the time parameter t = 03 Randomly pick an input x from X4 Order all elements v j of V according to their distance to x ,

obtaining the ρj

5 Adapt the codevectors according to Eq. 286 Increase the time parameter t = t + 17 if t < tmax go to step 3.

Francesco Masulli Introduction to Data Clustering 1

Page 115: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

The Capture Effect Neural Network [Firenze et al1994][FIR94]

The Capture Effect Neural Network (CENN) [Firenze et al1994] is a self-organizing neural network able to take intoaccount the local characteristics of the point-distribution(adaptive resolution clustering).CENN combines standard competitive self-organization ofthe weight-vectors [Kohonen95] with a non-linearmechanism of adaptive local modulation of receptive fields(RF) of neurons (Capture Effect).

Francesco Masulli Introduction to Data Clustering 1

Page 116: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

The Capture Effect Neural Network [Firenze et al1994]Training step

The learning phase of CENN is composed by the trainingstep, performing a vector quantization of data, and thelabeling step, where the prototypes, obtained by theprevious step, are grouped in order to obtain robustclusters.In the training step an initial abundant quantity of neuronsni = {wi , ri} is assumed and initialized with randomlychosen weight vectors wi (representing centers ofsub-clusters), and large radii ri (ri = R0) of the receptivefields RFi (modeled by Gaussian functions γ).The radius of a Gaussian RF is defined as the radius of anα-cut of RF itself.

Francesco Masulli Introduction to Data Clustering 1

Page 117: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

The Capture Effect Neural Network [Firenze et al1994]Training step

Then the data set is presented to CENN and the following learningformulas are applied:

∆wi = ηw (xk −wj )γ(dj (xk ))∑l γ(dl (xk ))

, (29)

∆ri =

{ηr (di (xk )− ri ) exp(−di (xk )/p)0 if di (xk ) ≥ R0

, (30)

whereηw and ηr are learning rates, di (xk ) = ‖xk −wi‖ is the Euclideandistance of points to weight vectors, andthe parameter p is defined as:

p ≡ < di (xk ) >

D ln 10, (31)

assuming D as the dimension of the feature space.Francesco Masulli Introduction to Data Clustering 1

Page 118: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

The Capture Effect Neural Network [Firenze et al1994]Labeling step

The labeling step

discards any neuron nq with rq = R0, i.e. discards neurons notrepresenting elements of the training set, and

then couples of neurons np and nq will receive the same label (i.e. theirassociated clusters are merged) if

‖wp −wq‖ < (rp + rq) σ, σ ∈ (0, 1), (32)

i.e., if they have (partially) overlapped RFs. The parameter σ is namedthe degree of overlapping.This process obtains c groups of neurons Gj , j ∈ [1, c]. We can thendefine for a cluster related to the j-th group, its center and radius as:

yj ≡< w• >Gj rj ≡< r• >Gj . (33)

Francesco Masulli Introduction to Data Clustering 1

Page 119: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

The Capture Effect Neural Network [Firenze et al1994]Labeling step

A remaining isolated neuron nq is called associable to agroup Gj if

‖wp −w•‖ < (rp + r•) (34)

at least for one neuron n• of group Gj .For such neurons associable to one or more groups, thefollowing completion rule of the labeling step is applied: anisolated neuron nq, associable to different groups, isassigned to the i-th group if and only if

i = arg∨

j(rGj − rq) ∀j . (35)

Francesco Masulli Introduction to Data Clustering 1

Page 120: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

The Capture Effect Neural Network [Firenze et al1994]Operational phase

In the operational phase an unknown vector x will beclassified by exploiting the winner-take-all (WTA) rule in thefollowing way:

x ∈ (j-th cluster) ⇐⇒ h = arg∨

i

‖x−wi‖ri

, wh ∈ Gj .

(36)

Francesco Masulli Introduction to Data Clustering 1

Page 121: Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

The Capture Effect Neural Network [Firenze et al1994]Operational phase

It is worth noting that, after the learning phase:

the distribution of the prototypes in the feature spaceapproaches the optimal vector quantization scheme of thedistribution of input data, that is approximates the mixtureprobability density function;the radial size of the RF of each neuron reaches a stablevalue which is strongly related to the spatial density ofinput data locally around the weight-vector (that is thecenter of the RF).

Francesco Masulli Introduction to Data Clustering 1


Recommended