IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Introduction to Data Clustering 1
Francesco Masulli
DIBRIS - Dip. Informatica, Bioingegneria, Robotica e Ingegneria dei Sistemi,University of Genova, ITALY
&S.H.R.O. - Sbarro Institute for Cancer Research and Molecular Medicine
Temple University, Philadelphia, PA, USAemail: [email protected]
ML-CI 2016
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Outline
1 Introduction
2 Partitioning Methods
3 Parametric/Statistical clustering
4 Hard Clustering
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Machine Learning
In 1959, Arthur Samuel defined Machine Learning as a"Field of study that gives computers the ability to learnwithout being explicitly programmed".Machine learning is about the construction and study ofsystems (learners) that can learn from data.For example, a learner could be trained on emailmessages to learn to distinguish between spam andnon-spam messages. After learning, it can then be used toclassify new email messages into spam and non-spamfolders (generalization).
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Machine Learning
In machine learning andpattern recognition, apattern is a data point (orinstance or condition),represented by a vectorof characteristics orattributes or features.From a statisticalview-point a pattern is arandom vector or amulti-variate randomvariable.
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Clustering Problem
Let us consider a set of labels and a set of unlabeledpatterns. The classification problem concerns theassignment of a label to each pattern in such a way thatsimilar patterns will share the same label.There are two principal approaches to solve theclassification problem: for the first approach a set oflabeled instances (training set) is supposed to be availablefor the design of the classifier, while for the secondapproach the available instances are unlabeled.In the first case we deal with the supervised classificationproblem, while in the second case we deal withunsupervised classification or clustering problems.
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
The Concept of Clustering
Greek philosopher Plato (∼ 400 BC):grouping objects based on their similarproperties (categorization)Statesman dialogue http://www.gutenberg.org/files/1738/1738-h/1738-h.htm
Approach further explored and systematized byAristotle (∼ 350 BC): differences betweenclasses and objectsCategories treatise http://classics.mit.edu/Aristotle/categories.html
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
The Concept of Clustering
Principles of grouping (Gestalt psychologists: M. Wertheimer,K.Koffa, W. Kohler ∼1930):
> Law of Proximity: perception tends to groupstimuli that are close together as part of the sameobject, and stimuli that are far apart as twoseparate objects.
> Law of Similarity: perception lends itself toseeing stimuli that physically resemble each otheras part of the same object, and stimuli that aredifferent as part of a different object.
> Law of Good Continuation: people tend toperceive each object as a single uninterruptedobject.
> Laws of Closure, of Good Form, of Common Fate, etc.Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
The Concept of ClusteringInformal Definition of Clustering
To find a structure in given data that will be aggregate insome categories (or clusters)Data belonging to a cluster are more similar to data of thatcluster than to data of other clustersThe aim of clustering methods is to group patterns on thebasis of a similarity (or dissimilarity) criteria where groups(or clusters) are sets of similar patterns.
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
What is a Cluster?
The notion of what constitutes a cluster is not well defined
From (Steinbach et al, 2002)
Clustering: not well-posed problem (Hadamard, 1923)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
RegularizationIll-Posed Problems [Hadamard, 1923] [KEC01, HAYK09]
A problem is well posed [Hadamard, 1923] when a solution
ExistsIs uniqueDepends continuosly on the initial data (i.e., robustnessagainst noise)
Many (especially inverse) pratical problems turned out to beill-posed.E.g., Differentiation is an ill-posed problem because its solutiondoes not depend continuosly on the data.–> robot vision: Inverse problem, ill-posed
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
RegularizationIll-Posed Problems (Hadamard, 1923)
To solve an ill-posed problem we try to regularize it byintroducing generic constraints that will restrict the spaceof solutions in an appropriate wayThe character of the constraint depends on a prioriknowledge of the solution.The constraints enable the calculation of admissiblesolutions called regularized solutions out other (perhaps oninfinite number of) possible solutions.
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Principle: Occam’s razor (William de Occam1285-1349)
"we should prefer simpler models to more complex models""this preference should be traded off against the extent to whichmodel fit the data" (Bishop,1996)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Clustering AlgorithmsClustering task
Unsupervised data analysis using clustering algorithmsprovides a useful tool to explore data structures.Clustering methods have been addressed in manycontexts and disciplines such as data mining, documentretrieval, image segmentation and pattern classification(Jain, 2009; Xu, 2005).
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Clustering Problem
We deal with unsupervised classification when:labeling is very expensive or infeasiblethe available labeling is ambiguouswe need to improve our understanding of the nature ofpatternswe want to reduce the number of data (information) totransmit
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Operational Definition of Clustering
Features to use (usually are given, sometime we select)and their attributes (binary, discrete, continue) and scales.REMARK: A good choice of features can lead to agood quality of clustering performanceClustering paradigm:
HierarchicalPartitiveVicinity
(Dis-)Similarity measure (indexes):(generalized-)Euclidean distance, correlationHamming, Jacard
An optimization procedure (when requested)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Clustering Paradigms
Hierarchical clustering able to find structures which can befurther divided in substructures and so on recursively. Theresult is a hierarchical structure of groups known asdendrogram. (Jain, 1999; Sneath, 1973; Ward, 1963)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Clustering Paradigms
Partitive/central clustering trying to obtain a single partitionof data, that are often based on the optimization of anappropriate objective function.A good cluster is that the distances between the points andthe cluster centroid are small (cluster compactness)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Clustering paradigms
Vicinity (connectivity) clustering: a good cluster is thateach point share the same cluster label as its nearestneighbor⇒it can represent any cluster shape that is an arbitrarymanifold in the data space (Shared Nearest NeighborClustering (Jarvis et al, 73; Ertoz et al, 2013), Spectralclustering (Filippone et al, 2008).
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Clustering AlgorithmsRepresentation and similarity measures
Crucial aspects in clustering are pattern representation and thesimilarity measure:
Each pattern is usually represented by a set of features ofthe system under study. It is very important to notice that agood choice of representation of patterns can lead toimprovements in clustering performance. Whether it ispossible to choose an appropriate set of features dependson the system under study.Once a representation is fixed it is possible to choose anappropriate similarity measure among patterns. The mostpopular dissimilarity measure for metric representations isthe distance or distortion, for instance the Euclideandistance (Duda&Hart, 1973).
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
ClusteringSimilarity measures
Direction cosine for continuous-valued patternscosθ = <x ,y>
||x ||||y ||
if vectors x and y are unitary→ cosθ = Cif C = 1 (full agreement)→ y = axif C = 0 → x⊥y
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
ClusteringSimilarity measures
Euclidean distance for continuous-valued patterns
E(x,y) ≡ ||x− y|| =√∑n
i=1(xi − yi)2
||x− y||2 = ||x||2 + ||y||2 − 2 x · y
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
ClusteringSimilarity measures
Minkosky distance for continuous-valued patternsM(x,y) = (
∑ni=1(xi − yi)
λ)1/λ
λ = 1 → city-block distance or Manatthan distance or l1
distanceλ = 2 → Euclidean distance or l2 distance
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
ClusteringSimilarity measures
Generalized Hamming distance, for ordered set withdiscrete-valued elements (binary, characters, etc.). Is thenumber of different elements, e.g.x = ( p, a, t, t, e, r, n )y = ( w, e, s, t, e, r, n )
↑ ↑ ↑H(x,y)=3
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
ClusteringSimilarity measures
Jaccard index, for setsd(A,B) = |A
⋂B|
|A⋃
B|
Distance of categorical variables
T (x , y) = δ(x , y) =
{0 if x = y1 otherwise
The notation δ(., .) is the Dirac’s delta
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
ClusteringSimilarity measures
Gaussian Kernel similarity function forcontinuous-valued patternsW (x , y) = exp
(M(x ,y)
2σ2
)whereM(x , y) is a distanceσ is the spread of the kernel
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
REMARK: Expectation
In probability theory, the expectation (or expected value,mathematical expectation, EV, mean, or the first moment)of a random variable is the weighted average of all possiblevalues that this random variable can take on.The weights used in computing this average correspond tothe probabilities in case of a discrete random variable, ordensities in case of a continuous random variable.From a rigorous theoretical standpoint, the expected valueis the integral of the random variable with respect to itsprobability measure.
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
REMARK: Expectation
Definition (Expectaction of a discrete random variable)
Suppose discrete random variable x can take value x1 with probability p1,value x2 with probability p2, and so on, up to value xk with probability pk .Then the expectation of this random variable x is defined as:
E(x) =k∑
i=1
xi pi
Definition (Expectaction of a univariate continuous random variable)
If the probability distribution of x admits a probability density function p(x),then the expected value can be computed as
E(x) =
∫ +∞
−∞x p(x)d x
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
REMARK: Expectation
REMARK: for multivariate random variables we have:
E(x) =k∑
i=1
xi pi when x multivariate discrete random variable
E(x) =
∫ +∞
−∞x p(x)d x when x multivariate continuous random variable
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
REMARK: Expectation
Theorem (Law of the Unconscious Statistician)
Let be the function g(x) of a random variable x. We known theprobability distribution of x but we do not known explicitly thedistribution of g(x). The expected value of g(x) is then
E [g(x)] =k∑
i=1
g(xi) pi when x multivariate discrete random variable
E [g(x)] =
∫ +∞
−∞g(x) p(x)d x when x multivariate continuous random variable
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
REMARK: Univariate Normal Distribution
Univariate Normal Distribution:
p(x) = N(µ, σ2) ≡ 1σ√
2πe−
12 ( x−µ
σ )2
(1)
where:σ2 variance; σ standard deviation,expectation of x or mean: E [x ] =
∫∞−∞ x p(x)d x ≡ µ
E [(x − µ)2] =∫∞−∞(x − µ)2 p(x)d x ≡ σ2
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
REMARK: Univariate Normal Distribution
Univariate Normal Distribution:99.7% of samples are in the interval |x − µ| ≤ 3σ95% of samples are in the interval |x − µ| ≤ 2σ68% of samples are in the interval |x − µ| ≤ σ
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Mahalanobis distanceMultivariate normal distribution:
p(x) = N(µ,Σ2) ≡ 1
(2π)d2 | Σ | 12
exp[−1
2(x− µ)t Σ−1(x− µ)
], (2)
where:
d dimensionality of the the feature space, (x− µ)t transposte of (x− µ),
µ ≡ E [x] mean vector,
Σ ≡ E [(x− µ)(x− µ))t ] covariance matrixσij = E [(xi − µi )× (xj − µj )] element ij-th of matrix Σσii variance of xi ; σij covariance of xi and xj
σij = 0 ⇐⇒ xi and xj statistically independentΣ symmetrical and semidefinite positive (i.e. zt Σ z ≥ 0)
Σ−1 inverse of Σ; | Σ | determinant of Σ.
Mahalanobis distanceM(x, µ) =
√(x− µ)tΣ−1(x− µ)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Mahalanobis distanceMultivariate normal distribution
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Clustering AlgorithmsExample 1
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Clustering AlgorithmsExample 2
Representation of a text document with a word-vector.Bag-Of-Words format: Each document is represented bythe set of its word frequencies (ignoring position of wordsin the document) and categories that it belongs to.The purpose of the format is to enable efficient executionof algorithms such as, clustering, learning, classification,visualization, etc.
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
PartitionsCodevectors
Let X = {x1, . . . ,xn} be a data set composed by npatterns for which every x i ∈ Rd .The codebook (or set of centroids) V is defined as the setV = {v1, . . . ,vc}, typically with c � n. Each elementv i ∈ Rd is called codevector (or centroid or prototype).The Voronoi region Ri of the codevector v i is the set ofvectors in Rd for which v i is the nearest codevector:
Ri =
{z ∈ Rd
∣∣∣∣ i = arg minj‖z − v j‖2
}. (3)
It is possible to prove that each Voronoi region is convex(Linde, 1980) and the boundaries of the regions are linearsegments.
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
PartitionsVoronoi set
The Voronoi set (or tassel, region,poliedronon) πi of the codevector v i is thesubset of elements of X for which thecodevector v i is the nearest codevector:
πi =
{x ∈ X
∣∣∣∣ i = arg minj‖x − v j‖2
}(4)
that is, the set of vectors belonging to Ri .A partition on Rd induced by all Voronoiregions is called Voronoi tessellation orDirichlet tessellation.
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Vector quantizationExample
While the clustering approach is descriptive, the vectorquantization is predictive
Sequence of images512x512 pixels of 8 biteachconsider sub-images 4x4we can represent asubimage as a vector
xk = (x1k , · · · , x16
k )
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Vector quantization
In the feature space the sub-images will aggregate in clusters
The center y j of a cluster jrepresent all elements of thecluster and is called codevector
codebook
y1 = (y1
1 , · · · , y161 )
y2 = (y12 , · · · , y16
2 )· · ·yc = (y1
c , · · · , y16c )
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Vector quantization
When we should trasmit asub-image xk , we will send in itsplace j , where
l = argminj ||xk −y j || (WTA rule)
In this way we will transmit aninformation of order ln2c insteadof ln2(512× 512× 8)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Vector quantization - Color ImagesColor models:
RGB additive color model: red, green, and blue light are added togetherin various ways to reproduce a broad array of colors.CMYK ( or process color, four color) subtractive color model used incolor printing. It refers to the four inks used in some color printing: cyan,magenta, yellow, and key (black).CcMmYK (or CMYKLcLm) six color subtractive color model used insome inkjet printers optimized for photo printing. CMYK model (cyan,magenta, yellow, and key) + light cyan (c) and light magenta (m).etc.
RGB (additive) CMYK (subtractive)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Vector quantization - Color Images
RGB color images can be coded using 3 intensity matrices(R,G,B) of 512x512 pixels of 8 bit each;each pixel will be a vector of 3 gray levels pij = (rij ,gij ,bij);a subimage is a vector xk = (r1
k ,g1k ,b
1k , · · · , r16
k ,g16k ,b16
k )
the rest of previous discussion remains unchanged
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Lagrange MultipliersGeneral case
TheoremIf a scalar field f (x1, · · · , xn) has a relative extremum when it issubject to m contraints (m < n), sayg1(x1, · · · , xn) = 0, · · · ,gm(x1, · · · , xn) = 0,the constrainted optimization problem (CP) can be solvedtrough the unconstrainted optimization of
L ≡ f (x1, · · · , xn)−m∑
i=1
λi g(x1, · · · , xn).
DefinitionL is called the Lagrangian andλi are called the Lagrange multipliers.
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Lagrange MultipliersGeneral case
∇f (x1, · · · , xn) =∑m
i=1 λi ∇g(x1, · · · , xn)
g1(x1, · · · , xn) = 0...gm(x1, · · · , xn) = 0
note:m + n equationsm + n unknown quantities
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Lagrange MultipliersExample
Find the extreme values of
z = x y
subject to the condition x + y = 1.Solution:
f (x , y) = xy
g(x , y) = x + y − 1 = 0∇f = λ∇g
x + y = 1
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Lagrange MultipliersExample
y = λ
x = λ
x + y = 1
x = y = λ =12
fmax (x , y) =14
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Picard IterationRecktenwald, 2000
DEF: Fixed PointA function g(x) is said to have a fixed point p if g(p) = p.In other words, if the value you put into the function isexactly the same value that you get out.Solving the equation f (x) = g(x)− x = 0 is identical tofinding the fixed point of g(x) AND the zero of f (x). So, weare dealing with another possible method for finding theroot of a one-variable equation.DEF: Fixed Point IterationThe iteration process is pn = g(pn−1) for n = 1,2,3, . . . .This process is also called Picard iteration, functionaliteration, or repeated substitution.
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Picard IterationRecktenwald, 2000
DEF: Fixed PointA function g(x) is said to have a fixed point p if g(p) = p.In other words, if the value you put into the function isexactly the same value that you get out.Solving the equation f (x) = g(x)− x = 0 is identical tofinding the fixed point of g(x) AND the zero of f (x). So, weare dealing with another possible method for finding theroot of a one-variable equation.DEF: Fixed Point IterationThe iteration process is pn = g(pn−1) for n = 1,2,3, . . . .This process is also called Picard iteration, functionaliteration, or repeated substitution.
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Picard IterationRecktenwald, 1998
Finding the root of f (x) = log(x + 4)− x on [0,2], i.e., f (x) = 0,is equivalent to find the fixed point of g(x) = log(x + 4) on [0,2].
n0123456
pn g(pn)
0.0000 0.60200.6020 0.66290.6629 0.66860.6686 0.66910.6691 0.66920.6692 0.66920.6692 0.6692
pn g(pn)
1.0000 0.69900.6990 0.67200.6720 0.66950.6695 0.66930.6693 0.66920.6692 0.66920.6692 0.6692
pn g(pn)
2.0000 0.77820.7782 0.67930.6793 0.67020.6702 0.66930.6693 0.66930.6693 0.66920.6692 0.6692
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Picard IterationRecktenwald, 1998
Finding the root of f (x) = log(x + 4)− x on [0,2], i.e., f (x) = 0,is equivalent to find the fixed point of g(x) = log(x + 4) on [0,2].
n0123456
pn g(pn)
0.0000 0.60200.6020 0.66290.6629 0.66860.6686 0.66910.6691 0.66920.6692 0.66920.6692 0.6692
pn g(pn)
1.0000 0.69900.6990 0.67200.6720 0.66950.6695 0.66930.6693 0.66920.6692 0.66920.6692 0.6692
pn g(pn)
2.0000 0.77820.7782 0.67930.6793 0.67020.6702 0.66930.6693 0.66930.6693 0.66920.6692 0.6692
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Picard Iteration (Rech2000) <REC00>Recktenwald, 1998
Uniqueness: The Fixed Point TheoremIf g is continuous on [a,b] and g(x) ∈ [a,b] for all x ∈ [a,b]then g has a fixed point in [a,b].In addition, if 0 < |g(x)| < 1 for all x ∈ [a,b] then g has aunique fixed point in [a,b]
Convergence Criteria for Picard IterationThe iteration process pn = g(pn−1) for n = 1,2,3, . . . willconverge to a unique solution for any initial value p0 in [a,b]if g
′exists on (a,b) and 0 < |g′(x)| < 1 for all x ∈ [a,b].
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Picard Iteration (Rech2000) <REC00>Recktenwald, 1998
Uniqueness: The Fixed Point TheoremIf g is continuous on [a,b] and g(x) ∈ [a,b] for all x ∈ [a,b]then g has a fixed point in [a,b].In addition, if 0 < |g(x)| < 1 for all x ∈ [a,b] then g has aunique fixed point in [a,b]
Convergence Criteria for Picard IterationThe iteration process pn = g(pn−1) for n = 1,2,3, . . . willconverge to a unique solution for any initial value p0 in [a,b]if g
′exists on (a,b) and 0 < |g′(x)| < 1 for all x ∈ [a,b].
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Picard Iteration (Rech2000) <REC00>Recktenwald, 1998
Uniqueness: The Fixed Point TheoremIf g is continuous on [a,b] and g(x) ∈ [a,b] for all x ∈ [a,b]then g has a fixed point in [a,b].In addition, if 0 < |g(x)| < 1 for all x ∈ [a,b] then g has aunique fixed point in [a,b]
Convergence Criteria for Picard IterationThe iteration process pn = g(pn−1) for n = 1,2,3, . . . willconverge to a unique solution for any initial value p0 in [a,b]if g
′exists on (a,b) and 0 < |g′(x)| < 1 for all x ∈ [a,b].
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Picard IterationRecktenwald, 1998
In general, for multivalued funtions, each iteration of thePicard iterations method is composed of two (or more)steps:
Step1: A subset of variables is kept fixed and the optimization isperformed with respect to the remaining variables
Step2: The role of the fixed and moving variables is swapped.
The optimization algorithm stops when variables changeless than a fixed threshold.A more general framework named "Alternating ClusterEstimation" is presented in T. A. Runkler, J. C. Bezdek,Alternating Cluster Estimation: A New Tool for Clustering and FunctionApproximation, IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 7,NO. 4, 377-393, 1999 [RUN99]
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Parametric Clustering (Duda,73)
Let X = {xh | h = 1, ..., n} be the set of unlabeled instanced (training set),and V = {vi | i = 1, ..., c} be the set of centers of clusters (or classes) ωj .Following a parametric learning approach, we make the followingassumptions:
1 The instances come from a known number of c classes ωi ,i ∈ {1, ..., c}.
2 The a priori probabilities P (ωi ), i.e. the probability of drawing patternsof class ωi from X are known.
3 The form of class-conditional probabilities densities p (x | ωi ,Θi ) (i.e.the probability density of instance xh inside class ωi ) are known, ∀i .
Θi is the unknown vector of parameters of the class-conditional probabilitiesdensities.Note that the third assumption reduces the clustering problem to the problemof estimation of the vector Θi (parametric learning).
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Parametric Clustering
In this setting, we assume that instances are obtained byselecting a class ωi and then selecting a pattern x according tothe probability law p (x | ωi ,Θi), i.e.:
p (x | Θ) =c∑
i=1
p (x | ωi ,Θi) P (ωi) , (5)
where Θ = (Θ1, ...,Θc).A density function of this form is called a mixture density(Duda73), p (xh | ωi ,Θi) are called the component densities,and P (ωi) are called the mixing parameters.
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Parametric Clustering
A well known parametric statistics method for estimating the parametervector Θ is based on maximum likelihood (Duda73). It assumes that theparameter vector Θ is fixed but unknown. The likelihood of the training set Xis the conditional density
p (X | Θ) =n∏
h=1
p (xh | Θ) (6)
or also:
p (X | Θ) =n∏
h=1
c∑i=1
p (xh | ωi ,Θi ) P (ωi ) , (7)
Its log is:
log p (X | Θ) =n∑
h=1
logc∑
i=1
p (xh | ωi ,Θi ) P (ωi ) , (8)
Then the maximum likelihood estimate Θ is that value of Θ that maximizesthe likelihood of the observed training set X (or its log).
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Parametric Clustering
If p (X | Θ) is a differentiable function of Θ, we can obtain thefollowing conditions for the maximum-likelihood estimate Θj :
n∑h=1
P(ωi | xh, Θ
)∇Θj
log(
p(
xh | ωi , Θi
))= 0 ∀ j . (9)
Constraints:
P (ωi) ≥ 0 (10)
c∑i=1
P (ωi) = 1 (11)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Parametric Clustering
Let we assume now that the component densities aremultivariate normal, i.e.:
p(
xh | ωi , Θi
)=
1
(2π)d2 | Σi |
12
exp[−1
2(xh − vi)
t Σ−1i (xh − vi)
],
(12)where:
d dimensionality of the the feature space,vi mean vector,Σi ≡ E [(xh − vi))(xh − vi))t ] covariance matrix(xh − vi)
t transpose of xh − vi , Σ−1i inverse of Σi ,
and | Σi | determinant of Σi .
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Parametric Clustering
The local-maximum-likelihood estimate are:
vi =
∑nh=1 P
(ωi | xh, Θi
)xh∑n
h=1 P(ωi | xh, Θi
) (13)
Σi =
∑nh=1 P
(ωi | xh, Θi
)(xh − vi )(xh − vi )
t∑nh=1 P
(ωi | xh, Θi
) (14)
P(ωi | xh, Θi
)=
| Σj |−12 exp[− 1
2 (xh − vi )t Σ−1
j (xh − vi )] P (ωi )∑cj=1 | Σj |−
12 exp[− 1
2 (xh − vj ))t Σ−1j (xh − vj ))] P (ωj )
(15)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Parametric Clustering
The Eqs. in the previous slide can be interpreted as thebasis of a gradient ascent or hill-climbing procedure formaximizing the likelihood procedure (Picard iteration);each cycle is composed by two (or more) steps:
Step1: A subset of variables is kept fixed and the optimization isperformed with respect to the remaining variables
Step2: The role of the fixed and moving variables is swapped.The optimization algorithm stops when variables changeless than a fixed thresholA Picard iteration can start with Eq. 15 using initialestimates to evaluate P
(ωi | xh, Θi
)and then using Eqs.
13, and 14 to update the other estimates, and then repeatthis cycle until the variations are less than an assignedthreshold.
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Parametric Clustering
The inversion of Σi is quite time consuming, and moreoverit maybe ill-conditioned.Like all hill-climbing procedures the results do dependupon the starting point, and therefore there is thepossibility of multiple solutions.
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
The K-Means Algorithm
We can notice that in Eq. 15 the probability P(ωi | xh, Θi
)is
large when the squared Mahalanobis distance
M2(xh, vi) ≡ (xh − vi)t Σ−1
j (xh − vi) (16)
is small.
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
The K-Means Algorithm
This observation is the rationale of the K-Means Hard C-Means(HCM) or C-Means or Basic Isodata algorithm (Duda73) [Isodatastands for Iterative Self-Organizing Data Analysis Techniques] that isbased on the following approximation:
P(ωi | xh, Θi
)=
{1 if Ej (xh) = min1≤j≤C EJ (xi)0 otherwise
(17)
where Ej (xi) is the local cost function or distortion and isusually assumed as the squared Euclidean distance
Ej (xh) ≡|| xh − yj ||2 (18)
Starting from the finite data set X this algorithm movesiteratively the k codevectors to the arithmetic mean of theirVoronoi sets {πi}i=1,...,k .
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
K-Means (Lloyd, 1957)EQUIVALENT FOUNDATION OF K-Means Algorithm: Lloyd’s algorithm, a.k.a. asVoronoi iteration or relaxation
Theorem
A necessary condition for a codebook V to minimize the EmpiricalQuantization Error (Gersho, 1992) or Expectation of Distortion or K-Meansfunctional or K-Means objective function (denoted as E(X ) or as < E >):
E(X ) =c∑
i=1
∑x∈πi
‖x − v i‖2 =∑
ih
uih‖xh − v i‖2, with uih =
{1 if xh ∈ πi
0 otherwise
(19)is that each codevector v i fulfills the centroid condition.
In case of a finite data set X and with Euclidean distance, the centroidcondition reduces to v i = 1
|πi |∑
x∈πix
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
K-Means (Lloyd, 1957)Algorithm
K-Means is made up by the following steps:1 choose the number k of clusters;2 initialize the codebook V with vectors randomly picked from
X or with random vectors in the minimum hyperboxcontaining the full data set, i.e., as s a combination of datapoints:
vj =n∑
h=1
γjhxh , (20)
whith coefficients γih ∈ [0,1] .3 compute the Voronoi set πi associated to the codevector v i ;4 move each codevector to the mean of its Voronoi set;5 return to step 3 if any codevector has changed otherwise
return the codebook.
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
K-Means (Lloyd, 1957)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
K-Means (Lloyd, 1957)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
K-Means (Lloyd, 1957)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
K-Means (Lloyd, 1957)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
K-Means (Lloyd, 1957)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
K-Means (Lloyd, 1957)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
K-Means (Lloyd, 1957)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
K-Means (Lloyd, 1957)pros
At each iteration of the algorithm a codebook is found anda Voronoi tessellation of the input space is provided.It is guaranteed that after each iteration the quantizationerror does not increase.At the end of the algorithm a local minimum of thequantization error is obtained.K-Means can be viewed as an Expectation-Maximizationalgorithm, ensuring the convergence after a finite numberof steps (Bishop, 1996)Different distances lead to different invariance propertiesas in the case of Mahalanobis distance which producesinvariance on ellipsoids (Duda&Hart, 1973)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
K-Means (Lloyd, 1957)cons
Local minima of E(X ) make the method dependent oninitialization, and the average is sensitive to outliers(Duda&Hart, 1973).
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
K-Means (Lloyd, 1957)cons
In order to overcome this problem:
Heuristics, e.g., split and merge algorithms (Isodata Algorithm).
Local search techniques based on a regularization framework: (addingconstraints on the solution, i.e. minimization of a modified riskfunctional) E.g., Isodata and Fuzzy clustering paradigms: FuzzyC-Means (FCM) (Bezdek, 1981), and Deterministic Annealing (DA)(Rose, 1990), Possibilistic Clustering (Krishnapuram & Keller, 1993,1996), and Graded Possibilistic Clustering (Masulli & Rovetta, 2003).
Global search techniques, e.g., minimization of E(x) using SimulatedAnnealing (Bogus et al., 1999), or Evolutionary Computing (Fogel,1993; Bezdek et al., 1994; Tseng&Yang,1997; Egan, 1998; Kuncheva etal., 1998; Hall et al., 1999; Masulli et al., 1999).
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
K-Means (Lloyd, 1957)cons
The number of clusters to find must be provided, and thiscan be done only using some a priori information oradditional validity criterion.K-Means can deal only with clusters with sphericallysymmetrical point distribution, since Euclidean distances ofpatterns from centroids are computed leading to aspherical invariance.The approximation in Eq. 17 making uih = {0,1} is oftentoo strong, while, by contrast, in real cases some objectsshow not zero similarity degrees to different classes.
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Isodata AlgorithmIterative Self-Organizing Data Analysis Techniques
Heuristic algorithm which allows the number of clusters tobe automatically adjusted during the iteration by mergingsimilar clusters and splitting clusters with large standarddeviations.The Isodata algorithm is more flexible than the K-meanmethod. But the user has to choose empirically many moreparameters.In the next slides we’ll give an example of IsodataAlgorithm fromhttp://fourier.eng.hmc.edu/e161/lectures/classification/node13.html
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Isodata AlgorithmIterative Self-Organizing Data Analysis Techniques
Given a set of samples {xi , i = 1, 2, ...,N} (where eachx = [x (i)
1 , ..., x (i)n ]T is a column vectors representing a point in the
n-dimensional feature space)
K = number of clusters desired;
I = maximum number of iterations allowed;
P = maximum number of pairs of cluster which can be merged;
ΘN = threshold value for minimum number of samples in each cluster(cardinality) can have (used for discarding clusters);
ΘS = threshold value for standard deviation (used for split operation);
ΘC = threshold value for pairwise distances (used for merge operation)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Isodata AlgorithmAlgorithm
Step 1. Arbitrarily choose k (not necessarily equal to K ) initial clustercenters from the data set {xi , i = 1, 2, ...,N}.Step 2. Assign each of the N data-points to the closest cluster center:x ∈ ωj if D L(x, vj ) = max {DL(x, vi ), i = 1, ..., k}Step 3. Discard clusters with fewer than ΘN members, i.e., if for any j ,Nj < ΘN , then discard ωj and k ← k − 1.
Step 4. Update each cluster center: vi =1Nj
∑x∈ωj
x (j = 1, · · · , k)
Step 5. Compute the average distance Dj of data-points in clustercenter ωj from their corresponding cluster
Dj =1Nj
∑x∈ωj
DL(x, vi ), (i = 1, ..., k)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Isodata AlgorithmAlgorithm
Step 6. Compute the overall average distance D of data-points fromtheir respective cluster centers:
D =1N
k∑j=1
NjDj
Step 7. If k ≤ K/2 (too few clusters), go to Step 8; else if k > K/2 (toomany clusters), go to Step 11; else go to Step 14. (Steps 8 through 10are for split operation, Steps 11 through 13 are for merge operation.)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Isodata AlgorithmAlgorithm
Step 8. First step to split. Find the standard deviation vector
σj = [σ(j)1 , · · · , σ
(j)n ]T for each cluster: σ(j)
i =
√√√√ 1Nj
∑x∈ωj
DL(xi − v (j)i )2,
where v (j)i is the ith component of vi and σi is the standard deviation of
the data-points in ωj , along the i-th coordinate axis. Nj is the number ofdata-points in ωj ,
Step 9. Find the maximum component of each σi for each cluster anddenote it byσ(j)
max ; do it for each j = 1, · · · , k .
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Isodata AlgorithmAlgorithm
Step 10. If for any σ(j)max , (j = 1, · · · , k), all of the following are true:
σ(j)max > Θs; Dj > D; Nj > 2ΘN .
then split vj into two new clusters centers v+j and v−j by adding ±δ to
the components of vj , corresponding to σ(j)max , where δ can be ασ(j)
max , forsome α > 0. Then delete vj , and let k ← k + 1. Goto step 2.else Go to Step 14.
Step 11. First step to merge. Compute the pairwise distances Dij
between every two cluster centers: Dij = DL(vi , vj )
and arrange these k(k − 1)/2 distances in ascending order.
Step 12. Find no more than P smallest Dij ’s wich are also smaller thanΘC and keep them in ascending order: Di1j1 ≤ Di2j2 ≤ · · · ≤ DiP jP
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Isodata AlgorithmAlgorithm
Step 13. Perform pairwise merge: for l = 1, · · · ,P, do the following:If neither of vil nor vjl has been used in this iteration, then merge themto form a new center:v = 1
Nil +Njl.
Delete vil and vjl and let k ← k − 1.Go to Step 2.
Step 14. Terminate if maximum number of iterations I is reached.Otherwise go to Step 2.
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
K-Medoids Algorithm
The k-medoids algorithm is a clustering algorithm relatedto the k-means algorithmIn contrast to the k-means algorithm, k-medoids choosesdatapoints as centers (medoids or exemplars).A necessary condition for the set of medoids M tominimize the functional:
F (X ) =k∑
i=1
∑x∈πj
| x −mj |, (21)
where mj is selected from the x ∈ πj with the aim tominimize F (x).It is more robust to noise and outliers as compared tok-means because it minimizes a sum of dissimilaritiesinstead of a sum of squared Euclidean distances.
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
K-Medoids AlgorithmAlgorithm
The most common realisation of k-medoid clustering is thePartitioning Around Medoids (PAM) algorithm (Theodoridis &Koutroumbas, 2006) and is as follows:
(1) Initialize: randomly select k of the n data points as the medoids
(2) Associate each data point to the closest medoid. ("closest" here isdefined using any valid distance metric, most commonly Euclideandistance, Manhattan distance or Minkowski distance)
(3) For each medoid m
(3.1) For each non-medoid data point o
(3.1.1) Swap m and o and compute the total cost of theconfiguration
(4)Select the configuration with the lowest cost.
(5) repeat steps 2 to 5 until there is no change in the medoid.
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Image Segmentation / Binarization
We associate to each pixel a squared subimage centeredon it, plus a possible set of features extracted by someimage processing operator (augmented feature vector)xij = (x(i−k)(j−k), · · · , x(i+k)(j+k), f1, · · · , fr )we use a small the number of codevectors and we assign afalse color to each cluster for image segmentationif we segment using only two codevectors we obtain abinarized image.
original image binarized image original image binarized image
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Image Segmentation
Robustneess to noise:A robustness measure for an estimator is the breakdownpoint, defined as the fraction of outliers able to corrupt theestimation
original MRI image +7% Gaussian noise segmented image
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Image Segmentation
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Batch K-Means (Lloyd, 1957)
The term batch means that at each step the algorithmtakes into account the whole data set to update thecodevectors. When the cardinality n of the data set X isvery high (e.g., several hundreds of thousands) the batchprocedure is computationally expensive.An on-line update has been introduced leading to theon-line K-Means algorithm (Linde, 1980; MacQueen,1967). At each step, this method simply randomly picks aninput pattern and updates its nearest codevector, ensuringthat the scheduling of the updating coefficient is adequateto allow convergence and consistency.
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
On-Line K-Means / Vector Quantization
1 Initialize codevectors small at the center of the hyperbox of data2 Winner-Takes-All (WTA)3 Adapt winner codevector:
∆v j = ε(t)(x − v j ) (22)
4 Stochastic Approximation (Robbins-Morro,1951)
ε = ε(t) (23)
PROS:
minimizes the K-MEANS functional→ approximates K-MEANSsolutions
adapts to a changing environment.
CONS:
degenerated codevectors at the end of the learning that represent noexamples (wasting of resources)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Online K-Means/ Vector Quantization
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Online K-Means/ Vector Quantization
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Online K-Means/ Vector Quantization
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Online K-Means/ Vector Quantization
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Online K-Means/ Vector Quantization
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
On-Line K-Means / Vector Quantization
Solutions:Conscience mechanism (DeSieno, 1988) [DES88].Determine the winner as:
s(x) = arg minv j∈V
(fj‖x − v j‖) (24)
where fj is the frequence of past winning of unit j .Self Organizing Maps [Kohonen, 1981];Fuzzy Learning Vector Quantization (Bezdek,1995).
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Self Organizing Maps (Kohonen, 1981)[KoH90]
Kohonen, T., Automatic formation of topological maps of patterns in aself-organizing system In Oja, E. and Simula, O., editors,Proceedings of 2SCIA, Scand. Conference on Image Analysis, pages214-220, Helsinki, Finland. Suomen HahmontunnistustutkimuksenSeura r.y, 1981.
A Self Organizing Map (SOM), also known as SelfOrganizing Feature Map (SOFM), represents data bymeans of codevectors organized on a grid with fixedtopology.Codevectors move to adapt to the input distribution, butadaptation is propagated along the grid also to neighboringcodevectors, according to a given propagation orneighborhood function. This effectively constrains theevolution of codevectors.
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Self Organizing Maps (Kohonen, 1981)
Grid topologies may differ. We consider a two-dimensional,square-mesh topology.The distance on the grid is used to determine how stronglya codevector is adapted when the unit aij is the winner.The metric used on a rectangular grid is the Manhattandistance, for which the distance between two elementsr = (r1, r2) and s = (s1, s2) is:
drs = |r1 − s1|+ |r2 − s2| . (25)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Self Organizing Maps (Kohonen, 1981) - Algorithm1 Initialize the codebook V with small codevectors at the center of the
hyperbox of data2 Initialize the set C of connections to form the rectangular grid of
dimension n1 × n2
3 Initialize t = 04 Randomly pick an input x from X5 Determine the winner:
s(x) = arg minv j∈V‖x − v j‖ (26)
6 Adapt each codevector:
∆v j = ε(t)h(drs)(x − v j ) (27)
where h is a decreasing function of d , e.g.: h(drs) = exp(− d2
rs2s2(t)
)7 Increment t8 if t < tmax go to step 49 end.
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Self Organizing Maps (Kohonen, 1981)
σ(t) and ε(t) are decreasing functions of t , e.g. (Ritter, 1991):
σ(t) = σi
(σfσi
)t/tmax, ε(t) = εi
(εfεi
)t/tmax,
σi, σf initial values for the functions σ(t) and ε(t)).εi, εf final values for the functions σ(t) and ε(t).
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Self Organizing Maps (Kohonen, 1981)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Self Organizing Maps (Kohonen, 1981)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Self Organizing Maps (Kohonen, 1981)
Calibration:
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Self Organizing Maps (Kohonen, 1981)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Self Organizing Maps (Kohonen, 1981)
Test with uniform distribution (noise):
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Self Organizing Maps (Kohonen, 1981)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Self Organizing Maps (Kohonen, 1981)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Self Organizing Maps (Kohonen, 1981)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Self Organizing Maps (Kohonen, 1981)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Self Organizing Maps (Kohonen, 1981)Using SOM for clustering
The method was originally devised as a tool for embeddingmultidimensional data into typically two dimensionalspaces, for data visualization.Since then, it has also been frequently used as a clusteringmethod, which was originally not considered appropriatebecause of the constraints imposed by the topology.
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Neural Gas (Martinetz, 1993) [MAR93]
This technique resembles the SOM in the sense that notonly the winner codevector is adapted.It is different in that codevectors are not constrained to beon a grid, and the adaptation of the codevectors near thewinner is controlled by a criterion based on distance ranks.Each time a pattern x is presented, all the codevectors v jare ranked according to their distance to x (the closestobtains the lowest rank).
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Neural Gas (Martinetz, 1993)Algorithm
ρj rank of the distance between x and the codevector v j
update rule:∆v j = ε(t)hλ(ρj)(x − v j) (28)
withε(t) ∈ [0,1] gradually lowered as t increaseshλ(ρj ) a function decreasing with ρj with a characteristicdecay λ; usually hλ(ρj ) = exp (−ρj/λ).
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
Neural Gas (Martinetz, 1993)Algorithm
1 Initialize the codebook V randomly picking from X2 Initialize the time parameter t = 03 Randomly pick an input x from X4 Order all elements v j of V according to their distance to x ,
obtaining the ρj
5 Adapt the codevectors according to Eq. 286 Increase the time parameter t = t + 17 if t < tmax go to step 3.
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
The Capture Effect Neural Network [Firenze et al1994][FIR94]
The Capture Effect Neural Network (CENN) [Firenze et al1994] is a self-organizing neural network able to take intoaccount the local characteristics of the point-distribution(adaptive resolution clustering).CENN combines standard competitive self-organization ofthe weight-vectors [Kohonen95] with a non-linearmechanism of adaptive local modulation of receptive fields(RF) of neurons (Capture Effect).
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
The Capture Effect Neural Network [Firenze et al1994]Training step
The learning phase of CENN is composed by the trainingstep, performing a vector quantization of data, and thelabeling step, where the prototypes, obtained by theprevious step, are grouped in order to obtain robustclusters.In the training step an initial abundant quantity of neuronsni = {wi , ri} is assumed and initialized with randomlychosen weight vectors wi (representing centers ofsub-clusters), and large radii ri (ri = R0) of the receptivefields RFi (modeled by Gaussian functions γ).The radius of a Gaussian RF is defined as the radius of anα-cut of RF itself.
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
The Capture Effect Neural Network [Firenze et al1994]Training step
Then the data set is presented to CENN and the following learningformulas are applied:
∆wi = ηw (xk −wj )γ(dj (xk ))∑l γ(dl (xk ))
, (29)
∆ri =
{ηr (di (xk )− ri ) exp(−di (xk )/p)0 if di (xk ) ≥ R0
, (30)
whereηw and ηr are learning rates, di (xk ) = ‖xk −wi‖ is the Euclideandistance of points to weight vectors, andthe parameter p is defined as:
p ≡ < di (xk ) >
D ln 10, (31)
assuming D as the dimension of the feature space.Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
The Capture Effect Neural Network [Firenze et al1994]Labeling step
The labeling step
discards any neuron nq with rq = R0, i.e. discards neurons notrepresenting elements of the training set, and
then couples of neurons np and nq will receive the same label (i.e. theirassociated clusters are merged) if
‖wp −wq‖ < (rp + rq) σ, σ ∈ (0, 1), (32)
i.e., if they have (partially) overlapped RFs. The parameter σ is namedthe degree of overlapping.This process obtains c groups of neurons Gj , j ∈ [1, c]. We can thendefine for a cluster related to the j-th group, its center and radius as:
yj ≡< w• >Gj rj ≡< r• >Gj . (33)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
The Capture Effect Neural Network [Firenze et al1994]Labeling step
A remaining isolated neuron nq is called associable to agroup Gj if
‖wp −w•‖ < (rp + r•) (34)
at least for one neuron n• of group Gj .For such neurons associable to one or more groups, thefollowing completion rule of the labeling step is applied: anisolated neuron nq, associable to different groups, isassigned to the i-th group if and only if
i = arg∨
j(rGj − rq) ∀j . (35)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
The Capture Effect Neural Network [Firenze et al1994]Operational phase
In the operational phase an unknown vector x will beclassified by exploiting the winner-take-all (WTA) rule in thefollowing way:
x ∈ (j-th cluster) ⇐⇒ h = arg∨
i
‖x−wi‖ri
, wh ∈ Gj .
(36)
Francesco Masulli Introduction to Data Clustering 1
IntroductionPartitioning Methods
Parametric/Statistical clusteringHard Clustering
The Capture Effect Neural Network [Firenze et al1994]Operational phase
It is worth noting that, after the learning phase:
the distribution of the prototypes in the feature spaceapproaches the optimal vector quantization scheme of thedistribution of input data, that is approximates the mixtureprobability density function;the radial size of the RF of each neuron reaches a stablevalue which is strongly related to the spatial density ofinput data locally around the weight-vector (that is thecenter of the RF).
Francesco Masulli Introduction to Data Clustering 1