Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods...

IntroductionPartitioning Methods

Parametric/Statistical clusteringHard Clustering

Introduction to Data Clustering 1

Francesco Masulli

DIBRIS - Dip. Informatica, Bioingegneria, Robotica e Ingegneria dei Sistemi,University of Genova, ITALY

&S.H.R.O. - Sbarro Institute for Cancer Research and Molecular Medicine

Temple University, Philadelphia, PA, USAemail: [email protected]

ML-CI 2016

Francesco Masulli Introduction to Data Clustering 1



Outline

1 Introduction

2 Partitioning Methods

3 Parametric/Statistical clustering

4 Hard Clustering




Machine Learning

In 1959, Arthur Samuel defined Machine Learning as a"Field of study that gives computers the ability to learnwithout being explicitly programmed".Machine learning is about the construction and study ofsystems (learners) that can learn from data.For example, a learner could be trained on emailmessages to learn to distinguish between spam andnon-spam messages. After learning, it can then be used toclassify new email messages into spam and non-spamfolders (generalization).




Machine Learning

In machine learning andpattern recognition, apattern is a data point (orinstance or condition),represented by a vectorof characteristics orattributes or features.From a statisticalview-point a pattern is arandom vector or amulti-variate randomvariable.




Clustering Problem

Let us consider a set of labels and a set of unlabeledpatterns. The classification problem concerns theassignment of a label to each pattern in such a way thatsimilar patterns will share the same label.There are two principal approaches to solve theclassification problem: for the first approach a set oflabeled instances (training set) is supposed to be availablefor the design of the classifier, while for the secondapproach the available instances are unlabeled.In the first case we deal with the supervised classificationproblem, while in the second case we deal withunsupervised classification or clustering problems.




The Concept of Clustering

Greek philosopher Plato (∼ 400 BC):grouping objects based on their similarproperties (categorization)Statesman dialogue http://www.gutenberg.org/files/1738/1738-h/1738-h.htm

Approach further explored and systematized byAristotle (∼ 350 BC): differences betweenclasses and objectsCategories treatise http://classics.mit.edu/Aristotle/categories.html




The Concept of Clustering

Principles of grouping (Gestalt psychologists: M. Wertheimer,K.Koffa, W. Kohler ∼1930):

> Law of Proximity: perception tends to groupstimuli that are close together as part of the sameobject, and stimuli that are far apart as twoseparate objects.

> Law of Similarity: perception lends itself toseeing stimuli that physically resemble each otheras part of the same object, and stimuli that aredifferent as part of a different object.

> Law of Good Continuation: people tend toperceive each object as a single uninterruptedobject.

> Laws of Closure, of Good Form, of Common Fate, etc.Francesco Masulli Introduction to Data Clustering 1



The Concept of ClusteringInformal Definition of Clustering

To find a structure in given data that will be aggregate insome categories (or clusters)Data belonging to a cluster are more similar to data of thatcluster than to data of other clustersThe aim of clustering methods is to group patterns on thebasis of a similarity (or dissimilarity) criteria where groups(or clusters) are sets of similar patterns.




What is a Cluster?

The notion of what constitutes a cluster is not well defined

From (Steinbach et al, 2002)

Clustering: not well-posed problem (Hadamard, 1923)




RegularizationIll-Posed Problems [Hadamard, 1923] [KEC01, HAYK09]

A problem is well posed [Hadamard, 1923] when a solution

ExistsIs uniqueDepends continuosly on the initial data (i.e., robustnessagainst noise)

Many (especially inverse) pratical problems turned out to beill-posed.E.g., Differentiation is an ill-posed problem because its solutiondoes not depend continuosly on the data.–> robot vision: Inverse problem, ill-posed




RegularizationIll-Posed Problems (Hadamard, 1923)

To solve an ill-posed problem we try to regularize it byintroducing generic constraints that will restrict the spaceof solutions in an appropriate wayThe character of the constraint depends on a prioriknowledge of the solution.The constraints enable the calculation of admissiblesolutions called regularized solutions out other (perhaps oninfinite number of) possible solutions.




Principle: Occam’s razor (William de Occam1285-1349)

"we should prefer simpler models to more complex models""this preference should be traded off against the extent to whichmodel fit the data" (Bishop,1996)




Clustering AlgorithmsClustering task

Unsupervised data analysis using clustering algorithmsprovides a useful tool to explore data structures.Clustering methods have been addressed in manycontexts and disciplines such as data mining, documentretrieval, image segmentation and pattern classification(Jain, 2009; Xu, 2005).




Clustering Problem

We deal with unsupervised classification when:labeling is very expensive or infeasiblethe available labeling is ambiguouswe need to improve our understanding of the nature ofpatternswe want to reduce the number of data (information) totransmit




Operational Definition of Clustering

Features to use (usually are given, sometime we select)and their attributes (binary, discrete, continue) and scales.REMARK: A good choice of features can lead to agood quality of clustering performanceClustering paradigm:

HierarchicalPartitiveVicinity

(Dis-)Similarity measure (indexes):(generalized-)Euclidean distance, correlationHamming, Jacard

An optimization procedure (when requested)




Clustering Paradigms

Hierarchical clustering able to find structures which can befurther divided in substructures and so on recursively. Theresult is a hierarchical structure of groups known asdendrogram. (Jain, 1999; Sneath, 1973; Ward, 1963)




Clustering Paradigms

Partitive/central clustering trying to obtain a single partitionof data, that are often based on the optimization of anappropriate objective function.A good cluster is that the distances between the points andthe cluster centroid are small (cluster compactness)




Clustering paradigms

Vicinity (connectivity) clustering: a good cluster is thateach point share the same cluster label as its nearestneighbor⇒it can represent any cluster shape that is an arbitrarymanifold in the data space (Shared Nearest NeighborClustering (Jarvis et al, 73; Ertoz et al, 2013), Spectralclustering (Filippone et al, 2008).




Clustering AlgorithmsRepresentation and similarity measures

Crucial aspects in clustering are pattern representation and thesimilarity measure:

Each pattern is usually represented by a set of features ofthe system under study. It is very important to notice that agood choice of representation of patterns can lead toimprovements in clustering performance. Whether it ispossible to choose an appropriate set of features dependson the system under study.Once a representation is fixed it is possible to choose anappropriate similarity measure among patterns. The mostpopular dissimilarity measure for metric representations isthe distance or distortion, for instance the Euclideandistance (Duda&Hart, 1973).




ClusteringSimilarity measures

Direction cosine for continuous-valued patternscosθ = <x ,y>

||x ||||y ||

if vectors x and y are unitary→ cosθ = Cif C = 1 (full agreement)→ y = axif C = 0 → x⊥y





Euclidean distance for continuous-valued patterns

E(x,y) ≡ ||x− y|| =√∑n

i=1(xi − yi)2

||x− y||2 = ||x||2 + ||y||2 − 2 x · y





Minkosky distance for continuous-valued patternsM(x,y) = (

∑ni=1(xi − yi)

λ)1/λ

λ = 1 → city-block distance or Manatthan distance or l1

distanceλ = 2 → Euclidean distance or l2 distance





Generalized Hamming distance, for ordered set withdiscrete-valued elements (binary, characters, etc.). Is thenumber of different elements, e.g.x = ( p, a, t, t, e, r, n )y = ( w, e, s, t, e, r, n )

↑ ↑ ↑H(x,y)=3





Jaccard index, for setsd(A,B) = |A

⋂B|

|A⋃

B|

Distance of categorical variables

T (x , y) = δ(x , y) =

{0 if x = y1 otherwise

The notation δ(., .) is the Dirac’s delta





Gaussian Kernel similarity function forcontinuous-valued patternsW (x , y) = exp

(M(x ,y)

2σ2

)whereM(x , y) is a distanceσ is the spread of the kernel




REMARK: Expectation

In probability theory, the expectation (or expected value,mathematical expectation, EV, mean, or the first moment)of a random variable is the weighted average of all possiblevalues that this random variable can take on.The weights used in computing this average correspond tothe probabilities in case of a discrete random variable, ordensities in case of a continuous random variable.From a rigorous theoretical standpoint, the expected valueis the integral of the random variable with respect to itsprobability measure.




REMARK: Expectation

Definition (Expectaction of a discrete random variable)

Suppose discrete random variable x can take value x1 with probability p1,value x2 with probability p2, and so on, up to value xk with probability pk .Then the expectation of this random variable x is defined as:

E(x) =k∑

i=1

xi pi

Definition (Expectaction of a univariate continuous random variable)

If the probability distribution of x admits a probability density function p(x),then the expected value can be computed as

E(x) =

∫ +∞

−∞x p(x)d x




REMARK: Expectation

REMARK: for multivariate random variables we have:

E(x) =k∑

i=1

xi pi when x multivariate discrete random variable

E(x) =

∫ +∞

−∞x p(x)d x when x multivariate continuous random variable




REMARK: Expectation

Theorem (Law of the Unconscious Statistician)

Let be the function g(x) of a random variable x. We known theprobability distribution of x but we do not known explicitly thedistribution of g(x). The expected value of g(x) is then

E [g(x)] =k∑

i=1

g(xi) pi when x multivariate discrete random variable

E [g(x)] =

∫ +∞

−∞g(x) p(x)d x when x multivariate continuous random variable




REMARK: Univariate Normal Distribution

Univariate Normal Distribution:

p(x) = N(µ, σ2) ≡ 1σ√

2πe−

12 ( x−µ

σ )2

(1)

where:σ2 variance; σ standard deviation,expectation of x or mean: E [x ] =

∫∞−∞ x p(x)d x ≡ µ

E [(x − µ)2] =∫∞−∞(x − µ)2 p(x)d x ≡ σ2




REMARK: Univariate Normal Distribution

Univariate Normal Distribution:99.7% of samples are in the interval |x − µ| ≤ 3σ95% of samples are in the interval |x − µ| ≤ 2σ68% of samples are in the interval |x − µ| ≤ σ




Mahalanobis distanceMultivariate normal distribution:

p(x) = N(µ,Σ2) ≡ 1

(2π)d2 | Σ | 12

exp[−1

2(x− µ)t Σ−1(x− µ)

], (2)

where:

d dimensionality of the the feature space, (x− µ)t transposte of (x− µ),

µ ≡ E [x] mean vector,

Σ ≡ E [(x− µ)(x− µ))t ] covariance matrixσij = E [(xi − µi )× (xj − µj )] element ij-th of matrix Σσii variance of xi ; σij covariance of xi and xj

σij = 0 ⇐⇒ xi and xj statistically independentΣ symmetrical and semidefinite positive (i.e. zt Σ z ≥ 0)

Σ−1 inverse of Σ; | Σ | determinant of Σ.

Mahalanobis distanceM(x, µ) =

√(x− µ)tΣ−1(x− µ)




Mahalanobis distanceMultivariate normal distribution




Clustering AlgorithmsExample 1




Clustering AlgorithmsExample 2

Representation of a text document with a word-vector.Bag-Of-Words format: Each document is represented bythe set of its word frequencies (ignoring position of wordsin the document) and categories that it belongs to.The purpose of the format is to enable efficient executionof algorithms such as, clustering, learning, classification,visualization, etc.




PartitionsCodevectors

Let X = {x1, . . . ,xn} be a data set composed by npatterns for which every x i ∈ Rd .The codebook (or set of centroids) V is defined as the setV = {v1, . . . ,vc}, typically with c � n. Each elementv i ∈ Rd is called codevector (or centroid or prototype).The Voronoi region Ri of the codevector v i is the set ofvectors in Rd for which v i is the nearest codevector:

Ri =

{z ∈ Rd

∣∣∣∣ i = arg minj‖z − v j‖2

}. (3)

It is possible to prove that each Voronoi region is convex(Linde, 1980) and the boundaries of the regions are linearsegments.




PartitionsVoronoi set

The Voronoi set (or tassel, region,poliedronon) πi of the codevector v i is thesubset of elements of X for which thecodevector v i is the nearest codevector:

πi =

{x ∈ X

∣∣∣∣ i = arg minj‖x − v j‖2

}(4)

that is, the set of vectors belonging to Ri .A partition on Rd induced by all Voronoiregions is called Voronoi tessellation orDirichlet tessellation.




Vector quantizationExample

While the clustering approach is descriptive, the vectorquantization is predictive

Sequence of images512x512 pixels of 8 biteachconsider sub-images 4x4we can represent asubimage as a vector

xk = (x1k , · · · , x16

k )




Vector quantization

In the feature space the sub-images will aggregate in clusters

The center y j of a cluster jrepresent all elements of thecluster and is called codevector

codebook

y1 = (y1

1 , · · · , y161 )

y2 = (y12 , · · · , y16

2 )· · ·yc = (y1

c , · · · , y16c )




Vector quantization

When we should trasmit asub-image xk , we will send in itsplace j , where

l = argminj ||xk −y j || (WTA rule)

In this way we will transmit aninformation of order ln2c insteadof ln2(512× 512× 8)




Vector quantization - Color ImagesColor models:

RGB additive color model: red, green, and blue light are added togetherin various ways to reproduce a broad array of colors.CMYK ( or process color, four color) subtractive color model used incolor printing. It refers to the four inks used in some color printing: cyan,magenta, yellow, and key (black).CcMmYK (or CMYKLcLm) six color subtractive color model used insome inkjet printers optimized for photo printing. CMYK model (cyan,magenta, yellow, and key) + light cyan (c) and light magenta (m).etc.

RGB (additive) CMYK (subtractive)




Vector quantization - Color Images

RGB color images can be coded using 3 intensity matrices(R,G,B) of 512x512 pixels of 8 bit each;each pixel will be a vector of 3 gray levels pij = (rij ,gij ,bij);a subimage is a vector xk = (r1

k ,g1k ,b

1k , · · · , r16

k ,g16k ,b16

k )

the rest of previous discussion remains unchanged




Lagrange MultipliersGeneral case

TheoremIf a scalar field f (x1, · · · , xn) has a relative extremum when it issubject to m contraints (m < n), sayg1(x1, · · · , xn) = 0, · · · ,gm(x1, · · · , xn) = 0,the constrainted optimization problem (CP) can be solvedtrough the unconstrainted optimization of

L ≡ f (x1, · · · , xn)−m∑

i=1

λi g(x1, · · · , xn).

DefinitionL is called the Lagrangian andλi are called the Lagrange multipliers.




Lagrange MultipliersGeneral case

∇f (x1, · · · , xn) =∑m

i=1 λi ∇g(x1, · · · , xn)

g1(x1, · · · , xn) = 0...gm(x1, · · · , xn) = 0

note:m + n equationsm + n unknown quantities




Lagrange MultipliersExample

Find the extreme values of

z = x y

subject to the condition x + y = 1.Solution:

f (x , y) = xy

g(x , y) = x + y − 1 = 0∇f = λ∇g

x + y = 1




Lagrange MultipliersExample

y = λ

x = λ

x + y = 1

x = y = λ =12

fmax (x , y) =14




Picard IterationRecktenwald, 2000

DEF: Fixed PointA function g(x) is said to have a fixed point p if g(p) = p.In other words, if the value you put into the function isexactly the same value that you get out.Solving the equation f (x) = g(x)− x = 0 is identical tofinding the fixed point of g(x) AND the zero of f (x). So, weare dealing with another possible method for finding theroot of a one-variable equation.DEF: Fixed Point IterationThe iteration process is pn = g(pn−1) for n = 1,2,3, . . . .This process is also called Picard iteration, functionaliteration, or repeated substitution.





DEF: Fixed PointA function g(x) is said to have a fixed point p if g(p) = p.In other words, if the value you put into the function isexactly the same value that you get out.Solving the equation f (x) = g(x)− x = 0 is identical tofinding the fixed point of g(x) AND the zero of f (x). So, weare dealing with another possible method for finding theroot of a one-variable equation.DEF: Fixed Point IterationThe iteration process is pn = g(pn−1) for n = 1,2,3, . . . .This process is also called Picard iteration, functionaliteration, or repeated substitution.





Finding the root of f (x) = log(x + 4)− x on [0,2], i.e., f (x) = 0,is equivalent to find the fixed point of g(x) = log(x + 4) on [0,2].

n0123456

pn g(pn)

0.0000 0.60200.6020 0.66290.6629 0.66860.6686 0.66910.6691 0.66920.6692 0.66920.6692 0.6692

pn g(pn)

1.0000 0.69900.6990 0.67200.6720 0.66950.6695 0.66930.6693 0.66920.6692 0.66920.6692 0.6692

pn g(pn)

2.0000 0.77820.7782 0.67930.6793 0.67020.6702 0.66930.6693 0.66930.6693 0.66920.6692 0.6692





Finding the root of f (x) = log(x + 4)− x on [0,2], i.e., f (x) = 0,is equivalent to find the fixed point of g(x) = log(x + 4) on [0,2].

n0123456

pn g(pn)

0.0000 0.60200.6020 0.66290.6629 0.66860.6686 0.66910.6691 0.66920.6692 0.66920.6692 0.6692

pn g(pn)

1.0000 0.69900.6990 0.67200.6720 0.66950.6695 0.66930.6693 0.66920.6692 0.66920.6692 0.6692

pn g(pn)

2.0000 0.77820.7782 0.67930.6793 0.67020.6702 0.66930.6693 0.66930.6693 0.66920.6692 0.6692




Picard Iteration (Rech2000) <REC00>Recktenwald, 1998

Uniqueness: The Fixed Point TheoremIf g is continuous on [a,b] and g(x) ∈ [a,b] for all x ∈ [a,b]then g has a fixed point in [a,b].In addition, if 0 < |g(x)| < 1 for all x ∈ [a,b] then g has aunique fixed point in [a,b]

Convergence Criteria for Picard IterationThe iteration process pn = g(pn−1) for n = 1,2,3, . . . willconverge to a unique solution for any initial value p0 in [a,b]if g

′exists on (a,b) and 0 < |g′(x)| < 1 for all x ∈ [a,b].



















In general, for multivalued funtions, each iteration of thePicard iterations method is composed of two (or more)steps:

Step1: A subset of variables is kept fixed and the optimization isperformed with respect to the remaining variables

Step2: The role of the fixed and moving variables is swapped.

The optimization algorithm stops when variables changeless than a fixed threshold.A more general framework named "Alternating ClusterEstimation" is presented in T. A. Runkler, J. C. Bezdek,Alternating Cluster Estimation: A New Tool for Clustering and FunctionApproximation, IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 7,NO. 4, 377-393, 1999 [RUN99]




Parametric Clustering (Duda,73)

Let X = {xh | h = 1, ..., n} be the set of unlabeled instanced (training set),and V = {vi | i = 1, ..., c} be the set of centers of clusters (or classes) ωj .Following a parametric learning approach, we make the followingassumptions:

1 The instances come from a known number of c classes ωi ,i ∈ {1, ..., c}.

2 The a priori probabilities P (ωi ), i.e. the probability of drawing patternsof class ωi from X are known.

3 The form of class-conditional probabilities densities p (x | ωi ,Θi ) (i.e.the probability density of instance xh inside class ωi ) are known, ∀i .

Θi is the unknown vector of parameters of the class-conditional probabilitiesdensities.Note that the third assumption reduces the clustering problem to the problemof estimation of the vector Θi (parametric learning).




Parametric Clustering

In this setting, we assume that instances are obtained byselecting a class ωi and then selecting a pattern x according tothe probability law p (x | ωi ,Θi), i.e.:

p (x | Θ) =c∑

i=1

p (x | ωi ,Θi) P (ωi) , (5)

where Θ = (Θ1, ...,Θc).A density function of this form is called a mixture density(Duda73), p (xh | ωi ,Θi) are called the component densities,and P (ωi) are called the mixing parameters.





A well known parametric statistics method for estimating the parametervector Θ is based on maximum likelihood (Duda73). It assumes that theparameter vector Θ is fixed but unknown. The likelihood of the training set Xis the conditional density

p (X | Θ) =n∏

h=1

p (xh | Θ) (6)

or also:

p (X | Θ) =n∏

h=1

c∑i=1

p (xh | ωi ,Θi ) P (ωi ) , (7)

Its log is:

log p (X | Θ) =n∑

h=1

logc∑

i=1

p (xh | ωi ,Θi ) P (ωi ) , (8)

Then the maximum likelihood estimate Θ is that value of Θ that maximizesthe likelihood of the observed training set X (or its log).





If p (X | Θ) is a differentiable function of Θ, we can obtain thefollowing conditions for the maximum-likelihood estimate Θj :

n∑h=1

P(ωi | xh, Θ

)∇Θj

log(

p(

xh | ωi , Θi

))= 0 ∀ j . (9)

Constraints:

P (ωi) ≥ 0 (10)

c∑i=1

P (ωi) = 1 (11)





Let we assume now that the component densities aremultivariate normal, i.e.:

p(

xh | ωi , Θi

)=

1

(2π)d2 | Σi |

12

exp[−1

2(xh − vi)

t Σ−1i (xh − vi)

],

(12)where:

d dimensionality of the the feature space,vi mean vector,Σi ≡ E [(xh − vi))(xh − vi))t ] covariance matrix(xh − vi)

t transpose of xh − vi , Σ−1i inverse of Σi ,

and | Σi | determinant of Σi .





The local-maximum-likelihood estimate are:

vi =

∑nh=1 P

(ωi | xh, Θi

)xh∑n

h=1 P(ωi | xh, Θi

) (13)

Σi =

∑nh=1 P

(ωi | xh, Θi

)(xh − vi )(xh − vi )

t∑nh=1 P

(ωi | xh, Θi

) (14)

P(ωi | xh, Θi

)=

| Σj |−12 exp[− 1

2 (xh − vi )t Σ−1

j (xh − vi )] P (ωi )∑cj=1 | Σj |−

12 exp[− 1

2 (xh − vj ))t Σ−1j (xh − vj ))] P (ωj )

(15)





The Eqs. in the previous slide can be interpreted as thebasis of a gradient ascent or hill-climbing procedure formaximizing the likelihood procedure (Picard iteration);each cycle is composed by two (or more) steps:

Step1: A subset of variables is kept fixed and the optimization isperformed with respect to the remaining variables

Step2: The role of the fixed and moving variables is swapped.The optimization algorithm stops when variables changeless than a fixed thresholA Picard iteration can start with Eq. 15 using initialestimates to evaluate P

(ωi | xh, Θi

)and then using Eqs.

13, and 14 to update the other estimates, and then repeatthis cycle until the variations are less than an assignedthreshold.





The inversion of Σi is quite time consuming, and moreoverit maybe ill-conditioned.Like all hill-climbing procedures the results do dependupon the starting point, and therefore there is thepossibility of multiple solutions.




The K-Means Algorithm

We can notice that in Eq. 15 the probability P(ωi | xh, Θi

)is

large when the squared Mahalanobis distance

M2(xh, vi) ≡ (xh − vi)t Σ−1

j (xh − vi) (16)

is small.




The K-Means Algorithm

This observation is the rationale of the K-Means Hard C-Means(HCM) or C-Means or Basic Isodata algorithm (Duda73) [Isodatastands for Iterative Self-Organizing Data Analysis Techniques] that isbased on the following approximation:

P(ωi | xh, Θi

)=

{1 if Ej (xh) = min1≤j≤C EJ (xi)0 otherwise

(17)

where Ej (xi) is the local cost function or distortion and isusually assumed as the squared Euclidean distance

Ej (xh) ≡|| xh − yj ||2 (18)

Starting from the finite data set X this algorithm movesiteratively the k codevectors to the arithmetic mean of theirVoronoi sets {πi}i=1,...,k .




K-Means (Lloyd, 1957)EQUIVALENT FOUNDATION OF K-Means Algorithm: Lloyd’s algorithm, a.k.a. asVoronoi iteration or relaxation

Theorem

A necessary condition for a codebook V to minimize the EmpiricalQuantization Error (Gersho, 1992) or Expectation of Distortion or K-Meansfunctional or K-Means objective function (denoted as E(X ) or as < E >):

E(X ) =c∑

i=1

∑x∈πi

‖x − v i‖2 =∑

ih

uih‖xh − v i‖2, with uih =

{1 if xh ∈ πi

0 otherwise

(19)is that each codevector v i fulfills the centroid condition.

In case of a finite data set X and with Euclidean distance, the centroidcondition reduces to v i = 1

|πi |∑

x∈πix




K-Means (Lloyd, 1957)Algorithm

K-Means is made up by the following steps:1 choose the number k of clusters;2 initialize the codebook V with vectors randomly picked from

X or with random vectors in the minimum hyperboxcontaining the full data set, i.e., as s a combination of datapoints:

vj =n∑

h=1

γjhxh , (20)

whith coefficients γih ∈ [0,1] .3 compute the Voronoi set πi associated to the codevector v i ;4 move each codevector to the mean of its Voronoi set;5 return to step 3 if any codevector has changed otherwise

return the codebook.




K-Means (Lloyd, 1957)




























K-Means (Lloyd, 1957)pros

At each iteration of the algorithm a codebook is found anda Voronoi tessellation of the input space is provided.It is guaranteed that after each iteration the quantizationerror does not increase.At the end of the algorithm a local minimum of thequantization error is obtained.K-Means can be viewed as an Expectation-Maximizationalgorithm, ensuring the convergence after a finite numberof steps (Bishop, 1996)Different distances lead to different invariance propertiesas in the case of Mahalanobis distance which producesinvariance on ellipsoids (Duda&Hart, 1973)




K-Means (Lloyd, 1957)cons

Local minima of E(X ) make the method dependent oninitialization, and the average is sensitive to outliers(Duda&Hart, 1973).





In order to overcome this problem:

Heuristics, e.g., split and merge algorithms (Isodata Algorithm).

Local search techniques based on a regularization framework: (addingconstraints on the solution, i.e. minimization of a modified riskfunctional) E.g., Isodata and Fuzzy clustering paradigms: FuzzyC-Means (FCM) (Bezdek, 1981), and Deterministic Annealing (DA)(Rose, 1990), Possibilistic Clustering (Krishnapuram & Keller, 1993,1996), and Graded Possibilistic Clustering (Masulli & Rovetta, 2003).

Global search techniques, e.g., minimization of E(x) using SimulatedAnnealing (Bogus et al., 1999), or Evolutionary Computing (Fogel,1993; Bezdek et al., 1994; Tseng&Yang,1997; Egan, 1998; Kuncheva etal., 1998; Hall et al., 1999; Masulli et al., 1999).





The number of clusters to find must be provided, and thiscan be done only using some a priori information oradditional validity criterion.K-Means can deal only with clusters with sphericallysymmetrical point distribution, since Euclidean distances ofpatterns from centroids are computed leading to aspherical invariance.The approximation in Eq. 17 making uih = {0,1} is oftentoo strong, while, by contrast, in real cases some objectsshow not zero similarity degrees to different classes.




Isodata AlgorithmIterative Self-Organizing Data Analysis Techniques

Heuristic algorithm which allows the number of clusters tobe automatically adjusted during the iteration by mergingsimilar clusters and splitting clusters with large standarddeviations.The Isodata algorithm is more flexible than the K-meanmethod. But the user has to choose empirically many moreparameters.In the next slides we’ll give an example of IsodataAlgorithm fromhttp://fourier.eng.hmc.edu/e161/lectures/classification/node13.html




Isodata AlgorithmIterative Self-Organizing Data Analysis Techniques

Given a set of samples {xi , i = 1, 2, ...,N} (where eachx = [x (i)

1 , ..., x (i)n ]T is a column vectors representing a point in the

n-dimensional feature space)

K = number of clusters desired;

I = maximum number of iterations allowed;

P = maximum number of pairs of cluster which can be merged;

ΘN = threshold value for minimum number of samples in each cluster(cardinality) can have (used for discarding clusters);

ΘS = threshold value for standard deviation (used for split operation);

ΘC = threshold value for pairwise distances (used for merge operation)




Isodata AlgorithmAlgorithm

Step 1. Arbitrarily choose k (not necessarily equal to K ) initial clustercenters from the data set {xi , i = 1, 2, ...,N}.Step 2. Assign each of the N data-points to the closest cluster center:x ∈ ωj if D L(x, vj ) = max {DL(x, vi ), i = 1, ..., k}Step 3. Discard clusters with fewer than ΘN members, i.e., if for any j ,Nj < ΘN , then discard ωj and k ← k − 1.

Step 4. Update each cluster center: vi =1Nj

∑x∈ωj

x (j = 1, · · · , k)

Step 5. Compute the average distance Dj of data-points in clustercenter ωj from their corresponding cluster

Dj =1Nj

∑x∈ωj

DL(x, vi ), (i = 1, ..., k)





Step 6. Compute the overall average distance D of data-points fromtheir respective cluster centers:

D =1N

k∑j=1

NjDj

Step 7. If k ≤ K/2 (too few clusters), go to Step 8; else if k > K/2 (toomany clusters), go to Step 11; else go to Step 14. (Steps 8 through 10are for split operation, Steps 11 through 13 are for merge operation.)





Step 8. First step to split. Find the standard deviation vector

σj = [σ(j)1 , · · · , σ

(j)n ]T for each cluster: σ(j)

i =

√√√√ 1Nj

∑x∈ωj

DL(xi − v (j)i )2,

where v (j)i is the ith component of vi and σi is the standard deviation of

the data-points in ωj , along the i-th coordinate axis. Nj is the number ofdata-points in ωj ,

Step 9. Find the maximum component of each σi for each cluster anddenote it byσ(j)

max ; do it for each j = 1, · · · , k .





Step 10. If for any σ(j)max , (j = 1, · · · , k), all of the following are true:

σ(j)max > Θs; Dj > D; Nj > 2ΘN .

then split vj into two new clusters centers v+j and v−j by adding ±δ to

the components of vj , corresponding to σ(j)max , where δ can be ασ(j)

max , forsome α > 0. Then delete vj , and let k ← k + 1. Goto step 2.else Go to Step 14.

Step 11. First step to merge. Compute the pairwise distances Dij

between every two cluster centers: Dij = DL(vi , vj )

and arrange these k(k − 1)/2 distances in ascending order.

Step 12. Find no more than P smallest Dij ’s wich are also smaller thanΘC and keep them in ascending order: Di1j1 ≤ Di2j2 ≤ · · · ≤ DiP jP





Step 13. Perform pairwise merge: for l = 1, · · · ,P, do the following:If neither of vil nor vjl has been used in this iteration, then merge themto form a new center:v = 1

Nil +Njl.

Delete vil and vjl and let k ← k − 1.Go to Step 2.

Step 14. Terminate if maximum number of iterations I is reached.Otherwise go to Step 2.




K-Medoids Algorithm

The k-medoids algorithm is a clustering algorithm relatedto the k-means algorithmIn contrast to the k-means algorithm, k-medoids choosesdatapoints as centers (medoids or exemplars).A necessary condition for the set of medoids M tominimize the functional:

F (X ) =k∑

i=1

∑x∈πj

| x −mj |, (21)

where mj is selected from the x ∈ πj with the aim tominimize F (x).It is more robust to noise and outliers as compared tok-means because it minimizes a sum of dissimilaritiesinstead of a sum of squared Euclidean distances.




K-Medoids AlgorithmAlgorithm

The most common realisation of k-medoid clustering is thePartitioning Around Medoids (PAM) algorithm (Theodoridis &Koutroumbas, 2006) and is as follows:

(1) Initialize: randomly select k of the n data points as the medoids

(2) Associate each data point to the closest medoid. ("closest" here isdefined using any valid distance metric, most commonly Euclideandistance, Manhattan distance or Minkowski distance)

(3) For each medoid m

(3.1) For each non-medoid data point o

(3.1.1) Swap m and o and compute the total cost of theconfiguration

(4)Select the configuration with the lowest cost.

(5) repeat steps 2 to 5 until there is no change in the medoid.




Image Segmentation / Binarization

We associate to each pixel a squared subimage centeredon it, plus a possible set of features extracted by someimage processing operator (augmented feature vector)xij = (x(i−k)(j−k), · · · , x(i+k)(j+k), f1, · · · , fr )we use a small the number of codevectors and we assign afalse color to each cluster for image segmentationif we segment using only two codevectors we obtain abinarized image.

original image binarized image original image binarized image




Image Segmentation

Robustneess to noise:A robustness measure for an estimator is the breakdownpoint, defined as the fraction of outliers able to corrupt theestimation

original MRI image +7% Gaussian noise segmented image




Image Segmentation




Batch K-Means (Lloyd, 1957)

The term batch means that at each step the algorithmtakes into account the whole data set to update thecodevectors. When the cardinality n of the data set X isvery high (e.g., several hundreds of thousands) the batchprocedure is computationally expensive.An on-line update has been introduced leading to theon-line K-Means algorithm (Linde, 1980; MacQueen,1967). At each step, this method simply randomly picks aninput pattern and updates its nearest codevector, ensuringthat the scheduling of the updating coefficient is adequateto allow convergence and consistency.




On-Line K-Means / Vector Quantization

1 Initialize codevectors small at the center of the hyperbox of data2 Winner-Takes-All (WTA)3 Adapt winner codevector:

∆v j = ε(t)(x − v j ) (22)

4 Stochastic Approximation (Robbins-Morro,1951)

ε = ε(t) (23)

PROS:

minimizes the K-MEANS functional→ approximates K-MEANSsolutions

adapts to a changing environment.

CONS:

degenerated codevectors at the end of the learning that represent noexamples (wasting of resources)




Online K-Means/ Vector Quantization




















On-Line K-Means / Vector Quantization

Solutions:Conscience mechanism (DeSieno, 1988) [DES88].Determine the winner as:

s(x) = arg minv j∈V

(fj‖x − v j‖) (24)

where fj is the frequence of past winning of unit j .Self Organizing Maps [Kohonen, 1981];Fuzzy Learning Vector Quantization (Bezdek,1995).




Self Organizing Maps (Kohonen, 1981)[KoH90]

Kohonen, T., Automatic formation of topological maps of patterns in aself-organizing system In Oja, E. and Simula, O., editors,Proceedings of 2SCIA, Scand. Conference on Image Analysis, pages214-220, Helsinki, Finland. Suomen HahmontunnistustutkimuksenSeura r.y, 1981.

A Self Organizing Map (SOM), also known as SelfOrganizing Feature Map (SOFM), represents data bymeans of codevectors organized on a grid with fixedtopology.Codevectors move to adapt to the input distribution, butadaptation is propagated along the grid also to neighboringcodevectors, according to a given propagation orneighborhood function. This effectively constrains theevolution of codevectors.




Self Organizing Maps (Kohonen, 1981)

Grid topologies may differ. We consider a two-dimensional,square-mesh topology.The distance on the grid is used to determine how stronglya codevector is adapted when the unit aij is the winner.The metric used on a rectangular grid is the Manhattandistance, for which the distance between two elementsr = (r1, r2) and s = (s1, s2) is:

drs = |r1 − s1|+ |r2 − s2| . (25)




Self Organizing Maps (Kohonen, 1981) - Algorithm1 Initialize the codebook V with small codevectors at the center of the

hyperbox of data2 Initialize the set C of connections to form the rectangular grid of

dimension n1 × n2

3 Initialize t = 04 Randomly pick an input x from X5 Determine the winner:

s(x) = arg minv j∈V‖x − v j‖ (26)

6 Adapt each codevector:

∆v j = ε(t)h(drs)(x − v j ) (27)

where h is a decreasing function of d , e.g.: h(drs) = exp(− d2

rs2s2(t)

)7 Increment t8 if t < tmax go to step 49 end.





σ(t) and ε(t) are decreasing functions of t , e.g. (Ritter, 1991):

σ(t) = σi

(σfσi

)t/tmax, ε(t) = εi

(εfεi

)t/tmax,

σi, σf initial values for the functions σ(t) and ε(t)).εi, εf final values for the functions σ(t) and ε(t).













Calibration:









Test with uniform distribution (noise):




















Self Organizing Maps (Kohonen, 1981)Using SOM for clustering

The method was originally devised as a tool for embeddingmultidimensional data into typically two dimensionalspaces, for data visualization.Since then, it has also been frequently used as a clusteringmethod, which was originally not considered appropriatebecause of the constraints imposed by the topology.




Neural Gas (Martinetz, 1993) [MAR93]

This technique resembles the SOM in the sense that notonly the winner codevector is adapted.It is different in that codevectors are not constrained to beon a grid, and the adaptation of the codevectors near thewinner is controlled by a criterion based on distance ranks.Each time a pattern x is presented, all the codevectors v jare ranked according to their distance to x (the closestobtains the lowest rank).




Neural Gas (Martinetz, 1993)Algorithm

ρj rank of the distance between x and the codevector v j

update rule:∆v j = ε(t)hλ(ρj)(x − v j) (28)

withε(t) ∈ [0,1] gradually lowered as t increaseshλ(ρj ) a function decreasing with ρj with a characteristicdecay λ; usually hλ(ρj ) = exp (−ρj/λ).




Neural Gas (Martinetz, 1993)Algorithm

1 Initialize the codebook V randomly picking from X2 Initialize the time parameter t = 03 Randomly pick an input x from X4 Order all elements v j of V according to their distance to x ,

obtaining the ρj

5 Adapt the codevectors according to Eq. 286 Increase the time parameter t = t + 17 if t < tmax go to step 3.




The Capture Effect Neural Network [Firenze et al1994][FIR94]

The Capture Effect Neural Network (CENN) [Firenze et al1994] is a self-organizing neural network able to take intoaccount the local characteristics of the point-distribution(adaptive resolution clustering).CENN combines standard competitive self-organization ofthe weight-vectors [Kohonen95] with a non-linearmechanism of adaptive local modulation of receptive fields(RF) of neurons (Capture Effect).




The Capture Effect Neural Network [Firenze et al1994]Training step

The learning phase of CENN is composed by the trainingstep, performing a vector quantization of data, and thelabeling step, where the prototypes, obtained by theprevious step, are grouped in order to obtain robustclusters.In the training step an initial abundant quantity of neuronsni = {wi , ri} is assumed and initialized with randomlychosen weight vectors wi (representing centers ofsub-clusters), and large radii ri (ri = R0) of the receptivefields RFi (modeled by Gaussian functions γ).The radius of a Gaussian RF is defined as the radius of anα-cut of RF itself.




The Capture Effect Neural Network [Firenze et al1994]Training step

Then the data set is presented to CENN and the following learningformulas are applied:

∆wi = ηw (xk −wj )γ(dj (xk ))∑l γ(dl (xk ))

, (29)

∆ri =

{ηr (di (xk )− ri ) exp(−di (xk )/p)0 if di (xk ) ≥ R0

, (30)

whereηw and ηr are learning rates, di (xk ) = ‖xk −wi‖ is the Euclideandistance of points to weight vectors, andthe parameter p is defined as:

p ≡ < di (xk ) >

D ln 10, (31)

assuming D as the dimension of the feature space.Francesco Masulli Introduction to Data Clustering 1



The Capture Effect Neural Network [Firenze et al1994]Labeling step

The labeling step

discards any neuron nq with rq = R0, i.e. discards neurons notrepresenting elements of the training set, and

then couples of neurons np and nq will receive the same label (i.e. theirassociated clusters are merged) if

‖wp −wq‖ < (rp + rq) σ, σ ∈ (0, 1), (32)

i.e., if they have (partially) overlapped RFs. The parameter σ is namedthe degree of overlapping.This process obtains c groups of neurons Gj , j ∈ [1, c]. We can thendefine for a cluster related to the j-th group, its center and radius as:

yj ≡< w• >Gj rj ≡< r• >Gj . (33)




The Capture Effect Neural Network [Firenze et al1994]Labeling step

A remaining isolated neuron nq is called associable to agroup Gj if

‖wp −w•‖ < (rp + r•) (34)

at least for one neuron n• of group Gj .For such neurons associable to one or more groups, thefollowing completion rule of the labeling step is applied: anisolated neuron nq, associable to different groups, isassigned to the i-th group if and only if

i = arg∨

j(rGj − rq) ∀j . (35)




The Capture Effect Neural Network [Firenze et al1994]Operational phase

In the operational phase an unknown vector x will beclassified by exploiting the winner-take-all (WTA) rule in thefollowing way:

x ∈ (j-th cluster) ⇐⇒ h = arg∨

i

‖x−wi‖ri

, wh ∈ Gj .

(36)




The Capture Effect Neural Network [Firenze et al1994]Operational phase

It is worth noting that, after the learning phase:

the distribution of the prototypes in the feature spaceapproaches the optimal vector quantization scheme of thedistribution of input data, that is approximates the mixtureprobability density function;the radial size of the RF of each neuron reaches a stablevalue which is strongly related to the spatial density ofinput data locally around the weight-vector (that is thecenter of the RF).


Date post:	14-Oct-2020
Category:	Documents
Upload:	others
View:	13 times
Download:	0 times

Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods...

Documents