Download - Gaussian mean-shift algorithmsaarti/SMLRG/miguel_slides.pdf · Gaussian mean-shift algorithms ... F Newton’s method F etc. F Fixed-point iteration: the mean-shift algorithm. It

Gaussian mean-shift algorithms

¦

Miguel A. Carreira-PerpinanDept. of Computer Science & Electrical Engineering, OGI/OHSU

http://www.csee.ogi.edu/~miguel



Gaussian mean-shift (GMS) as an EM algorithm

v Gaussian mean-shift (GMS) as an EM algorithmv Acceleration strategies for GMS image segmentationv Gaussian blurring mean-shift (GBMS)

p. 1

Gaussian mixtures, kernel density estimates

Given a dataset X = {xn}Nn=1 ⊂ R

D, define a Gaussian kerneldensity estimate with bandwidth σ:

p(x) =1

N

N∑

n=1

K

(∥

∥

∥

∥

x− xn

σ

∥

∥

∥

∥

2)

K(t) ∝ e−t/2.

v Other kernels:

F Epanechnikov: K(t) ∝

{

1− t, t ∈ [0, 1)

0, t ≥ 1(finite support).

F Student’s t: K(t) ∝(

1 + tα

)−α+D

2 .v Useful way to represent the density of a data set.v Extremely popular in machine learning and statistics.v Objective here: find modes of p(x).

p. 2

Why find modes? Nonparametric clustering

PSfrag replacementsC1 C2C3

p(x)

v Modes↔ clusters.v Assign each point x to a mode.v Nonparametric:

F no model for the clusters’ shapeF no predetermined number of clusters.

p. 3

Why find modes? Multivalued mappings

PSfrag replacements

xx

y

x0

x0

x1

x1

x2

x2

y0 p(x|y

=y 0

)p(x, y)

v Given a density model for (x, y), represent a multivaluedmapping y → x by modes of the conditional distribution p(x|y).Map y0 to {x0, x1, x2}.

v p(x, y) is a Gaussian mixture⇒ p(x|y) is a Gaussian mixture.p. 4

Mode-finding algorithms

v Idea: start a hill-climbing algorithm from every centroid. Thistypically finds all modes in practice (Carreira-Perpiñán ’00).

v Hill-climbing algorithms:F Gradient ascentF Newton’s methodF etc.F Fixed-point iteration: the mean-shift algorithm.

It is based on ideas by Fukunaga & Hostetler ’75(also Cheng ’95, Carreira-Perpiñán ’00, Comaniciu & Meer ’02, etc.).

p. 5

The mean-shift algorithm

It is obtained by reorganising the stationary point equation∇p(x) = 0 as a fixed-point iteration: x(τ+1) = f(x(τ)).For example, for an isotropic kde p(x) = 1

N

∑Nn=1 K

(

‖x−xn

σ‖2)

:

∇p(x) =1

N

N∑

n=1

K ′(∥

∥

∥

∥

x− xn

σ

∥

∥

∥

∥

2)2

σ2(x− xn)

=2

Nσ2

N∑

n=1

K ′(∥

∥

∥

∥

x− xn

σ

∥

∥

∥

∥

2)

x−

2

Nσ2

N∑

n=1

K ′(∥

∥

∥

∥

x− xn

σ

∥

∥

∥

∥

2)

xn = 0

⇒ x =N∑

n=1

K ′( ‖x−xn

σ‖2)

∑Nn′=1 K ′

(

‖x−xn′

σ‖2)xn.

p. 6

The mean-shift algorithm (cont.)

Gaussian,full covariance: f(x) =

(

N∑

n=1

p(n|x)Σ−1n

)−1 N∑

n=1

p(n|x)Σ−1n xn

Gaussian,isotropic: f(x) =

N∑

n=1

p(n|x)xn p(n|x) =e−

12‖

x−xn

σ‖

2

∑Nn′=1 e

− 12

‚

‚

‚

x−xn′

σ

‚

‚

‚

2

Other kernel,isotropic: f(x) =

N∑

n=1

K ′( ‖x−xn

σ‖2)

∑Nn′=1 K ′

(

‖x−xn′

σ‖2)xn.

v f∞ = f ◦ f ◦ . . . maps points in RD to stationary points of p.

v It is a particularly simple algorithm, with no step size.

p. 7

Gaussian mean-shift (GMS) as a clustering algorithm

v xn, xm in same cluster if they converge to same modev nonparametric clustering, able to deal with complex cluster

shapes; σ determines the number of clustersv popular in computer vision (segmentation, tracking)

(Comaniciu & Meer)

Segmentation example (one data point x = (i, j, I) per pixel, where(i, j) = location and I = intensity or colour):

p. 8

Pseudocode: GMS clustering

for n ∈ {1, . . . , N} For each data pointx← xn Starting pointrepeat Iteration loop

∀n: p(n|x)←exp

(

−12‖(x− xn)/σ‖2

)

∑Nn′=1 exp

(

−12‖(x− xn′)/σ‖2

) Post. prob. (E step)

x←N∑

n=1

p(n|x)xn Update x (M step)

until x’s update <

� � �

zn ← x Modeendconnected-components({zn}

Nn=1, � �� ) Clusters

p. 9

GMS is an EM algorithm

Define the following artificial maximum-likelihood problem(Carreira-Perpiñán & Williams ’03):v Model: Gaussian mixture that moves as a rigid body

({πn,xn,Σn}Nn=1 fixed; only parameter is x)

q(y|x) =N∑

n=1

πn |2πΣn|− 1

2 e−12(y−(xn−x))T Σ−1

n (y−(xn−x)).

v Data: just one point at the origin: y = 0.Then, the resulting expectation-maximisation (EM) algorithm thatmaximises the log-likelihood wrt the parameter x (translation of theGM) is formally identical to the GMS step.v E step: update posterior probabilities p(n|x).v M step: update parameter x.

p. 10

GMS is an EM algorithm (cont.)

The M step maximises the following lower bound on log p:

%(x) = log p(x(τ))−Q(x(τ)|x(τ)) + Q(x|x(τ))

with%(x(τ)) = log p(x(τ)) and %(x) ≤ log p(x) ∀x ∈ R

D

where

Q(x|x(τ)) = constant− 1

2σ2

N∑

n=1

p(n|x(τ)) ‖x− xn‖2

is the expected complete-data log-likelihood from the E step.

p. 11

Non-Gaussian mean-shift is a GEM algorithm

v Consider a mixture of non-Gaussian kernels.v Analogous derivation as maximum-likelihood problem, but now

the M step does not have a closed-form solution for x.v However, it is possible to solve the M step iteratively with a

certain fixed-point iteration.v Taking just one step of this fixed-point iteration provides an

inexact M step and so a generalised EM algorithm, which isformally identical to the mean-shift algorithm(Carreira-Perpiñán ’06).

p. 12

Convergence properties of mean-shift

From the properties of (G)EM algorithms (Dempster et al ’77) weexpect, generally speaking:v Convergence to a mode from almost any starting point

Can also converge to saddle points and minima.v Monotonic increase of p at every step

(Jensen’s inequality).v Convergence of linear order.

However, particular mean-shift algorithms (for particular kernels)may show different properties.We analyse in detail GMS (isotropic case: Σn = σ2I).

p. 13

Convergence properties of GMS

Taylor expansion around a mode x∗ of the fixed-point mapping f :

x(τ+1) = f(x(τ)) = x∗ + J(x∗)(x(τ) − x∗) +O(‖x(τ) − x∗‖2)

⇒ x(τ+1) − x∗ ≈ J(x∗)(x(τ) − x∗)

where the Jacobian of f is

J(x) =1

σ2

M∑

m=1

p(m|x)(µm − f(x))(µm − f(x))T .

At a mode, J(x∗) = 1σ2×(local covariance at x∗).

Thus, the convergence is linear with rate

r = limτ→∞

∥

∥x(τ+1) − x∗∥∥

‖x(τ) − x∗‖= λmax(J(x∗)) < 1.

p. 14

Convergence properties of GMS (cont.)

Special cases depending on σ:v σ → 0, σ →∞ (practically uninteresting): r → 0, superlinear.v σ at mode merges: r = 1, sublinear (awfully slow).v Intermediate σ (practically useful): r close to 1 (slow).

−6 −4 −2 0 2 4 6

PSfrag replacementsσ = 0.2

σ = 0.4

σ = 1

σ = 4

x

r

σ10−1 100 1010

0.2

0.4

0.6

0.8

1

PSfrag replacementsσ = 0.2σ = 0.4

σ = 1σ = 4

x

r

σ

p. 15

Convergence properties of GMS (cont.)

v The GMS iterates follow the local principal component at themode and lie in the interior of the convex hull of the data points.

v Smooth path: angle between consecutive steps in (−π2, π

2)

(Comaniciu & Meer ’02).v GMS step size is not optimal along the GMS search direction.

p. 16

Number of modes of a Gaussian mixture

In principle, one would expect that N points should produce atmost N modes (additional modes are an artifact of the kernel). But:In 1D:v A Gaussian mixture with N components has at most N modes

(Carreira-Perpiñán & Williams ’03).v A mixture of non-Gaussian kernels can have more than N

modes:

K(x) ∝ 11+ex/10

:

Epanechnikov kernel:

p. 17

Number of modes of a Gaussian mixture (cont.)

In D ≥ 2, even a mixture of isotropic Gaussians can have morethan N modes (Carreira-Perpiñán & Williams ’03):

−0.5 0 0.5 1 1.5−0.5

0

0.5

1

1.5

However, in practice the number of modes is much smaller than Nbecause components coalesce. This may be a reason why, inpractice, GMS produces better segmentations than other kernels.

p. 18

Convergence domains of GMS

In general, the convergence domains (geometric locus of pointsthat converge to each mode) are nonconvex and can bedisconnected and take quite fancy shapes:

p. 19

Convergence domains of GMS (cont.)

The boundary between some domains can be fractal:

These peculiar domains are probably undesirable for clustering;but, small effect (seemingly confined to cluster boundaries).

p. 20

Gaussian mean-shift as an EM algorithm: summary

v GMS is an EM algorithmv Non-Gaussian mean-shift is a GEM algorithmv GMS converges to a mode from almost any starting pointv Convergence is linear (occasionally superlinear or sublinear),

slow in practicev The iterates approach a mode along its local principal

componentv Gaussian mixtures and kernel density estimates can have

more modes than components (but seems rare)v The convergence domains for GMS can be fractal

p. 21

Acceleration strategies for GMS image segmentation


p. 22

Computational bottlenecks of GMS

GMS is slow: O(kN 2D). Computational bottlenecks:v B1: large average number of iterations k ∼ 100

(linear convergence).v B2: large cost per iteration ∼ 2ND multiplications

F E step: ND to obtain p(n|x(τ)).F M step: ND to obtain x(τ+1).

Acceleration techniques must address B1 and/or B2.

We propose four strategies:v � �: spatial discretisationv � �: spatial neighbourhoodv � : sparse EMv � �: EM–Newton

p. 23

Evaluation of strategies

Evaluation of strategies:v Goal: to achieve the same segmentation as GMS. Visual

evaluation of segmentation not enough; we compute thesegmentation error wrt GMS segmentation (= no. pixelsmisclustered as a % over the whole image).

v We report running times in normalised iterations (= 1 iterationof GMS) to ensure independence from implementation details.

v No pre- or postprocessing of clusters (e.g. removal of smallclusters).

Why segmentation errors?v The accelerated scheme may not converge to a mode of p(x).v xn may converge to a different mode than with exact GMS.

p. 24

� � �: spatial discretisation

v

Many different pixels converging to the same modetravel almost identical paths. We discretise the spatialdomain by subdividing every pixel (i, j) into n×n cells;points projecting to the same cell share the same fate.

� � � � � �� PSfrag replacements

v This works because paths in D dimensions are wellapproximated by their 2D projection on the spatial domain.

v

We stop iterating once we hit an already visited cell⇒ massively reduces the total number of iterations.

v We start first with a set of pixels uniformly distributed over theimage (this finds all modes quickly).

v Converges to a modev Segmentation error→ 0 by increasing n

v Addresses B1. Parameter: n = 1, 2, 3 . . . (discretisation level)

p. 25

� � �: spatial neighbourhood

Approximates E and M steps with a subset of the datapoints (rather than all N ) consisting of a neighbourhood inthe spatial domain (not the range domain). Finding neigh-bours is for free (unlike finding neighbours in full space).

PSfra

gre

plac

emen

ts

p(n|x(τ)) =exp

(

−12

∥

∥(x(τ) − xn)/σ∥

∥

2)

∑

n′∈N (x(τ)) exp(

−12‖(x(τ) − xn′)/σ‖

2)

x(τ+1) =∑

n∈N (x(τ)) p(n|x(τ))xn

v Does not converge to a modev Segmentation error→ 0 by increasing e

v Addresses B2. Parameter: e ∈ (0, 1] (fraction of data set usedas neighbours)

p. 26

� � �: sparse EM

Sparse EM (Neal & Hinton ’98): coordinate ascent on the space of(x, p) where p are posterior probabilities; this maximises the freeenergy F (p,x) = log p(x)−D (p‖p(n|x)) and also p(x).v Run partial E steps frequently: update p(n|x) only for n ∈ S;

S = plausible set (nearest neighbours), kept constant overpartial E steps. FAST.

v Run full E steps infrequently: update all p(n|x) and S. SLOW.We choose S containing as many neighbours as necessary toaccount for a total probability

∑

n∈S p(n|x) ≥ 1− ε ∈ (0, 1]. Thus,the fraction of data used e varies after each full step.Using fixed e produced worse results.v Converges to a mode no matter how S is chosen;

computational savings if few full stepsv Segmentation error→ 0 by decreasing ε

v Addresses B2. Parameter: ε ∈ [0, 1) (probability not in S) p. 27

� � �: EM–Newton

Start with EM steps, which quickly increase p. Switch to Newtonsteps when EM slows down (reverting to EM if bad Newton step).Specifically, try Newton step when ‖x(τ) − x(τ−1)‖ < θ.Computing the Newton step xN yields the EM step xEM for free:

Gradient: g(x) =p(x)

σ2

N∑

n=1

p(n|x)(xn − x) =p(x)

σ2(xEM − x)

Hessian: H(x) =p(x)

σ2

(

−I +1

σ2

N∑

n=1

p(n|x)(xn − x)(xn − x)T

)

Newton step: xN = x−H−1(x)g(x).

v Converges to a mode with quadratic orderv Segmentation error→ 0 by decreasing θ

v Addresses B1. Parameter: θ > 0 (minimum EM step length)relative to σ

p. 28

Computational cost

Strategy Cost per iteration relative to exact GMS

� � � : spatial discretisation 1

� � �: spatial neighbourhood e

� � �: sparse EM 2 if full step, e if partial step

� � �: EM–Newton 1 if EM step,(

1 +D+1

4

)

if Newton step,(

32 +

D+14

)

if EM step after failed N. step

e ∈ (0, 1] is the fraction of the data set used (neighbours for � �,plausible set for � ).Using iterations normalised wrt GMS allows direct comparison ofthe numbers of iterations in the experiments.

p. 29

Experimental results with image segmentation

v Dataset: xn = (in, jn, In) (greyscale) or xn = (in, jn, L∗n, u∗

n, v∗n)

(colour) where (i, j) is the pixel’s spatial position. Prescale I or(L∗, u∗, v∗) so same σ for all dimensions (in pixels).

v Best segmentations for large bandwidths: σ ≈ 15×(image side).

v We study all strategies with different images over a range of σ.v Convergence tolerance � � �

= 10−3 pixels.

��

100× 100Number of clusters Total number

of iterations

PSfrag replacementsσIterations

Clusters 8 12 16 20 240

5

10

15

PSfrag replacements

σIterations

Clus

ters

8 12 16 20 240

5

10

x 105

PSfrag replacements

σIte

ratio

nsClusters

p. 30

Experimental results with image segmentation (cont.)

Segmentation results for each method under its optimal parametervalue for σ = 12:

GMS: 5 modes

its = 823937

!" # : 5 modeserror 1.62%its = 33791

!" $: 5 modeserror 0.02%its = 324694

!" %: 5 modeserror 0.00%its = 340095

!" &: 5 modeserror 1.98%its = 141904

segm

enta

tion

#ite

ratio

ns

20

40

60

80

100

120

140

160

20

40

60

80

100

120

10

20

30

40

50

60

10

20

30

40

50

60

20

40

60

80

100

120

itera

tions

hist

ogra

m

0 50 100 1500

5001000

0 50 100 1500

5000

0 50 100 1500

5001000

0 50 100 1500

500

0 50 100 1500

20004000

p. 31


Optimal parameter value for each method (= that which minimisesthe iterations ratio s. t. achieving an error < 3%) as a function of σ.

8 12 16 20 241234567

10

15

20

8 12 16 20 240.19

0.25

0.31

0.38

0.440.47

0.14

0.26

0.4

0.56

0.760.86

8 12 16 20 2410−11

10−8

10−6

10−410−310−210−1

8 12 16 20 240.020.05

0.1

0.2

0.3

0.5

PSfrag replacements

σσ

σσ

n r e

e

ε θ

� � � �

� � �

p. 32


Clustering error (percent) and number of iterations for each methodunder its optimal parameter value as a function of σ.

8 12 16 20 240

0.5

1

1.5

2

2.5

3

8 12 16 20 240

5

10

15x 105

msms1ms2ms3ms4

PSfrag replacements

erro

r

σσIte

ratio

ns

p. 33


Clustering error (percent) and computational cost for each methodas a function of its parameter.

1234567 10 15 200

2

4

6

8

10

12

1234567 10 15 20

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.19 0.25 0.31 0.38 0.440.470

10

20

30

40

50

60

70

80

90

0.19 0.25 0.31 0.38 0.440.47

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.14 0.26 0.4 0.56 0.760.86

10−11 10−8 10−6 10−410−310−210−10

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

10−11 10−8 10−6 10−410−310−210−1

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

0.020.050.1 0.2 0.3 0.50

1

2

3

4

5

6

7

8

9

10

0.020.050.1 0.2 0.3 0.5

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

σ = 8σ = 12σ = 16σ = 20σ = 24

PSfrag replacements

!" # !" $ !" % !" &

n

e

r

e

ε θ

erro

rite

ratio

nra

tiowr

tGM

S

p. 34


Plot of all iterates for all starting pixels: most cells in � � are empty.

0 500500

20

40

60

80

PSfrag replacements

I

i j

Proportion of cells visited by � �(less than nN out of n2N )

Average number of iterationsper pixel for � �

1 2 3 4 5 6 7 10 15 200

0.2

0.4

0.6

0.8

1

PSfrag replacements Prop

ortio

n

n 50 64 100 128 256 5121

2

3

4

n = 1

n = 2

n = 3n = 4n = 5n = 6

PSfrag replacements

√N

Itera

tions

/pixe

l

p. 35

Acceleration strategies: summary (cont.)

v With parameters set optimally, all methods can obtain nearlythe same segmentation as GMS with significant speedups.Near-optimal parameter values can easily be set in advance for� �, � and � �, and somewhat more heuristically for � �.

v � � (spatial discretisation): best strategy; 10×–100× speedup(average number of iterations per pixel k = 2–4 only) with verysmall error controlled by the discretisation level.

v � � (EM–Newton): second best; 1.5×–6× speedup.v � �, � (neighbourhood methods): 1×–3× speedup; �

attains the lowest (near-zero) error, while � � can result inunacceptably large errors for a suboptimal setting of itsparameter (neighbourhood size).The neighbourhood methods ( � �, � ) are less effectivebecause GMS needs very large neighbourhoods (comparableto the data set size for the larger bandwidths).

p. 36

Acceleration strategies: summary (cont.)

Other methods:v Approximate algorithms for neighbour search (e.g. kd-trees):

less effective because GMS needs large neighbourhoods.v Other optimisation algorithms (e.g. quasi-Newton): need to

ensure first steps are EM.v Fast Gauss transform: can be combined with our methods.

Extensions:v Adaptive and non-isotropic bandwidths: all 4 methods.v Non-image data: � , � � readily applicable.

Is it possible to extend � � to higher dimensions?

p. 37

Gaussian blurring mean-shift (GBMS)


p. 38

Gaussian blurring mean-shift (GBMS)

v This is the algorithm that Fukunaga & Hostetler really proposedv It has gone largely unnoticed.v Same iterative scheme, but now the data points move at each

iteration. Result: sequence of progressively shrunk datasetsX(0),X(1), . . . converging to single, all-points-coincident cluster.

τ = 0 τ = 1 τ = 2 τ = 3 τ = 4 τ = 5

τ = 6 τ = 7 τ = 8 τ = 9 τ = 10 τ = 11

p. 39

Pseudocode: GBMS clustering

repeat Iteration loopfor m ∈ {1, . . . , N} For each data point

∀n: p(n|xm)←exp

(

−12‖(xm − xn)/σ‖2

)

∑Nn′=1 exp

(

−12‖(xm − xn′)/σ‖2

)

ym ←N∑

n=1

p(n|xm)xn One GMS step

end∀m: xm ← ym Update whole dataset

until stop Stopping criterionconnected-components({xn}

Nn=1, � �� ) Clusters

p. 40

Stopping criterion for GBMS

v GBMS converges to an all-points-coincident cluster (Cheng ’95).Proof: diameter of data set decreases at least geometrically.

v However, GBMS shows two phases:F Phase 1: points merge into clusters of coincident points (a

few iterations); we want to stop here.F Phase 2: clusters keep approaching and merging (a few to

hundreds of iterations); slowly erases clustering structure.v We need a stopping criterion (as opposed to a convergence

criterion) to stop just after phase 1.v Simply checking ‖X(τ) −X(τ−1)‖ <

� � � does not work becausethe points are always moving.

v Instead, consider the histogram of updates {e(τ)n }

Nn=1,

e(τ)n = ‖x(τ)

n − x(τ−1)n ‖. Though the histograms change as points

move, in phase 2 the entropy H does not change (thehistogram bins change their position but not their values).

p. 41

Stopping criterion for GBMS (cont.)

τ = 0 τ = 1 τ = 2 τ = 3 τ = 4 τ = 5 τ = 6

PSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacements

0 27.1230

20

40

60

80

PSfrag replacements0 25.98220

20

40

60

80


50

100

150

200


200

400

600

800


500

1000

1500


500

1000

1500

PSfrag replacementsτ = 7 τ = 8 τ = 9 τ = 10 τ = 11 τ = 12 τ = 13


0 6.33970

500

1000

1500


500

1000

1500


500

1000

1500


500

1000

1500


500

1000

1500


500

1000

1500


500

1000

1500

PSfrag replacementsτ = 14 τ = 15 τ = 16 τ = 17 τ = 18 τ = 19 τ = 20


0 0.39940

500

1000

1500


500

1000

1500


500

1000

1500


500

1000

1500


500

1000

1500


500

1000

1500


500

1000

1500

PSfrag replacements p. 42

Stopping criterion for GBMS (cont.)

Stopping criterion:

(

|H(e(τ+1))−H(e(τ))| < 10−8)

OR(

1

N

N∑

n=1

e(τ+1)n <

� � �)

0 5 10 15 200

2

4

6

8

10

PSfrag replacements

τ

aver

age

upda

te

update entropy updatedoes not become

small enough

entropystops

changing 0 5 10 15 200

1

2

3

4

10 15 200.87

0.875

0.88

0.885

PSfrag replacements

τ

average update

upda

teen

tropy

updatedoes not become

small enough

entropystops

changing

p. 43

Convergence rate of GBMS

GBMS shrinks a Gaussian cluster towards its mean with cubicconvergence rate. Proof: work with continuous pdf

p(x) =1

N

N∑

n=1

K

(∥

∥

∥

∥

x− xn

σ

∥

∥

∥

∥

2)

⇒ p(x) =

∫

RD

q(y)K(x− y) dy

New dataset:

x =

∫

RD

p(y|x)y dy = E {y|x} ∀x ∈ RD with p(y|x) =

K(x− y)q(y)

p(x)

Considering 1D w.l.o.g. (dimensions decouple):

q(x) = N (x; 0, s2)⇒ p(x) = N (x; 0, s2 + σ2), q(x) = N (x; 0, (rs)2)

with r = 11+(σ/s)2

∈ (0, 1).

p. 44

Convergence rate of GBMS (cont.)Recurrence relation for the standard deviation s(τ):

s(τ+1) =1

1 + (σ/s(τ))2s(τ) for s(0) > 0

which converges to 0 with cubic order, i.e., limτ→∞|s(τ+1)−s(∞)||s(τ)−s(∞)|

3 <∞.Effectively, σ increases at every step.

τ = 0 τ = 1 τ = 2 τ = 3 τ = 4 τ = 5

−3 −2 −1 0 1 2 3

−3

−2

−1

0

1

2

3

PSfrag replacements←−−−−−→

7.0←−−−−−→

4.2←−−−−−→

1.5←−−−−−→

0.1←−−−−−→

4 · 10−5←−−−−−→

2 · 10−15

−3 −2 −1 0 1 2 3

−3

−2

−1

0

1

2

3


7.0←−−−−−→

4.2←−−−−−→

1.5←−−−−−→

0.1←−−−−−→

4 · 10−5←−−−−−→

2 · 10−15

−3 −2 −1 0 1 2 3

−3

−2

−1

0

1

2

3


7.0←−−−−−→

4.2←−−−−−→

1.5←−−−−−→

0.1←−−−−−→

4 · 10−5←−−−−−→

2 · 10−15

−3 −2 −1 0 1 2 3

−3

−2

−1

0

1

2

3


7.0←−−−−−→

4.2←−−−−−→

1.5←−−−−−→

0.1←−−−−−→

4 · 10−5←−−−−−→

2 · 10−15

−3 −2 −1 0 1 2 3

−3

−2

−1

0

1

2

3


7.0←−−−−−→

4.2←−−−−−→

1.5←−−−−−→

0.1←−−−−−→

4 · 10−5←−−−−−→

2 · 10−15

−3 −2 −1 0 1 2 3

−3

−2

−1

0

1

2

3


7.0←−−−−−→

4.2←−−−−−→

1.5←−−−−−→

0.1←−−−−−→

4 · 10−5←−−−−−→

2 · 10−15


7.0←−−−−−→

4.2←−−−−−→

1.5←−−−−−→

0.1←−−−−−→

4 · 10−5←−−−−−→

2 · 10−15


7.0←−−−−−→

4.2←−−−−−→

1.5←−−−−−→

0.1←−−−−−→

4 · 10−5←−−−−−→

2 · 10−15


7.0←−−−−−→

4.2←−−−−−→

1.5←−−−−−→

0.1←−−−−−→

4 · 10−5←−−−−−→

2 · 10−15


7.0←−−−−−→

4.2←−−−−−→

1.5←−−−−−→

0.1←−−−−−→

4 · 10−5←−−−−−→

2 · 10−15


7.0←−−−−−→

4.2←−−−−−→

1.5←−−−−−→

0.1←−−−−−→

4 · 10−5←−−−−−→2 · 10−15


7.0←−−−−−→

4.2←−−−−−→

1.5←−−−−−→

0.1←−−−−−→

4 · 10−5←−−−−−→

2 · 10−15

The cluster shrinks without changing orientation or mean. Theprincipal direction shrinks more slowly relative to other directions. p. 45

Convergence rate of GBMS (cont.)

In summary, a Gaussian distribution remains Gaussian with thesame mean but each principal axis decreases cubically. Thisexplains the practical behaviour shown by GBMS:1. Clusters collapse extremely fast (clustering).2. After a few iterations only the local principal component

survives⇒ temporary linearly-shaped clusters (denoising).τ = 1 τ = 2 τ = 3

p. 46

Convergence rate of GBMS (cont.)

Number of GBMS iterations τ necessary to achieve s(τ) <� � � as a

function of the bandwidth σ for s(0) = 1:

0 0.5 1 1.5 2100

101

102

103

PSfrag replacements

'( )

= 10−3'( )

= 10−10

τ

σ

Note that GMS converges with linear order (p = 1), thus requiringmany iterations to converge to a mode.

p. 47

Connection with spectral clustering

GBMS pseudocode (inner loop)for m ∈ {1, . . . , N} For each data point

∀n: p(n|xm)←exp

(

−12‖(xm − xn)/σ‖2

)

∑Nn′=1 exp

(

−12‖(xm − xn′)/σ‖2

)

ym ←N∑

n=1

p(n|xm)xn One GMS step

end∀m: xm ← ym Update whole dataset

Same pseudocode in matrix formW =

(

exp(

−12‖(xm − xn)/σ‖2

))

nmGaussian affinity matrix

D = diag(

∑Nn=1 wnm

)

Degree (normalising) matrix

X = XWD−1 Update whole datasetp. 48

Connection with spectral clustering (cont.)

v GBMS can be written as repeated products X← XP with therandom-walk matrix P = WD−1 equivalent to the graphLaplacian (W: Gaussian affinities, D: degree matrix, P:posterior probabilities p(n|xm)).

v In phase 1 P is quickly changing as points cluster.v In phase 2 P is almost constant (and perfectly blocky) so

GBMS implicitly extracts the leading eigenvectors (powermethod) like spectral clustering.

v Thus GBMS is much faster than computing eigenvectors(about 5 matrix-vector products are enough).

v Actually, since P is a positive matrix it has a single leadingeigenvector (Perron-Frobenius th.) with constant components,so eventually all points collapse.

p. 49

Connection with spectral clustering (cont.)

Leading 7 eigenvectors u1, . . . ,u7 ∈ RN and leading 20

eigenvalues µ1, . . . , µ20 ∈ (0, 1] of P = (p(n|xm))nm:u2 u3 u4 u5 u6 u7 µn

τ=

0

PSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacements 5 10 15 200

0.5

1

PSfrag replacements

τ=

2


0.5

1

PSfrag replacements

τ=

4


0.5

1

PSfrag replacements

τ=

6


0.5

1

PSfrag replacements

τ=

11


0.5

1PSfrag replacements

n p. 50

Accelerated GBMS algorithm

v At each iteration, replace clusters already formed with a singlepoint of mass = no. points in cluster.

v The algorithm is now an alternation between GMS blurringsteps and connected-component reduction steps.

v Equivalent to the original GBMS but faster.v The effective number of points N (τ) (thus the computation)

decreases very quickly.v Computational cost:

F k1: average number of iterations per point for GMSF k2: number of iterations for GBMS and accelerated GBMS

GMS GBMS Accelerated GBMS2N2Dk1

32N2Dk2

32D∑k2

τ=1 (N (τ−1))2

p. 51

Experiments with image segmentation

Original image GMS GBMS

Number of iterationsImage GMS GBMS Accelerated GBMS

*+ ,- .+ ,+ /

124× 124 71.5 (σ = 24.2) 18 (σ = 20.3) 4.6 (σ = 20.3)

0+ / 1

137× 110 36.4 (σ = 33) 14 (σ = 24) 4.8 (σ = 24)p. 52

Experiments with image segmentation (cont.)

Effective number of points N (τ) in accelerated GBMS:

0 5 10 150

2000

4000

6000

8000

10000

12000

14000

16000cameramanhand

PSfrag replacements

N (τ)

τ

Accelerated GBMS isv 2×–4× faster than GBMSv 5×–60× faster than GMS

p. 53


GMS GBMS and accelerated GBMS

1

3

5

7

9

9 10 11 12 13 14 15 16 17 18 190

20

40

60

80

PSfrag replacements

σ

clust

ers

Cite

ratio

ns

σ = 8.4 σ = 10.1 σ = 18.3C = 6 C = 4 C = 2

1

3

5

7

9

0

5

10

15

8 9 10 11 12 13 14 15 160

1

2

3

4

PSfrag replacements

σclu

ster

sC

itera

tions

spee

dup

GBMS

acc. GBMS

GBMS = acc. GBMS

σ = 8.1 σ = 8.8 σ = 15.8C = 6 C = 4 C = 2

p. 54


v Segmentations of similar quality to those of GMS but faster.v Computational cost O(kN 2). Possible further accelerations:

F Fast Gauss transformF Sparse affinities W:

H Truncated Gaussian kernelH Proximity graph (k-nearest-neighbours, etc.)

v Bandwidth σ set by the user; good values of σ:F GMS: σ ≈ 1

5×(image-side)

F GBMS: somewhat smallerCan also try values of σ in a scaled-down version of the image(fast) and then scale up σ.

p. 55

GBMS: summary

v We have turned a neglected algorithm originally due toFukunaga & Hostetler into a practical algorithm by introducinga reliable stopping criterion and an acceleration.

v Fast convergence (cubic order for Gaussian clusters).v Connection with spectral clustering (GBMS interleaves refining

the affinities with power iterations).v Excellent segmentation results, faster than GMS and spectral

clustering; only parameter is σ (which controls the granularity).v Very simple to implement.v We hope to see it more widely applied.

p. 56