Gaussian mean-shift algorithms
¦
Miguel A. Carreira-PerpinanDept. of Computer Science & Electrical Engineering, OGI/OHSU
http://www.csee.ogi.edu/~miguel
Gaussian mean-shift (GMS) as an EM algorithm
v Gaussian mean-shift (GMS) as an EM algorithmv Acceleration strategies for GMS image segmentationv Gaussian blurring mean-shift (GBMS)
p. 1
Gaussian mixtures, kernel density estimates
Given a dataset X = {xn}Nn=1 ⊂ R
D, define a Gaussian kerneldensity estimate with bandwidth σ:
p(x) =1
N
N∑
n=1
K
(∥
∥
∥
∥
x− xn
σ
∥
∥
∥
∥
2)
K(t) ∝ e−t/2.
v Other kernels:
F Epanechnikov: K(t) ∝
{
1− t, t ∈ [0, 1)
0, t ≥ 1(finite support).
F Student’s t: K(t) ∝(
1 + tα
)−α+D
2 .v Useful way to represent the density of a data set.v Extremely popular in machine learning and statistics.v Objective here: find modes of p(x).
p. 2
Why find modes? Nonparametric clustering
PSfrag replacementsC1 C2C3
p(x)
v Modes↔ clusters.v Assign each point x to a mode.v Nonparametric:
F no model for the clusters’ shapeF no predetermined number of clusters.
p. 3
Why find modes? Multivalued mappings
PSfrag replacements
xx
y
x0
x0
x1
x1
x2
x2
y0 p(x|y
=y 0
)p(x, y)
v Given a density model for (x, y), represent a multivaluedmapping y → x by modes of the conditional distribution p(x|y).Map y0 to {x0, x1, x2}.
v p(x, y) is a Gaussian mixture⇒ p(x|y) is a Gaussian mixture.p. 4
Mode-finding algorithms
v Idea: start a hill-climbing algorithm from every centroid. Thistypically finds all modes in practice (Carreira-Perpiñán ’00).
v Hill-climbing algorithms:F Gradient ascentF Newton’s methodF etc.F Fixed-point iteration: the mean-shift algorithm.
It is based on ideas by Fukunaga & Hostetler ’75(also Cheng ’95, Carreira-Perpiñán ’00, Comaniciu & Meer ’02, etc.).
p. 5
The mean-shift algorithm
It is obtained by reorganising the stationary point equation∇p(x) = 0 as a fixed-point iteration: x(τ+1) = f(x(τ)).For example, for an isotropic kde p(x) = 1
N
∑Nn=1 K
(
‖x−xn
σ‖2)
:
∇p(x) =1
N
N∑
n=1
K ′(∥
∥
∥
∥
x− xn
σ
∥
∥
∥
∥
2)2
σ2(x− xn)
=2
Nσ2
N∑
n=1
K ′(∥
∥
∥
∥
x− xn
σ
∥
∥
∥
∥
2)
x−
2
Nσ2
N∑
n=1
K ′(∥
∥
∥
∥
x− xn
σ
∥
∥
∥
∥
2)
xn = 0
⇒ x =N∑
n=1
K ′( ‖x−xn
σ‖2)
∑Nn′=1 K ′
(
‖x−xn′
σ‖2)xn.
p. 6
The mean-shift algorithm (cont.)
Gaussian,full covariance: f(x) =
(
N∑
n=1
p(n|x)Σ−1n
)−1 N∑
n=1
p(n|x)Σ−1n xn
Gaussian,isotropic: f(x) =
N∑
n=1
p(n|x)xn p(n|x) =e−
12‖
x−xn
σ‖
2
∑Nn′=1 e
− 12
‚
‚
‚
x−xn′
σ
‚
‚
‚
2
Other kernel,isotropic: f(x) =
N∑
n=1
K ′( ‖x−xn
σ‖2)
∑Nn′=1 K ′
(
‖x−xn′
σ‖2)xn.
v f∞ = f ◦ f ◦ . . . maps points in RD to stationary points of p.
v It is a particularly simple algorithm, with no step size.
p. 7
Gaussian mean-shift (GMS) as a clustering algorithm
v xn, xm in same cluster if they converge to same modev nonparametric clustering, able to deal with complex cluster
shapes; σ determines the number of clustersv popular in computer vision (segmentation, tracking)
(Comaniciu & Meer)
Segmentation example (one data point x = (i, j, I) per pixel, where(i, j) = location and I = intensity or colour):
p. 8
Pseudocode: GMS clustering
for n ∈ {1, . . . , N} For each data pointx← xn Starting pointrepeat Iteration loop
∀n: p(n|x)←exp
(
−12‖(x− xn)/σ‖2
)
∑Nn′=1 exp
(
−12‖(x− xn′)/σ‖2
) Post. prob. (E step)
x←N∑
n=1
p(n|x)xn Update x (M step)
until x’s update <
� � �
zn ← x Modeendconnected-components({zn}
Nn=1, � ��� � � � ) Clusters
p. 9
GMS is an EM algorithm
Define the following artificial maximum-likelihood problem(Carreira-Perpiñán & Williams ’03):v Model: Gaussian mixture that moves as a rigid body
({πn,xn,Σn}Nn=1 fixed; only parameter is x)
q(y|x) =N∑
n=1
πn |2πΣn|− 1
2 e−12(y−(xn−x))T Σ−1
n (y−(xn−x)).
v Data: just one point at the origin: y = 0.Then, the resulting expectation-maximisation (EM) algorithm thatmaximises the log-likelihood wrt the parameter x (translation of theGM) is formally identical to the GMS step.v E step: update posterior probabilities p(n|x).v M step: update parameter x.
p. 10
GMS is an EM algorithm (cont.)
The M step maximises the following lower bound on log p:
%(x) = log p(x(τ))−Q(x(τ)|x(τ)) + Q(x|x(τ))
with%(x(τ)) = log p(x(τ)) and %(x) ≤ log p(x) ∀x ∈ R
D
where
Q(x|x(τ)) = constant− 1
2σ2
N∑
n=1
p(n|x(τ)) ‖x− xn‖2
is the expected complete-data log-likelihood from the E step.
p. 11
Non-Gaussian mean-shift is a GEM algorithm
v Consider a mixture of non-Gaussian kernels.v Analogous derivation as maximum-likelihood problem, but now
the M step does not have a closed-form solution for x.v However, it is possible to solve the M step iteratively with a
certain fixed-point iteration.v Taking just one step of this fixed-point iteration provides an
inexact M step and so a generalised EM algorithm, which isformally identical to the mean-shift algorithm(Carreira-Perpiñán ’06).
p. 12
Convergence properties of mean-shift
From the properties of (G)EM algorithms (Dempster et al ’77) weexpect, generally speaking:v Convergence to a mode from almost any starting point
Can also converge to saddle points and minima.v Monotonic increase of p at every step
(Jensen’s inequality).v Convergence of linear order.
However, particular mean-shift algorithms (for particular kernels)may show different properties.We analyse in detail GMS (isotropic case: Σn = σ2I).
p. 13
Convergence properties of GMS
Taylor expansion around a mode x∗ of the fixed-point mapping f :
x(τ+1) = f(x(τ)) = x∗ + J(x∗)(x(τ) − x∗) +O(‖x(τ) − x∗‖2)
⇒ x(τ+1) − x∗ ≈ J(x∗)(x(τ) − x∗)
where the Jacobian of f is
J(x) =1
σ2
M∑
m=1
p(m|x)(µm − f(x))(µm − f(x))T .
At a mode, J(x∗) = 1σ2×(local covariance at x∗).
Thus, the convergence is linear with rate
r = limτ→∞
∥
∥x(τ+1) − x∗∥∥
‖x(τ) − x∗‖= λmax(J(x∗)) < 1.
p. 14
Convergence properties of GMS (cont.)
Special cases depending on σ:v σ → 0, σ →∞ (practically uninteresting): r → 0, superlinear.v σ at mode merges: r = 1, sublinear (awfully slow).v Intermediate σ (practically useful): r close to 1 (slow).
−6 −4 −2 0 2 4 6
PSfrag replacementsσ = 0.2
σ = 0.4
σ = 1
σ = 4
x
r
σ10−1 100 1010
0.2
0.4
0.6
0.8
1
PSfrag replacementsσ = 0.2σ = 0.4
σ = 1σ = 4
x
r
σ
p. 15
Convergence properties of GMS (cont.)
v The GMS iterates follow the local principal component at themode and lie in the interior of the convex hull of the data points.
v Smooth path: angle between consecutive steps in (−π2, π
2)
(Comaniciu & Meer ’02).v GMS step size is not optimal along the GMS search direction.
p. 16
Number of modes of a Gaussian mixture
In principle, one would expect that N points should produce atmost N modes (additional modes are an artifact of the kernel). But:In 1D:v A Gaussian mixture with N components has at most N modes
(Carreira-Perpiñán & Williams ’03).v A mixture of non-Gaussian kernels can have more than N
modes:
K(x) ∝ 11+ex/10
:
Epanechnikov kernel:
p. 17
Number of modes of a Gaussian mixture (cont.)
In D ≥ 2, even a mixture of isotropic Gaussians can have morethan N modes (Carreira-Perpiñán & Williams ’03):
−0.5 0 0.5 1 1.5−0.5
0
0.5
1
1.5
However, in practice the number of modes is much smaller than Nbecause components coalesce. This may be a reason why, inpractice, GMS produces better segmentations than other kernels.
p. 18
Convergence domains of GMS
In general, the convergence domains (geometric locus of pointsthat converge to each mode) are nonconvex and can bedisconnected and take quite fancy shapes:
p. 19
Convergence domains of GMS (cont.)
The boundary between some domains can be fractal:
These peculiar domains are probably undesirable for clustering;but, small effect (seemingly confined to cluster boundaries).
p. 20
Gaussian mean-shift as an EM algorithm: summary
v GMS is an EM algorithmv Non-Gaussian mean-shift is a GEM algorithmv GMS converges to a mode from almost any starting pointv Convergence is linear (occasionally superlinear or sublinear),
slow in practicev The iterates approach a mode along its local principal
componentv Gaussian mixtures and kernel density estimates can have
more modes than components (but seems rare)v The convergence domains for GMS can be fractal
p. 21
Acceleration strategies for GMS image segmentation
v Gaussian mean-shift (GMS) as an EM algorithmv Acceleration strategies for GMS image segmentationv Gaussian blurring mean-shift (GBMS)
p. 22
Computational bottlenecks of GMS
GMS is slow: O(kN 2D). Computational bottlenecks:v B1: large average number of iterations k ∼ 100
(linear convergence).v B2: large cost per iteration ∼ 2ND multiplications
F E step: ND to obtain p(n|x(τ)).F M step: ND to obtain x(τ+1).
Acceleration techniques must address B1 and/or B2.
We propose four strategies:v � �: spatial discretisationv � �: spatial neighbourhoodv � : sparse EMv � �: EM–Newton
p. 23
Evaluation of strategies
Evaluation of strategies:v Goal: to achieve the same segmentation as GMS. Visual
evaluation of segmentation not enough; we compute thesegmentation error wrt GMS segmentation (= no. pixelsmisclustered as a % over the whole image).
v We report running times in normalised iterations (= 1 iterationof GMS) to ensure independence from implementation details.
v No pre- or postprocessing of clusters (e.g. removal of smallclusters).
Why segmentation errors?v The accelerated scheme may not converge to a mode of p(x).v xn may converge to a different mode than with exact GMS.
p. 24
� � �: spatial discretisation
v
Many different pixels converging to the same modetravel almost identical paths. We discretise the spatialdomain by subdividing every pixel (i, j) into n×n cells;points projecting to the same cell share the same fate.
� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �� � � � �PSfrag replacements
v This works because paths in D dimensions are wellapproximated by their 2D projection on the spatial domain.
v
We stop iterating once we hit an already visited cell⇒ massively reduces the total number of iterations.
v We start first with a set of pixels uniformly distributed over theimage (this finds all modes quickly).
v Converges to a modev Segmentation error→ 0 by increasing n
v Addresses B1. Parameter: n = 1, 2, 3 . . . (discretisation level)
p. 25
� � �: spatial neighbourhood
Approximates E and M steps with a subset of the datapoints (rather than all N ) consisting of a neighbourhood inthe spatial domain (not the range domain). Finding neigh-bours is for free (unlike finding neighbours in full space).
PSfra
gre
plac
emen
ts
p(n|x(τ)) =exp
(
−12
∥
∥(x(τ) − xn)/σ∥
∥
2)
∑
n′∈N (x(τ)) exp(
−12‖(x(τ) − xn′)/σ‖
2)
x(τ+1) =∑
n∈N (x(τ)) p(n|x(τ))xn
v Does not converge to a modev Segmentation error→ 0 by increasing e
v Addresses B2. Parameter: e ∈ (0, 1] (fraction of data set usedas neighbours)
p. 26
� � �: sparse EM
Sparse EM (Neal & Hinton ’98): coordinate ascent on the space of(x, p) where p are posterior probabilities; this maximises the freeenergy F (p,x) = log p(x)−D (p‖p(n|x)) and also p(x).v Run partial E steps frequently: update p(n|x) only for n ∈ S;
S = plausible set (nearest neighbours), kept constant overpartial E steps. FAST.
v Run full E steps infrequently: update all p(n|x) and S. SLOW.We choose S containing as many neighbours as necessary toaccount for a total probability
∑
n∈S p(n|x) ≥ 1− ε ∈ (0, 1]. Thus,the fraction of data used e varies after each full step.Using fixed e produced worse results.v Converges to a mode no matter how S is chosen;
computational savings if few full stepsv Segmentation error→ 0 by decreasing ε
v Addresses B2. Parameter: ε ∈ [0, 1) (probability not in S) p. 27
� � �: EM–Newton
Start with EM steps, which quickly increase p. Switch to Newtonsteps when EM slows down (reverting to EM if bad Newton step).Specifically, try Newton step when ‖x(τ) − x(τ−1)‖ < θ.Computing the Newton step xN yields the EM step xEM for free:
Gradient: g(x) =p(x)
σ2
N∑
n=1
p(n|x)(xn − x) =p(x)
σ2(xEM − x)
Hessian: H(x) =p(x)
σ2
(
−I +1
σ2
N∑
n=1
p(n|x)(xn − x)(xn − x)T
)
Newton step: xN = x−H−1(x)g(x).
v Converges to a mode with quadratic orderv Segmentation error→ 0 by decreasing θ
v Addresses B1. Parameter: θ > 0 (minimum EM step length)relative to σ
p. 28
Computational cost
Strategy Cost per iteration relative to exact GMS
� � � : spatial discretisation 1
� � �: spatial neighbourhood e
� � �: sparse EM 2 if full step, e if partial step
� � �: EM–Newton 1 if EM step,(
1 +D+1
4
)
if Newton step,(
32 +
D+14
)
if EM step after failed N. step
e ∈ (0, 1] is the fraction of the data set used (neighbours for � �,plausible set for � ).Using iterations normalised wrt GMS allows direct comparison ofthe numbers of iterations in the experiments.
p. 29
Experimental results with image segmentation
v Dataset: xn = (in, jn, In) (greyscale) or xn = (in, jn, L∗n, u∗
n, v∗n)
(colour) where (i, j) is the pixel’s spatial position. Prescale I or(L∗, u∗, v∗) so same σ for all dimensions (in pixels).
v Best segmentations for large bandwidths: σ ≈ 15×(image side).
v We study all strategies with different images over a range of σ.v Convergence tolerance � � �
= 10−3 pixels.
�� � � � �� �
100× 100Number of clusters Total number
of iterations
PSfrag replacementsσIterations
Clusters 8 12 16 20 240
5
10
15
PSfrag replacements
σIterations
Clus
ters
8 12 16 20 240
5
10
x 105
PSfrag replacements
σIte
ratio
nsClusters
p. 30
Experimental results with image segmentation (cont.)
Segmentation results for each method under its optimal parametervalue for σ = 12:
GMS: 5 modes
its = 823937
!" # : 5 modeserror 1.62%its = 33791
!" $: 5 modeserror 0.02%its = 324694
!" %: 5 modeserror 0.00%its = 340095
!" &: 5 modeserror 1.98%its = 141904
segm
enta
tion
#ite
ratio
ns
20
40
60
80
100
120
140
160
20
40
60
80
100
120
10
20
30
40
50
60
10
20
30
40
50
60
20
40
60
80
100
120
itera
tions
hist
ogra
m
0 50 100 1500
5001000
0 50 100 1500
5000
0 50 100 1500
5001000
0 50 100 1500
500
0 50 100 1500
20004000
p. 31
Experimental results with image segmentation (cont.)
Optimal parameter value for each method (= that which minimisesthe iterations ratio s. t. achieving an error < 3%) as a function of σ.
8 12 16 20 241234567
10
15
20
8 12 16 20 240.19
0.25
0.31
0.38
0.440.47
0.14
0.26
0.4
0.56
0.760.86
8 12 16 20 2410−11
10−8
10−6
10−410−310−210−1
8 12 16 20 240.020.05
0.1
0.2
0.3
0.5
PSfrag replacements
σσ
σσ
n r e
e
ε θ
� � � �
� � �
p. 32
Experimental results with image segmentation (cont.)
Clustering error (percent) and number of iterations for each methodunder its optimal parameter value as a function of σ.
8 12 16 20 240
0.5
1
1.5
2
2.5
3
8 12 16 20 240
5
10
15x 105
msms1ms2ms3ms4
PSfrag replacements
erro
r
σσIte
ratio
ns
p. 33
Experimental results with image segmentation (cont.)
Clustering error (percent) and computational cost for each methodas a function of its parameter.
1234567 10 15 200
2
4
6
8
10
12
1234567 10 15 20
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
0.19 0.25 0.31 0.38 0.440.470
10
20
30
40
50
60
70
80
90
0.19 0.25 0.31 0.38 0.440.47
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.14 0.26 0.4 0.56 0.760.86
10−11 10−8 10−6 10−410−310−210−10
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
10−11 10−8 10−6 10−410−310−210−1
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0.020.050.1 0.2 0.3 0.50
1
2
3
4
5
6
7
8
9
10
0.020.050.1 0.2 0.3 0.5
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
σ = 8σ = 12σ = 16σ = 20σ = 24
PSfrag replacements
!" # !" $ !" % !" &
n
e
r
e
ε θ
erro
rite
ratio
nra
tiowr
tGM
S
p. 34
Experimental results with image segmentation (cont.)
Plot of all iterates for all starting pixels: most cells in � � are empty.
0 500500
20
40
60
80
PSfrag replacements
I
i j
Proportion of cells visited by � �(less than nN out of n2N )
Average number of iterationsper pixel for � �
1 2 3 4 5 6 7 10 15 200
0.2
0.4
0.6
0.8
1
PSfrag replacements Prop
ortio
n
n 50 64 100 128 256 5121
2
3
4
n = 1
n = 2
n = 3n = 4n = 5n = 6
PSfrag replacements
√N
Itera
tions
/pixe
l
p. 35
Acceleration strategies: summary (cont.)
v With parameters set optimally, all methods can obtain nearlythe same segmentation as GMS with significant speedups.Near-optimal parameter values can easily be set in advance for� �, � and � �, and somewhat more heuristically for � �.
v � � (spatial discretisation): best strategy; 10×–100× speedup(average number of iterations per pixel k = 2–4 only) with verysmall error controlled by the discretisation level.
v � � (EM–Newton): second best; 1.5×–6× speedup.v � �, � (neighbourhood methods): 1×–3× speedup; �
attains the lowest (near-zero) error, while � � can result inunacceptably large errors for a suboptimal setting of itsparameter (neighbourhood size).The neighbourhood methods ( � �, � ) are less effectivebecause GMS needs very large neighbourhoods (comparableto the data set size for the larger bandwidths).
p. 36
Acceleration strategies: summary (cont.)
Other methods:v Approximate algorithms for neighbour search (e.g. kd-trees):
less effective because GMS needs large neighbourhoods.v Other optimisation algorithms (e.g. quasi-Newton): need to
ensure first steps are EM.v Fast Gauss transform: can be combined with our methods.
Extensions:v Adaptive and non-isotropic bandwidths: all 4 methods.v Non-image data: � , � � readily applicable.
Is it possible to extend � � to higher dimensions?
p. 37
Gaussian blurring mean-shift (GBMS)
v Gaussian mean-shift (GMS) as an EM algorithmv Acceleration strategies for GMS image segmentationv Gaussian blurring mean-shift (GBMS)
p. 38
Gaussian blurring mean-shift (GBMS)
v This is the algorithm that Fukunaga & Hostetler really proposedv It has gone largely unnoticed.v Same iterative scheme, but now the data points move at each
iteration. Result: sequence of progressively shrunk datasetsX(0),X(1), . . . converging to single, all-points-coincident cluster.
τ = 0 τ = 1 τ = 2 τ = 3 τ = 4 τ = 5
τ = 6 τ = 7 τ = 8 τ = 9 τ = 10 τ = 11
p. 39
Pseudocode: GBMS clustering
repeat Iteration loopfor m ∈ {1, . . . , N} For each data point
∀n: p(n|xm)←exp
(
−12‖(xm − xn)/σ‖2
)
∑Nn′=1 exp
(
−12‖(xm − xn′)/σ‖2
)
ym ←N∑
n=1
p(n|xm)xn One GMS step
end∀m: xm ← ym Update whole dataset
until stop Stopping criterionconnected-components({xn}
Nn=1, � ��� � � � ) Clusters
p. 40
Stopping criterion for GBMS
v GBMS converges to an all-points-coincident cluster (Cheng ’95).Proof: diameter of data set decreases at least geometrically.
v However, GBMS shows two phases:F Phase 1: points merge into clusters of coincident points (a
few iterations); we want to stop here.F Phase 2: clusters keep approaching and merging (a few to
hundreds of iterations); slowly erases clustering structure.v We need a stopping criterion (as opposed to a convergence
criterion) to stop just after phase 1.v Simply checking ‖X(τ) −X(τ−1)‖ <
� � � does not work becausethe points are always moving.
v Instead, consider the histogram of updates {e(τ)n }
Nn=1,
e(τ)n = ‖x(τ)
n − x(τ−1)n ‖. Though the histograms change as points
move, in phase 2 the entropy H does not change (thehistogram bins change their position but not their values).
p. 41
Stopping criterion for GBMS (cont.)
τ = 0 τ = 1 τ = 2 τ = 3 τ = 4 τ = 5 τ = 6
PSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacements
0 27.1230
20
40
60
80
PSfrag replacements0 25.98220
20
40
60
80
PSfrag replacements0 20.87620
50
100
150
200
PSfrag replacements0 22.60520
200
400
600
800
PSfrag replacements0 11.08760
500
1000
1500
PSfrag replacements0 9.55410
500
1000
1500
PSfrag replacementsτ = 7 τ = 8 τ = 9 τ = 10 τ = 11 τ = 12 τ = 13
PSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacements
0 6.33970
500
1000
1500
PSfrag replacements0 8.4490
500
1000
1500
PSfrag replacements0 5.63580
500
1000
1500
PSfrag replacements0 0.58170
500
1000
1500
PSfrag replacements0 0.29410
500
1000
1500
PSfrag replacements0 0.32290
500
1000
1500
PSfrag replacements0 0.35750
500
1000
1500
PSfrag replacementsτ = 14 τ = 15 τ = 16 τ = 17 τ = 18 τ = 19 τ = 20
PSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacements
0 0.39940
500
1000
1500
PSfrag replacements0 0.45110
500
1000
1500
PSfrag replacements0 0.51630
500
1000
1500
PSfrag replacements0 0.60050
500
1000
1500
PSfrag replacements0 0.71230
500
1000
1500
PSfrag replacements0 0.86630
500
1000
1500
PSfrag replacements0 1.08810
500
1000
1500
PSfrag replacements p. 42
Stopping criterion for GBMS (cont.)
Stopping criterion:
(
|H(e(τ+1))−H(e(τ))| < 10−8)
OR(
1
N
N∑
n=1
e(τ+1)n <
� � �)
0 5 10 15 200
2
4
6
8
10
PSfrag replacements
τ
aver
age
upda
te
update entropy updatedoes not become
small enough
entropystops
changing 0 5 10 15 200
1
2
3
4
10 15 200.87
0.875
0.88
0.885
PSfrag replacements
τ
average update
upda
teen
tropy
updatedoes not become
small enough
entropystops
changing
p. 43
Convergence rate of GBMS
GBMS shrinks a Gaussian cluster towards its mean with cubicconvergence rate. Proof: work with continuous pdf
p(x) =1
N
N∑
n=1
K
(∥
∥
∥
∥
x− xn
σ
∥
∥
∥
∥
2)
⇒ p(x) =
∫
RD
q(y)K(x− y) dy
New dataset:
x =
∫
RD
p(y|x)y dy = E {y|x} ∀x ∈ RD with p(y|x) =
K(x− y)q(y)
p(x)
Considering 1D w.l.o.g. (dimensions decouple):
q(x) = N (x; 0, s2)⇒ p(x) = N (x; 0, s2 + σ2), q(x) = N (x; 0, (rs)2)
with r = 11+(σ/s)2
∈ (0, 1).
p. 44
Convergence rate of GBMS (cont.)Recurrence relation for the standard deviation s(τ):
s(τ+1) =1
1 + (σ/s(τ))2s(τ) for s(0) > 0
which converges to 0 with cubic order, i.e., limτ→∞|s(τ+1)−s(∞)||s(τ)−s(∞)|
3 <∞.Effectively, σ increases at every step.
τ = 0 τ = 1 τ = 2 τ = 3 τ = 4 τ = 5
−3 −2 −1 0 1 2 3
−3
−2
−1
0
1
2
3
PSfrag replacements←−−−−−→
7.0←−−−−−→
4.2←−−−−−→
1.5←−−−−−→
0.1←−−−−−→
4 · 10−5←−−−−−→
2 · 10−15
−3 −2 −1 0 1 2 3
−3
−2
−1
0
1
2
3
PSfrag replacements←−−−−−→
7.0←−−−−−→
4.2←−−−−−→
1.5←−−−−−→
0.1←−−−−−→
4 · 10−5←−−−−−→
2 · 10−15
−3 −2 −1 0 1 2 3
−3
−2
−1
0
1
2
3
PSfrag replacements←−−−−−→
7.0←−−−−−→
4.2←−−−−−→
1.5←−−−−−→
0.1←−−−−−→
4 · 10−5←−−−−−→
2 · 10−15
−3 −2 −1 0 1 2 3
−3
−2
−1
0
1
2
3
PSfrag replacements←−−−−−→
7.0←−−−−−→
4.2←−−−−−→
1.5←−−−−−→
0.1←−−−−−→
4 · 10−5←−−−−−→
2 · 10−15
−3 −2 −1 0 1 2 3
−3
−2
−1
0
1
2
3
PSfrag replacements←−−−−−→
7.0←−−−−−→
4.2←−−−−−→
1.5←−−−−−→
0.1←−−−−−→
4 · 10−5←−−−−−→
2 · 10−15
−3 −2 −1 0 1 2 3
−3
−2
−1
0
1
2
3
PSfrag replacements←−−−−−→
7.0←−−−−−→
4.2←−−−−−→
1.5←−−−−−→
0.1←−−−−−→
4 · 10−5←−−−−−→
2 · 10−15
PSfrag replacements←−−−−−→
7.0←−−−−−→
4.2←−−−−−→
1.5←−−−−−→
0.1←−−−−−→
4 · 10−5←−−−−−→
2 · 10−15
PSfrag replacements←−−−−−→
7.0←−−−−−→
4.2←−−−−−→
1.5←−−−−−→
0.1←−−−−−→
4 · 10−5←−−−−−→
2 · 10−15
PSfrag replacements←−−−−−→
7.0←−−−−−→
4.2←−−−−−→
1.5←−−−−−→
0.1←−−−−−→
4 · 10−5←−−−−−→
2 · 10−15
PSfrag replacements←−−−−−→
7.0←−−−−−→
4.2←−−−−−→
1.5←−−−−−→
0.1←−−−−−→
4 · 10−5←−−−−−→
2 · 10−15
PSfrag replacements←−−−−−→
7.0←−−−−−→
4.2←−−−−−→
1.5←−−−−−→
0.1←−−−−−→
4 · 10−5←−−−−−→2 · 10−15
PSfrag replacements←−−−−−→
7.0←−−−−−→
4.2←−−−−−→
1.5←−−−−−→
0.1←−−−−−→
4 · 10−5←−−−−−→
2 · 10−15
The cluster shrinks without changing orientation or mean. Theprincipal direction shrinks more slowly relative to other directions. p. 45
Convergence rate of GBMS (cont.)
In summary, a Gaussian distribution remains Gaussian with thesame mean but each principal axis decreases cubically. Thisexplains the practical behaviour shown by GBMS:1. Clusters collapse extremely fast (clustering).2. After a few iterations only the local principal component
survives⇒ temporary linearly-shaped clusters (denoising).τ = 1 τ = 2 τ = 3
p. 46
Convergence rate of GBMS (cont.)
Number of GBMS iterations τ necessary to achieve s(τ) <� � � as a
function of the bandwidth σ for s(0) = 1:
0 0.5 1 1.5 2100
101
102
103
PSfrag replacements
'( )
= 10−3'( )
= 10−10
τ
σ
Note that GMS converges with linear order (p = 1), thus requiringmany iterations to converge to a mode.
p. 47
Connection with spectral clustering
GBMS pseudocode (inner loop)for m ∈ {1, . . . , N} For each data point
∀n: p(n|xm)←exp
(
−12‖(xm − xn)/σ‖2
)
∑Nn′=1 exp
(
−12‖(xm − xn′)/σ‖2
)
ym ←N∑
n=1
p(n|xm)xn One GMS step
end∀m: xm ← ym Update whole dataset
Same pseudocode in matrix formW =
(
exp(
−12‖(xm − xn)/σ‖2
))
nmGaussian affinity matrix
D = diag(
∑Nn=1 wnm
)
Degree (normalising) matrix
X = XWD−1 Update whole datasetp. 48
Connection with spectral clustering (cont.)
v GBMS can be written as repeated products X← XP with therandom-walk matrix P = WD−1 equivalent to the graphLaplacian (W: Gaussian affinities, D: degree matrix, P:posterior probabilities p(n|xm)).
v In phase 1 P is quickly changing as points cluster.v In phase 2 P is almost constant (and perfectly blocky) so
GBMS implicitly extracts the leading eigenvectors (powermethod) like spectral clustering.
v Thus GBMS is much faster than computing eigenvectors(about 5 matrix-vector products are enough).
v Actually, since P is a positive matrix it has a single leadingeigenvector (Perron-Frobenius th.) with constant components,so eventually all points collapse.
p. 49
Connection with spectral clustering (cont.)
Leading 7 eigenvectors u1, . . . ,u7 ∈ RN and leading 20
eigenvalues µ1, . . . , µ20 ∈ (0, 1] of P = (p(n|xm))nm:u2 u3 u4 u5 u6 u7 µn
τ=
0
PSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacements 5 10 15 200
0.5
1
PSfrag replacements
τ=
2
PSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacements 5 10 15 200
0.5
1
PSfrag replacements
τ=
4
PSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacements 5 10 15 200
0.5
1
PSfrag replacements
τ=
6
PSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacements 5 10 15 200
0.5
1
PSfrag replacements
τ=
11
PSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacementsPSfrag replacements 5 10 15 200
0.5
1PSfrag replacements
n p. 50
Accelerated GBMS algorithm
v At each iteration, replace clusters already formed with a singlepoint of mass = no. points in cluster.
v The algorithm is now an alternation between GMS blurringsteps and connected-component reduction steps.
v Equivalent to the original GBMS but faster.v The effective number of points N (τ) (thus the computation)
decreases very quickly.v Computational cost:
F k1: average number of iterations per point for GMSF k2: number of iterations for GBMS and accelerated GBMS
GMS GBMS Accelerated GBMS2N2Dk1
32N2Dk2
32D∑k2
τ=1 (N (τ−1))2
p. 51
Experiments with image segmentation
Original image GMS GBMS
Number of iterationsImage GMS GBMS Accelerated GBMS
*+ ,- .+ ,+ /
124× 124 71.5 (σ = 24.2) 18 (σ = 20.3) 4.6 (σ = 20.3)
0+ / 1
137× 110 36.4 (σ = 33) 14 (σ = 24) 4.8 (σ = 24)p. 52
Experiments with image segmentation (cont.)
Effective number of points N (τ) in accelerated GBMS:
0 5 10 150
2000
4000
6000
8000
10000
12000
14000
16000cameramanhand
PSfrag replacements
N (τ)
τ
Accelerated GBMS isv 2×–4× faster than GBMSv 5×–60× faster than GMS
p. 53
Experiments with image segmentation (cont.)
GMS GBMS and accelerated GBMS
1
3
5
7
9
9 10 11 12 13 14 15 16 17 18 190
20
40
60
80
PSfrag replacements
σ
clust
ers
Cite
ratio
ns
σ = 8.4 σ = 10.1 σ = 18.3C = 6 C = 4 C = 2
1
3
5
7
9
0
5
10
15
8 9 10 11 12 13 14 15 160
1
2
3
4
PSfrag replacements
σclu
ster
sC
itera
tions
spee
dup
GBMS
acc. GBMS
GBMS = acc. GBMS
σ = 8.1 σ = 8.8 σ = 15.8C = 6 C = 4 C = 2
p. 54
Experiments with image segmentation (cont.)
v Segmentations of similar quality to those of GMS but faster.v Computational cost O(kN 2). Possible further accelerations:
F Fast Gauss transformF Sparse affinities W:
H Truncated Gaussian kernelH Proximity graph (k-nearest-neighbours, etc.)
v Bandwidth σ set by the user; good values of σ:F GMS: σ ≈ 1
5×(image-side)
F GBMS: somewhat smallerCan also try values of σ in a scaled-down version of the image(fast) and then scale up σ.
p. 55
GBMS: summary
v We have turned a neglected algorithm originally due toFukunaga & Hostetler into a practical algorithm by introducinga reliable stopping criterion and an acceleration.
v Fast convergence (cubic order for Gaussian clusters).v Connection with spectral clustering (GBMS interleaves refining
the affinities with power iterations).v Excellent segmentation results, faster than GMS and spectral
clustering; only parameter is σ (which controls the granularity).v Very simple to implement.v We hope to see it more widely applied.
p. 56