Clustering

Unsupervised LeanringSourangshu Bhattacharya

Clustering

• Dataset: 𝑥1, … , 𝑥𝑁 𝑥𝑖 ∈ 𝑅𝐷.

• Goal: Partition the data into 𝐾 clusters.

• Criterion: • Intra-cluster distances should be smaller than inter-

cluster distances.

Clustering

Clustering

• Dataset: 𝑥1, … , 𝑥𝑁 𝑥𝑖 ∈ 𝑅𝐷.

• Goal: Partition the data into 𝐾 clusters.

• Criterion: • Intra-cluster distances should be smaller than inter-

cluster distances.

• Many clustering schemes:• K-means, K-medoids, etc.

• Hierarchical clustering.

• Spectral clustering.

K- means clustering

• Each cluster k is represented by a center, 𝜇𝑘 ∈ 𝑅𝐷.

• Let 𝛾𝑛𝑘 ∈ {0,1} denote whether datapoint 𝑛 is assigned to cluster 𝑘.𝛾𝑛𝑘 = 1 if 𝑥𝑛 belongs to 𝑘𝛾𝑛𝑘 = 0 else.

• Distortion measure:

𝐽 =

𝑛=1

𝑁

𝑘=1

𝐾

𝛾𝑛𝑘 𝑥𝑛 − 𝜇𝑘2

K-means clustering

• We need to find both 𝛾𝑛𝑘 and 𝜇𝑘.

• Do it alternatively for 𝛾𝑛𝑘 and 𝜇𝑘 keeping the other fixed.

K-means algorithm

• Initialize {𝜇1, … , 𝜇𝐾} randomly.

• Iterate till convergence:• E-step:

𝛾𝑛𝑘 = 1 𝑖𝑓 𝑘 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑗| 𝑥𝑛 − 𝜇𝑗 |

0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒• M-step:

𝜇𝑘 = 𝑛 𝛾𝑛𝑘𝑥𝑛 𝑛 𝛾𝑛𝑘

• Complexity: 𝑂(𝑁𝐾)

K-means

K-means algorithm

• It is easy to check that:• E-step: minimizes 𝐽 w.r.t. 𝛾𝑛𝑘 keeping 𝜇𝑘 fixed.

• M-step: minimizes 𝐽 w.r.t. 𝜇𝑘 keeping 𝛾𝑛𝑘 fixed.

• The iteration will converge:• 𝐽 keeping decreasing in each step by a certain minimum

amount.

• Minimum value for 𝐽 is zero.

• Converge to a local minimum.

K-medoids

• Objective function:

𝐽 =

𝑛=1

𝑁

𝑘=1

𝐾

𝛾𝑛𝑘𝐷(𝑥𝑛, 𝜇𝑘)

• Medoids, 𝜇𝑘 ∈ {𝑥1, … , 𝑥𝑁}.

• E-step: 𝛾𝑛𝑘 = 1 𝑖𝑓 𝑘 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑗𝐷(𝑥𝑛, 𝜇𝑗)

0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

• M-step: Assign 𝜇𝑘 to 𝑥𝑗 which minimizes 𝑛=1𝑁 𝛾𝑛𝑘𝐷(𝑥𝑛, 𝑥𝑗) over all 𝑥𝑗.

K-medoids

• Complexity is 𝑂(𝑁2).

• Convergence: function values converges, but the medoids can oscillate.

• Advantage: works in non-vectorial datasets.

Mixture of Gaussians

• 𝑧 ∈ 0,1 𝐾: be a discrete latent variable, such that 𝑘 𝑧𝑘 = 1.

• 𝑧𝑘 selects the cluster (mixture component) from which the data point is generated.

• There are K Gaussian distributions:𝒩 𝑥 𝜇1, Σ1…𝒩(𝑥|𝜇𝐾 , Σ𝐾)

Mixture of Gaussians

• Given a data point 𝑥:

𝑃 𝑥 =

𝑘=1

𝐾

𝜋𝑘 𝒩(𝑥|𝜇𝑘 , Σ𝑘)

• Where:𝜋𝑘 = 𝑃(𝑧𝑘 = 1)

Generative Procedure

• Select z from probability distr. 𝜋𝑘.

• Hence: 𝑃 𝑧 = 𝑘=1𝐾 𝜋𝑘

𝑧𝑘.

• Given z, generate x according to the conditional distr.:

𝑃 𝑥 𝑧𝑘 = 1 = 𝒩(𝑥|𝜇𝑘 , Σ𝑘)

• Hence:

𝑃 𝑥 𝑧 =

𝑘=1

𝐾

𝒩 𝑥 𝜇𝑘 , Σ𝑘𝑧𝑘

Generative Procedure

• Joint distr.:𝑃 𝑥, 𝑧 = 𝑝 𝑧 𝑝 𝑥 𝑧

=

𝑘=1

𝐾

𝜋𝑘𝒩 𝑥 𝜇𝑘 , Σ𝑘𝑧𝑘

• Marginal:

𝑝 𝑥 =

𝑧

𝑝(𝑥, 𝑧) =

𝑘=1

𝐾

𝜋𝑘𝒩(𝑥|𝜇𝑘 , Σ𝑘)

Posterior distribution

• 𝑧𝑘 = 1 given 𝑥:

Example

Max-likelihood

• Let 𝐷 = {𝑥1, … , 𝑥𝑁}

• Likelihood function:

𝑃 𝐷 𝝅, 𝝁, 𝚺 =

𝑛=1

𝑁

𝑘=1

𝐾

𝜋𝑘𝒩(𝑥𝑛|𝜇𝑘 , Σ𝑘)

• Log likelihood:

ln 𝑃 𝐷 𝝅, 𝝁, 𝚺 =

𝑛=1

𝑁

ln(

𝑘=1

𝐾

𝜋𝑘𝒩(𝑥𝑛|𝜇𝑘 , Σ𝑘))

• Maximize log-likelihood w.r.t. 𝝅, 𝝁 and 𝚺.

KKT conditions

• Differentiating w.r.t. 𝜇𝑘:

• Multiplying by Σ𝑘−1:

• Where:

KKT conditions

• Similarly, differentiating w.r.t. Σ𝑘:

• Lagrangian w.r.t. 𝜋𝑘:

KKT conditions

• Minimizing:

• Multiplying with 𝜋𝑘 and adding over k: 𝜆 = −𝑁.

• Hence:

• Where:

(EM) Algorithm• Initialize 𝜇𝑘 , Σ𝑘 and 𝜋𝑘.

• E-step:

• M-step:

• Repeat above two steps till ln 𝑃 𝐷 𝝅, 𝝁, 𝚺converges.

Example

General EM Algorithm

• Incomplete data log-likelihood of a latent variable model:

• Summation inside logarithm.

• Complete data log-likelihood is more tractable:

General EM Algorithm

• Define expectation w.r.t. posterior 𝑝(𝑍|𝑋, 𝜃𝑜𝑙𝑑):

• Maximize w.r.t. 𝜃:

General EM algorithm

• Initialize 𝜃𝑜𝑙𝑑.

• Iterate till convergence:

For Gaussian Mixtures:

• E-step:

• M-step: Maximize

Relation to K-means

• Let Σ𝑘 = 𝜖𝐼.

• Hence:

• Giving:

• In the limit that 𝜖 → 0, 𝛾(𝑧𝑛𝑘) with the smallest | 𝑥𝑛 − 𝜇𝑘 | becomes 1. Rest become 0.

Relation to K-means

• Setting 𝑟𝑛𝑘 = 𝛾(𝑧𝑛𝑘), when 𝜖 → 0:

EM Analysis

• For any latent variable model:

• Following holds:

• Where:

EM Analysis

• Since KL divergence > 0:

• Tight when the probabilities are same:

EM Analysis

Date post:	29-Jan-2016
Category:	Documents
Upload:	dikshantjanaawa
View:	3 times
Download:	0 times

Clustering

Documents