Date post: | 29-Jan-2016 |
Category: |
Documents |
Upload: | dikshantjanaawa |
View: | 3 times |
Download: | 0 times |
Unsupervised LeanringSourangshu Bhattacharya
Clustering
• Dataset: 𝑥1, … , 𝑥𝑁 𝑥𝑖 ∈ 𝑅𝐷.
• Goal: Partition the data into 𝐾 clusters.
• Criterion: • Intra-cluster distances should be smaller than inter-
cluster distances.
Clustering
Clustering
• Dataset: 𝑥1, … , 𝑥𝑁 𝑥𝑖 ∈ 𝑅𝐷.
• Goal: Partition the data into 𝐾 clusters.
• Criterion: • Intra-cluster distances should be smaller than inter-
cluster distances.
• Many clustering schemes:• K-means, K-medoids, etc.
• Hierarchical clustering.
• Spectral clustering.
K- means clustering
• Each cluster k is represented by a center, 𝜇𝑘 ∈ 𝑅𝐷.
• Let 𝛾𝑛𝑘 ∈ {0,1} denote whether datapoint 𝑛 is assigned to cluster 𝑘.𝛾𝑛𝑘 = 1 if 𝑥𝑛 belongs to 𝑘𝛾𝑛𝑘 = 0 else.
• Distortion measure:
𝐽 =
𝑛=1
𝑁
𝑘=1
𝐾
𝛾𝑛𝑘 𝑥𝑛 − 𝜇𝑘2
K-means clustering
• We need to find both 𝛾𝑛𝑘 and 𝜇𝑘.
• Do it alternatively for 𝛾𝑛𝑘 and 𝜇𝑘 keeping the other fixed.
K-means algorithm
• Initialize {𝜇1, … , 𝜇𝐾} randomly.
• Iterate till convergence:• E-step:
𝛾𝑛𝑘 = 1 𝑖𝑓 𝑘 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑗| 𝑥𝑛 − 𝜇𝑗 |
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒• M-step:
𝜇𝑘 = 𝑛 𝛾𝑛𝑘𝑥𝑛 𝑛 𝛾𝑛𝑘
• Complexity: 𝑂(𝑁𝐾)
K-means
K-means algorithm
• It is easy to check that:• E-step: minimizes 𝐽 w.r.t. 𝛾𝑛𝑘 keeping 𝜇𝑘 fixed.
• M-step: minimizes 𝐽 w.r.t. 𝜇𝑘 keeping 𝛾𝑛𝑘 fixed.
• The iteration will converge:• 𝐽 keeping decreasing in each step by a certain minimum
amount.
• Minimum value for 𝐽 is zero.
• Converge to a local minimum.
K-medoids
• Objective function:
𝐽 =
𝑛=1
𝑁
𝑘=1
𝐾
𝛾𝑛𝑘𝐷(𝑥𝑛, 𝜇𝑘)
• Medoids, 𝜇𝑘 ∈ {𝑥1, … , 𝑥𝑁}.
• E-step: 𝛾𝑛𝑘 = 1 𝑖𝑓 𝑘 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑗𝐷(𝑥𝑛, 𝜇𝑗)
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
• M-step: Assign 𝜇𝑘 to 𝑥𝑗 which minimizes 𝑛=1𝑁 𝛾𝑛𝑘𝐷(𝑥𝑛, 𝑥𝑗) over all 𝑥𝑗.
K-medoids
• Complexity is 𝑂(𝑁2).
• Convergence: function values converges, but the medoids can oscillate.
• Advantage: works in non-vectorial datasets.
Mixture of Gaussians
• 𝑧 ∈ 0,1 𝐾: be a discrete latent variable, such that 𝑘 𝑧𝑘 = 1.
• 𝑧𝑘 selects the cluster (mixture component) from which the data point is generated.
• There are K Gaussian distributions:𝒩 𝑥 𝜇1, Σ1…𝒩(𝑥|𝜇𝐾 , Σ𝐾)
Mixture of Gaussians
• Given a data point 𝑥:
𝑃 𝑥 =
𝑘=1
𝐾
𝜋𝑘 𝒩(𝑥|𝜇𝑘 , Σ𝑘)
• Where:𝜋𝑘 = 𝑃(𝑧𝑘 = 1)
Generative Procedure
• Select z from probability distr. 𝜋𝑘.
• Hence: 𝑃 𝑧 = 𝑘=1𝐾 𝜋𝑘
𝑧𝑘.
• Given z, generate x according to the conditional distr.:
𝑃 𝑥 𝑧𝑘 = 1 = 𝒩(𝑥|𝜇𝑘 , Σ𝑘)
• Hence:
𝑃 𝑥 𝑧 =
𝑘=1
𝐾
𝒩 𝑥 𝜇𝑘 , Σ𝑘𝑧𝑘
Generative Procedure
• Joint distr.:𝑃 𝑥, 𝑧 = 𝑝 𝑧 𝑝 𝑥 𝑧
=
𝑘=1
𝐾
𝜋𝑘𝒩 𝑥 𝜇𝑘 , Σ𝑘𝑧𝑘
• Marginal:
𝑝 𝑥 =
𝑧
𝑝(𝑥, 𝑧) =
𝑘=1
𝐾
𝜋𝑘𝒩(𝑥|𝜇𝑘 , Σ𝑘)
Posterior distribution
• 𝑧𝑘 = 1 given 𝑥:
Example
Max-likelihood
• Let 𝐷 = {𝑥1, … , 𝑥𝑁}
• Likelihood function:
𝑃 𝐷 𝝅, 𝝁, 𝚺 =
𝑛=1
𝑁
𝑘=1
𝐾
𝜋𝑘𝒩(𝑥𝑛|𝜇𝑘 , Σ𝑘)
• Log likelihood:
ln 𝑃 𝐷 𝝅, 𝝁, 𝚺 =
𝑛=1
𝑁
ln(
𝑘=1
𝐾
𝜋𝑘𝒩(𝑥𝑛|𝜇𝑘 , Σ𝑘))
• Maximize log-likelihood w.r.t. 𝝅, 𝝁 and 𝚺.
KKT conditions
• Differentiating w.r.t. 𝜇𝑘:
• Multiplying by Σ𝑘−1:
• Where:
KKT conditions
• Similarly, differentiating w.r.t. Σ𝑘:
• Lagrangian w.r.t. 𝜋𝑘:
KKT conditions
• Minimizing:
• Multiplying with 𝜋𝑘 and adding over k: 𝜆 = −𝑁.
• Hence:
• Where:
(EM) Algorithm• Initialize 𝜇𝑘 , Σ𝑘 and 𝜋𝑘.
• E-step:
• M-step:
• Repeat above two steps till ln 𝑃 𝐷 𝝅, 𝝁, 𝚺converges.
Example
General EM Algorithm
• Incomplete data log-likelihood of a latent variable model:
• Summation inside logarithm.
• Complete data log-likelihood is more tractable:
General EM Algorithm
• Define expectation w.r.t. posterior 𝑝(𝑍|𝑋, 𝜃𝑜𝑙𝑑):
• Maximize w.r.t. 𝜃:
General EM algorithm
• Initialize 𝜃𝑜𝑙𝑑.
• Iterate till convergence:
For Gaussian Mixtures:
• E-step:
• M-step: Maximize
Relation to K-means
• Let Σ𝑘 = 𝜖𝐼.
• Hence:
• Giving:
• In the limit that 𝜖 → 0, 𝛾(𝑧𝑛𝑘) with the smallest | 𝑥𝑛 − 𝜇𝑘 | becomes 1. Rest become 0.
Relation to K-means
• Setting 𝑟𝑛𝑘 = 𝛾(𝑧𝑛𝑘), when 𝜖 → 0:
EM Analysis
• For any latent variable model:
• Following holds:
• Where:
EM Analysis
• Since KL divergence > 0:
• Tight when the probabilities are same:
EM Analysis