Clustering Techniques
Berlin Chen 2005
References1 Introduction to Machine Learning Chapter 72 Modern Information Retrieval Chapters 5 73 Foundations of Statistical Natural Language Processing Chapter 144 A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov
Models Jeff A Bilmes UC Berkeley TR-97-021
IR ndash Berlin Chen 2
Clustering
bull Place similar objects in the same group and assign dissimilar objects to different groupsndash Word clustering
bull Neighbor overlap words occur with the similar left and right neighbors (such as in and on)
ndash Document clusteringbull Documents with the similar topics or concepts are put
together
bull But clustering cannot give a comprehensive description of the objectndash How to label objects shown on the visual display
bull Regarded as a kind of semiparametric learning approachndash Allow a mixture of distributions to be used for estimating the
input samples (a parametric model for each group of samples)
IR ndash Berlin Chen 3
Clustering vs Classification
bull Classification is supervised and requires a set of labeled training instances for each group (class)ndash Learning with a teacher
bull Clustering is unsupervised and learns without a teacher to provide the labeling information of the training data set
ndash Also called automatic or unsupervised classification
IR ndash Berlin Chen 4
Types of Clustering Algorithms
bull Two types of structures produced by clustering algorithmsndash Flat or non-hierarchical clusteringndash Hierarchical clustering
bull Flat clusteringndash Simply consisting of a certain number of clusters and the relation
between clusters is often undeterminedndash Measurement construction error minimization or probabilistic
optimizationbull Hierarchical clustering
ndash A hierarchy with usual interpretation that each node stands for a subclass of its motherrsquos node
bull The leaves of the tree are the single objectsbull Each node represents the cluster that contains all the objects
of its descendantsndash Measurement similarities of instances
IR ndash Berlin Chen 5
Hard Assignment vs Soft Assignment
bull Another important distinction between clustering algorithms is whether they perform soft or hard assignment
bull Hard Assignmentndash Each object is assigned to one and only one cluster
bull Soft Assignment (probabilistic approach)ndash Each object may be assigned to multiple clustersndash An object has a probability distribution over
clusters where is the probability that is a member of
ndash Is somewhat more appropriate in many tasks such as NLP IR hellip
ix ( )ixP sdotjc
jc( )ji cxP ix
IR ndash Berlin Chen 6
Hard Assignment vs Soft Assignment (cont)
bull Hierarchical clustering usually adopts hard assignment
bull While in flat clustering both types of assignments are common
IR ndash Berlin Chen 7
Summarized Attributes of Clustering Algorithms bull Hierarchical Clustering
ndash Preferable for detailed data analysis
ndash Provide more information than flat clustering
ndash No single best algorithm (each of the algorithms only optimal for some applications)
ndash Less efficient than flat clustering (minimally have to compute n x nmatrix of similarity coefficients)
IR ndash Berlin Chen 8
Summarized Attributes of Clustering Algorithms (cont)
bull Flat Clusteringndash Preferable if efficiency is a consideration or data sets are very
large
ndash K-means is the conceptually method and should probably be used on a new data because its results are often sufficient
ndash K-means assumes a simple Euclidean representation space and so cannot be used for many data sets eg nominal data like colors (or samples with features of different scales)
ndash The EM algorithm is the most choice It can accommodate definition of clusters and allocation of objects based on complex probabilistic models
bull Its extensions can be used to handle topologicalhierarchical orders of samples
ndash Probabilistic Latent Semantic Analysis (PLSA) Topic Mixture Model (TMM) etc
IR ndash Berlin Chen 9
Hierarchical Clustering
IR ndash Berlin Chen 10
Hierarchical Clustering
bull Can be in either bottom-up or top-down mannersndash Bottom-up (agglomerative)
bull Start with individual objects and grouping the most similar ones
ndash Eg with the minimum distance apart
bull The procedure terminates when one cluster containing all objects has been formed
ndash Top-down (divisive)bull Start with all objects in a group and divide them into groups
so as to maximize within-group similarity
( ) ( )yxdyxsim
11
+=
凝集的
分裂的
distance measures willbe discussed later on
IR ndash Berlin Chen 11
Hierarchical Agglomerative Clustering (HAC)
bull A bottom-up approach
bull Assume a similarity measure for determining the similarity of two objects
bull Start with all objects in a separate cluster and then repeatedly joins the two clusters that have the most similarity until there is one only cluster survived
bull The history of mergingclustering forms a binary tree or hierarchy
IR ndash Berlin Chen 12
HAC (cont)
bull Algorithm
cluster number
Initialization (for tree leaves)Each object is a cluster
merged as a new cluster
The original two clusters are removed
IR ndash Berlin Chen 13
Distance Metrics
bull Euclidian Distance (L2 norm)
ndash Make sure that all attributesdimensions have the same scale (orthe same variance)
bull L1 Norm (City-block distance)
bull Cosine Similarity (transform to a distance by subtracting from 1)
2
12 )()( i
m
ii yxyxL minus=sum
=
rr
sum=
minus=m
iii yxyxL
11 )( rr
yxyxrr
rr
sdotminus
bull1 ranged between 0 and 1
IR ndash Berlin Chen 14
Measures of Cluster Similaritybull Especially for the bottom-up approaches
bull Single-link clusteringndash The similarity between two clusters is the similarity of the two
closest objects in the clusters
ndash Search over all pairs of objects that are from the two differentclusters and select the pair with the greatest similarity
ndash Elongated clusters are achieved
Ci Cj
greatest similarity
( ) ( )yxsimccsimji cycxji
rrrrisinisin
= max
cf the minimal spanning tree
IR ndash Berlin Chen 15
Measures of Cluster Similarity (cont)
bull Complete-link clusteringndash The similarity between two clusters is the similarity of their two
most dissimilar members
ndash Sphere-shaped clusters are achieved
ndash Preferable for most IR and NLP applications
Ci Cj
least similarity
( ) ( )yxsimccsimji cycxji
rrrrisinisin
= min
IR ndash Berlin Chen 16
Measures of Cluster Similarity (cont)
single link
complete link
IR ndash Berlin Chen 17
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering
ndash The similarity between two clusters is the average similarity between members
ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine
bull There exists an fast algorithm for computing the average similarity
( ) ( ) yxyxyxyxyxsim
rrrr
rrrrrr
sdot=sdot
== cos
length-normalized vectors
Ci Cj
IR ndash Berlin Chen 18
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
ndash The average similarity SIM between vectors in a cluster cj is defined as
ndash The sum of members in a cluster cj
ndash Express in terms of
( ) ( ) ( ) ( )sum sumsum sumisin
neisinisin
neisin
sdotminus
=minus
=j jj j cx
xycyjjcx
xycyjj
j yxcc
yxsimcc
cSIMr
rrrr
rrr
rrrr
11
11
( ) sumisin
=jcx
j xcsr
rr
( )jcSIM ( )jcsr
( ) ( ) ( )( ) ( )( ) ( )
( ) ( ) ( )( )1
1
1
minus
minussdot=there4
+minus=
sdot+minus=
sdot=sdot=sdot
sum
sum sumsum
isin
isin isinisin
jj
jjj
j
jjjj
cxjjj
cx cyj
cxjj
ccccscs
cSIM
ccSIMcc
xxcSIMcc
yxcsxcscs
j
j jj
rr
rr
rrrrrr
r
r rr
=1
length-normalized vector
IR ndash Berlin Chen 19
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
-As merging two clusters ci and cj the cluster sum vectors and are known in advance
ndash The average similarity for their union will be
( )icsr ( )jcsr
( )( ) ( )( ) ( ) ( )( ) ( )
( )( )1
minus++
+minus+sdot+
=cup
jiji
jijiji
ji
cccccccscscscs
ccSIMrrrr
( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr
ic jc
ji cc +
Ci Cj ( )jcsr( )ics
r
IR ndash Berlin Chen 20
Example Word Clustering
bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words
ldquoberdquo has least similarity with the other 21 words
higher nodesdecreasingof similarity
IR ndash Berlin Chen 21
Divisive Clustering
bull A top-down approach
bull Start with all objects in a single cluster
bull At each iteration select the least coherent cluster and split it
bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved
bull The history of clustering forms a binary tree or hierarchy
IR ndash Berlin Chen 22
Divisive Clustering (cont)
bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure
bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation
egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)
IR ndash Berlin Chen 23
Divisive Clustering (cont)
bull Algorithm
split the least coherent cluster
Generate two new clusters and remove the original one
IR ndash Berlin Chen 24
Non-Hierarchical Clustering
IR ndash Berlin Chen 25
Non-hierarchical Clustering
bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)
bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters
bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm
group average similarity likelihood mutual information
k-1 rarr k rarr k+1
Hierarchical clustering also has to face this problem
IR ndash Berlin Chen 26
The K-means Algorithm
bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members
bull The K-means algorithm also can be regarded as ndash A kind of vector quantization
bull Map from a continuous space (high resolution) to a discrete space (low resolution)
ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3
vectorcode wordcode vectorreferenceor centriodcluster
1
index 1
j
kjj
jnt
t
m
mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28
IR ndash Berlin Chen 27
The K-means Algorithm (cont)
ndash and are unknownndash depends on and this optimization problem
can not be solved analytically
( )⎪⎩
⎪⎨⎧ minus=minus
=minus= sum sum= =
=otherwise 0
minif 1 where
errortion Reconstruc Total2
1 11
jt
jit
ti
N
t
k
ii
tti
kii bbE
mxmxmxXm
tib
imtib
im
label
IR ndash Berlin Chen 28
The K-means Algorithm (cont)
bull Initializationndash A set of initial cluster centers is needed
bull Recursionndash Assign each object to the cluster whose center is closest
ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members
bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)
kii 1=m
sum
sum
=
= sdot=
Nt
ti
tNt
ti
i bb
1
1 xm
tx
⎪⎩
⎪⎨⎧ minus=minus
=otherwise 0
minif 1 jt
jit
tib
mxmx
These two steps are repeated until stabilizesim
IR ndash Berlin Chen 29
The K-means Algorithm (cont)
bull Algorithm
IR ndash Berlin Chen 30
The K-means Algorithm (cont)
bull Example 1
IR ndash Berlin Chen 31
The K-means Algorithm (cont)
bull Example 2
governmentfinancesports
research
name
IR ndash Berlin Chen 32
The K-means Algorithm (cont)
bull Choice of initial cluster centers (seeds) is important
ndash Pick at randomndash Calculate the mean of all data and generate k initial centers
by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)
divide it range into k equal interval and take the mean of data in each group as the initial center
ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects
bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set
bull Poor seeds will result in sub-optimal clustering
im
im
δm plusmni
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 2
Clustering
bull Place similar objects in the same group and assign dissimilar objects to different groupsndash Word clustering
bull Neighbor overlap words occur with the similar left and right neighbors (such as in and on)
ndash Document clusteringbull Documents with the similar topics or concepts are put
together
bull But clustering cannot give a comprehensive description of the objectndash How to label objects shown on the visual display
bull Regarded as a kind of semiparametric learning approachndash Allow a mixture of distributions to be used for estimating the
input samples (a parametric model for each group of samples)
IR ndash Berlin Chen 3
Clustering vs Classification
bull Classification is supervised and requires a set of labeled training instances for each group (class)ndash Learning with a teacher
bull Clustering is unsupervised and learns without a teacher to provide the labeling information of the training data set
ndash Also called automatic or unsupervised classification
IR ndash Berlin Chen 4
Types of Clustering Algorithms
bull Two types of structures produced by clustering algorithmsndash Flat or non-hierarchical clusteringndash Hierarchical clustering
bull Flat clusteringndash Simply consisting of a certain number of clusters and the relation
between clusters is often undeterminedndash Measurement construction error minimization or probabilistic
optimizationbull Hierarchical clustering
ndash A hierarchy with usual interpretation that each node stands for a subclass of its motherrsquos node
bull The leaves of the tree are the single objectsbull Each node represents the cluster that contains all the objects
of its descendantsndash Measurement similarities of instances
IR ndash Berlin Chen 5
Hard Assignment vs Soft Assignment
bull Another important distinction between clustering algorithms is whether they perform soft or hard assignment
bull Hard Assignmentndash Each object is assigned to one and only one cluster
bull Soft Assignment (probabilistic approach)ndash Each object may be assigned to multiple clustersndash An object has a probability distribution over
clusters where is the probability that is a member of
ndash Is somewhat more appropriate in many tasks such as NLP IR hellip
ix ( )ixP sdotjc
jc( )ji cxP ix
IR ndash Berlin Chen 6
Hard Assignment vs Soft Assignment (cont)
bull Hierarchical clustering usually adopts hard assignment
bull While in flat clustering both types of assignments are common
IR ndash Berlin Chen 7
Summarized Attributes of Clustering Algorithms bull Hierarchical Clustering
ndash Preferable for detailed data analysis
ndash Provide more information than flat clustering
ndash No single best algorithm (each of the algorithms only optimal for some applications)
ndash Less efficient than flat clustering (minimally have to compute n x nmatrix of similarity coefficients)
IR ndash Berlin Chen 8
Summarized Attributes of Clustering Algorithms (cont)
bull Flat Clusteringndash Preferable if efficiency is a consideration or data sets are very
large
ndash K-means is the conceptually method and should probably be used on a new data because its results are often sufficient
ndash K-means assumes a simple Euclidean representation space and so cannot be used for many data sets eg nominal data like colors (or samples with features of different scales)
ndash The EM algorithm is the most choice It can accommodate definition of clusters and allocation of objects based on complex probabilistic models
bull Its extensions can be used to handle topologicalhierarchical orders of samples
ndash Probabilistic Latent Semantic Analysis (PLSA) Topic Mixture Model (TMM) etc
IR ndash Berlin Chen 9
Hierarchical Clustering
IR ndash Berlin Chen 10
Hierarchical Clustering
bull Can be in either bottom-up or top-down mannersndash Bottom-up (agglomerative)
bull Start with individual objects and grouping the most similar ones
ndash Eg with the minimum distance apart
bull The procedure terminates when one cluster containing all objects has been formed
ndash Top-down (divisive)bull Start with all objects in a group and divide them into groups
so as to maximize within-group similarity
( ) ( )yxdyxsim
11
+=
凝集的
分裂的
distance measures willbe discussed later on
IR ndash Berlin Chen 11
Hierarchical Agglomerative Clustering (HAC)
bull A bottom-up approach
bull Assume a similarity measure for determining the similarity of two objects
bull Start with all objects in a separate cluster and then repeatedly joins the two clusters that have the most similarity until there is one only cluster survived
bull The history of mergingclustering forms a binary tree or hierarchy
IR ndash Berlin Chen 12
HAC (cont)
bull Algorithm
cluster number
Initialization (for tree leaves)Each object is a cluster
merged as a new cluster
The original two clusters are removed
IR ndash Berlin Chen 13
Distance Metrics
bull Euclidian Distance (L2 norm)
ndash Make sure that all attributesdimensions have the same scale (orthe same variance)
bull L1 Norm (City-block distance)
bull Cosine Similarity (transform to a distance by subtracting from 1)
2
12 )()( i
m
ii yxyxL minus=sum
=
rr
sum=
minus=m
iii yxyxL
11 )( rr
yxyxrr
rr
sdotminus
bull1 ranged between 0 and 1
IR ndash Berlin Chen 14
Measures of Cluster Similaritybull Especially for the bottom-up approaches
bull Single-link clusteringndash The similarity between two clusters is the similarity of the two
closest objects in the clusters
ndash Search over all pairs of objects that are from the two differentclusters and select the pair with the greatest similarity
ndash Elongated clusters are achieved
Ci Cj
greatest similarity
( ) ( )yxsimccsimji cycxji
rrrrisinisin
= max
cf the minimal spanning tree
IR ndash Berlin Chen 15
Measures of Cluster Similarity (cont)
bull Complete-link clusteringndash The similarity between two clusters is the similarity of their two
most dissimilar members
ndash Sphere-shaped clusters are achieved
ndash Preferable for most IR and NLP applications
Ci Cj
least similarity
( ) ( )yxsimccsimji cycxji
rrrrisinisin
= min
IR ndash Berlin Chen 16
Measures of Cluster Similarity (cont)
single link
complete link
IR ndash Berlin Chen 17
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering
ndash The similarity between two clusters is the average similarity between members
ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine
bull There exists an fast algorithm for computing the average similarity
( ) ( ) yxyxyxyxyxsim
rrrr
rrrrrr
sdot=sdot
== cos
length-normalized vectors
Ci Cj
IR ndash Berlin Chen 18
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
ndash The average similarity SIM between vectors in a cluster cj is defined as
ndash The sum of members in a cluster cj
ndash Express in terms of
( ) ( ) ( ) ( )sum sumsum sumisin
neisinisin
neisin
sdotminus
=minus
=j jj j cx
xycyjjcx
xycyjj
j yxcc
yxsimcc
cSIMr
rrrr
rrr
rrrr
11
11
( ) sumisin
=jcx
j xcsr
rr
( )jcSIM ( )jcsr
( ) ( ) ( )( ) ( )( ) ( )
( ) ( ) ( )( )1
1
1
minus
minussdot=there4
+minus=
sdot+minus=
sdot=sdot=sdot
sum
sum sumsum
isin
isin isinisin
jj
jjj
j
jjjj
cxjjj
cx cyj
cxjj
ccccscs
cSIM
ccSIMcc
xxcSIMcc
yxcsxcscs
j
j jj
rr
rr
rrrrrr
r
r rr
=1
length-normalized vector
IR ndash Berlin Chen 19
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
-As merging two clusters ci and cj the cluster sum vectors and are known in advance
ndash The average similarity for their union will be
( )icsr ( )jcsr
( )( ) ( )( ) ( ) ( )( ) ( )
( )( )1
minus++
+minus+sdot+
=cup
jiji
jijiji
ji
cccccccscscscs
ccSIMrrrr
( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr
ic jc
ji cc +
Ci Cj ( )jcsr( )ics
r
IR ndash Berlin Chen 20
Example Word Clustering
bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words
ldquoberdquo has least similarity with the other 21 words
higher nodesdecreasingof similarity
IR ndash Berlin Chen 21
Divisive Clustering
bull A top-down approach
bull Start with all objects in a single cluster
bull At each iteration select the least coherent cluster and split it
bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved
bull The history of clustering forms a binary tree or hierarchy
IR ndash Berlin Chen 22
Divisive Clustering (cont)
bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure
bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation
egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)
IR ndash Berlin Chen 23
Divisive Clustering (cont)
bull Algorithm
split the least coherent cluster
Generate two new clusters and remove the original one
IR ndash Berlin Chen 24
Non-Hierarchical Clustering
IR ndash Berlin Chen 25
Non-hierarchical Clustering
bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)
bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters
bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm
group average similarity likelihood mutual information
k-1 rarr k rarr k+1
Hierarchical clustering also has to face this problem
IR ndash Berlin Chen 26
The K-means Algorithm
bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members
bull The K-means algorithm also can be regarded as ndash A kind of vector quantization
bull Map from a continuous space (high resolution) to a discrete space (low resolution)
ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3
vectorcode wordcode vectorreferenceor centriodcluster
1
index 1
j
kjj
jnt
t
m
mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28
IR ndash Berlin Chen 27
The K-means Algorithm (cont)
ndash and are unknownndash depends on and this optimization problem
can not be solved analytically
( )⎪⎩
⎪⎨⎧ minus=minus
=minus= sum sum= =
=otherwise 0
minif 1 where
errortion Reconstruc Total2
1 11
jt
jit
ti
N
t
k
ii
tti
kii bbE
mxmxmxXm
tib
imtib
im
label
IR ndash Berlin Chen 28
The K-means Algorithm (cont)
bull Initializationndash A set of initial cluster centers is needed
bull Recursionndash Assign each object to the cluster whose center is closest
ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members
bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)
kii 1=m
sum
sum
=
= sdot=
Nt
ti
tNt
ti
i bb
1
1 xm
tx
⎪⎩
⎪⎨⎧ minus=minus
=otherwise 0
minif 1 jt
jit
tib
mxmx
These two steps are repeated until stabilizesim
IR ndash Berlin Chen 29
The K-means Algorithm (cont)
bull Algorithm
IR ndash Berlin Chen 30
The K-means Algorithm (cont)
bull Example 1
IR ndash Berlin Chen 31
The K-means Algorithm (cont)
bull Example 2
governmentfinancesports
research
name
IR ndash Berlin Chen 32
The K-means Algorithm (cont)
bull Choice of initial cluster centers (seeds) is important
ndash Pick at randomndash Calculate the mean of all data and generate k initial centers
by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)
divide it range into k equal interval and take the mean of data in each group as the initial center
ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects
bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set
bull Poor seeds will result in sub-optimal clustering
im
im
δm plusmni
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 3
Clustering vs Classification
bull Classification is supervised and requires a set of labeled training instances for each group (class)ndash Learning with a teacher
bull Clustering is unsupervised and learns without a teacher to provide the labeling information of the training data set
ndash Also called automatic or unsupervised classification
IR ndash Berlin Chen 4
Types of Clustering Algorithms
bull Two types of structures produced by clustering algorithmsndash Flat or non-hierarchical clusteringndash Hierarchical clustering
bull Flat clusteringndash Simply consisting of a certain number of clusters and the relation
between clusters is often undeterminedndash Measurement construction error minimization or probabilistic
optimizationbull Hierarchical clustering
ndash A hierarchy with usual interpretation that each node stands for a subclass of its motherrsquos node
bull The leaves of the tree are the single objectsbull Each node represents the cluster that contains all the objects
of its descendantsndash Measurement similarities of instances
IR ndash Berlin Chen 5
Hard Assignment vs Soft Assignment
bull Another important distinction between clustering algorithms is whether they perform soft or hard assignment
bull Hard Assignmentndash Each object is assigned to one and only one cluster
bull Soft Assignment (probabilistic approach)ndash Each object may be assigned to multiple clustersndash An object has a probability distribution over
clusters where is the probability that is a member of
ndash Is somewhat more appropriate in many tasks such as NLP IR hellip
ix ( )ixP sdotjc
jc( )ji cxP ix
IR ndash Berlin Chen 6
Hard Assignment vs Soft Assignment (cont)
bull Hierarchical clustering usually adopts hard assignment
bull While in flat clustering both types of assignments are common
IR ndash Berlin Chen 7
Summarized Attributes of Clustering Algorithms bull Hierarchical Clustering
ndash Preferable for detailed data analysis
ndash Provide more information than flat clustering
ndash No single best algorithm (each of the algorithms only optimal for some applications)
ndash Less efficient than flat clustering (minimally have to compute n x nmatrix of similarity coefficients)
IR ndash Berlin Chen 8
Summarized Attributes of Clustering Algorithms (cont)
bull Flat Clusteringndash Preferable if efficiency is a consideration or data sets are very
large
ndash K-means is the conceptually method and should probably be used on a new data because its results are often sufficient
ndash K-means assumes a simple Euclidean representation space and so cannot be used for many data sets eg nominal data like colors (or samples with features of different scales)
ndash The EM algorithm is the most choice It can accommodate definition of clusters and allocation of objects based on complex probabilistic models
bull Its extensions can be used to handle topologicalhierarchical orders of samples
ndash Probabilistic Latent Semantic Analysis (PLSA) Topic Mixture Model (TMM) etc
IR ndash Berlin Chen 9
Hierarchical Clustering
IR ndash Berlin Chen 10
Hierarchical Clustering
bull Can be in either bottom-up or top-down mannersndash Bottom-up (agglomerative)
bull Start with individual objects and grouping the most similar ones
ndash Eg with the minimum distance apart
bull The procedure terminates when one cluster containing all objects has been formed
ndash Top-down (divisive)bull Start with all objects in a group and divide them into groups
so as to maximize within-group similarity
( ) ( )yxdyxsim
11
+=
凝集的
分裂的
distance measures willbe discussed later on
IR ndash Berlin Chen 11
Hierarchical Agglomerative Clustering (HAC)
bull A bottom-up approach
bull Assume a similarity measure for determining the similarity of two objects
bull Start with all objects in a separate cluster and then repeatedly joins the two clusters that have the most similarity until there is one only cluster survived
bull The history of mergingclustering forms a binary tree or hierarchy
IR ndash Berlin Chen 12
HAC (cont)
bull Algorithm
cluster number
Initialization (for tree leaves)Each object is a cluster
merged as a new cluster
The original two clusters are removed
IR ndash Berlin Chen 13
Distance Metrics
bull Euclidian Distance (L2 norm)
ndash Make sure that all attributesdimensions have the same scale (orthe same variance)
bull L1 Norm (City-block distance)
bull Cosine Similarity (transform to a distance by subtracting from 1)
2
12 )()( i
m
ii yxyxL minus=sum
=
rr
sum=
minus=m
iii yxyxL
11 )( rr
yxyxrr
rr
sdotminus
bull1 ranged between 0 and 1
IR ndash Berlin Chen 14
Measures of Cluster Similaritybull Especially for the bottom-up approaches
bull Single-link clusteringndash The similarity between two clusters is the similarity of the two
closest objects in the clusters
ndash Search over all pairs of objects that are from the two differentclusters and select the pair with the greatest similarity
ndash Elongated clusters are achieved
Ci Cj
greatest similarity
( ) ( )yxsimccsimji cycxji
rrrrisinisin
= max
cf the minimal spanning tree
IR ndash Berlin Chen 15
Measures of Cluster Similarity (cont)
bull Complete-link clusteringndash The similarity between two clusters is the similarity of their two
most dissimilar members
ndash Sphere-shaped clusters are achieved
ndash Preferable for most IR and NLP applications
Ci Cj
least similarity
( ) ( )yxsimccsimji cycxji
rrrrisinisin
= min
IR ndash Berlin Chen 16
Measures of Cluster Similarity (cont)
single link
complete link
IR ndash Berlin Chen 17
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering
ndash The similarity between two clusters is the average similarity between members
ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine
bull There exists an fast algorithm for computing the average similarity
( ) ( ) yxyxyxyxyxsim
rrrr
rrrrrr
sdot=sdot
== cos
length-normalized vectors
Ci Cj
IR ndash Berlin Chen 18
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
ndash The average similarity SIM between vectors in a cluster cj is defined as
ndash The sum of members in a cluster cj
ndash Express in terms of
( ) ( ) ( ) ( )sum sumsum sumisin
neisinisin
neisin
sdotminus
=minus
=j jj j cx
xycyjjcx
xycyjj
j yxcc
yxsimcc
cSIMr
rrrr
rrr
rrrr
11
11
( ) sumisin
=jcx
j xcsr
rr
( )jcSIM ( )jcsr
( ) ( ) ( )( ) ( )( ) ( )
( ) ( ) ( )( )1
1
1
minus
minussdot=there4
+minus=
sdot+minus=
sdot=sdot=sdot
sum
sum sumsum
isin
isin isinisin
jj
jjj
j
jjjj
cxjjj
cx cyj
cxjj
ccccscs
cSIM
ccSIMcc
xxcSIMcc
yxcsxcscs
j
j jj
rr
rr
rrrrrr
r
r rr
=1
length-normalized vector
IR ndash Berlin Chen 19
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
-As merging two clusters ci and cj the cluster sum vectors and are known in advance
ndash The average similarity for their union will be
( )icsr ( )jcsr
( )( ) ( )( ) ( ) ( )( ) ( )
( )( )1
minus++
+minus+sdot+
=cup
jiji
jijiji
ji
cccccccscscscs
ccSIMrrrr
( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr
ic jc
ji cc +
Ci Cj ( )jcsr( )ics
r
IR ndash Berlin Chen 20
Example Word Clustering
bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words
ldquoberdquo has least similarity with the other 21 words
higher nodesdecreasingof similarity
IR ndash Berlin Chen 21
Divisive Clustering
bull A top-down approach
bull Start with all objects in a single cluster
bull At each iteration select the least coherent cluster and split it
bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved
bull The history of clustering forms a binary tree or hierarchy
IR ndash Berlin Chen 22
Divisive Clustering (cont)
bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure
bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation
egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)
IR ndash Berlin Chen 23
Divisive Clustering (cont)
bull Algorithm
split the least coherent cluster
Generate two new clusters and remove the original one
IR ndash Berlin Chen 24
Non-Hierarchical Clustering
IR ndash Berlin Chen 25
Non-hierarchical Clustering
bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)
bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters
bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm
group average similarity likelihood mutual information
k-1 rarr k rarr k+1
Hierarchical clustering also has to face this problem
IR ndash Berlin Chen 26
The K-means Algorithm
bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members
bull The K-means algorithm also can be regarded as ndash A kind of vector quantization
bull Map from a continuous space (high resolution) to a discrete space (low resolution)
ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3
vectorcode wordcode vectorreferenceor centriodcluster
1
index 1
j
kjj
jnt
t
m
mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28
IR ndash Berlin Chen 27
The K-means Algorithm (cont)
ndash and are unknownndash depends on and this optimization problem
can not be solved analytically
( )⎪⎩
⎪⎨⎧ minus=minus
=minus= sum sum= =
=otherwise 0
minif 1 where
errortion Reconstruc Total2
1 11
jt
jit
ti
N
t
k
ii
tti
kii bbE
mxmxmxXm
tib
imtib
im
label
IR ndash Berlin Chen 28
The K-means Algorithm (cont)
bull Initializationndash A set of initial cluster centers is needed
bull Recursionndash Assign each object to the cluster whose center is closest
ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members
bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)
kii 1=m
sum
sum
=
= sdot=
Nt
ti
tNt
ti
i bb
1
1 xm
tx
⎪⎩
⎪⎨⎧ minus=minus
=otherwise 0
minif 1 jt
jit
tib
mxmx
These two steps are repeated until stabilizesim
IR ndash Berlin Chen 29
The K-means Algorithm (cont)
bull Algorithm
IR ndash Berlin Chen 30
The K-means Algorithm (cont)
bull Example 1
IR ndash Berlin Chen 31
The K-means Algorithm (cont)
bull Example 2
governmentfinancesports
research
name
IR ndash Berlin Chen 32
The K-means Algorithm (cont)
bull Choice of initial cluster centers (seeds) is important
ndash Pick at randomndash Calculate the mean of all data and generate k initial centers
by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)
divide it range into k equal interval and take the mean of data in each group as the initial center
ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects
bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set
bull Poor seeds will result in sub-optimal clustering
im
im
δm plusmni
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 4
Types of Clustering Algorithms
bull Two types of structures produced by clustering algorithmsndash Flat or non-hierarchical clusteringndash Hierarchical clustering
bull Flat clusteringndash Simply consisting of a certain number of clusters and the relation
between clusters is often undeterminedndash Measurement construction error minimization or probabilistic
optimizationbull Hierarchical clustering
ndash A hierarchy with usual interpretation that each node stands for a subclass of its motherrsquos node
bull The leaves of the tree are the single objectsbull Each node represents the cluster that contains all the objects
of its descendantsndash Measurement similarities of instances
IR ndash Berlin Chen 5
Hard Assignment vs Soft Assignment
bull Another important distinction between clustering algorithms is whether they perform soft or hard assignment
bull Hard Assignmentndash Each object is assigned to one and only one cluster
bull Soft Assignment (probabilistic approach)ndash Each object may be assigned to multiple clustersndash An object has a probability distribution over
clusters where is the probability that is a member of
ndash Is somewhat more appropriate in many tasks such as NLP IR hellip
ix ( )ixP sdotjc
jc( )ji cxP ix
IR ndash Berlin Chen 6
Hard Assignment vs Soft Assignment (cont)
bull Hierarchical clustering usually adopts hard assignment
bull While in flat clustering both types of assignments are common
IR ndash Berlin Chen 7
Summarized Attributes of Clustering Algorithms bull Hierarchical Clustering
ndash Preferable for detailed data analysis
ndash Provide more information than flat clustering
ndash No single best algorithm (each of the algorithms only optimal for some applications)
ndash Less efficient than flat clustering (minimally have to compute n x nmatrix of similarity coefficients)
IR ndash Berlin Chen 8
Summarized Attributes of Clustering Algorithms (cont)
bull Flat Clusteringndash Preferable if efficiency is a consideration or data sets are very
large
ndash K-means is the conceptually method and should probably be used on a new data because its results are often sufficient
ndash K-means assumes a simple Euclidean representation space and so cannot be used for many data sets eg nominal data like colors (or samples with features of different scales)
ndash The EM algorithm is the most choice It can accommodate definition of clusters and allocation of objects based on complex probabilistic models
bull Its extensions can be used to handle topologicalhierarchical orders of samples
ndash Probabilistic Latent Semantic Analysis (PLSA) Topic Mixture Model (TMM) etc
IR ndash Berlin Chen 9
Hierarchical Clustering
IR ndash Berlin Chen 10
Hierarchical Clustering
bull Can be in either bottom-up or top-down mannersndash Bottom-up (agglomerative)
bull Start with individual objects and grouping the most similar ones
ndash Eg with the minimum distance apart
bull The procedure terminates when one cluster containing all objects has been formed
ndash Top-down (divisive)bull Start with all objects in a group and divide them into groups
so as to maximize within-group similarity
( ) ( )yxdyxsim
11
+=
凝集的
分裂的
distance measures willbe discussed later on
IR ndash Berlin Chen 11
Hierarchical Agglomerative Clustering (HAC)
bull A bottom-up approach
bull Assume a similarity measure for determining the similarity of two objects
bull Start with all objects in a separate cluster and then repeatedly joins the two clusters that have the most similarity until there is one only cluster survived
bull The history of mergingclustering forms a binary tree or hierarchy
IR ndash Berlin Chen 12
HAC (cont)
bull Algorithm
cluster number
Initialization (for tree leaves)Each object is a cluster
merged as a new cluster
The original two clusters are removed
IR ndash Berlin Chen 13
Distance Metrics
bull Euclidian Distance (L2 norm)
ndash Make sure that all attributesdimensions have the same scale (orthe same variance)
bull L1 Norm (City-block distance)
bull Cosine Similarity (transform to a distance by subtracting from 1)
2
12 )()( i
m
ii yxyxL minus=sum
=
rr
sum=
minus=m
iii yxyxL
11 )( rr
yxyxrr
rr
sdotminus
bull1 ranged between 0 and 1
IR ndash Berlin Chen 14
Measures of Cluster Similaritybull Especially for the bottom-up approaches
bull Single-link clusteringndash The similarity between two clusters is the similarity of the two
closest objects in the clusters
ndash Search over all pairs of objects that are from the two differentclusters and select the pair with the greatest similarity
ndash Elongated clusters are achieved
Ci Cj
greatest similarity
( ) ( )yxsimccsimji cycxji
rrrrisinisin
= max
cf the minimal spanning tree
IR ndash Berlin Chen 15
Measures of Cluster Similarity (cont)
bull Complete-link clusteringndash The similarity between two clusters is the similarity of their two
most dissimilar members
ndash Sphere-shaped clusters are achieved
ndash Preferable for most IR and NLP applications
Ci Cj
least similarity
( ) ( )yxsimccsimji cycxji
rrrrisinisin
= min
IR ndash Berlin Chen 16
Measures of Cluster Similarity (cont)
single link
complete link
IR ndash Berlin Chen 17
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering
ndash The similarity between two clusters is the average similarity between members
ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine
bull There exists an fast algorithm for computing the average similarity
( ) ( ) yxyxyxyxyxsim
rrrr
rrrrrr
sdot=sdot
== cos
length-normalized vectors
Ci Cj
IR ndash Berlin Chen 18
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
ndash The average similarity SIM between vectors in a cluster cj is defined as
ndash The sum of members in a cluster cj
ndash Express in terms of
( ) ( ) ( ) ( )sum sumsum sumisin
neisinisin
neisin
sdotminus
=minus
=j jj j cx
xycyjjcx
xycyjj
j yxcc
yxsimcc
cSIMr
rrrr
rrr
rrrr
11
11
( ) sumisin
=jcx
j xcsr
rr
( )jcSIM ( )jcsr
( ) ( ) ( )( ) ( )( ) ( )
( ) ( ) ( )( )1
1
1
minus
minussdot=there4
+minus=
sdot+minus=
sdot=sdot=sdot
sum
sum sumsum
isin
isin isinisin
jj
jjj
j
jjjj
cxjjj
cx cyj
cxjj
ccccscs
cSIM
ccSIMcc
xxcSIMcc
yxcsxcscs
j
j jj
rr
rr
rrrrrr
r
r rr
=1
length-normalized vector
IR ndash Berlin Chen 19
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
-As merging two clusters ci and cj the cluster sum vectors and are known in advance
ndash The average similarity for their union will be
( )icsr ( )jcsr
( )( ) ( )( ) ( ) ( )( ) ( )
( )( )1
minus++
+minus+sdot+
=cup
jiji
jijiji
ji
cccccccscscscs
ccSIMrrrr
( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr
ic jc
ji cc +
Ci Cj ( )jcsr( )ics
r
IR ndash Berlin Chen 20
Example Word Clustering
bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words
ldquoberdquo has least similarity with the other 21 words
higher nodesdecreasingof similarity
IR ndash Berlin Chen 21
Divisive Clustering
bull A top-down approach
bull Start with all objects in a single cluster
bull At each iteration select the least coherent cluster and split it
bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved
bull The history of clustering forms a binary tree or hierarchy
IR ndash Berlin Chen 22
Divisive Clustering (cont)
bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure
bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation
egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)
IR ndash Berlin Chen 23
Divisive Clustering (cont)
bull Algorithm
split the least coherent cluster
Generate two new clusters and remove the original one
IR ndash Berlin Chen 24
Non-Hierarchical Clustering
IR ndash Berlin Chen 25
Non-hierarchical Clustering
bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)
bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters
bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm
group average similarity likelihood mutual information
k-1 rarr k rarr k+1
Hierarchical clustering also has to face this problem
IR ndash Berlin Chen 26
The K-means Algorithm
bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members
bull The K-means algorithm also can be regarded as ndash A kind of vector quantization
bull Map from a continuous space (high resolution) to a discrete space (low resolution)
ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3
vectorcode wordcode vectorreferenceor centriodcluster
1
index 1
j
kjj
jnt
t
m
mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28
IR ndash Berlin Chen 27
The K-means Algorithm (cont)
ndash and are unknownndash depends on and this optimization problem
can not be solved analytically
( )⎪⎩
⎪⎨⎧ minus=minus
=minus= sum sum= =
=otherwise 0
minif 1 where
errortion Reconstruc Total2
1 11
jt
jit
ti
N
t
k
ii
tti
kii bbE
mxmxmxXm
tib
imtib
im
label
IR ndash Berlin Chen 28
The K-means Algorithm (cont)
bull Initializationndash A set of initial cluster centers is needed
bull Recursionndash Assign each object to the cluster whose center is closest
ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members
bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)
kii 1=m
sum
sum
=
= sdot=
Nt
ti
tNt
ti
i bb
1
1 xm
tx
⎪⎩
⎪⎨⎧ minus=minus
=otherwise 0
minif 1 jt
jit
tib
mxmx
These two steps are repeated until stabilizesim
IR ndash Berlin Chen 29
The K-means Algorithm (cont)
bull Algorithm
IR ndash Berlin Chen 30
The K-means Algorithm (cont)
bull Example 1
IR ndash Berlin Chen 31
The K-means Algorithm (cont)
bull Example 2
governmentfinancesports
research
name
IR ndash Berlin Chen 32
The K-means Algorithm (cont)
bull Choice of initial cluster centers (seeds) is important
ndash Pick at randomndash Calculate the mean of all data and generate k initial centers
by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)
divide it range into k equal interval and take the mean of data in each group as the initial center
ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects
bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set
bull Poor seeds will result in sub-optimal clustering
im
im
δm plusmni
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 5
Hard Assignment vs Soft Assignment
bull Another important distinction between clustering algorithms is whether they perform soft or hard assignment
bull Hard Assignmentndash Each object is assigned to one and only one cluster
bull Soft Assignment (probabilistic approach)ndash Each object may be assigned to multiple clustersndash An object has a probability distribution over
clusters where is the probability that is a member of
ndash Is somewhat more appropriate in many tasks such as NLP IR hellip
ix ( )ixP sdotjc
jc( )ji cxP ix
IR ndash Berlin Chen 6
Hard Assignment vs Soft Assignment (cont)
bull Hierarchical clustering usually adopts hard assignment
bull While in flat clustering both types of assignments are common
IR ndash Berlin Chen 7
Summarized Attributes of Clustering Algorithms bull Hierarchical Clustering
ndash Preferable for detailed data analysis
ndash Provide more information than flat clustering
ndash No single best algorithm (each of the algorithms only optimal for some applications)
ndash Less efficient than flat clustering (minimally have to compute n x nmatrix of similarity coefficients)
IR ndash Berlin Chen 8
Summarized Attributes of Clustering Algorithms (cont)
bull Flat Clusteringndash Preferable if efficiency is a consideration or data sets are very
large
ndash K-means is the conceptually method and should probably be used on a new data because its results are often sufficient
ndash K-means assumes a simple Euclidean representation space and so cannot be used for many data sets eg nominal data like colors (or samples with features of different scales)
ndash The EM algorithm is the most choice It can accommodate definition of clusters and allocation of objects based on complex probabilistic models
bull Its extensions can be used to handle topologicalhierarchical orders of samples
ndash Probabilistic Latent Semantic Analysis (PLSA) Topic Mixture Model (TMM) etc
IR ndash Berlin Chen 9
Hierarchical Clustering
IR ndash Berlin Chen 10
Hierarchical Clustering
bull Can be in either bottom-up or top-down mannersndash Bottom-up (agglomerative)
bull Start with individual objects and grouping the most similar ones
ndash Eg with the minimum distance apart
bull The procedure terminates when one cluster containing all objects has been formed
ndash Top-down (divisive)bull Start with all objects in a group and divide them into groups
so as to maximize within-group similarity
( ) ( )yxdyxsim
11
+=
凝集的
分裂的
distance measures willbe discussed later on
IR ndash Berlin Chen 11
Hierarchical Agglomerative Clustering (HAC)
bull A bottom-up approach
bull Assume a similarity measure for determining the similarity of two objects
bull Start with all objects in a separate cluster and then repeatedly joins the two clusters that have the most similarity until there is one only cluster survived
bull The history of mergingclustering forms a binary tree or hierarchy
IR ndash Berlin Chen 12
HAC (cont)
bull Algorithm
cluster number
Initialization (for tree leaves)Each object is a cluster
merged as a new cluster
The original two clusters are removed
IR ndash Berlin Chen 13
Distance Metrics
bull Euclidian Distance (L2 norm)
ndash Make sure that all attributesdimensions have the same scale (orthe same variance)
bull L1 Norm (City-block distance)
bull Cosine Similarity (transform to a distance by subtracting from 1)
2
12 )()( i
m
ii yxyxL minus=sum
=
rr
sum=
minus=m
iii yxyxL
11 )( rr
yxyxrr
rr
sdotminus
bull1 ranged between 0 and 1
IR ndash Berlin Chen 14
Measures of Cluster Similaritybull Especially for the bottom-up approaches
bull Single-link clusteringndash The similarity between two clusters is the similarity of the two
closest objects in the clusters
ndash Search over all pairs of objects that are from the two differentclusters and select the pair with the greatest similarity
ndash Elongated clusters are achieved
Ci Cj
greatest similarity
( ) ( )yxsimccsimji cycxji
rrrrisinisin
= max
cf the minimal spanning tree
IR ndash Berlin Chen 15
Measures of Cluster Similarity (cont)
bull Complete-link clusteringndash The similarity between two clusters is the similarity of their two
most dissimilar members
ndash Sphere-shaped clusters are achieved
ndash Preferable for most IR and NLP applications
Ci Cj
least similarity
( ) ( )yxsimccsimji cycxji
rrrrisinisin
= min
IR ndash Berlin Chen 16
Measures of Cluster Similarity (cont)
single link
complete link
IR ndash Berlin Chen 17
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering
ndash The similarity between two clusters is the average similarity between members
ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine
bull There exists an fast algorithm for computing the average similarity
( ) ( ) yxyxyxyxyxsim
rrrr
rrrrrr
sdot=sdot
== cos
length-normalized vectors
Ci Cj
IR ndash Berlin Chen 18
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
ndash The average similarity SIM between vectors in a cluster cj is defined as
ndash The sum of members in a cluster cj
ndash Express in terms of
( ) ( ) ( ) ( )sum sumsum sumisin
neisinisin
neisin
sdotminus
=minus
=j jj j cx
xycyjjcx
xycyjj
j yxcc
yxsimcc
cSIMr
rrrr
rrr
rrrr
11
11
( ) sumisin
=jcx
j xcsr
rr
( )jcSIM ( )jcsr
( ) ( ) ( )( ) ( )( ) ( )
( ) ( ) ( )( )1
1
1
minus
minussdot=there4
+minus=
sdot+minus=
sdot=sdot=sdot
sum
sum sumsum
isin
isin isinisin
jj
jjj
j
jjjj
cxjjj
cx cyj
cxjj
ccccscs
cSIM
ccSIMcc
xxcSIMcc
yxcsxcscs
j
j jj
rr
rr
rrrrrr
r
r rr
=1
length-normalized vector
IR ndash Berlin Chen 19
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
-As merging two clusters ci and cj the cluster sum vectors and are known in advance
ndash The average similarity for their union will be
( )icsr ( )jcsr
( )( ) ( )( ) ( ) ( )( ) ( )
( )( )1
minus++
+minus+sdot+
=cup
jiji
jijiji
ji
cccccccscscscs
ccSIMrrrr
( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr
ic jc
ji cc +
Ci Cj ( )jcsr( )ics
r
IR ndash Berlin Chen 20
Example Word Clustering
bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words
ldquoberdquo has least similarity with the other 21 words
higher nodesdecreasingof similarity
IR ndash Berlin Chen 21
Divisive Clustering
bull A top-down approach
bull Start with all objects in a single cluster
bull At each iteration select the least coherent cluster and split it
bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved
bull The history of clustering forms a binary tree or hierarchy
IR ndash Berlin Chen 22
Divisive Clustering (cont)
bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure
bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation
egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)
IR ndash Berlin Chen 23
Divisive Clustering (cont)
bull Algorithm
split the least coherent cluster
Generate two new clusters and remove the original one
IR ndash Berlin Chen 24
Non-Hierarchical Clustering
IR ndash Berlin Chen 25
Non-hierarchical Clustering
bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)
bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters
bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm
group average similarity likelihood mutual information
k-1 rarr k rarr k+1
Hierarchical clustering also has to face this problem
IR ndash Berlin Chen 26
The K-means Algorithm
bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members
bull The K-means algorithm also can be regarded as ndash A kind of vector quantization
bull Map from a continuous space (high resolution) to a discrete space (low resolution)
ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3
vectorcode wordcode vectorreferenceor centriodcluster
1
index 1
j
kjj
jnt
t
m
mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28
IR ndash Berlin Chen 27
The K-means Algorithm (cont)
ndash and are unknownndash depends on and this optimization problem
can not be solved analytically
( )⎪⎩
⎪⎨⎧ minus=minus
=minus= sum sum= =
=otherwise 0
minif 1 where
errortion Reconstruc Total2
1 11
jt
jit
ti
N
t
k
ii
tti
kii bbE
mxmxmxXm
tib
imtib
im
label
IR ndash Berlin Chen 28
The K-means Algorithm (cont)
bull Initializationndash A set of initial cluster centers is needed
bull Recursionndash Assign each object to the cluster whose center is closest
ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members
bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)
kii 1=m
sum
sum
=
= sdot=
Nt
ti
tNt
ti
i bb
1
1 xm
tx
⎪⎩
⎪⎨⎧ minus=minus
=otherwise 0
minif 1 jt
jit
tib
mxmx
These two steps are repeated until stabilizesim
IR ndash Berlin Chen 29
The K-means Algorithm (cont)
bull Algorithm
IR ndash Berlin Chen 30
The K-means Algorithm (cont)
bull Example 1
IR ndash Berlin Chen 31
The K-means Algorithm (cont)
bull Example 2
governmentfinancesports
research
name
IR ndash Berlin Chen 32
The K-means Algorithm (cont)
bull Choice of initial cluster centers (seeds) is important
ndash Pick at randomndash Calculate the mean of all data and generate k initial centers
by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)
divide it range into k equal interval and take the mean of data in each group as the initial center
ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects
bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set
bull Poor seeds will result in sub-optimal clustering
im
im
δm plusmni
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 6
Hard Assignment vs Soft Assignment (cont)
bull Hierarchical clustering usually adopts hard assignment
bull While in flat clustering both types of assignments are common
IR ndash Berlin Chen 7
Summarized Attributes of Clustering Algorithms bull Hierarchical Clustering
ndash Preferable for detailed data analysis
ndash Provide more information than flat clustering
ndash No single best algorithm (each of the algorithms only optimal for some applications)
ndash Less efficient than flat clustering (minimally have to compute n x nmatrix of similarity coefficients)
IR ndash Berlin Chen 8
Summarized Attributes of Clustering Algorithms (cont)
bull Flat Clusteringndash Preferable if efficiency is a consideration or data sets are very
large
ndash K-means is the conceptually method and should probably be used on a new data because its results are often sufficient
ndash K-means assumes a simple Euclidean representation space and so cannot be used for many data sets eg nominal data like colors (or samples with features of different scales)
ndash The EM algorithm is the most choice It can accommodate definition of clusters and allocation of objects based on complex probabilistic models
bull Its extensions can be used to handle topologicalhierarchical orders of samples
ndash Probabilistic Latent Semantic Analysis (PLSA) Topic Mixture Model (TMM) etc
IR ndash Berlin Chen 9
Hierarchical Clustering
IR ndash Berlin Chen 10
Hierarchical Clustering
bull Can be in either bottom-up or top-down mannersndash Bottom-up (agglomerative)
bull Start with individual objects and grouping the most similar ones
ndash Eg with the minimum distance apart
bull The procedure terminates when one cluster containing all objects has been formed
ndash Top-down (divisive)bull Start with all objects in a group and divide them into groups
so as to maximize within-group similarity
( ) ( )yxdyxsim
11
+=
凝集的
分裂的
distance measures willbe discussed later on
IR ndash Berlin Chen 11
Hierarchical Agglomerative Clustering (HAC)
bull A bottom-up approach
bull Assume a similarity measure for determining the similarity of two objects
bull Start with all objects in a separate cluster and then repeatedly joins the two clusters that have the most similarity until there is one only cluster survived
bull The history of mergingclustering forms a binary tree or hierarchy
IR ndash Berlin Chen 12
HAC (cont)
bull Algorithm
cluster number
Initialization (for tree leaves)Each object is a cluster
merged as a new cluster
The original two clusters are removed
IR ndash Berlin Chen 13
Distance Metrics
bull Euclidian Distance (L2 norm)
ndash Make sure that all attributesdimensions have the same scale (orthe same variance)
bull L1 Norm (City-block distance)
bull Cosine Similarity (transform to a distance by subtracting from 1)
2
12 )()( i
m
ii yxyxL minus=sum
=
rr
sum=
minus=m
iii yxyxL
11 )( rr
yxyxrr
rr
sdotminus
bull1 ranged between 0 and 1
IR ndash Berlin Chen 14
Measures of Cluster Similaritybull Especially for the bottom-up approaches
bull Single-link clusteringndash The similarity between two clusters is the similarity of the two
closest objects in the clusters
ndash Search over all pairs of objects that are from the two differentclusters and select the pair with the greatest similarity
ndash Elongated clusters are achieved
Ci Cj
greatest similarity
( ) ( )yxsimccsimji cycxji
rrrrisinisin
= max
cf the minimal spanning tree
IR ndash Berlin Chen 15
Measures of Cluster Similarity (cont)
bull Complete-link clusteringndash The similarity between two clusters is the similarity of their two
most dissimilar members
ndash Sphere-shaped clusters are achieved
ndash Preferable for most IR and NLP applications
Ci Cj
least similarity
( ) ( )yxsimccsimji cycxji
rrrrisinisin
= min
IR ndash Berlin Chen 16
Measures of Cluster Similarity (cont)
single link
complete link
IR ndash Berlin Chen 17
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering
ndash The similarity between two clusters is the average similarity between members
ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine
bull There exists an fast algorithm for computing the average similarity
( ) ( ) yxyxyxyxyxsim
rrrr
rrrrrr
sdot=sdot
== cos
length-normalized vectors
Ci Cj
IR ndash Berlin Chen 18
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
ndash The average similarity SIM between vectors in a cluster cj is defined as
ndash The sum of members in a cluster cj
ndash Express in terms of
( ) ( ) ( ) ( )sum sumsum sumisin
neisinisin
neisin
sdotminus
=minus
=j jj j cx
xycyjjcx
xycyjj
j yxcc
yxsimcc
cSIMr
rrrr
rrr
rrrr
11
11
( ) sumisin
=jcx
j xcsr
rr
( )jcSIM ( )jcsr
( ) ( ) ( )( ) ( )( ) ( )
( ) ( ) ( )( )1
1
1
minus
minussdot=there4
+minus=
sdot+minus=
sdot=sdot=sdot
sum
sum sumsum
isin
isin isinisin
jj
jjj
j
jjjj
cxjjj
cx cyj
cxjj
ccccscs
cSIM
ccSIMcc
xxcSIMcc
yxcsxcscs
j
j jj
rr
rr
rrrrrr
r
r rr
=1
length-normalized vector
IR ndash Berlin Chen 19
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
-As merging two clusters ci and cj the cluster sum vectors and are known in advance
ndash The average similarity for their union will be
( )icsr ( )jcsr
( )( ) ( )( ) ( ) ( )( ) ( )
( )( )1
minus++
+minus+sdot+
=cup
jiji
jijiji
ji
cccccccscscscs
ccSIMrrrr
( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr
ic jc
ji cc +
Ci Cj ( )jcsr( )ics
r
IR ndash Berlin Chen 20
Example Word Clustering
bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words
ldquoberdquo has least similarity with the other 21 words
higher nodesdecreasingof similarity
IR ndash Berlin Chen 21
Divisive Clustering
bull A top-down approach
bull Start with all objects in a single cluster
bull At each iteration select the least coherent cluster and split it
bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved
bull The history of clustering forms a binary tree or hierarchy
IR ndash Berlin Chen 22
Divisive Clustering (cont)
bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure
bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation
egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)
IR ndash Berlin Chen 23
Divisive Clustering (cont)
bull Algorithm
split the least coherent cluster
Generate two new clusters and remove the original one
IR ndash Berlin Chen 24
Non-Hierarchical Clustering
IR ndash Berlin Chen 25
Non-hierarchical Clustering
bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)
bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters
bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm
group average similarity likelihood mutual information
k-1 rarr k rarr k+1
Hierarchical clustering also has to face this problem
IR ndash Berlin Chen 26
The K-means Algorithm
bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members
bull The K-means algorithm also can be regarded as ndash A kind of vector quantization
bull Map from a continuous space (high resolution) to a discrete space (low resolution)
ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3
vectorcode wordcode vectorreferenceor centriodcluster
1
index 1
j
kjj
jnt
t
m
mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28
IR ndash Berlin Chen 27
The K-means Algorithm (cont)
ndash and are unknownndash depends on and this optimization problem
can not be solved analytically
( )⎪⎩
⎪⎨⎧ minus=minus
=minus= sum sum= =
=otherwise 0
minif 1 where
errortion Reconstruc Total2
1 11
jt
jit
ti
N
t
k
ii
tti
kii bbE
mxmxmxXm
tib
imtib
im
label
IR ndash Berlin Chen 28
The K-means Algorithm (cont)
bull Initializationndash A set of initial cluster centers is needed
bull Recursionndash Assign each object to the cluster whose center is closest
ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members
bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)
kii 1=m
sum
sum
=
= sdot=
Nt
ti
tNt
ti
i bb
1
1 xm
tx
⎪⎩
⎪⎨⎧ minus=minus
=otherwise 0
minif 1 jt
jit
tib
mxmx
These two steps are repeated until stabilizesim
IR ndash Berlin Chen 29
The K-means Algorithm (cont)
bull Algorithm
IR ndash Berlin Chen 30
The K-means Algorithm (cont)
bull Example 1
IR ndash Berlin Chen 31
The K-means Algorithm (cont)
bull Example 2
governmentfinancesports
research
name
IR ndash Berlin Chen 32
The K-means Algorithm (cont)
bull Choice of initial cluster centers (seeds) is important
ndash Pick at randomndash Calculate the mean of all data and generate k initial centers
by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)
divide it range into k equal interval and take the mean of data in each group as the initial center
ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects
bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set
bull Poor seeds will result in sub-optimal clustering
im
im
δm plusmni
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 7
Summarized Attributes of Clustering Algorithms bull Hierarchical Clustering
ndash Preferable for detailed data analysis
ndash Provide more information than flat clustering
ndash No single best algorithm (each of the algorithms only optimal for some applications)
ndash Less efficient than flat clustering (minimally have to compute n x nmatrix of similarity coefficients)
IR ndash Berlin Chen 8
Summarized Attributes of Clustering Algorithms (cont)
bull Flat Clusteringndash Preferable if efficiency is a consideration or data sets are very
large
ndash K-means is the conceptually method and should probably be used on a new data because its results are often sufficient
ndash K-means assumes a simple Euclidean representation space and so cannot be used for many data sets eg nominal data like colors (or samples with features of different scales)
ndash The EM algorithm is the most choice It can accommodate definition of clusters and allocation of objects based on complex probabilistic models
bull Its extensions can be used to handle topologicalhierarchical orders of samples
ndash Probabilistic Latent Semantic Analysis (PLSA) Topic Mixture Model (TMM) etc
IR ndash Berlin Chen 9
Hierarchical Clustering
IR ndash Berlin Chen 10
Hierarchical Clustering
bull Can be in either bottom-up or top-down mannersndash Bottom-up (agglomerative)
bull Start with individual objects and grouping the most similar ones
ndash Eg with the minimum distance apart
bull The procedure terminates when one cluster containing all objects has been formed
ndash Top-down (divisive)bull Start with all objects in a group and divide them into groups
so as to maximize within-group similarity
( ) ( )yxdyxsim
11
+=
凝集的
分裂的
distance measures willbe discussed later on
IR ndash Berlin Chen 11
Hierarchical Agglomerative Clustering (HAC)
bull A bottom-up approach
bull Assume a similarity measure for determining the similarity of two objects
bull Start with all objects in a separate cluster and then repeatedly joins the two clusters that have the most similarity until there is one only cluster survived
bull The history of mergingclustering forms a binary tree or hierarchy
IR ndash Berlin Chen 12
HAC (cont)
bull Algorithm
cluster number
Initialization (for tree leaves)Each object is a cluster
merged as a new cluster
The original two clusters are removed
IR ndash Berlin Chen 13
Distance Metrics
bull Euclidian Distance (L2 norm)
ndash Make sure that all attributesdimensions have the same scale (orthe same variance)
bull L1 Norm (City-block distance)
bull Cosine Similarity (transform to a distance by subtracting from 1)
2
12 )()( i
m
ii yxyxL minus=sum
=
rr
sum=
minus=m
iii yxyxL
11 )( rr
yxyxrr
rr
sdotminus
bull1 ranged between 0 and 1
IR ndash Berlin Chen 14
Measures of Cluster Similaritybull Especially for the bottom-up approaches
bull Single-link clusteringndash The similarity between two clusters is the similarity of the two
closest objects in the clusters
ndash Search over all pairs of objects that are from the two differentclusters and select the pair with the greatest similarity
ndash Elongated clusters are achieved
Ci Cj
greatest similarity
( ) ( )yxsimccsimji cycxji
rrrrisinisin
= max
cf the minimal spanning tree
IR ndash Berlin Chen 15
Measures of Cluster Similarity (cont)
bull Complete-link clusteringndash The similarity between two clusters is the similarity of their two
most dissimilar members
ndash Sphere-shaped clusters are achieved
ndash Preferable for most IR and NLP applications
Ci Cj
least similarity
( ) ( )yxsimccsimji cycxji
rrrrisinisin
= min
IR ndash Berlin Chen 16
Measures of Cluster Similarity (cont)
single link
complete link
IR ndash Berlin Chen 17
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering
ndash The similarity between two clusters is the average similarity between members
ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine
bull There exists an fast algorithm for computing the average similarity
( ) ( ) yxyxyxyxyxsim
rrrr
rrrrrr
sdot=sdot
== cos
length-normalized vectors
Ci Cj
IR ndash Berlin Chen 18
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
ndash The average similarity SIM between vectors in a cluster cj is defined as
ndash The sum of members in a cluster cj
ndash Express in terms of
( ) ( ) ( ) ( )sum sumsum sumisin
neisinisin
neisin
sdotminus
=minus
=j jj j cx
xycyjjcx
xycyjj
j yxcc
yxsimcc
cSIMr
rrrr
rrr
rrrr
11
11
( ) sumisin
=jcx
j xcsr
rr
( )jcSIM ( )jcsr
( ) ( ) ( )( ) ( )( ) ( )
( ) ( ) ( )( )1
1
1
minus
minussdot=there4
+minus=
sdot+minus=
sdot=sdot=sdot
sum
sum sumsum
isin
isin isinisin
jj
jjj
j
jjjj
cxjjj
cx cyj
cxjj
ccccscs
cSIM
ccSIMcc
xxcSIMcc
yxcsxcscs
j
j jj
rr
rr
rrrrrr
r
r rr
=1
length-normalized vector
IR ndash Berlin Chen 19
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
-As merging two clusters ci and cj the cluster sum vectors and are known in advance
ndash The average similarity for their union will be
( )icsr ( )jcsr
( )( ) ( )( ) ( ) ( )( ) ( )
( )( )1
minus++
+minus+sdot+
=cup
jiji
jijiji
ji
cccccccscscscs
ccSIMrrrr
( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr
ic jc
ji cc +
Ci Cj ( )jcsr( )ics
r
IR ndash Berlin Chen 20
Example Word Clustering
bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words
ldquoberdquo has least similarity with the other 21 words
higher nodesdecreasingof similarity
IR ndash Berlin Chen 21
Divisive Clustering
bull A top-down approach
bull Start with all objects in a single cluster
bull At each iteration select the least coherent cluster and split it
bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved
bull The history of clustering forms a binary tree or hierarchy
IR ndash Berlin Chen 22
Divisive Clustering (cont)
bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure
bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation
egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)
IR ndash Berlin Chen 23
Divisive Clustering (cont)
bull Algorithm
split the least coherent cluster
Generate two new clusters and remove the original one
IR ndash Berlin Chen 24
Non-Hierarchical Clustering
IR ndash Berlin Chen 25
Non-hierarchical Clustering
bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)
bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters
bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm
group average similarity likelihood mutual information
k-1 rarr k rarr k+1
Hierarchical clustering also has to face this problem
IR ndash Berlin Chen 26
The K-means Algorithm
bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members
bull The K-means algorithm also can be regarded as ndash A kind of vector quantization
bull Map from a continuous space (high resolution) to a discrete space (low resolution)
ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3
vectorcode wordcode vectorreferenceor centriodcluster
1
index 1
j
kjj
jnt
t
m
mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28
IR ndash Berlin Chen 27
The K-means Algorithm (cont)
ndash and are unknownndash depends on and this optimization problem
can not be solved analytically
( )⎪⎩
⎪⎨⎧ minus=minus
=minus= sum sum= =
=otherwise 0
minif 1 where
errortion Reconstruc Total2
1 11
jt
jit
ti
N
t
k
ii
tti
kii bbE
mxmxmxXm
tib
imtib
im
label
IR ndash Berlin Chen 28
The K-means Algorithm (cont)
bull Initializationndash A set of initial cluster centers is needed
bull Recursionndash Assign each object to the cluster whose center is closest
ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members
bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)
kii 1=m
sum
sum
=
= sdot=
Nt
ti
tNt
ti
i bb
1
1 xm
tx
⎪⎩
⎪⎨⎧ minus=minus
=otherwise 0
minif 1 jt
jit
tib
mxmx
These two steps are repeated until stabilizesim
IR ndash Berlin Chen 29
The K-means Algorithm (cont)
bull Algorithm
IR ndash Berlin Chen 30
The K-means Algorithm (cont)
bull Example 1
IR ndash Berlin Chen 31
The K-means Algorithm (cont)
bull Example 2
governmentfinancesports
research
name
IR ndash Berlin Chen 32
The K-means Algorithm (cont)
bull Choice of initial cluster centers (seeds) is important
ndash Pick at randomndash Calculate the mean of all data and generate k initial centers
by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)
divide it range into k equal interval and take the mean of data in each group as the initial center
ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects
bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set
bull Poor seeds will result in sub-optimal clustering
im
im
δm plusmni
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 8
Summarized Attributes of Clustering Algorithms (cont)
bull Flat Clusteringndash Preferable if efficiency is a consideration or data sets are very
large
ndash K-means is the conceptually method and should probably be used on a new data because its results are often sufficient
ndash K-means assumes a simple Euclidean representation space and so cannot be used for many data sets eg nominal data like colors (or samples with features of different scales)
ndash The EM algorithm is the most choice It can accommodate definition of clusters and allocation of objects based on complex probabilistic models
bull Its extensions can be used to handle topologicalhierarchical orders of samples
ndash Probabilistic Latent Semantic Analysis (PLSA) Topic Mixture Model (TMM) etc
IR ndash Berlin Chen 9
Hierarchical Clustering
IR ndash Berlin Chen 10
Hierarchical Clustering
bull Can be in either bottom-up or top-down mannersndash Bottom-up (agglomerative)
bull Start with individual objects and grouping the most similar ones
ndash Eg with the minimum distance apart
bull The procedure terminates when one cluster containing all objects has been formed
ndash Top-down (divisive)bull Start with all objects in a group and divide them into groups
so as to maximize within-group similarity
( ) ( )yxdyxsim
11
+=
凝集的
分裂的
distance measures willbe discussed later on
IR ndash Berlin Chen 11
Hierarchical Agglomerative Clustering (HAC)
bull A bottom-up approach
bull Assume a similarity measure for determining the similarity of two objects
bull Start with all objects in a separate cluster and then repeatedly joins the two clusters that have the most similarity until there is one only cluster survived
bull The history of mergingclustering forms a binary tree or hierarchy
IR ndash Berlin Chen 12
HAC (cont)
bull Algorithm
cluster number
Initialization (for tree leaves)Each object is a cluster
merged as a new cluster
The original two clusters are removed
IR ndash Berlin Chen 13
Distance Metrics
bull Euclidian Distance (L2 norm)
ndash Make sure that all attributesdimensions have the same scale (orthe same variance)
bull L1 Norm (City-block distance)
bull Cosine Similarity (transform to a distance by subtracting from 1)
2
12 )()( i
m
ii yxyxL minus=sum
=
rr
sum=
minus=m
iii yxyxL
11 )( rr
yxyxrr
rr
sdotminus
bull1 ranged between 0 and 1
IR ndash Berlin Chen 14
Measures of Cluster Similaritybull Especially for the bottom-up approaches
bull Single-link clusteringndash The similarity between two clusters is the similarity of the two
closest objects in the clusters
ndash Search over all pairs of objects that are from the two differentclusters and select the pair with the greatest similarity
ndash Elongated clusters are achieved
Ci Cj
greatest similarity
( ) ( )yxsimccsimji cycxji
rrrrisinisin
= max
cf the minimal spanning tree
IR ndash Berlin Chen 15
Measures of Cluster Similarity (cont)
bull Complete-link clusteringndash The similarity between two clusters is the similarity of their two
most dissimilar members
ndash Sphere-shaped clusters are achieved
ndash Preferable for most IR and NLP applications
Ci Cj
least similarity
( ) ( )yxsimccsimji cycxji
rrrrisinisin
= min
IR ndash Berlin Chen 16
Measures of Cluster Similarity (cont)
single link
complete link
IR ndash Berlin Chen 17
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering
ndash The similarity between two clusters is the average similarity between members
ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine
bull There exists an fast algorithm for computing the average similarity
( ) ( ) yxyxyxyxyxsim
rrrr
rrrrrr
sdot=sdot
== cos
length-normalized vectors
Ci Cj
IR ndash Berlin Chen 18
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
ndash The average similarity SIM between vectors in a cluster cj is defined as
ndash The sum of members in a cluster cj
ndash Express in terms of
( ) ( ) ( ) ( )sum sumsum sumisin
neisinisin
neisin
sdotminus
=minus
=j jj j cx
xycyjjcx
xycyjj
j yxcc
yxsimcc
cSIMr
rrrr
rrr
rrrr
11
11
( ) sumisin
=jcx
j xcsr
rr
( )jcSIM ( )jcsr
( ) ( ) ( )( ) ( )( ) ( )
( ) ( ) ( )( )1
1
1
minus
minussdot=there4
+minus=
sdot+minus=
sdot=sdot=sdot
sum
sum sumsum
isin
isin isinisin
jj
jjj
j
jjjj
cxjjj
cx cyj
cxjj
ccccscs
cSIM
ccSIMcc
xxcSIMcc
yxcsxcscs
j
j jj
rr
rr
rrrrrr
r
r rr
=1
length-normalized vector
IR ndash Berlin Chen 19
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
-As merging two clusters ci and cj the cluster sum vectors and are known in advance
ndash The average similarity for their union will be
( )icsr ( )jcsr
( )( ) ( )( ) ( ) ( )( ) ( )
( )( )1
minus++
+minus+sdot+
=cup
jiji
jijiji
ji
cccccccscscscs
ccSIMrrrr
( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr
ic jc
ji cc +
Ci Cj ( )jcsr( )ics
r
IR ndash Berlin Chen 20
Example Word Clustering
bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words
ldquoberdquo has least similarity with the other 21 words
higher nodesdecreasingof similarity
IR ndash Berlin Chen 21
Divisive Clustering
bull A top-down approach
bull Start with all objects in a single cluster
bull At each iteration select the least coherent cluster and split it
bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved
bull The history of clustering forms a binary tree or hierarchy
IR ndash Berlin Chen 22
Divisive Clustering (cont)
bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure
bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation
egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)
IR ndash Berlin Chen 23
Divisive Clustering (cont)
bull Algorithm
split the least coherent cluster
Generate two new clusters and remove the original one
IR ndash Berlin Chen 24
Non-Hierarchical Clustering
IR ndash Berlin Chen 25
Non-hierarchical Clustering
bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)
bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters
bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm
group average similarity likelihood mutual information
k-1 rarr k rarr k+1
Hierarchical clustering also has to face this problem
IR ndash Berlin Chen 26
The K-means Algorithm
bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members
bull The K-means algorithm also can be regarded as ndash A kind of vector quantization
bull Map from a continuous space (high resolution) to a discrete space (low resolution)
ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3
vectorcode wordcode vectorreferenceor centriodcluster
1
index 1
j
kjj
jnt
t
m
mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28
IR ndash Berlin Chen 27
The K-means Algorithm (cont)
ndash and are unknownndash depends on and this optimization problem
can not be solved analytically
( )⎪⎩
⎪⎨⎧ minus=minus
=minus= sum sum= =
=otherwise 0
minif 1 where
errortion Reconstruc Total2
1 11
jt
jit
ti
N
t
k
ii
tti
kii bbE
mxmxmxXm
tib
imtib
im
label
IR ndash Berlin Chen 28
The K-means Algorithm (cont)
bull Initializationndash A set of initial cluster centers is needed
bull Recursionndash Assign each object to the cluster whose center is closest
ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members
bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)
kii 1=m
sum
sum
=
= sdot=
Nt
ti
tNt
ti
i bb
1
1 xm
tx
⎪⎩
⎪⎨⎧ minus=minus
=otherwise 0
minif 1 jt
jit
tib
mxmx
These two steps are repeated until stabilizesim
IR ndash Berlin Chen 29
The K-means Algorithm (cont)
bull Algorithm
IR ndash Berlin Chen 30
The K-means Algorithm (cont)
bull Example 1
IR ndash Berlin Chen 31
The K-means Algorithm (cont)
bull Example 2
governmentfinancesports
research
name
IR ndash Berlin Chen 32
The K-means Algorithm (cont)
bull Choice of initial cluster centers (seeds) is important
ndash Pick at randomndash Calculate the mean of all data and generate k initial centers
by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)
divide it range into k equal interval and take the mean of data in each group as the initial center
ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects
bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set
bull Poor seeds will result in sub-optimal clustering
im
im
δm plusmni
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 9
Hierarchical Clustering
IR ndash Berlin Chen 10
Hierarchical Clustering
bull Can be in either bottom-up or top-down mannersndash Bottom-up (agglomerative)
bull Start with individual objects and grouping the most similar ones
ndash Eg with the minimum distance apart
bull The procedure terminates when one cluster containing all objects has been formed
ndash Top-down (divisive)bull Start with all objects in a group and divide them into groups
so as to maximize within-group similarity
( ) ( )yxdyxsim
11
+=
凝集的
分裂的
distance measures willbe discussed later on
IR ndash Berlin Chen 11
Hierarchical Agglomerative Clustering (HAC)
bull A bottom-up approach
bull Assume a similarity measure for determining the similarity of two objects
bull Start with all objects in a separate cluster and then repeatedly joins the two clusters that have the most similarity until there is one only cluster survived
bull The history of mergingclustering forms a binary tree or hierarchy
IR ndash Berlin Chen 12
HAC (cont)
bull Algorithm
cluster number
Initialization (for tree leaves)Each object is a cluster
merged as a new cluster
The original two clusters are removed
IR ndash Berlin Chen 13
Distance Metrics
bull Euclidian Distance (L2 norm)
ndash Make sure that all attributesdimensions have the same scale (orthe same variance)
bull L1 Norm (City-block distance)
bull Cosine Similarity (transform to a distance by subtracting from 1)
2
12 )()( i
m
ii yxyxL minus=sum
=
rr
sum=
minus=m
iii yxyxL
11 )( rr
yxyxrr
rr
sdotminus
bull1 ranged between 0 and 1
IR ndash Berlin Chen 14
Measures of Cluster Similaritybull Especially for the bottom-up approaches
bull Single-link clusteringndash The similarity between two clusters is the similarity of the two
closest objects in the clusters
ndash Search over all pairs of objects that are from the two differentclusters and select the pair with the greatest similarity
ndash Elongated clusters are achieved
Ci Cj
greatest similarity
( ) ( )yxsimccsimji cycxji
rrrrisinisin
= max
cf the minimal spanning tree
IR ndash Berlin Chen 15
Measures of Cluster Similarity (cont)
bull Complete-link clusteringndash The similarity between two clusters is the similarity of their two
most dissimilar members
ndash Sphere-shaped clusters are achieved
ndash Preferable for most IR and NLP applications
Ci Cj
least similarity
( ) ( )yxsimccsimji cycxji
rrrrisinisin
= min
IR ndash Berlin Chen 16
Measures of Cluster Similarity (cont)
single link
complete link
IR ndash Berlin Chen 17
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering
ndash The similarity between two clusters is the average similarity between members
ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine
bull There exists an fast algorithm for computing the average similarity
( ) ( ) yxyxyxyxyxsim
rrrr
rrrrrr
sdot=sdot
== cos
length-normalized vectors
Ci Cj
IR ndash Berlin Chen 18
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
ndash The average similarity SIM between vectors in a cluster cj is defined as
ndash The sum of members in a cluster cj
ndash Express in terms of
( ) ( ) ( ) ( )sum sumsum sumisin
neisinisin
neisin
sdotminus
=minus
=j jj j cx
xycyjjcx
xycyjj
j yxcc
yxsimcc
cSIMr
rrrr
rrr
rrrr
11
11
( ) sumisin
=jcx
j xcsr
rr
( )jcSIM ( )jcsr
( ) ( ) ( )( ) ( )( ) ( )
( ) ( ) ( )( )1
1
1
minus
minussdot=there4
+minus=
sdot+minus=
sdot=sdot=sdot
sum
sum sumsum
isin
isin isinisin
jj
jjj
j
jjjj
cxjjj
cx cyj
cxjj
ccccscs
cSIM
ccSIMcc
xxcSIMcc
yxcsxcscs
j
j jj
rr
rr
rrrrrr
r
r rr
=1
length-normalized vector
IR ndash Berlin Chen 19
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
-As merging two clusters ci and cj the cluster sum vectors and are known in advance
ndash The average similarity for their union will be
( )icsr ( )jcsr
( )( ) ( )( ) ( ) ( )( ) ( )
( )( )1
minus++
+minus+sdot+
=cup
jiji
jijiji
ji
cccccccscscscs
ccSIMrrrr
( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr
ic jc
ji cc +
Ci Cj ( )jcsr( )ics
r
IR ndash Berlin Chen 20
Example Word Clustering
bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words
ldquoberdquo has least similarity with the other 21 words
higher nodesdecreasingof similarity
IR ndash Berlin Chen 21
Divisive Clustering
bull A top-down approach
bull Start with all objects in a single cluster
bull At each iteration select the least coherent cluster and split it
bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved
bull The history of clustering forms a binary tree or hierarchy
IR ndash Berlin Chen 22
Divisive Clustering (cont)
bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure
bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation
egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)
IR ndash Berlin Chen 23
Divisive Clustering (cont)
bull Algorithm
split the least coherent cluster
Generate two new clusters and remove the original one
IR ndash Berlin Chen 24
Non-Hierarchical Clustering
IR ndash Berlin Chen 25
Non-hierarchical Clustering
bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)
bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters
bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm
group average similarity likelihood mutual information
k-1 rarr k rarr k+1
Hierarchical clustering also has to face this problem
IR ndash Berlin Chen 26
The K-means Algorithm
bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members
bull The K-means algorithm also can be regarded as ndash A kind of vector quantization
bull Map from a continuous space (high resolution) to a discrete space (low resolution)
ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3
vectorcode wordcode vectorreferenceor centriodcluster
1
index 1
j
kjj
jnt
t
m
mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28
IR ndash Berlin Chen 27
The K-means Algorithm (cont)
ndash and are unknownndash depends on and this optimization problem
can not be solved analytically
( )⎪⎩
⎪⎨⎧ minus=minus
=minus= sum sum= =
=otherwise 0
minif 1 where
errortion Reconstruc Total2
1 11
jt
jit
ti
N
t
k
ii
tti
kii bbE
mxmxmxXm
tib
imtib
im
label
IR ndash Berlin Chen 28
The K-means Algorithm (cont)
bull Initializationndash A set of initial cluster centers is needed
bull Recursionndash Assign each object to the cluster whose center is closest
ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members
bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)
kii 1=m
sum
sum
=
= sdot=
Nt
ti
tNt
ti
i bb
1
1 xm
tx
⎪⎩
⎪⎨⎧ minus=minus
=otherwise 0
minif 1 jt
jit
tib
mxmx
These two steps are repeated until stabilizesim
IR ndash Berlin Chen 29
The K-means Algorithm (cont)
bull Algorithm
IR ndash Berlin Chen 30
The K-means Algorithm (cont)
bull Example 1
IR ndash Berlin Chen 31
The K-means Algorithm (cont)
bull Example 2
governmentfinancesports
research
name
IR ndash Berlin Chen 32
The K-means Algorithm (cont)
bull Choice of initial cluster centers (seeds) is important
ndash Pick at randomndash Calculate the mean of all data and generate k initial centers
by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)
divide it range into k equal interval and take the mean of data in each group as the initial center
ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects
bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set
bull Poor seeds will result in sub-optimal clustering
im
im
δm plusmni
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 10
Hierarchical Clustering
bull Can be in either bottom-up or top-down mannersndash Bottom-up (agglomerative)
bull Start with individual objects and grouping the most similar ones
ndash Eg with the minimum distance apart
bull The procedure terminates when one cluster containing all objects has been formed
ndash Top-down (divisive)bull Start with all objects in a group and divide them into groups
so as to maximize within-group similarity
( ) ( )yxdyxsim
11
+=
凝集的
分裂的
distance measures willbe discussed later on
IR ndash Berlin Chen 11
Hierarchical Agglomerative Clustering (HAC)
bull A bottom-up approach
bull Assume a similarity measure for determining the similarity of two objects
bull Start with all objects in a separate cluster and then repeatedly joins the two clusters that have the most similarity until there is one only cluster survived
bull The history of mergingclustering forms a binary tree or hierarchy
IR ndash Berlin Chen 12
HAC (cont)
bull Algorithm
cluster number
Initialization (for tree leaves)Each object is a cluster
merged as a new cluster
The original two clusters are removed
IR ndash Berlin Chen 13
Distance Metrics
bull Euclidian Distance (L2 norm)
ndash Make sure that all attributesdimensions have the same scale (orthe same variance)
bull L1 Norm (City-block distance)
bull Cosine Similarity (transform to a distance by subtracting from 1)
2
12 )()( i
m
ii yxyxL minus=sum
=
rr
sum=
minus=m
iii yxyxL
11 )( rr
yxyxrr
rr
sdotminus
bull1 ranged between 0 and 1
IR ndash Berlin Chen 14
Measures of Cluster Similaritybull Especially for the bottom-up approaches
bull Single-link clusteringndash The similarity between two clusters is the similarity of the two
closest objects in the clusters
ndash Search over all pairs of objects that are from the two differentclusters and select the pair with the greatest similarity
ndash Elongated clusters are achieved
Ci Cj
greatest similarity
( ) ( )yxsimccsimji cycxji
rrrrisinisin
= max
cf the minimal spanning tree
IR ndash Berlin Chen 15
Measures of Cluster Similarity (cont)
bull Complete-link clusteringndash The similarity between two clusters is the similarity of their two
most dissimilar members
ndash Sphere-shaped clusters are achieved
ndash Preferable for most IR and NLP applications
Ci Cj
least similarity
( ) ( )yxsimccsimji cycxji
rrrrisinisin
= min
IR ndash Berlin Chen 16
Measures of Cluster Similarity (cont)
single link
complete link
IR ndash Berlin Chen 17
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering
ndash The similarity between two clusters is the average similarity between members
ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine
bull There exists an fast algorithm for computing the average similarity
( ) ( ) yxyxyxyxyxsim
rrrr
rrrrrr
sdot=sdot
== cos
length-normalized vectors
Ci Cj
IR ndash Berlin Chen 18
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
ndash The average similarity SIM between vectors in a cluster cj is defined as
ndash The sum of members in a cluster cj
ndash Express in terms of
( ) ( ) ( ) ( )sum sumsum sumisin
neisinisin
neisin
sdotminus
=minus
=j jj j cx
xycyjjcx
xycyjj
j yxcc
yxsimcc
cSIMr
rrrr
rrr
rrrr
11
11
( ) sumisin
=jcx
j xcsr
rr
( )jcSIM ( )jcsr
( ) ( ) ( )( ) ( )( ) ( )
( ) ( ) ( )( )1
1
1
minus
minussdot=there4
+minus=
sdot+minus=
sdot=sdot=sdot
sum
sum sumsum
isin
isin isinisin
jj
jjj
j
jjjj
cxjjj
cx cyj
cxjj
ccccscs
cSIM
ccSIMcc
xxcSIMcc
yxcsxcscs
j
j jj
rr
rr
rrrrrr
r
r rr
=1
length-normalized vector
IR ndash Berlin Chen 19
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
-As merging two clusters ci and cj the cluster sum vectors and are known in advance
ndash The average similarity for their union will be
( )icsr ( )jcsr
( )( ) ( )( ) ( ) ( )( ) ( )
( )( )1
minus++
+minus+sdot+
=cup
jiji
jijiji
ji
cccccccscscscs
ccSIMrrrr
( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr
ic jc
ji cc +
Ci Cj ( )jcsr( )ics
r
IR ndash Berlin Chen 20
Example Word Clustering
bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words
ldquoberdquo has least similarity with the other 21 words
higher nodesdecreasingof similarity
IR ndash Berlin Chen 21
Divisive Clustering
bull A top-down approach
bull Start with all objects in a single cluster
bull At each iteration select the least coherent cluster and split it
bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved
bull The history of clustering forms a binary tree or hierarchy
IR ndash Berlin Chen 22
Divisive Clustering (cont)
bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure
bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation
egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)
IR ndash Berlin Chen 23
Divisive Clustering (cont)
bull Algorithm
split the least coherent cluster
Generate two new clusters and remove the original one
IR ndash Berlin Chen 24
Non-Hierarchical Clustering
IR ndash Berlin Chen 25
Non-hierarchical Clustering
bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)
bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters
bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm
group average similarity likelihood mutual information
k-1 rarr k rarr k+1
Hierarchical clustering also has to face this problem
IR ndash Berlin Chen 26
The K-means Algorithm
bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members
bull The K-means algorithm also can be regarded as ndash A kind of vector quantization
bull Map from a continuous space (high resolution) to a discrete space (low resolution)
ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3
vectorcode wordcode vectorreferenceor centriodcluster
1
index 1
j
kjj
jnt
t
m
mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28
IR ndash Berlin Chen 27
The K-means Algorithm (cont)
ndash and are unknownndash depends on and this optimization problem
can not be solved analytically
( )⎪⎩
⎪⎨⎧ minus=minus
=minus= sum sum= =
=otherwise 0
minif 1 where
errortion Reconstruc Total2
1 11
jt
jit
ti
N
t
k
ii
tti
kii bbE
mxmxmxXm
tib
imtib
im
label
IR ndash Berlin Chen 28
The K-means Algorithm (cont)
bull Initializationndash A set of initial cluster centers is needed
bull Recursionndash Assign each object to the cluster whose center is closest
ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members
bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)
kii 1=m
sum
sum
=
= sdot=
Nt
ti
tNt
ti
i bb
1
1 xm
tx
⎪⎩
⎪⎨⎧ minus=minus
=otherwise 0
minif 1 jt
jit
tib
mxmx
These two steps are repeated until stabilizesim
IR ndash Berlin Chen 29
The K-means Algorithm (cont)
bull Algorithm
IR ndash Berlin Chen 30
The K-means Algorithm (cont)
bull Example 1
IR ndash Berlin Chen 31
The K-means Algorithm (cont)
bull Example 2
governmentfinancesports
research
name
IR ndash Berlin Chen 32
The K-means Algorithm (cont)
bull Choice of initial cluster centers (seeds) is important
ndash Pick at randomndash Calculate the mean of all data and generate k initial centers
by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)
divide it range into k equal interval and take the mean of data in each group as the initial center
ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects
bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set
bull Poor seeds will result in sub-optimal clustering
im
im
δm plusmni
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 11
Hierarchical Agglomerative Clustering (HAC)
bull A bottom-up approach
bull Assume a similarity measure for determining the similarity of two objects
bull Start with all objects in a separate cluster and then repeatedly joins the two clusters that have the most similarity until there is one only cluster survived
bull The history of mergingclustering forms a binary tree or hierarchy
IR ndash Berlin Chen 12
HAC (cont)
bull Algorithm
cluster number
Initialization (for tree leaves)Each object is a cluster
merged as a new cluster
The original two clusters are removed
IR ndash Berlin Chen 13
Distance Metrics
bull Euclidian Distance (L2 norm)
ndash Make sure that all attributesdimensions have the same scale (orthe same variance)
bull L1 Norm (City-block distance)
bull Cosine Similarity (transform to a distance by subtracting from 1)
2
12 )()( i
m
ii yxyxL minus=sum
=
rr
sum=
minus=m
iii yxyxL
11 )( rr
yxyxrr
rr
sdotminus
bull1 ranged between 0 and 1
IR ndash Berlin Chen 14
Measures of Cluster Similaritybull Especially for the bottom-up approaches
bull Single-link clusteringndash The similarity between two clusters is the similarity of the two
closest objects in the clusters
ndash Search over all pairs of objects that are from the two differentclusters and select the pair with the greatest similarity
ndash Elongated clusters are achieved
Ci Cj
greatest similarity
( ) ( )yxsimccsimji cycxji
rrrrisinisin
= max
cf the minimal spanning tree
IR ndash Berlin Chen 15
Measures of Cluster Similarity (cont)
bull Complete-link clusteringndash The similarity between two clusters is the similarity of their two
most dissimilar members
ndash Sphere-shaped clusters are achieved
ndash Preferable for most IR and NLP applications
Ci Cj
least similarity
( ) ( )yxsimccsimji cycxji
rrrrisinisin
= min
IR ndash Berlin Chen 16
Measures of Cluster Similarity (cont)
single link
complete link
IR ndash Berlin Chen 17
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering
ndash The similarity between two clusters is the average similarity between members
ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine
bull There exists an fast algorithm for computing the average similarity
( ) ( ) yxyxyxyxyxsim
rrrr
rrrrrr
sdot=sdot
== cos
length-normalized vectors
Ci Cj
IR ndash Berlin Chen 18
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
ndash The average similarity SIM between vectors in a cluster cj is defined as
ndash The sum of members in a cluster cj
ndash Express in terms of
( ) ( ) ( ) ( )sum sumsum sumisin
neisinisin
neisin
sdotminus
=minus
=j jj j cx
xycyjjcx
xycyjj
j yxcc
yxsimcc
cSIMr
rrrr
rrr
rrrr
11
11
( ) sumisin
=jcx
j xcsr
rr
( )jcSIM ( )jcsr
( ) ( ) ( )( ) ( )( ) ( )
( ) ( ) ( )( )1
1
1
minus
minussdot=there4
+minus=
sdot+minus=
sdot=sdot=sdot
sum
sum sumsum
isin
isin isinisin
jj
jjj
j
jjjj
cxjjj
cx cyj
cxjj
ccccscs
cSIM
ccSIMcc
xxcSIMcc
yxcsxcscs
j
j jj
rr
rr
rrrrrr
r
r rr
=1
length-normalized vector
IR ndash Berlin Chen 19
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
-As merging two clusters ci and cj the cluster sum vectors and are known in advance
ndash The average similarity for their union will be
( )icsr ( )jcsr
( )( ) ( )( ) ( ) ( )( ) ( )
( )( )1
minus++
+minus+sdot+
=cup
jiji
jijiji
ji
cccccccscscscs
ccSIMrrrr
( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr
ic jc
ji cc +
Ci Cj ( )jcsr( )ics
r
IR ndash Berlin Chen 20
Example Word Clustering
bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words
ldquoberdquo has least similarity with the other 21 words
higher nodesdecreasingof similarity
IR ndash Berlin Chen 21
Divisive Clustering
bull A top-down approach
bull Start with all objects in a single cluster
bull At each iteration select the least coherent cluster and split it
bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved
bull The history of clustering forms a binary tree or hierarchy
IR ndash Berlin Chen 22
Divisive Clustering (cont)
bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure
bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation
egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)
IR ndash Berlin Chen 23
Divisive Clustering (cont)
bull Algorithm
split the least coherent cluster
Generate two new clusters and remove the original one
IR ndash Berlin Chen 24
Non-Hierarchical Clustering
IR ndash Berlin Chen 25
Non-hierarchical Clustering
bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)
bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters
bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm
group average similarity likelihood mutual information
k-1 rarr k rarr k+1
Hierarchical clustering also has to face this problem
IR ndash Berlin Chen 26
The K-means Algorithm
bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members
bull The K-means algorithm also can be regarded as ndash A kind of vector quantization
bull Map from a continuous space (high resolution) to a discrete space (low resolution)
ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3
vectorcode wordcode vectorreferenceor centriodcluster
1
index 1
j
kjj
jnt
t
m
mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28
IR ndash Berlin Chen 27
The K-means Algorithm (cont)
ndash and are unknownndash depends on and this optimization problem
can not be solved analytically
( )⎪⎩
⎪⎨⎧ minus=minus
=minus= sum sum= =
=otherwise 0
minif 1 where
errortion Reconstruc Total2
1 11
jt
jit
ti
N
t
k
ii
tti
kii bbE
mxmxmxXm
tib
imtib
im
label
IR ndash Berlin Chen 28
The K-means Algorithm (cont)
bull Initializationndash A set of initial cluster centers is needed
bull Recursionndash Assign each object to the cluster whose center is closest
ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members
bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)
kii 1=m
sum
sum
=
= sdot=
Nt
ti
tNt
ti
i bb
1
1 xm
tx
⎪⎩
⎪⎨⎧ minus=minus
=otherwise 0
minif 1 jt
jit
tib
mxmx
These two steps are repeated until stabilizesim
IR ndash Berlin Chen 29
The K-means Algorithm (cont)
bull Algorithm
IR ndash Berlin Chen 30
The K-means Algorithm (cont)
bull Example 1
IR ndash Berlin Chen 31
The K-means Algorithm (cont)
bull Example 2
governmentfinancesports
research
name
IR ndash Berlin Chen 32
The K-means Algorithm (cont)
bull Choice of initial cluster centers (seeds) is important
ndash Pick at randomndash Calculate the mean of all data and generate k initial centers
by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)
divide it range into k equal interval and take the mean of data in each group as the initial center
ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects
bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set
bull Poor seeds will result in sub-optimal clustering
im
im
δm plusmni
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 12
HAC (cont)
bull Algorithm
cluster number
Initialization (for tree leaves)Each object is a cluster
merged as a new cluster
The original two clusters are removed
IR ndash Berlin Chen 13
Distance Metrics
bull Euclidian Distance (L2 norm)
ndash Make sure that all attributesdimensions have the same scale (orthe same variance)
bull L1 Norm (City-block distance)
bull Cosine Similarity (transform to a distance by subtracting from 1)
2
12 )()( i
m
ii yxyxL minus=sum
=
rr
sum=
minus=m
iii yxyxL
11 )( rr
yxyxrr
rr
sdotminus
bull1 ranged between 0 and 1
IR ndash Berlin Chen 14
Measures of Cluster Similaritybull Especially for the bottom-up approaches
bull Single-link clusteringndash The similarity between two clusters is the similarity of the two
closest objects in the clusters
ndash Search over all pairs of objects that are from the two differentclusters and select the pair with the greatest similarity
ndash Elongated clusters are achieved
Ci Cj
greatest similarity
( ) ( )yxsimccsimji cycxji
rrrrisinisin
= max
cf the minimal spanning tree
IR ndash Berlin Chen 15
Measures of Cluster Similarity (cont)
bull Complete-link clusteringndash The similarity between two clusters is the similarity of their two
most dissimilar members
ndash Sphere-shaped clusters are achieved
ndash Preferable for most IR and NLP applications
Ci Cj
least similarity
( ) ( )yxsimccsimji cycxji
rrrrisinisin
= min
IR ndash Berlin Chen 16
Measures of Cluster Similarity (cont)
single link
complete link
IR ndash Berlin Chen 17
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering
ndash The similarity between two clusters is the average similarity between members
ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine
bull There exists an fast algorithm for computing the average similarity
( ) ( ) yxyxyxyxyxsim
rrrr
rrrrrr
sdot=sdot
== cos
length-normalized vectors
Ci Cj
IR ndash Berlin Chen 18
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
ndash The average similarity SIM between vectors in a cluster cj is defined as
ndash The sum of members in a cluster cj
ndash Express in terms of
( ) ( ) ( ) ( )sum sumsum sumisin
neisinisin
neisin
sdotminus
=minus
=j jj j cx
xycyjjcx
xycyjj
j yxcc
yxsimcc
cSIMr
rrrr
rrr
rrrr
11
11
( ) sumisin
=jcx
j xcsr
rr
( )jcSIM ( )jcsr
( ) ( ) ( )( ) ( )( ) ( )
( ) ( ) ( )( )1
1
1
minus
minussdot=there4
+minus=
sdot+minus=
sdot=sdot=sdot
sum
sum sumsum
isin
isin isinisin
jj
jjj
j
jjjj
cxjjj
cx cyj
cxjj
ccccscs
cSIM
ccSIMcc
xxcSIMcc
yxcsxcscs
j
j jj
rr
rr
rrrrrr
r
r rr
=1
length-normalized vector
IR ndash Berlin Chen 19
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
-As merging two clusters ci and cj the cluster sum vectors and are known in advance
ndash The average similarity for their union will be
( )icsr ( )jcsr
( )( ) ( )( ) ( ) ( )( ) ( )
( )( )1
minus++
+minus+sdot+
=cup
jiji
jijiji
ji
cccccccscscscs
ccSIMrrrr
( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr
ic jc
ji cc +
Ci Cj ( )jcsr( )ics
r
IR ndash Berlin Chen 20
Example Word Clustering
bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words
ldquoberdquo has least similarity with the other 21 words
higher nodesdecreasingof similarity
IR ndash Berlin Chen 21
Divisive Clustering
bull A top-down approach
bull Start with all objects in a single cluster
bull At each iteration select the least coherent cluster and split it
bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved
bull The history of clustering forms a binary tree or hierarchy
IR ndash Berlin Chen 22
Divisive Clustering (cont)
bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure
bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation
egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)
IR ndash Berlin Chen 23
Divisive Clustering (cont)
bull Algorithm
split the least coherent cluster
Generate two new clusters and remove the original one
IR ndash Berlin Chen 24
Non-Hierarchical Clustering
IR ndash Berlin Chen 25
Non-hierarchical Clustering
bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)
bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters
bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm
group average similarity likelihood mutual information
k-1 rarr k rarr k+1
Hierarchical clustering also has to face this problem
IR ndash Berlin Chen 26
The K-means Algorithm
bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members
bull The K-means algorithm also can be regarded as ndash A kind of vector quantization
bull Map from a continuous space (high resolution) to a discrete space (low resolution)
ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3
vectorcode wordcode vectorreferenceor centriodcluster
1
index 1
j
kjj
jnt
t
m
mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28
IR ndash Berlin Chen 27
The K-means Algorithm (cont)
ndash and are unknownndash depends on and this optimization problem
can not be solved analytically
( )⎪⎩
⎪⎨⎧ minus=minus
=minus= sum sum= =
=otherwise 0
minif 1 where
errortion Reconstruc Total2
1 11
jt
jit
ti
N
t
k
ii
tti
kii bbE
mxmxmxXm
tib
imtib
im
label
IR ndash Berlin Chen 28
The K-means Algorithm (cont)
bull Initializationndash A set of initial cluster centers is needed
bull Recursionndash Assign each object to the cluster whose center is closest
ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members
bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)
kii 1=m
sum
sum
=
= sdot=
Nt
ti
tNt
ti
i bb
1
1 xm
tx
⎪⎩
⎪⎨⎧ minus=minus
=otherwise 0
minif 1 jt
jit
tib
mxmx
These two steps are repeated until stabilizesim
IR ndash Berlin Chen 29
The K-means Algorithm (cont)
bull Algorithm
IR ndash Berlin Chen 30
The K-means Algorithm (cont)
bull Example 1
IR ndash Berlin Chen 31
The K-means Algorithm (cont)
bull Example 2
governmentfinancesports
research
name
IR ndash Berlin Chen 32
The K-means Algorithm (cont)
bull Choice of initial cluster centers (seeds) is important
ndash Pick at randomndash Calculate the mean of all data and generate k initial centers
by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)
divide it range into k equal interval and take the mean of data in each group as the initial center
ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects
bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set
bull Poor seeds will result in sub-optimal clustering
im
im
δm plusmni
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 13
Distance Metrics
bull Euclidian Distance (L2 norm)
ndash Make sure that all attributesdimensions have the same scale (orthe same variance)
bull L1 Norm (City-block distance)
bull Cosine Similarity (transform to a distance by subtracting from 1)
2
12 )()( i
m
ii yxyxL minus=sum
=
rr
sum=
minus=m
iii yxyxL
11 )( rr
yxyxrr
rr
sdotminus
bull1 ranged between 0 and 1
IR ndash Berlin Chen 14
Measures of Cluster Similaritybull Especially for the bottom-up approaches
bull Single-link clusteringndash The similarity between two clusters is the similarity of the two
closest objects in the clusters
ndash Search over all pairs of objects that are from the two differentclusters and select the pair with the greatest similarity
ndash Elongated clusters are achieved
Ci Cj
greatest similarity
( ) ( )yxsimccsimji cycxji
rrrrisinisin
= max
cf the minimal spanning tree
IR ndash Berlin Chen 15
Measures of Cluster Similarity (cont)
bull Complete-link clusteringndash The similarity between two clusters is the similarity of their two
most dissimilar members
ndash Sphere-shaped clusters are achieved
ndash Preferable for most IR and NLP applications
Ci Cj
least similarity
( ) ( )yxsimccsimji cycxji
rrrrisinisin
= min
IR ndash Berlin Chen 16
Measures of Cluster Similarity (cont)
single link
complete link
IR ndash Berlin Chen 17
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering
ndash The similarity between two clusters is the average similarity between members
ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine
bull There exists an fast algorithm for computing the average similarity
( ) ( ) yxyxyxyxyxsim
rrrr
rrrrrr
sdot=sdot
== cos
length-normalized vectors
Ci Cj
IR ndash Berlin Chen 18
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
ndash The average similarity SIM between vectors in a cluster cj is defined as
ndash The sum of members in a cluster cj
ndash Express in terms of
( ) ( ) ( ) ( )sum sumsum sumisin
neisinisin
neisin
sdotminus
=minus
=j jj j cx
xycyjjcx
xycyjj
j yxcc
yxsimcc
cSIMr
rrrr
rrr
rrrr
11
11
( ) sumisin
=jcx
j xcsr
rr
( )jcSIM ( )jcsr
( ) ( ) ( )( ) ( )( ) ( )
( ) ( ) ( )( )1
1
1
minus
minussdot=there4
+minus=
sdot+minus=
sdot=sdot=sdot
sum
sum sumsum
isin
isin isinisin
jj
jjj
j
jjjj
cxjjj
cx cyj
cxjj
ccccscs
cSIM
ccSIMcc
xxcSIMcc
yxcsxcscs
j
j jj
rr
rr
rrrrrr
r
r rr
=1
length-normalized vector
IR ndash Berlin Chen 19
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
-As merging two clusters ci and cj the cluster sum vectors and are known in advance
ndash The average similarity for their union will be
( )icsr ( )jcsr
( )( ) ( )( ) ( ) ( )( ) ( )
( )( )1
minus++
+minus+sdot+
=cup
jiji
jijiji
ji
cccccccscscscs
ccSIMrrrr
( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr
ic jc
ji cc +
Ci Cj ( )jcsr( )ics
r
IR ndash Berlin Chen 20
Example Word Clustering
bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words
ldquoberdquo has least similarity with the other 21 words
higher nodesdecreasingof similarity
IR ndash Berlin Chen 21
Divisive Clustering
bull A top-down approach
bull Start with all objects in a single cluster
bull At each iteration select the least coherent cluster and split it
bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved
bull The history of clustering forms a binary tree or hierarchy
IR ndash Berlin Chen 22
Divisive Clustering (cont)
bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure
bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation
egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)
IR ndash Berlin Chen 23
Divisive Clustering (cont)
bull Algorithm
split the least coherent cluster
Generate two new clusters and remove the original one
IR ndash Berlin Chen 24
Non-Hierarchical Clustering
IR ndash Berlin Chen 25
Non-hierarchical Clustering
bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)
bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters
bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm
group average similarity likelihood mutual information
k-1 rarr k rarr k+1
Hierarchical clustering also has to face this problem
IR ndash Berlin Chen 26
The K-means Algorithm
bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members
bull The K-means algorithm also can be regarded as ndash A kind of vector quantization
bull Map from a continuous space (high resolution) to a discrete space (low resolution)
ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3
vectorcode wordcode vectorreferenceor centriodcluster
1
index 1
j
kjj
jnt
t
m
mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28
IR ndash Berlin Chen 27
The K-means Algorithm (cont)
ndash and are unknownndash depends on and this optimization problem
can not be solved analytically
( )⎪⎩
⎪⎨⎧ minus=minus
=minus= sum sum= =
=otherwise 0
minif 1 where
errortion Reconstruc Total2
1 11
jt
jit
ti
N
t
k
ii
tti
kii bbE
mxmxmxXm
tib
imtib
im
label
IR ndash Berlin Chen 28
The K-means Algorithm (cont)
bull Initializationndash A set of initial cluster centers is needed
bull Recursionndash Assign each object to the cluster whose center is closest
ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members
bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)
kii 1=m
sum
sum
=
= sdot=
Nt
ti
tNt
ti
i bb
1
1 xm
tx
⎪⎩
⎪⎨⎧ minus=minus
=otherwise 0
minif 1 jt
jit
tib
mxmx
These two steps are repeated until stabilizesim
IR ndash Berlin Chen 29
The K-means Algorithm (cont)
bull Algorithm
IR ndash Berlin Chen 30
The K-means Algorithm (cont)
bull Example 1
IR ndash Berlin Chen 31
The K-means Algorithm (cont)
bull Example 2
governmentfinancesports
research
name
IR ndash Berlin Chen 32
The K-means Algorithm (cont)
bull Choice of initial cluster centers (seeds) is important
ndash Pick at randomndash Calculate the mean of all data and generate k initial centers
by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)
divide it range into k equal interval and take the mean of data in each group as the initial center
ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects
bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set
bull Poor seeds will result in sub-optimal clustering
im
im
δm plusmni
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 14
Measures of Cluster Similaritybull Especially for the bottom-up approaches
bull Single-link clusteringndash The similarity between two clusters is the similarity of the two
closest objects in the clusters
ndash Search over all pairs of objects that are from the two differentclusters and select the pair with the greatest similarity
ndash Elongated clusters are achieved
Ci Cj
greatest similarity
( ) ( )yxsimccsimji cycxji
rrrrisinisin
= max
cf the minimal spanning tree
IR ndash Berlin Chen 15
Measures of Cluster Similarity (cont)
bull Complete-link clusteringndash The similarity between two clusters is the similarity of their two
most dissimilar members
ndash Sphere-shaped clusters are achieved
ndash Preferable for most IR and NLP applications
Ci Cj
least similarity
( ) ( )yxsimccsimji cycxji
rrrrisinisin
= min
IR ndash Berlin Chen 16
Measures of Cluster Similarity (cont)
single link
complete link
IR ndash Berlin Chen 17
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering
ndash The similarity between two clusters is the average similarity between members
ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine
bull There exists an fast algorithm for computing the average similarity
( ) ( ) yxyxyxyxyxsim
rrrr
rrrrrr
sdot=sdot
== cos
length-normalized vectors
Ci Cj
IR ndash Berlin Chen 18
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
ndash The average similarity SIM between vectors in a cluster cj is defined as
ndash The sum of members in a cluster cj
ndash Express in terms of
( ) ( ) ( ) ( )sum sumsum sumisin
neisinisin
neisin
sdotminus
=minus
=j jj j cx
xycyjjcx
xycyjj
j yxcc
yxsimcc
cSIMr
rrrr
rrr
rrrr
11
11
( ) sumisin
=jcx
j xcsr
rr
( )jcSIM ( )jcsr
( ) ( ) ( )( ) ( )( ) ( )
( ) ( ) ( )( )1
1
1
minus
minussdot=there4
+minus=
sdot+minus=
sdot=sdot=sdot
sum
sum sumsum
isin
isin isinisin
jj
jjj
j
jjjj
cxjjj
cx cyj
cxjj
ccccscs
cSIM
ccSIMcc
xxcSIMcc
yxcsxcscs
j
j jj
rr
rr
rrrrrr
r
r rr
=1
length-normalized vector
IR ndash Berlin Chen 19
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
-As merging two clusters ci and cj the cluster sum vectors and are known in advance
ndash The average similarity for their union will be
( )icsr ( )jcsr
( )( ) ( )( ) ( ) ( )( ) ( )
( )( )1
minus++
+minus+sdot+
=cup
jiji
jijiji
ji
cccccccscscscs
ccSIMrrrr
( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr
ic jc
ji cc +
Ci Cj ( )jcsr( )ics
r
IR ndash Berlin Chen 20
Example Word Clustering
bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words
ldquoberdquo has least similarity with the other 21 words
higher nodesdecreasingof similarity
IR ndash Berlin Chen 21
Divisive Clustering
bull A top-down approach
bull Start with all objects in a single cluster
bull At each iteration select the least coherent cluster and split it
bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved
bull The history of clustering forms a binary tree or hierarchy
IR ndash Berlin Chen 22
Divisive Clustering (cont)
bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure
bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation
egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)
IR ndash Berlin Chen 23
Divisive Clustering (cont)
bull Algorithm
split the least coherent cluster
Generate two new clusters and remove the original one
IR ndash Berlin Chen 24
Non-Hierarchical Clustering
IR ndash Berlin Chen 25
Non-hierarchical Clustering
bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)
bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters
bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm
group average similarity likelihood mutual information
k-1 rarr k rarr k+1
Hierarchical clustering also has to face this problem
IR ndash Berlin Chen 26
The K-means Algorithm
bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members
bull The K-means algorithm also can be regarded as ndash A kind of vector quantization
bull Map from a continuous space (high resolution) to a discrete space (low resolution)
ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3
vectorcode wordcode vectorreferenceor centriodcluster
1
index 1
j
kjj
jnt
t
m
mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28
IR ndash Berlin Chen 27
The K-means Algorithm (cont)
ndash and are unknownndash depends on and this optimization problem
can not be solved analytically
( )⎪⎩
⎪⎨⎧ minus=minus
=minus= sum sum= =
=otherwise 0
minif 1 where
errortion Reconstruc Total2
1 11
jt
jit
ti
N
t
k
ii
tti
kii bbE
mxmxmxXm
tib
imtib
im
label
IR ndash Berlin Chen 28
The K-means Algorithm (cont)
bull Initializationndash A set of initial cluster centers is needed
bull Recursionndash Assign each object to the cluster whose center is closest
ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members
bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)
kii 1=m
sum
sum
=
= sdot=
Nt
ti
tNt
ti
i bb
1
1 xm
tx
⎪⎩
⎪⎨⎧ minus=minus
=otherwise 0
minif 1 jt
jit
tib
mxmx
These two steps are repeated until stabilizesim
IR ndash Berlin Chen 29
The K-means Algorithm (cont)
bull Algorithm
IR ndash Berlin Chen 30
The K-means Algorithm (cont)
bull Example 1
IR ndash Berlin Chen 31
The K-means Algorithm (cont)
bull Example 2
governmentfinancesports
research
name
IR ndash Berlin Chen 32
The K-means Algorithm (cont)
bull Choice of initial cluster centers (seeds) is important
ndash Pick at randomndash Calculate the mean of all data and generate k initial centers
by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)
divide it range into k equal interval and take the mean of data in each group as the initial center
ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects
bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set
bull Poor seeds will result in sub-optimal clustering
im
im
δm plusmni
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 15
Measures of Cluster Similarity (cont)
bull Complete-link clusteringndash The similarity between two clusters is the similarity of their two
most dissimilar members
ndash Sphere-shaped clusters are achieved
ndash Preferable for most IR and NLP applications
Ci Cj
least similarity
( ) ( )yxsimccsimji cycxji
rrrrisinisin
= min
IR ndash Berlin Chen 16
Measures of Cluster Similarity (cont)
single link
complete link
IR ndash Berlin Chen 17
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering
ndash The similarity between two clusters is the average similarity between members
ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine
bull There exists an fast algorithm for computing the average similarity
( ) ( ) yxyxyxyxyxsim
rrrr
rrrrrr
sdot=sdot
== cos
length-normalized vectors
Ci Cj
IR ndash Berlin Chen 18
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
ndash The average similarity SIM between vectors in a cluster cj is defined as
ndash The sum of members in a cluster cj
ndash Express in terms of
( ) ( ) ( ) ( )sum sumsum sumisin
neisinisin
neisin
sdotminus
=minus
=j jj j cx
xycyjjcx
xycyjj
j yxcc
yxsimcc
cSIMr
rrrr
rrr
rrrr
11
11
( ) sumisin
=jcx
j xcsr
rr
( )jcSIM ( )jcsr
( ) ( ) ( )( ) ( )( ) ( )
( ) ( ) ( )( )1
1
1
minus
minussdot=there4
+minus=
sdot+minus=
sdot=sdot=sdot
sum
sum sumsum
isin
isin isinisin
jj
jjj
j
jjjj
cxjjj
cx cyj
cxjj
ccccscs
cSIM
ccSIMcc
xxcSIMcc
yxcsxcscs
j
j jj
rr
rr
rrrrrr
r
r rr
=1
length-normalized vector
IR ndash Berlin Chen 19
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
-As merging two clusters ci and cj the cluster sum vectors and are known in advance
ndash The average similarity for their union will be
( )icsr ( )jcsr
( )( ) ( )( ) ( ) ( )( ) ( )
( )( )1
minus++
+minus+sdot+
=cup
jiji
jijiji
ji
cccccccscscscs
ccSIMrrrr
( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr
ic jc
ji cc +
Ci Cj ( )jcsr( )ics
r
IR ndash Berlin Chen 20
Example Word Clustering
bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words
ldquoberdquo has least similarity with the other 21 words
higher nodesdecreasingof similarity
IR ndash Berlin Chen 21
Divisive Clustering
bull A top-down approach
bull Start with all objects in a single cluster
bull At each iteration select the least coherent cluster and split it
bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved
bull The history of clustering forms a binary tree or hierarchy
IR ndash Berlin Chen 22
Divisive Clustering (cont)
bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure
bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation
egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)
IR ndash Berlin Chen 23
Divisive Clustering (cont)
bull Algorithm
split the least coherent cluster
Generate two new clusters and remove the original one
IR ndash Berlin Chen 24
Non-Hierarchical Clustering
IR ndash Berlin Chen 25
Non-hierarchical Clustering
bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)
bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters
bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm
group average similarity likelihood mutual information
k-1 rarr k rarr k+1
Hierarchical clustering also has to face this problem
IR ndash Berlin Chen 26
The K-means Algorithm
bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members
bull The K-means algorithm also can be regarded as ndash A kind of vector quantization
bull Map from a continuous space (high resolution) to a discrete space (low resolution)
ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3
vectorcode wordcode vectorreferenceor centriodcluster
1
index 1
j
kjj
jnt
t
m
mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28
IR ndash Berlin Chen 27
The K-means Algorithm (cont)
ndash and are unknownndash depends on and this optimization problem
can not be solved analytically
( )⎪⎩
⎪⎨⎧ minus=minus
=minus= sum sum= =
=otherwise 0
minif 1 where
errortion Reconstruc Total2
1 11
jt
jit
ti
N
t
k
ii
tti
kii bbE
mxmxmxXm
tib
imtib
im
label
IR ndash Berlin Chen 28
The K-means Algorithm (cont)
bull Initializationndash A set of initial cluster centers is needed
bull Recursionndash Assign each object to the cluster whose center is closest
ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members
bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)
kii 1=m
sum
sum
=
= sdot=
Nt
ti
tNt
ti
i bb
1
1 xm
tx
⎪⎩
⎪⎨⎧ minus=minus
=otherwise 0
minif 1 jt
jit
tib
mxmx
These two steps are repeated until stabilizesim
IR ndash Berlin Chen 29
The K-means Algorithm (cont)
bull Algorithm
IR ndash Berlin Chen 30
The K-means Algorithm (cont)
bull Example 1
IR ndash Berlin Chen 31
The K-means Algorithm (cont)
bull Example 2
governmentfinancesports
research
name
IR ndash Berlin Chen 32
The K-means Algorithm (cont)
bull Choice of initial cluster centers (seeds) is important
ndash Pick at randomndash Calculate the mean of all data and generate k initial centers
by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)
divide it range into k equal interval and take the mean of data in each group as the initial center
ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects
bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set
bull Poor seeds will result in sub-optimal clustering
im
im
δm plusmni
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 16
Measures of Cluster Similarity (cont)
single link
complete link
IR ndash Berlin Chen 17
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering
ndash The similarity between two clusters is the average similarity between members
ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine
bull There exists an fast algorithm for computing the average similarity
( ) ( ) yxyxyxyxyxsim
rrrr
rrrrrr
sdot=sdot
== cos
length-normalized vectors
Ci Cj
IR ndash Berlin Chen 18
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
ndash The average similarity SIM between vectors in a cluster cj is defined as
ndash The sum of members in a cluster cj
ndash Express in terms of
( ) ( ) ( ) ( )sum sumsum sumisin
neisinisin
neisin
sdotminus
=minus
=j jj j cx
xycyjjcx
xycyjj
j yxcc
yxsimcc
cSIMr
rrrr
rrr
rrrr
11
11
( ) sumisin
=jcx
j xcsr
rr
( )jcSIM ( )jcsr
( ) ( ) ( )( ) ( )( ) ( )
( ) ( ) ( )( )1
1
1
minus
minussdot=there4
+minus=
sdot+minus=
sdot=sdot=sdot
sum
sum sumsum
isin
isin isinisin
jj
jjj
j
jjjj
cxjjj
cx cyj
cxjj
ccccscs
cSIM
ccSIMcc
xxcSIMcc
yxcsxcscs
j
j jj
rr
rr
rrrrrr
r
r rr
=1
length-normalized vector
IR ndash Berlin Chen 19
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
-As merging two clusters ci and cj the cluster sum vectors and are known in advance
ndash The average similarity for their union will be
( )icsr ( )jcsr
( )( ) ( )( ) ( ) ( )( ) ( )
( )( )1
minus++
+minus+sdot+
=cup
jiji
jijiji
ji
cccccccscscscs
ccSIMrrrr
( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr
ic jc
ji cc +
Ci Cj ( )jcsr( )ics
r
IR ndash Berlin Chen 20
Example Word Clustering
bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words
ldquoberdquo has least similarity with the other 21 words
higher nodesdecreasingof similarity
IR ndash Berlin Chen 21
Divisive Clustering
bull A top-down approach
bull Start with all objects in a single cluster
bull At each iteration select the least coherent cluster and split it
bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved
bull The history of clustering forms a binary tree or hierarchy
IR ndash Berlin Chen 22
Divisive Clustering (cont)
bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure
bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation
egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)
IR ndash Berlin Chen 23
Divisive Clustering (cont)
bull Algorithm
split the least coherent cluster
Generate two new clusters and remove the original one
IR ndash Berlin Chen 24
Non-Hierarchical Clustering
IR ndash Berlin Chen 25
Non-hierarchical Clustering
bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)
bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters
bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm
group average similarity likelihood mutual information
k-1 rarr k rarr k+1
Hierarchical clustering also has to face this problem
IR ndash Berlin Chen 26
The K-means Algorithm
bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members
bull The K-means algorithm also can be regarded as ndash A kind of vector quantization
bull Map from a continuous space (high resolution) to a discrete space (low resolution)
ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3
vectorcode wordcode vectorreferenceor centriodcluster
1
index 1
j
kjj
jnt
t
m
mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28
IR ndash Berlin Chen 27
The K-means Algorithm (cont)
ndash and are unknownndash depends on and this optimization problem
can not be solved analytically
( )⎪⎩
⎪⎨⎧ minus=minus
=minus= sum sum= =
=otherwise 0
minif 1 where
errortion Reconstruc Total2
1 11
jt
jit
ti
N
t
k
ii
tti
kii bbE
mxmxmxXm
tib
imtib
im
label
IR ndash Berlin Chen 28
The K-means Algorithm (cont)
bull Initializationndash A set of initial cluster centers is needed
bull Recursionndash Assign each object to the cluster whose center is closest
ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members
bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)
kii 1=m
sum
sum
=
= sdot=
Nt
ti
tNt
ti
i bb
1
1 xm
tx
⎪⎩
⎪⎨⎧ minus=minus
=otherwise 0
minif 1 jt
jit
tib
mxmx
These two steps are repeated until stabilizesim
IR ndash Berlin Chen 29
The K-means Algorithm (cont)
bull Algorithm
IR ndash Berlin Chen 30
The K-means Algorithm (cont)
bull Example 1
IR ndash Berlin Chen 31
The K-means Algorithm (cont)
bull Example 2
governmentfinancesports
research
name
IR ndash Berlin Chen 32
The K-means Algorithm (cont)
bull Choice of initial cluster centers (seeds) is important
ndash Pick at randomndash Calculate the mean of all data and generate k initial centers
by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)
divide it range into k equal interval and take the mean of data in each group as the initial center
ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects
bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set
bull Poor seeds will result in sub-optimal clustering
im
im
δm plusmni
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 17
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering
ndash The similarity between two clusters is the average similarity between members
ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine
bull There exists an fast algorithm for computing the average similarity
( ) ( ) yxyxyxyxyxsim
rrrr
rrrrrr
sdot=sdot
== cos
length-normalized vectors
Ci Cj
IR ndash Berlin Chen 18
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
ndash The average similarity SIM between vectors in a cluster cj is defined as
ndash The sum of members in a cluster cj
ndash Express in terms of
( ) ( ) ( ) ( )sum sumsum sumisin
neisinisin
neisin
sdotminus
=minus
=j jj j cx
xycyjjcx
xycyjj
j yxcc
yxsimcc
cSIMr
rrrr
rrr
rrrr
11
11
( ) sumisin
=jcx
j xcsr
rr
( )jcSIM ( )jcsr
( ) ( ) ( )( ) ( )( ) ( )
( ) ( ) ( )( )1
1
1
minus
minussdot=there4
+minus=
sdot+minus=
sdot=sdot=sdot
sum
sum sumsum
isin
isin isinisin
jj
jjj
j
jjjj
cxjjj
cx cyj
cxjj
ccccscs
cSIM
ccSIMcc
xxcSIMcc
yxcsxcscs
j
j jj
rr
rr
rrrrrr
r
r rr
=1
length-normalized vector
IR ndash Berlin Chen 19
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
-As merging two clusters ci and cj the cluster sum vectors and are known in advance
ndash The average similarity for their union will be
( )icsr ( )jcsr
( )( ) ( )( ) ( ) ( )( ) ( )
( )( )1
minus++
+minus+sdot+
=cup
jiji
jijiji
ji
cccccccscscscs
ccSIMrrrr
( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr
ic jc
ji cc +
Ci Cj ( )jcsr( )ics
r
IR ndash Berlin Chen 20
Example Word Clustering
bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words
ldquoberdquo has least similarity with the other 21 words
higher nodesdecreasingof similarity
IR ndash Berlin Chen 21
Divisive Clustering
bull A top-down approach
bull Start with all objects in a single cluster
bull At each iteration select the least coherent cluster and split it
bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved
bull The history of clustering forms a binary tree or hierarchy
IR ndash Berlin Chen 22
Divisive Clustering (cont)
bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure
bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation
egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)
IR ndash Berlin Chen 23
Divisive Clustering (cont)
bull Algorithm
split the least coherent cluster
Generate two new clusters and remove the original one
IR ndash Berlin Chen 24
Non-Hierarchical Clustering
IR ndash Berlin Chen 25
Non-hierarchical Clustering
bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)
bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters
bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm
group average similarity likelihood mutual information
k-1 rarr k rarr k+1
Hierarchical clustering also has to face this problem
IR ndash Berlin Chen 26
The K-means Algorithm
bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members
bull The K-means algorithm also can be regarded as ndash A kind of vector quantization
bull Map from a continuous space (high resolution) to a discrete space (low resolution)
ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3
vectorcode wordcode vectorreferenceor centriodcluster
1
index 1
j
kjj
jnt
t
m
mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28
IR ndash Berlin Chen 27
The K-means Algorithm (cont)
ndash and are unknownndash depends on and this optimization problem
can not be solved analytically
( )⎪⎩
⎪⎨⎧ minus=minus
=minus= sum sum= =
=otherwise 0
minif 1 where
errortion Reconstruc Total2
1 11
jt
jit
ti
N
t
k
ii
tti
kii bbE
mxmxmxXm
tib
imtib
im
label
IR ndash Berlin Chen 28
The K-means Algorithm (cont)
bull Initializationndash A set of initial cluster centers is needed
bull Recursionndash Assign each object to the cluster whose center is closest
ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members
bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)
kii 1=m
sum
sum
=
= sdot=
Nt
ti
tNt
ti
i bb
1
1 xm
tx
⎪⎩
⎪⎨⎧ minus=minus
=otherwise 0
minif 1 jt
jit
tib
mxmx
These two steps are repeated until stabilizesim
IR ndash Berlin Chen 29
The K-means Algorithm (cont)
bull Algorithm
IR ndash Berlin Chen 30
The K-means Algorithm (cont)
bull Example 1
IR ndash Berlin Chen 31
The K-means Algorithm (cont)
bull Example 2
governmentfinancesports
research
name
IR ndash Berlin Chen 32
The K-means Algorithm (cont)
bull Choice of initial cluster centers (seeds) is important
ndash Pick at randomndash Calculate the mean of all data and generate k initial centers
by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)
divide it range into k equal interval and take the mean of data in each group as the initial center
ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects
bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set
bull Poor seeds will result in sub-optimal clustering
im
im
δm plusmni
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 18
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
ndash The average similarity SIM between vectors in a cluster cj is defined as
ndash The sum of members in a cluster cj
ndash Express in terms of
( ) ( ) ( ) ( )sum sumsum sumisin
neisinisin
neisin
sdotminus
=minus
=j jj j cx
xycyjjcx
xycyjj
j yxcc
yxsimcc
cSIMr
rrrr
rrr
rrrr
11
11
( ) sumisin
=jcx
j xcsr
rr
( )jcSIM ( )jcsr
( ) ( ) ( )( ) ( )( ) ( )
( ) ( ) ( )( )1
1
1
minus
minussdot=there4
+minus=
sdot+minus=
sdot=sdot=sdot
sum
sum sumsum
isin
isin isinisin
jj
jjj
j
jjjj
cxjjj
cx cyj
cxjj
ccccscs
cSIM
ccSIMcc
xxcSIMcc
yxcsxcscs
j
j jj
rr
rr
rrrrrr
r
r rr
=1
length-normalized vector
IR ndash Berlin Chen 19
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
-As merging two clusters ci and cj the cluster sum vectors and are known in advance
ndash The average similarity for their union will be
( )icsr ( )jcsr
( )( ) ( )( ) ( ) ( )( ) ( )
( )( )1
minus++
+minus+sdot+
=cup
jiji
jijiji
ji
cccccccscscscs
ccSIMrrrr
( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr
ic jc
ji cc +
Ci Cj ( )jcsr( )ics
r
IR ndash Berlin Chen 20
Example Word Clustering
bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words
ldquoberdquo has least similarity with the other 21 words
higher nodesdecreasingof similarity
IR ndash Berlin Chen 21
Divisive Clustering
bull A top-down approach
bull Start with all objects in a single cluster
bull At each iteration select the least coherent cluster and split it
bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved
bull The history of clustering forms a binary tree or hierarchy
IR ndash Berlin Chen 22
Divisive Clustering (cont)
bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure
bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation
egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)
IR ndash Berlin Chen 23
Divisive Clustering (cont)
bull Algorithm
split the least coherent cluster
Generate two new clusters and remove the original one
IR ndash Berlin Chen 24
Non-Hierarchical Clustering
IR ndash Berlin Chen 25
Non-hierarchical Clustering
bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)
bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters
bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm
group average similarity likelihood mutual information
k-1 rarr k rarr k+1
Hierarchical clustering also has to face this problem
IR ndash Berlin Chen 26
The K-means Algorithm
bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members
bull The K-means algorithm also can be regarded as ndash A kind of vector quantization
bull Map from a continuous space (high resolution) to a discrete space (low resolution)
ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3
vectorcode wordcode vectorreferenceor centriodcluster
1
index 1
j
kjj
jnt
t
m
mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28
IR ndash Berlin Chen 27
The K-means Algorithm (cont)
ndash and are unknownndash depends on and this optimization problem
can not be solved analytically
( )⎪⎩
⎪⎨⎧ minus=minus
=minus= sum sum= =
=otherwise 0
minif 1 where
errortion Reconstruc Total2
1 11
jt
jit
ti
N
t
k
ii
tti
kii bbE
mxmxmxXm
tib
imtib
im
label
IR ndash Berlin Chen 28
The K-means Algorithm (cont)
bull Initializationndash A set of initial cluster centers is needed
bull Recursionndash Assign each object to the cluster whose center is closest
ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members
bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)
kii 1=m
sum
sum
=
= sdot=
Nt
ti
tNt
ti
i bb
1
1 xm
tx
⎪⎩
⎪⎨⎧ minus=minus
=otherwise 0
minif 1 jt
jit
tib
mxmx
These two steps are repeated until stabilizesim
IR ndash Berlin Chen 29
The K-means Algorithm (cont)
bull Algorithm
IR ndash Berlin Chen 30
The K-means Algorithm (cont)
bull Example 1
IR ndash Berlin Chen 31
The K-means Algorithm (cont)
bull Example 2
governmentfinancesports
research
name
IR ndash Berlin Chen 32
The K-means Algorithm (cont)
bull Choice of initial cluster centers (seeds) is important
ndash Pick at randomndash Calculate the mean of all data and generate k initial centers
by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)
divide it range into k equal interval and take the mean of data in each group as the initial center
ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects
bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set
bull Poor seeds will result in sub-optimal clustering
im
im
δm plusmni
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 19
Measures of Cluster Similarity (cont)
bull Group-average agglomerative clustering (cont)
-As merging two clusters ci and cj the cluster sum vectors and are known in advance
ndash The average similarity for their union will be
( )icsr ( )jcsr
( )( ) ( )( ) ( ) ( )( ) ( )
( )( )1
minus++
+minus+sdot+
=cup
jiji
jijiji
ji
cccccccscscscs
ccSIMrrrr
( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr
ic jc
ji cc +
Ci Cj ( )jcsr( )ics
r
IR ndash Berlin Chen 20
Example Word Clustering
bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words
ldquoberdquo has least similarity with the other 21 words
higher nodesdecreasingof similarity
IR ndash Berlin Chen 21
Divisive Clustering
bull A top-down approach
bull Start with all objects in a single cluster
bull At each iteration select the least coherent cluster and split it
bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved
bull The history of clustering forms a binary tree or hierarchy
IR ndash Berlin Chen 22
Divisive Clustering (cont)
bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure
bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation
egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)
IR ndash Berlin Chen 23
Divisive Clustering (cont)
bull Algorithm
split the least coherent cluster
Generate two new clusters and remove the original one
IR ndash Berlin Chen 24
Non-Hierarchical Clustering
IR ndash Berlin Chen 25
Non-hierarchical Clustering
bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)
bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters
bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm
group average similarity likelihood mutual information
k-1 rarr k rarr k+1
Hierarchical clustering also has to face this problem
IR ndash Berlin Chen 26
The K-means Algorithm
bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members
bull The K-means algorithm also can be regarded as ndash A kind of vector quantization
bull Map from a continuous space (high resolution) to a discrete space (low resolution)
ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3
vectorcode wordcode vectorreferenceor centriodcluster
1
index 1
j
kjj
jnt
t
m
mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28
IR ndash Berlin Chen 27
The K-means Algorithm (cont)
ndash and are unknownndash depends on and this optimization problem
can not be solved analytically
( )⎪⎩
⎪⎨⎧ minus=minus
=minus= sum sum= =
=otherwise 0
minif 1 where
errortion Reconstruc Total2
1 11
jt
jit
ti
N
t
k
ii
tti
kii bbE
mxmxmxXm
tib
imtib
im
label
IR ndash Berlin Chen 28
The K-means Algorithm (cont)
bull Initializationndash A set of initial cluster centers is needed
bull Recursionndash Assign each object to the cluster whose center is closest
ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members
bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)
kii 1=m
sum
sum
=
= sdot=
Nt
ti
tNt
ti
i bb
1
1 xm
tx
⎪⎩
⎪⎨⎧ minus=minus
=otherwise 0
minif 1 jt
jit
tib
mxmx
These two steps are repeated until stabilizesim
IR ndash Berlin Chen 29
The K-means Algorithm (cont)
bull Algorithm
IR ndash Berlin Chen 30
The K-means Algorithm (cont)
bull Example 1
IR ndash Berlin Chen 31
The K-means Algorithm (cont)
bull Example 2
governmentfinancesports
research
name
IR ndash Berlin Chen 32
The K-means Algorithm (cont)
bull Choice of initial cluster centers (seeds) is important
ndash Pick at randomndash Calculate the mean of all data and generate k initial centers
by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)
divide it range into k equal interval and take the mean of data in each group as the initial center
ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects
bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set
bull Poor seeds will result in sub-optimal clustering
im
im
δm plusmni
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 20
Example Word Clustering
bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words
ldquoberdquo has least similarity with the other 21 words
higher nodesdecreasingof similarity
IR ndash Berlin Chen 21
Divisive Clustering
bull A top-down approach
bull Start with all objects in a single cluster
bull At each iteration select the least coherent cluster and split it
bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved
bull The history of clustering forms a binary tree or hierarchy
IR ndash Berlin Chen 22
Divisive Clustering (cont)
bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure
bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation
egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)
IR ndash Berlin Chen 23
Divisive Clustering (cont)
bull Algorithm
split the least coherent cluster
Generate two new clusters and remove the original one
IR ndash Berlin Chen 24
Non-Hierarchical Clustering
IR ndash Berlin Chen 25
Non-hierarchical Clustering
bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)
bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters
bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm
group average similarity likelihood mutual information
k-1 rarr k rarr k+1
Hierarchical clustering also has to face this problem
IR ndash Berlin Chen 26
The K-means Algorithm
bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members
bull The K-means algorithm also can be regarded as ndash A kind of vector quantization
bull Map from a continuous space (high resolution) to a discrete space (low resolution)
ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3
vectorcode wordcode vectorreferenceor centriodcluster
1
index 1
j
kjj
jnt
t
m
mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28
IR ndash Berlin Chen 27
The K-means Algorithm (cont)
ndash and are unknownndash depends on and this optimization problem
can not be solved analytically
( )⎪⎩
⎪⎨⎧ minus=minus
=minus= sum sum= =
=otherwise 0
minif 1 where
errortion Reconstruc Total2
1 11
jt
jit
ti
N
t
k
ii
tti
kii bbE
mxmxmxXm
tib
imtib
im
label
IR ndash Berlin Chen 28
The K-means Algorithm (cont)
bull Initializationndash A set of initial cluster centers is needed
bull Recursionndash Assign each object to the cluster whose center is closest
ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members
bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)
kii 1=m
sum
sum
=
= sdot=
Nt
ti
tNt
ti
i bb
1
1 xm
tx
⎪⎩
⎪⎨⎧ minus=minus
=otherwise 0
minif 1 jt
jit
tib
mxmx
These two steps are repeated until stabilizesim
IR ndash Berlin Chen 29
The K-means Algorithm (cont)
bull Algorithm
IR ndash Berlin Chen 30
The K-means Algorithm (cont)
bull Example 1
IR ndash Berlin Chen 31
The K-means Algorithm (cont)
bull Example 2
governmentfinancesports
research
name
IR ndash Berlin Chen 32
The K-means Algorithm (cont)
bull Choice of initial cluster centers (seeds) is important
ndash Pick at randomndash Calculate the mean of all data and generate k initial centers
by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)
divide it range into k equal interval and take the mean of data in each group as the initial center
ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects
bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set
bull Poor seeds will result in sub-optimal clustering
im
im
δm plusmni
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 21
Divisive Clustering
bull A top-down approach
bull Start with all objects in a single cluster
bull At each iteration select the least coherent cluster and split it
bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved
bull The history of clustering forms a binary tree or hierarchy
IR ndash Berlin Chen 22
Divisive Clustering (cont)
bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure
bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation
egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)
IR ndash Berlin Chen 23
Divisive Clustering (cont)
bull Algorithm
split the least coherent cluster
Generate two new clusters and remove the original one
IR ndash Berlin Chen 24
Non-Hierarchical Clustering
IR ndash Berlin Chen 25
Non-hierarchical Clustering
bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)
bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters
bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm
group average similarity likelihood mutual information
k-1 rarr k rarr k+1
Hierarchical clustering also has to face this problem
IR ndash Berlin Chen 26
The K-means Algorithm
bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members
bull The K-means algorithm also can be regarded as ndash A kind of vector quantization
bull Map from a continuous space (high resolution) to a discrete space (low resolution)
ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3
vectorcode wordcode vectorreferenceor centriodcluster
1
index 1
j
kjj
jnt
t
m
mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28
IR ndash Berlin Chen 27
The K-means Algorithm (cont)
ndash and are unknownndash depends on and this optimization problem
can not be solved analytically
( )⎪⎩
⎪⎨⎧ minus=minus
=minus= sum sum= =
=otherwise 0
minif 1 where
errortion Reconstruc Total2
1 11
jt
jit
ti
N
t
k
ii
tti
kii bbE
mxmxmxXm
tib
imtib
im
label
IR ndash Berlin Chen 28
The K-means Algorithm (cont)
bull Initializationndash A set of initial cluster centers is needed
bull Recursionndash Assign each object to the cluster whose center is closest
ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members
bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)
kii 1=m
sum
sum
=
= sdot=
Nt
ti
tNt
ti
i bb
1
1 xm
tx
⎪⎩
⎪⎨⎧ minus=minus
=otherwise 0
minif 1 jt
jit
tib
mxmx
These two steps are repeated until stabilizesim
IR ndash Berlin Chen 29
The K-means Algorithm (cont)
bull Algorithm
IR ndash Berlin Chen 30
The K-means Algorithm (cont)
bull Example 1
IR ndash Berlin Chen 31
The K-means Algorithm (cont)
bull Example 2
governmentfinancesports
research
name
IR ndash Berlin Chen 32
The K-means Algorithm (cont)
bull Choice of initial cluster centers (seeds) is important
ndash Pick at randomndash Calculate the mean of all data and generate k initial centers
by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)
divide it range into k equal interval and take the mean of data in each group as the initial center
ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects
bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set
bull Poor seeds will result in sub-optimal clustering
im
im
δm plusmni
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 22
Divisive Clustering (cont)
bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure
bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation
egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)
IR ndash Berlin Chen 23
Divisive Clustering (cont)
bull Algorithm
split the least coherent cluster
Generate two new clusters and remove the original one
IR ndash Berlin Chen 24
Non-Hierarchical Clustering
IR ndash Berlin Chen 25
Non-hierarchical Clustering
bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)
bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters
bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm
group average similarity likelihood mutual information
k-1 rarr k rarr k+1
Hierarchical clustering also has to face this problem
IR ndash Berlin Chen 26
The K-means Algorithm
bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members
bull The K-means algorithm also can be regarded as ndash A kind of vector quantization
bull Map from a continuous space (high resolution) to a discrete space (low resolution)
ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3
vectorcode wordcode vectorreferenceor centriodcluster
1
index 1
j
kjj
jnt
t
m
mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28
IR ndash Berlin Chen 27
The K-means Algorithm (cont)
ndash and are unknownndash depends on and this optimization problem
can not be solved analytically
( )⎪⎩
⎪⎨⎧ minus=minus
=minus= sum sum= =
=otherwise 0
minif 1 where
errortion Reconstruc Total2
1 11
jt
jit
ti
N
t
k
ii
tti
kii bbE
mxmxmxXm
tib
imtib
im
label
IR ndash Berlin Chen 28
The K-means Algorithm (cont)
bull Initializationndash A set of initial cluster centers is needed
bull Recursionndash Assign each object to the cluster whose center is closest
ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members
bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)
kii 1=m
sum
sum
=
= sdot=
Nt
ti
tNt
ti
i bb
1
1 xm
tx
⎪⎩
⎪⎨⎧ minus=minus
=otherwise 0
minif 1 jt
jit
tib
mxmx
These two steps are repeated until stabilizesim
IR ndash Berlin Chen 29
The K-means Algorithm (cont)
bull Algorithm
IR ndash Berlin Chen 30
The K-means Algorithm (cont)
bull Example 1
IR ndash Berlin Chen 31
The K-means Algorithm (cont)
bull Example 2
governmentfinancesports
research
name
IR ndash Berlin Chen 32
The K-means Algorithm (cont)
bull Choice of initial cluster centers (seeds) is important
ndash Pick at randomndash Calculate the mean of all data and generate k initial centers
by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)
divide it range into k equal interval and take the mean of data in each group as the initial center
ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects
bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set
bull Poor seeds will result in sub-optimal clustering
im
im
δm plusmni
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 23
Divisive Clustering (cont)
bull Algorithm
split the least coherent cluster
Generate two new clusters and remove the original one
IR ndash Berlin Chen 24
Non-Hierarchical Clustering
IR ndash Berlin Chen 25
Non-hierarchical Clustering
bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)
bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters
bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm
group average similarity likelihood mutual information
k-1 rarr k rarr k+1
Hierarchical clustering also has to face this problem
IR ndash Berlin Chen 26
The K-means Algorithm
bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members
bull The K-means algorithm also can be regarded as ndash A kind of vector quantization
bull Map from a continuous space (high resolution) to a discrete space (low resolution)
ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3
vectorcode wordcode vectorreferenceor centriodcluster
1
index 1
j
kjj
jnt
t
m
mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28
IR ndash Berlin Chen 27
The K-means Algorithm (cont)
ndash and are unknownndash depends on and this optimization problem
can not be solved analytically
( )⎪⎩
⎪⎨⎧ minus=minus
=minus= sum sum= =
=otherwise 0
minif 1 where
errortion Reconstruc Total2
1 11
jt
jit
ti
N
t
k
ii
tti
kii bbE
mxmxmxXm
tib
imtib
im
label
IR ndash Berlin Chen 28
The K-means Algorithm (cont)
bull Initializationndash A set of initial cluster centers is needed
bull Recursionndash Assign each object to the cluster whose center is closest
ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members
bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)
kii 1=m
sum
sum
=
= sdot=
Nt
ti
tNt
ti
i bb
1
1 xm
tx
⎪⎩
⎪⎨⎧ minus=minus
=otherwise 0
minif 1 jt
jit
tib
mxmx
These two steps are repeated until stabilizesim
IR ndash Berlin Chen 29
The K-means Algorithm (cont)
bull Algorithm
IR ndash Berlin Chen 30
The K-means Algorithm (cont)
bull Example 1
IR ndash Berlin Chen 31
The K-means Algorithm (cont)
bull Example 2
governmentfinancesports
research
name
IR ndash Berlin Chen 32
The K-means Algorithm (cont)
bull Choice of initial cluster centers (seeds) is important
ndash Pick at randomndash Calculate the mean of all data and generate k initial centers
by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)
divide it range into k equal interval and take the mean of data in each group as the initial center
ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects
bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set
bull Poor seeds will result in sub-optimal clustering
im
im
δm plusmni
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 24
Non-Hierarchical Clustering
IR ndash Berlin Chen 25
Non-hierarchical Clustering
bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)
bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters
bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm
group average similarity likelihood mutual information
k-1 rarr k rarr k+1
Hierarchical clustering also has to face this problem
IR ndash Berlin Chen 26
The K-means Algorithm
bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members
bull The K-means algorithm also can be regarded as ndash A kind of vector quantization
bull Map from a continuous space (high resolution) to a discrete space (low resolution)
ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3
vectorcode wordcode vectorreferenceor centriodcluster
1
index 1
j
kjj
jnt
t
m
mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28
IR ndash Berlin Chen 27
The K-means Algorithm (cont)
ndash and are unknownndash depends on and this optimization problem
can not be solved analytically
( )⎪⎩
⎪⎨⎧ minus=minus
=minus= sum sum= =
=otherwise 0
minif 1 where
errortion Reconstruc Total2
1 11
jt
jit
ti
N
t
k
ii
tti
kii bbE
mxmxmxXm
tib
imtib
im
label
IR ndash Berlin Chen 28
The K-means Algorithm (cont)
bull Initializationndash A set of initial cluster centers is needed
bull Recursionndash Assign each object to the cluster whose center is closest
ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members
bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)
kii 1=m
sum
sum
=
= sdot=
Nt
ti
tNt
ti
i bb
1
1 xm
tx
⎪⎩
⎪⎨⎧ minus=minus
=otherwise 0
minif 1 jt
jit
tib
mxmx
These two steps are repeated until stabilizesim
IR ndash Berlin Chen 29
The K-means Algorithm (cont)
bull Algorithm
IR ndash Berlin Chen 30
The K-means Algorithm (cont)
bull Example 1
IR ndash Berlin Chen 31
The K-means Algorithm (cont)
bull Example 2
governmentfinancesports
research
name
IR ndash Berlin Chen 32
The K-means Algorithm (cont)
bull Choice of initial cluster centers (seeds) is important
ndash Pick at randomndash Calculate the mean of all data and generate k initial centers
by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)
divide it range into k equal interval and take the mean of data in each group as the initial center
ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects
bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set
bull Poor seeds will result in sub-optimal clustering
im
im
δm plusmni
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 25
Non-hierarchical Clustering
bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)
bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters
bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm
group average similarity likelihood mutual information
k-1 rarr k rarr k+1
Hierarchical clustering also has to face this problem
IR ndash Berlin Chen 26
The K-means Algorithm
bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members
bull The K-means algorithm also can be regarded as ndash A kind of vector quantization
bull Map from a continuous space (high resolution) to a discrete space (low resolution)
ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3
vectorcode wordcode vectorreferenceor centriodcluster
1
index 1
j
kjj
jnt
t
m
mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28
IR ndash Berlin Chen 27
The K-means Algorithm (cont)
ndash and are unknownndash depends on and this optimization problem
can not be solved analytically
( )⎪⎩
⎪⎨⎧ minus=minus
=minus= sum sum= =
=otherwise 0
minif 1 where
errortion Reconstruc Total2
1 11
jt
jit
ti
N
t
k
ii
tti
kii bbE
mxmxmxXm
tib
imtib
im
label
IR ndash Berlin Chen 28
The K-means Algorithm (cont)
bull Initializationndash A set of initial cluster centers is needed
bull Recursionndash Assign each object to the cluster whose center is closest
ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members
bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)
kii 1=m
sum
sum
=
= sdot=
Nt
ti
tNt
ti
i bb
1
1 xm
tx
⎪⎩
⎪⎨⎧ minus=minus
=otherwise 0
minif 1 jt
jit
tib
mxmx
These two steps are repeated until stabilizesim
IR ndash Berlin Chen 29
The K-means Algorithm (cont)
bull Algorithm
IR ndash Berlin Chen 30
The K-means Algorithm (cont)
bull Example 1
IR ndash Berlin Chen 31
The K-means Algorithm (cont)
bull Example 2
governmentfinancesports
research
name
IR ndash Berlin Chen 32
The K-means Algorithm (cont)
bull Choice of initial cluster centers (seeds) is important
ndash Pick at randomndash Calculate the mean of all data and generate k initial centers
by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)
divide it range into k equal interval and take the mean of data in each group as the initial center
ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects
bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set
bull Poor seeds will result in sub-optimal clustering
im
im
δm plusmni
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 26
The K-means Algorithm
bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members
bull The K-means algorithm also can be regarded as ndash A kind of vector quantization
bull Map from a continuous space (high resolution) to a discrete space (low resolution)
ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3
vectorcode wordcode vectorreferenceor centriodcluster
1
index 1
j
kjj
jnt
t
m
mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28
IR ndash Berlin Chen 27
The K-means Algorithm (cont)
ndash and are unknownndash depends on and this optimization problem
can not be solved analytically
( )⎪⎩
⎪⎨⎧ minus=minus
=minus= sum sum= =
=otherwise 0
minif 1 where
errortion Reconstruc Total2
1 11
jt
jit
ti
N
t
k
ii
tti
kii bbE
mxmxmxXm
tib
imtib
im
label
IR ndash Berlin Chen 28
The K-means Algorithm (cont)
bull Initializationndash A set of initial cluster centers is needed
bull Recursionndash Assign each object to the cluster whose center is closest
ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members
bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)
kii 1=m
sum
sum
=
= sdot=
Nt
ti
tNt
ti
i bb
1
1 xm
tx
⎪⎩
⎪⎨⎧ minus=minus
=otherwise 0
minif 1 jt
jit
tib
mxmx
These two steps are repeated until stabilizesim
IR ndash Berlin Chen 29
The K-means Algorithm (cont)
bull Algorithm
IR ndash Berlin Chen 30
The K-means Algorithm (cont)
bull Example 1
IR ndash Berlin Chen 31
The K-means Algorithm (cont)
bull Example 2
governmentfinancesports
research
name
IR ndash Berlin Chen 32
The K-means Algorithm (cont)
bull Choice of initial cluster centers (seeds) is important
ndash Pick at randomndash Calculate the mean of all data and generate k initial centers
by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)
divide it range into k equal interval and take the mean of data in each group as the initial center
ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects
bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set
bull Poor seeds will result in sub-optimal clustering
im
im
δm plusmni
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 27
The K-means Algorithm (cont)
ndash and are unknownndash depends on and this optimization problem
can not be solved analytically
( )⎪⎩
⎪⎨⎧ minus=minus
=minus= sum sum= =
=otherwise 0
minif 1 where
errortion Reconstruc Total2
1 11
jt
jit
ti
N
t
k
ii
tti
kii bbE
mxmxmxXm
tib
imtib
im
label
IR ndash Berlin Chen 28
The K-means Algorithm (cont)
bull Initializationndash A set of initial cluster centers is needed
bull Recursionndash Assign each object to the cluster whose center is closest
ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members
bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)
kii 1=m
sum
sum
=
= sdot=
Nt
ti
tNt
ti
i bb
1
1 xm
tx
⎪⎩
⎪⎨⎧ minus=minus
=otherwise 0
minif 1 jt
jit
tib
mxmx
These two steps are repeated until stabilizesim
IR ndash Berlin Chen 29
The K-means Algorithm (cont)
bull Algorithm
IR ndash Berlin Chen 30
The K-means Algorithm (cont)
bull Example 1
IR ndash Berlin Chen 31
The K-means Algorithm (cont)
bull Example 2
governmentfinancesports
research
name
IR ndash Berlin Chen 32
The K-means Algorithm (cont)
bull Choice of initial cluster centers (seeds) is important
ndash Pick at randomndash Calculate the mean of all data and generate k initial centers
by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)
divide it range into k equal interval and take the mean of data in each group as the initial center
ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects
bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set
bull Poor seeds will result in sub-optimal clustering
im
im
δm plusmni
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 28
The K-means Algorithm (cont)
bull Initializationndash A set of initial cluster centers is needed
bull Recursionndash Assign each object to the cluster whose center is closest
ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members
bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)
kii 1=m
sum
sum
=
= sdot=
Nt
ti
tNt
ti
i bb
1
1 xm
tx
⎪⎩
⎪⎨⎧ minus=minus
=otherwise 0
minif 1 jt
jit
tib
mxmx
These two steps are repeated until stabilizesim
IR ndash Berlin Chen 29
The K-means Algorithm (cont)
bull Algorithm
IR ndash Berlin Chen 30
The K-means Algorithm (cont)
bull Example 1
IR ndash Berlin Chen 31
The K-means Algorithm (cont)
bull Example 2
governmentfinancesports
research
name
IR ndash Berlin Chen 32
The K-means Algorithm (cont)
bull Choice of initial cluster centers (seeds) is important
ndash Pick at randomndash Calculate the mean of all data and generate k initial centers
by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)
divide it range into k equal interval and take the mean of data in each group as the initial center
ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects
bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set
bull Poor seeds will result in sub-optimal clustering
im
im
δm plusmni
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 29
The K-means Algorithm (cont)
bull Algorithm
IR ndash Berlin Chen 30
The K-means Algorithm (cont)
bull Example 1
IR ndash Berlin Chen 31
The K-means Algorithm (cont)
bull Example 2
governmentfinancesports
research
name
IR ndash Berlin Chen 32
The K-means Algorithm (cont)
bull Choice of initial cluster centers (seeds) is important
ndash Pick at randomndash Calculate the mean of all data and generate k initial centers
by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)
divide it range into k equal interval and take the mean of data in each group as the initial center
ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects
bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set
bull Poor seeds will result in sub-optimal clustering
im
im
δm plusmni
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 30
The K-means Algorithm (cont)
bull Example 1
IR ndash Berlin Chen 31
The K-means Algorithm (cont)
bull Example 2
governmentfinancesports
research
name
IR ndash Berlin Chen 32
The K-means Algorithm (cont)
bull Choice of initial cluster centers (seeds) is important
ndash Pick at randomndash Calculate the mean of all data and generate k initial centers
by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)
divide it range into k equal interval and take the mean of data in each group as the initial center
ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects
bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set
bull Poor seeds will result in sub-optimal clustering
im
im
δm plusmni
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 31
The K-means Algorithm (cont)
bull Example 2
governmentfinancesports
research
name
IR ndash Berlin Chen 32
The K-means Algorithm (cont)
bull Choice of initial cluster centers (seeds) is important
ndash Pick at randomndash Calculate the mean of all data and generate k initial centers
by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)
divide it range into k equal interval and take the mean of data in each group as the initial center
ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects
bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set
bull Poor seeds will result in sub-optimal clustering
im
im
δm plusmni
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 32
The K-means Algorithm (cont)
bull Choice of initial cluster centers (seeds) is important
ndash Pick at randomndash Calculate the mean of all data and generate k initial centers
by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)
divide it range into k equal interval and take the mean of data in each group as the initial center
ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects
bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set
bull Poor seeds will result in sub-optimal clustering
im
im
δm plusmni
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 33
The K-means Algorithm (cont)
bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters
ndash Or perturb objects slightly
bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression
bull Map from the original space to l-dimensional spacehypercube
l=log2k (k clusters)Nodes on the hypercube
A linear classifier
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 34
The K-means Algorithm (cont)
bull Eg the LBG algorithmndash By Linde Buzo and Gray
Global mean Cluster 1 mean
Cluster 2mean
μ11Σ11ω11μ12Σ12ω12
μ13Σ13ω13 μ14Σ14ω14
Mrarr2M at each iteration
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 35
The EM Algorithmbull A soft version of the K-mean algorithm
ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability
distributions
sumixr
( )1cxP i( )11 cP=π
( )22 cP=π
( )KK cP=π
( )2cxP i
( )Ki cxP
( ) ( ) ( )sum=
=ΘK
kkkii cPcxPxP
1
ΘΘ rr
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ= minus
kikT
ki
kmki xxcxP μμ
π
rrrrr 1
21exp
2
1Θ
Continuous caseLikelihood function fordata samples
( ) ( )
( ) ( ) cPcxP
xPP
n
i
K
kkki
n
ii
i
iiprod sum
prod
= =
=
=
=
1 1
1
ΘΘ
ΘΘ
r
rX
A Mixture Gaussian HMM(or A Mixture of Gaussians)
xxx nr
Krr 21=X
( ) ( ) ( )( )
( ) ( )ΘΘmax
ΘΘmax max
tionclassifica
kkik
i
kki
kikk
cPcx
xPcPcx
xcP
r
r
rr
=
Θ=Θ
(iid) ddistributey identicallt independen are sixr xxx n
rL
rr 21=X
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 36
The EM Algorithm (cont)
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 37
Maximum Likelihood Estimation
bull Hard Assignment
State S1
P(B| S1)=24=05
P(W| S1)=24=05
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 38
Maximum Likelihood Estimation
bull Soft Assignment
State S1 State S2
07 03
04 06
09 01
05 05
P(B| S1)=(07+09)(07+04+09+05)
=1625=064
P(B| S1)=(04+05)(07+04+09+05)
=0925=036
P(B| S2)=(03+01)(03+06+01+05)
=0415=027
P(B| S2)=(06+05)(03+06+01+05)
=01115=073
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 39
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Derive the complete data likelihood function
( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )
( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]
( )[ ]( )[ ]
( )[ ]sum
sum sumsum
sum sumsum
sum sum prodsum
sum sum prodsum
prod sumprod
=
=
=
=
=
++times
times++=
==
= ==
= ==
= = ==
= = ==
= ==
CCX
X
Θ
Θ
Θ
Θ
ΘΘ
ΘΘΘΘ
ΘΘΘΘ
ΘΘΘΘ
1 121
1
1 121
1
1 1 11
1 1 11
11
1111
1 11
21
2
21
2
2
2
P
cxcxcxP
cxcxcxP
cxP
cPcxP
cPcxPcPcxP
cPcxPcPcxP
cPcxP xPP
K
k
K
kknkk
K
k
K
k
K
kknkk
K
k
K
k
K
k
n
iki
K
k
K
k
K
k
n
ikki
K
k
KKnn
KK
n
i
K
kkki
n
ii
i n
n
i n
n
i n
i
i n
ii
i
ii
rL
rrL
rL
rrL
rL
rL
rr
Lrr
rr
nn kkkk
nn
ccccxxxx
121
121
minus== minus
L
rrL
rr
CX
the complete data likelihood function
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
likelihood function
)kinds( ofkindsmany How
nKC
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 40
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the
log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data
ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function
bull We have shown this property when deriving the HMM-based retrieval model
( )ΘΘΦ
( )ΘX
( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum
sum
=
=
==Φ
C
C
XCXC
CXX
CX
CXXC
CX
ΘlogΘΘ
ΘlogΘ
ΘloglogΘΘΘΘ
PP
P
PP
PELE CM
( )Θlog XP( )ΘΘΦ
known unknown
( )ΘΘΦ ( )Θlog XP
( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 41
The EM Algorithm (cont)bull Endashstep (Expectation)
ndash The auxiliary function ( )ΘΘΦ
( ) ( )( ) ( )( )( ) ( )
( ) ( )
( ) ( )
( )[ ] ( )
( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum
sum sum
sum sum
sum sum
sum sum sum prod
sum sum prodsum
sum sumprod
sum prodprod
sum
= = = =
= =
= =
= =
= = = =
= = ==
= ==
= ==
+=
=
=
=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎦
⎤⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=Φ
m
k
n
i
m
k
n
ikijkkjk
m
k
n
ikkijk
m
k
n
ikijk
m
k
n
iikki
m
k
n
i ccc
n
jjkkkki
ccc
m
k
n
jjkki
n
ikk
cccki
n
i
n
jjk
ccc
n
iki
n
j j
kj
cxPxcPcPxcP
cPcxPxcP
cxPxcP
xcPcxP
xcPcxP
xcPcxP
cxPxcP
cxPxP
cxP
PP
P
jj
j
j
nkkk
ji
nkkk
ji
nkkk
ij
nkkk
i
j
1 1 1 1
1 1
1 1
1 1
1 1 1
1 11
11
11
ΘlogΘΘlogΘ
ΘΘlogΘ
ΘlogΘ
ΘΘlog
ΘΘlog
ΘΘlog
ΘlogΘ
ΘlogΘ
Θ
ΘlogΘΘ
ΘΘ
21
21
21
21
rrr
rr
rr
rr
rr
rr
rr
rr
r
K
K
K
K
C
C
C
C
C
CXX
CX
δ
δ
⎩⎨⎧ =
=otherwise 0
if 1
kk ikk i
δ
See Next Slide
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 42
The EM Algorithm (cont)
ndash Note that
( )
( )[ ]
( )[ ]
( ) ( )
( )
( )Θ
Θ1
ΘΘ
Θ
Θ
Θ
1
1
1 1
1 1 1 1
1 1 1 1
1
1 2
1 2
21
ik
ik
n
ijj
m
cikkk
n
ijj
m
kjk
m
c
m
c
m
c
n
jjkkk
m
c
m
c
m
c
n
jjkkk
ccc
n
jjkkk
xcP
xcP
xcPxcP
xcP
xcP
xcP
ik
ii
j
j
k k nk
ji
k k nk
ji
nkkk
ji
r
r
rr
rL
rL
r
K
=
⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡=
=
=
⎥⎦
⎤⎢⎣
⎡
prod
sumprod sum
sum sum sum prod
sum sum sum prod
sum prod
ne=
=ne= =
= = = =
= = = =
= =
δ
δ
δ
δC
( )( )( ) ( )
sum sum sum prod
prod sum
= = = =
= =
=
+++++++++=M
k
M
k
M
k
T
t tk
TMTTMM
T
t
M
k tk
T t
t t
a
aaaaaaaaa
a
1 1 1 1
212222111211
1 1
1 2
Note
ki cx toaligned beonly can r
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 43
The EM Algorithm (cont)
bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two
( ) ( ) ( )
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )( ) ( )
( )
( ) ( ) ( )( ) ( )( ) ( )
( )sum sumsum
sum sum
sum sumsum
sum sum
sum sum
= =
=
= =
= =
=
= =
= =
=
=Φ
=
=
=Φ
Φ+Φ=Φ
n
i
K
kkiK
llli
kki
n
i
K
kkiikb
n
i
K
kkK
llli
kki
n
i
K
kk
i
kki
n
i
K
kkika
ba
cxPcPcxP
cPcxP
cxPxcP
cPcPcxP
cPcxP
cPxP
cPcxP
cPxcP
1 1
1
1 1
1 1
1
1 1
1 1
ΘlogΘΘ
ΘΘ
ΘlogΘΘΘ
ΘlogΘΘ
ΘΘ
ΘlogΘ
ΘΘ
ΘlogΘΘΘ
whereΘΘΘΘΘΘ
r
r
r
rr
r
r
r
r
r
auxiliary function for mixture weights
auxiliary function for cluster distributions
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 44
The EM Algorithm (cont)
bull M-step (Maximization)ndash Remember that
bull Maximize a function F with a constraint by applying Lagrange multiplier
sum
sumsumsum
sum sumsum
=
===
= ==
=there4
minus=rArrminus=
forallminus=rArr=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛minus+=rArr=
N
jj
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwy
jyw
yw
yF
yywFywF
1
111
1 11
0ˆ
1logˆlog that Suppose
Multiplier Lagrange applyingBy
ll
ll
l
l
partpart
Constraint
jj
j
yyy 1logNote
=part
part
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 45
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ
( ) ( ) ( )( ) ( )
( ) ( )( ) ( ) 1ΘΘlog
ΘΘ
ΘΘ
1ΘΘΘΘΘ
11 1
1
1
⎟⎠
⎞⎜⎝
⎛minus+=
⎟⎠
⎞⎜⎝
⎛minus+Φ=Φ
sumsum sumsum
sum
== =
=
=
K
kk
K
k
n
ikK
llli
kki
K
kkaa
cPlcPcPcxP
cPcxP
cPl
r
r
kw ky
( )
( ) ( )( ) ( )( ) ( )( ) ( )
( ) ( )( ) ( )
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
ΘΘ
Θˆ
1
1
1 1
1
1
1
1
n
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
cPcxP
w
wcP
n
iK
llli
kki
K
k
n
iK
llli
kki
n
iK
llli
kki
K
kk
kkk
sumsum
sum sumsum
sumsum
sum
=
=
= =
=
=
=
=
====rArr
r
r
r
r
r
r
π
auxiliary function for mixture weights (or priors for Gaussians)
kii
k cxr classin falls that timesofnumber expected the r
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 46
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ
( ) ( ) ( )( ) ( )
( )sum sumsum= =
=
=Φn
i
K
kkiK
llli
kkib cxP
cPcxP
cPcxP
1 1
1
ΘlogΘΘ
ΘΘ ΘΘ r
r
r
auxiliary function for (multivariate) Gaussian Means and Variances
( )( )
( ) ( )⎟⎠⎞
⎜⎝⎛ minusΣminusminus
Σ=Θ minus
kiT
ki
kmki xxcxP
kμμ
π
rrrrr 1
21exp
2
1
( ) ( )( ) ( )
and ΘΘ
ΘΘLet
1
sum=
= K
llli
kkiik
cPcxP
cPcxPw
r
r ( )( ) ( ) ( )ki
Tkik
ki
xxm
cxP
kμμπ rrrr
r
minusΣminusminusΣminussdotminus
=Θ
minus1
21log2
12log2
log
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1log21 ΘΘ μμ rrrr
constant
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 47
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to
( ) ( ) ( )( )
( ) ( )( ) ( )( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sum
=
=
=
=
=
=
=
minus
ΘΘ
ΘΘ
sdotΘΘ
ΘΘ
=sdot
=rArr
=minusminusΣsdotsdotsdotminus=part
Φpart
n
iK
llli
kki
n
iiK
llli
kki
n
iik
n
iiik
k
n
ikikik
k
b
cPcxP
cPcxP
xcPcxP
cPcxP
w
xw
xw
1
1
1
1
1
1
1
1
ˆ
01ˆˆ221 ˆ
ΘΘ
r
r
r
r
r
r
r
rrr
μ
μμ
( )ΘΘbΦ kμr
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( )here symmetric is and
)(
1minus
+=
kΣd
d xCCxCxx T
T
ki
ik
cxr
classin falls that timesofnumber expected the
r
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 48
The EM Algorithm (cont)
bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ
( )[ ]
here symmetric is and
)det(det
kΣd
d TXXX
X minussdot=
kΣ
( ) ( ) ( )sum sum= =
minus +⎥⎦⎤
⎢⎣⎡ minusΣminus+Σminus=Φ
n
i
K
kkik
T
kikikb Dxxw1 1
1
ˆˆˆ2
1ˆlog21 ΘΘ μμ rrrr
( ) ( )( )
( )( )
( )( )
( )( )
( )( )( ) ( )( ) ( )
( )( )
( ) ( )( ) ( )
sumsum
sumsum
sum
sum
sumsum
sumsum
sumsum
sum
=
=
=
=
=
=
==
=
minusminus
=
minus
=
minusminus
=
minus
=
minusminusminusminus
ΘΘ
ΘΘ
minusminussdotΘΘ
ΘΘ
=minusminussdot
=ΣrArr
minusminussdot=ΣsdotrArr
ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr
ΣminusminusΣsdot=ΣsdotrArr
=⎥⎦⎤
⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=
ΣpartΦpart
n
iK
llli
kki
n
i
T
kikiK
llli
kki
n
iik
n
i
T
kikiik
k
n
i
T
kikiik
n
ikik
k
n
ik
T
kikikkik
n
ikkkik
n
ik
T
kikikik
n
ikik
n
ik
T
kikikkkkikk
b
cPcxP
cPcxP
xxcPcxP
cPcxP
w
xxw
xxww
xxww
xxww
xxw
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1111
ˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆˆˆˆˆˆˆ
ˆˆˆˆˆ
0ˆˆˆˆˆ2
1 ˆΘΘ
r
r
rrrr
r
r
rrrr
rrrr
rrrr
rrrr
rrrr
μμμμ
μμ
μμ
μμ
μμ
111 )( minusminusminus
minus= XabXX
bXa TT
dd
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 49
The EM Algorithm (cont)
bull The initial cluster distributions can be estimated using the K-means algorithm
bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached
( )ΘXP
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 50
Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information
ndash TMMPLSA approach
bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map
bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map
bull When a cluster has many documents we can further analyze it into an other map on the next layer
Two-dimensional Tree Structure
for Organized Topics
( ) ( ) ( ) ( )sum ⎥⎦
⎤⎢⎣
⎡sum=
= =
K
k
K
lljklikij TwPYTPDTPDwP
1 1
( ) ( )⎥⎥⎦
⎤
⎢⎢⎣
⎡minus= 2
2
2exp
21
σσπlk
klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=
( ) ( )( )sum
=
=
K
sks
klkl
TTE
TTEYTP
1
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 51
Hierarchical Document Organization (cont)
bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection
ndash EM training can be performed
( ) ( )
( ) ( ) ( ) ( )⎭⎬⎫
⎩⎨⎧sum ⎥
⎦
⎤⎢⎣
⎡sumsum sum=
sum sum=
= == =
= =
K
k
K
lljklik
N
i
J
nij
ijN
i
J
nijT
TwPYTPDTPDwc
DwPDwcL
1 11 1
1 1
log
log
( )( ) ( )
( ) ( )sum sum
sum=
=prime =primeprimeprimeprimeprime
=J
j
N
iijkij
N
iijkij
kjDwTPDwc
DwTPDwcTwP
1 1
1
|
||ˆ
( )( ) ( )
( ) |
|ˆ 1
i
J
jijkij
ik Dc
DwTPDwcDTP
sum= =
where
( )( ) ( ) ( )
( ) ( ) ( )sum⎭⎬⎫
⎩⎨⎧
sdot⎥⎦
⎤⎢⎣
⎡sum
sdot⎥⎦
⎤⎢⎣
⎡sum
=prime
=primeprime
=primeprimeprimeprime
=K
kik
K
lkllj
ikK
lkllj
ijk
DTPTTPTwP
DTPTTPTwPDwTP
1 1
1
|||
||||
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 52
Hierarchical Document Organization (cont)
bull Criterion for Topic Word Selecting
( )( ) ( )
( ) ( )sum minus
sum=
=primeprimeprime
=N
iikij
N
iikij
kjDTPDwc
DTPDwcTwS
1
1
]|1[
|
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 53
Hierarchical Document Organization (cont)
bull Example
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 54
Hierarchical Document Organization (cont)
bull Example (cont)
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 55
Hierarchical Document Organization (cont)
bull Self-Organization Map (SOM) ndash A recursive regression process
[ ]Tnmmmm 121111 =
(Mapping Layer
Input Layer
[ ]Tnxxxx 21=Input Vector
[ ]Tniiii mmmm 21 =
Weight Vector
)]()()[()()1( )( tmtxthtmtm iixcii minus+=+
ii
mxxc primeprime
minus= minarg)(
where( )sum minus=minus primeprime n nini mxmx 2
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ minusminus=
)(2exp)()( 2
2
)()( t
rrtth xci
ixc σα
imx
ii mx minus
imprime
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where
IR ndash Berlin Chen 56
Hierarchical Document Organization (cont)
bull Results
20604100SOM1917540194773020650201916510
TMM
distBetweendistWithinIterationsModel
Within
BetweenDist dist
distR =
sumsum
sumsum
= +=
= +==D
i
D
ijBetween
D
i
D
ijBetween
Between
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ ne
=otherwise
TTijdistjif jrirMap
Between 0
)()(
( ) ( )22)( jijiMap yyxxijdist minus+minus=
⎩⎨⎧ ne
= 0
1 )(
otherwise
TTjiC jrir
Between
sumsum
sumsum
= +=
= +== D
i
D
ijWithin
D
i
D
ijWithin
Within
jiC
jifdist
1 1
1 1
)(
)(⎩⎨⎧ =
= 0
)()(
otherwise
TTijdistjif jrirMap
Within
⎪⎩
⎪⎨⎧ =
= 0
1 )(
otherwise
TTjiC jrir
Within
where