+ All Categories
Home > Documents > Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering...

Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering...

Date post: 28-Sep-2020
Category:
Upload: others
View: 5 times
Download: 1 times
Share this document with a friend
56
Clustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter 7 2. Modern Information Retrieval, Chapters 5, 7 3. Foundations of Statistical Natural Language Processing, Chapter 14 4. "A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models," Jeff A. Bilmes, U.C. Berkeley TR-97-021
Transcript
Page 1: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

Clustering Techniques

Berlin Chen 2005

References1 Introduction to Machine Learning Chapter 72 Modern Information Retrieval Chapters 5 73 Foundations of Statistical Natural Language Processing Chapter 144 A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov

Models Jeff A Bilmes UC Berkeley TR-97-021

IR ndash Berlin Chen 2

Clustering

bull Place similar objects in the same group and assign dissimilar objects to different groupsndash Word clustering

bull Neighbor overlap words occur with the similar left and right neighbors (such as in and on)

ndash Document clusteringbull Documents with the similar topics or concepts are put

together

bull But clustering cannot give a comprehensive description of the objectndash How to label objects shown on the visual display

bull Regarded as a kind of semiparametric learning approachndash Allow a mixture of distributions to be used for estimating the

input samples (a parametric model for each group of samples)

IR ndash Berlin Chen 3

Clustering vs Classification

bull Classification is supervised and requires a set of labeled training instances for each group (class)ndash Learning with a teacher

bull Clustering is unsupervised and learns without a teacher to provide the labeling information of the training data set

ndash Also called automatic or unsupervised classification

IR ndash Berlin Chen 4

Types of Clustering Algorithms

bull Two types of structures produced by clustering algorithmsndash Flat or non-hierarchical clusteringndash Hierarchical clustering

bull Flat clusteringndash Simply consisting of a certain number of clusters and the relation

between clusters is often undeterminedndash Measurement construction error minimization or probabilistic

optimizationbull Hierarchical clustering

ndash A hierarchy with usual interpretation that each node stands for a subclass of its motherrsquos node

bull The leaves of the tree are the single objectsbull Each node represents the cluster that contains all the objects

of its descendantsndash Measurement similarities of instances

IR ndash Berlin Chen 5

Hard Assignment vs Soft Assignment

bull Another important distinction between clustering algorithms is whether they perform soft or hard assignment

bull Hard Assignmentndash Each object is assigned to one and only one cluster

bull Soft Assignment (probabilistic approach)ndash Each object may be assigned to multiple clustersndash An object has a probability distribution over

clusters where is the probability that is a member of

ndash Is somewhat more appropriate in many tasks such as NLP IR hellip

ix ( )ixP sdotjc

jc( )ji cxP ix

IR ndash Berlin Chen 6

Hard Assignment vs Soft Assignment (cont)

bull Hierarchical clustering usually adopts hard assignment

bull While in flat clustering both types of assignments are common

IR ndash Berlin Chen 7

Summarized Attributes of Clustering Algorithms bull Hierarchical Clustering

ndash Preferable for detailed data analysis

ndash Provide more information than flat clustering

ndash No single best algorithm (each of the algorithms only optimal for some applications)

ndash Less efficient than flat clustering (minimally have to compute n x nmatrix of similarity coefficients)

IR ndash Berlin Chen 8

Summarized Attributes of Clustering Algorithms (cont)

bull Flat Clusteringndash Preferable if efficiency is a consideration or data sets are very

large

ndash K-means is the conceptually method and should probably be used on a new data because its results are often sufficient

ndash K-means assumes a simple Euclidean representation space and so cannot be used for many data sets eg nominal data like colors (or samples with features of different scales)

ndash The EM algorithm is the most choice It can accommodate definition of clusters and allocation of objects based on complex probabilistic models

bull Its extensions can be used to handle topologicalhierarchical orders of samples

ndash Probabilistic Latent Semantic Analysis (PLSA) Topic Mixture Model (TMM) etc

IR ndash Berlin Chen 9

Hierarchical Clustering

IR ndash Berlin Chen 10

Hierarchical Clustering

bull Can be in either bottom-up or top-down mannersndash Bottom-up (agglomerative)

bull Start with individual objects and grouping the most similar ones

ndash Eg with the minimum distance apart

bull The procedure terminates when one cluster containing all objects has been formed

ndash Top-down (divisive)bull Start with all objects in a group and divide them into groups

so as to maximize within-group similarity

( ) ( )yxdyxsim

11

+=

凝集的

分裂的

distance measures willbe discussed later on

IR ndash Berlin Chen 11

Hierarchical Agglomerative Clustering (HAC)

bull A bottom-up approach

bull Assume a similarity measure for determining the similarity of two objects

bull Start with all objects in a separate cluster and then repeatedly joins the two clusters that have the most similarity until there is one only cluster survived

bull The history of mergingclustering forms a binary tree or hierarchy

IR ndash Berlin Chen 12

HAC (cont)

bull Algorithm

cluster number

Initialization (for tree leaves)Each object is a cluster

merged as a new cluster

The original two clusters are removed

IR ndash Berlin Chen 13

Distance Metrics

bull Euclidian Distance (L2 norm)

ndash Make sure that all attributesdimensions have the same scale (orthe same variance)

bull L1 Norm (City-block distance)

bull Cosine Similarity (transform to a distance by subtracting from 1)

2

12 )()( i

m

ii yxyxL minus=sum

=

rr

sum=

minus=m

iii yxyxL

11 )( rr

yxyxrr

rr

sdotminus

bull1 ranged between 0 and 1

IR ndash Berlin Chen 14

Measures of Cluster Similaritybull Especially for the bottom-up approaches

bull Single-link clusteringndash The similarity between two clusters is the similarity of the two

closest objects in the clusters

ndash Search over all pairs of objects that are from the two differentclusters and select the pair with the greatest similarity

ndash Elongated clusters are achieved

Ci Cj

greatest similarity

( ) ( )yxsimccsimji cycxji

rrrrisinisin

= max

cf the minimal spanning tree

IR ndash Berlin Chen 15

Measures of Cluster Similarity (cont)

bull Complete-link clusteringndash The similarity between two clusters is the similarity of their two

most dissimilar members

ndash Sphere-shaped clusters are achieved

ndash Preferable for most IR and NLP applications

Ci Cj

least similarity

( ) ( )yxsimccsimji cycxji

rrrrisinisin

= min

IR ndash Berlin Chen 16

Measures of Cluster Similarity (cont)

single link

complete link

IR ndash Berlin Chen 17

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering

ndash The similarity between two clusters is the average similarity between members

ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine

bull There exists an fast algorithm for computing the average similarity

( ) ( ) yxyxyxyxyxsim

rrrr

rrrrrr

sdot=sdot

== cos

length-normalized vectors

Ci Cj

IR ndash Berlin Chen 18

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

ndash The average similarity SIM between vectors in a cluster cj is defined as

ndash The sum of members in a cluster cj

ndash Express in terms of

( ) ( ) ( ) ( )sum sumsum sumisin

neisinisin

neisin

sdotminus

=minus

=j jj j cx

xycyjjcx

xycyjj

j yxcc

yxsimcc

cSIMr

rrrr

rrr

rrrr

11

11

( ) sumisin

=jcx

j xcsr

rr

( )jcSIM ( )jcsr

( ) ( ) ( )( ) ( )( ) ( )

( ) ( ) ( )( )1

1

1

minus

minussdot=there4

+minus=

sdot+minus=

sdot=sdot=sdot

sum

sum sumsum

isin

isin isinisin

jj

jjj

j

jjjj

cxjjj

cx cyj

cxjj

ccccscs

cSIM

ccSIMcc

xxcSIMcc

yxcsxcscs

j

j jj

rr

rr

rrrrrr

r

r rr

=1

length-normalized vector

IR ndash Berlin Chen 19

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

-As merging two clusters ci and cj the cluster sum vectors and are known in advance

ndash The average similarity for their union will be

( )icsr ( )jcsr

( )( ) ( )( ) ( ) ( )( ) ( )

( )( )1

minus++

+minus+sdot+

=cup

jiji

jijiji

ji

cccccccscscscs

ccSIMrrrr

( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr

ic jc

ji cc +

Ci Cj ( )jcsr( )ics

r

IR ndash Berlin Chen 20

Example Word Clustering

bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words

ldquoberdquo has least similarity with the other 21 words

higher nodesdecreasingof similarity

IR ndash Berlin Chen 21

Divisive Clustering

bull A top-down approach

bull Start with all objects in a single cluster

bull At each iteration select the least coherent cluster and split it

bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved

bull The history of clustering forms a binary tree or hierarchy

IR ndash Berlin Chen 22

Divisive Clustering (cont)

bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure

bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation

egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)

IR ndash Berlin Chen 23

Divisive Clustering (cont)

bull Algorithm

split the least coherent cluster

Generate two new clusters and remove the original one

IR ndash Berlin Chen 24

Non-Hierarchical Clustering

IR ndash Berlin Chen 25

Non-hierarchical Clustering

bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)

bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters

bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm

group average similarity likelihood mutual information

k-1 rarr k rarr k+1

Hierarchical clustering also has to face this problem

IR ndash Berlin Chen 26

The K-means Algorithm

bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members

bull The K-means algorithm also can be regarded as ndash A kind of vector quantization

bull Map from a continuous space (high resolution) to a discrete space (low resolution)

ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3

vectorcode wordcode vectorreferenceor centriodcluster

1

index 1

j

kjj

jnt

t

m

mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28

IR ndash Berlin Chen 27

The K-means Algorithm (cont)

ndash and are unknownndash depends on and this optimization problem

can not be solved analytically

( )⎪⎩

⎪⎨⎧ minus=minus

=minus= sum sum= =

=otherwise 0

minif 1 where

errortion Reconstruc Total2

1 11

jt

jit

ti

N

t

k

ii

tti

kii bbE

mxmxmxXm

tib

imtib

im

label

IR ndash Berlin Chen 28

The K-means Algorithm (cont)

bull Initializationndash A set of initial cluster centers is needed

bull Recursionndash Assign each object to the cluster whose center is closest

ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members

bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)

kii 1=m

sum

sum

=

= sdot=

Nt

ti

tNt

ti

i bb

1

1 xm

tx

⎪⎩

⎪⎨⎧ minus=minus

=otherwise 0

minif 1 jt

jit

tib

mxmx

These two steps are repeated until stabilizesim

IR ndash Berlin Chen 29

The K-means Algorithm (cont)

bull Algorithm

IR ndash Berlin Chen 30

The K-means Algorithm (cont)

bull Example 1

IR ndash Berlin Chen 31

The K-means Algorithm (cont)

bull Example 2

governmentfinancesports

research

name

IR ndash Berlin Chen 32

The K-means Algorithm (cont)

bull Choice of initial cluster centers (seeds) is important

ndash Pick at randomndash Calculate the mean of all data and generate k initial centers

by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)

divide it range into k equal interval and take the mean of data in each group as the initial center

ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects

bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set

bull Poor seeds will result in sub-optimal clustering

im

im

δm plusmni

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 2: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 2

Clustering

bull Place similar objects in the same group and assign dissimilar objects to different groupsndash Word clustering

bull Neighbor overlap words occur with the similar left and right neighbors (such as in and on)

ndash Document clusteringbull Documents with the similar topics or concepts are put

together

bull But clustering cannot give a comprehensive description of the objectndash How to label objects shown on the visual display

bull Regarded as a kind of semiparametric learning approachndash Allow a mixture of distributions to be used for estimating the

input samples (a parametric model for each group of samples)

IR ndash Berlin Chen 3

Clustering vs Classification

bull Classification is supervised and requires a set of labeled training instances for each group (class)ndash Learning with a teacher

bull Clustering is unsupervised and learns without a teacher to provide the labeling information of the training data set

ndash Also called automatic or unsupervised classification

IR ndash Berlin Chen 4

Types of Clustering Algorithms

bull Two types of structures produced by clustering algorithmsndash Flat or non-hierarchical clusteringndash Hierarchical clustering

bull Flat clusteringndash Simply consisting of a certain number of clusters and the relation

between clusters is often undeterminedndash Measurement construction error minimization or probabilistic

optimizationbull Hierarchical clustering

ndash A hierarchy with usual interpretation that each node stands for a subclass of its motherrsquos node

bull The leaves of the tree are the single objectsbull Each node represents the cluster that contains all the objects

of its descendantsndash Measurement similarities of instances

IR ndash Berlin Chen 5

Hard Assignment vs Soft Assignment

bull Another important distinction between clustering algorithms is whether they perform soft or hard assignment

bull Hard Assignmentndash Each object is assigned to one and only one cluster

bull Soft Assignment (probabilistic approach)ndash Each object may be assigned to multiple clustersndash An object has a probability distribution over

clusters where is the probability that is a member of

ndash Is somewhat more appropriate in many tasks such as NLP IR hellip

ix ( )ixP sdotjc

jc( )ji cxP ix

IR ndash Berlin Chen 6

Hard Assignment vs Soft Assignment (cont)

bull Hierarchical clustering usually adopts hard assignment

bull While in flat clustering both types of assignments are common

IR ndash Berlin Chen 7

Summarized Attributes of Clustering Algorithms bull Hierarchical Clustering

ndash Preferable for detailed data analysis

ndash Provide more information than flat clustering

ndash No single best algorithm (each of the algorithms only optimal for some applications)

ndash Less efficient than flat clustering (minimally have to compute n x nmatrix of similarity coefficients)

IR ndash Berlin Chen 8

Summarized Attributes of Clustering Algorithms (cont)

bull Flat Clusteringndash Preferable if efficiency is a consideration or data sets are very

large

ndash K-means is the conceptually method and should probably be used on a new data because its results are often sufficient

ndash K-means assumes a simple Euclidean representation space and so cannot be used for many data sets eg nominal data like colors (or samples with features of different scales)

ndash The EM algorithm is the most choice It can accommodate definition of clusters and allocation of objects based on complex probabilistic models

bull Its extensions can be used to handle topologicalhierarchical orders of samples

ndash Probabilistic Latent Semantic Analysis (PLSA) Topic Mixture Model (TMM) etc

IR ndash Berlin Chen 9

Hierarchical Clustering

IR ndash Berlin Chen 10

Hierarchical Clustering

bull Can be in either bottom-up or top-down mannersndash Bottom-up (agglomerative)

bull Start with individual objects and grouping the most similar ones

ndash Eg with the minimum distance apart

bull The procedure terminates when one cluster containing all objects has been formed

ndash Top-down (divisive)bull Start with all objects in a group and divide them into groups

so as to maximize within-group similarity

( ) ( )yxdyxsim

11

+=

凝集的

分裂的

distance measures willbe discussed later on

IR ndash Berlin Chen 11

Hierarchical Agglomerative Clustering (HAC)

bull A bottom-up approach

bull Assume a similarity measure for determining the similarity of two objects

bull Start with all objects in a separate cluster and then repeatedly joins the two clusters that have the most similarity until there is one only cluster survived

bull The history of mergingclustering forms a binary tree or hierarchy

IR ndash Berlin Chen 12

HAC (cont)

bull Algorithm

cluster number

Initialization (for tree leaves)Each object is a cluster

merged as a new cluster

The original two clusters are removed

IR ndash Berlin Chen 13

Distance Metrics

bull Euclidian Distance (L2 norm)

ndash Make sure that all attributesdimensions have the same scale (orthe same variance)

bull L1 Norm (City-block distance)

bull Cosine Similarity (transform to a distance by subtracting from 1)

2

12 )()( i

m

ii yxyxL minus=sum

=

rr

sum=

minus=m

iii yxyxL

11 )( rr

yxyxrr

rr

sdotminus

bull1 ranged between 0 and 1

IR ndash Berlin Chen 14

Measures of Cluster Similaritybull Especially for the bottom-up approaches

bull Single-link clusteringndash The similarity between two clusters is the similarity of the two

closest objects in the clusters

ndash Search over all pairs of objects that are from the two differentclusters and select the pair with the greatest similarity

ndash Elongated clusters are achieved

Ci Cj

greatest similarity

( ) ( )yxsimccsimji cycxji

rrrrisinisin

= max

cf the minimal spanning tree

IR ndash Berlin Chen 15

Measures of Cluster Similarity (cont)

bull Complete-link clusteringndash The similarity between two clusters is the similarity of their two

most dissimilar members

ndash Sphere-shaped clusters are achieved

ndash Preferable for most IR and NLP applications

Ci Cj

least similarity

( ) ( )yxsimccsimji cycxji

rrrrisinisin

= min

IR ndash Berlin Chen 16

Measures of Cluster Similarity (cont)

single link

complete link

IR ndash Berlin Chen 17

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering

ndash The similarity between two clusters is the average similarity between members

ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine

bull There exists an fast algorithm for computing the average similarity

( ) ( ) yxyxyxyxyxsim

rrrr

rrrrrr

sdot=sdot

== cos

length-normalized vectors

Ci Cj

IR ndash Berlin Chen 18

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

ndash The average similarity SIM between vectors in a cluster cj is defined as

ndash The sum of members in a cluster cj

ndash Express in terms of

( ) ( ) ( ) ( )sum sumsum sumisin

neisinisin

neisin

sdotminus

=minus

=j jj j cx

xycyjjcx

xycyjj

j yxcc

yxsimcc

cSIMr

rrrr

rrr

rrrr

11

11

( ) sumisin

=jcx

j xcsr

rr

( )jcSIM ( )jcsr

( ) ( ) ( )( ) ( )( ) ( )

( ) ( ) ( )( )1

1

1

minus

minussdot=there4

+minus=

sdot+minus=

sdot=sdot=sdot

sum

sum sumsum

isin

isin isinisin

jj

jjj

j

jjjj

cxjjj

cx cyj

cxjj

ccccscs

cSIM

ccSIMcc

xxcSIMcc

yxcsxcscs

j

j jj

rr

rr

rrrrrr

r

r rr

=1

length-normalized vector

IR ndash Berlin Chen 19

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

-As merging two clusters ci and cj the cluster sum vectors and are known in advance

ndash The average similarity for their union will be

( )icsr ( )jcsr

( )( ) ( )( ) ( ) ( )( ) ( )

( )( )1

minus++

+minus+sdot+

=cup

jiji

jijiji

ji

cccccccscscscs

ccSIMrrrr

( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr

ic jc

ji cc +

Ci Cj ( )jcsr( )ics

r

IR ndash Berlin Chen 20

Example Word Clustering

bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words

ldquoberdquo has least similarity with the other 21 words

higher nodesdecreasingof similarity

IR ndash Berlin Chen 21

Divisive Clustering

bull A top-down approach

bull Start with all objects in a single cluster

bull At each iteration select the least coherent cluster and split it

bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved

bull The history of clustering forms a binary tree or hierarchy

IR ndash Berlin Chen 22

Divisive Clustering (cont)

bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure

bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation

egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)

IR ndash Berlin Chen 23

Divisive Clustering (cont)

bull Algorithm

split the least coherent cluster

Generate two new clusters and remove the original one

IR ndash Berlin Chen 24

Non-Hierarchical Clustering

IR ndash Berlin Chen 25

Non-hierarchical Clustering

bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)

bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters

bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm

group average similarity likelihood mutual information

k-1 rarr k rarr k+1

Hierarchical clustering also has to face this problem

IR ndash Berlin Chen 26

The K-means Algorithm

bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members

bull The K-means algorithm also can be regarded as ndash A kind of vector quantization

bull Map from a continuous space (high resolution) to a discrete space (low resolution)

ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3

vectorcode wordcode vectorreferenceor centriodcluster

1

index 1

j

kjj

jnt

t

m

mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28

IR ndash Berlin Chen 27

The K-means Algorithm (cont)

ndash and are unknownndash depends on and this optimization problem

can not be solved analytically

( )⎪⎩

⎪⎨⎧ minus=minus

=minus= sum sum= =

=otherwise 0

minif 1 where

errortion Reconstruc Total2

1 11

jt

jit

ti

N

t

k

ii

tti

kii bbE

mxmxmxXm

tib

imtib

im

label

IR ndash Berlin Chen 28

The K-means Algorithm (cont)

bull Initializationndash A set of initial cluster centers is needed

bull Recursionndash Assign each object to the cluster whose center is closest

ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members

bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)

kii 1=m

sum

sum

=

= sdot=

Nt

ti

tNt

ti

i bb

1

1 xm

tx

⎪⎩

⎪⎨⎧ minus=minus

=otherwise 0

minif 1 jt

jit

tib

mxmx

These two steps are repeated until stabilizesim

IR ndash Berlin Chen 29

The K-means Algorithm (cont)

bull Algorithm

IR ndash Berlin Chen 30

The K-means Algorithm (cont)

bull Example 1

IR ndash Berlin Chen 31

The K-means Algorithm (cont)

bull Example 2

governmentfinancesports

research

name

IR ndash Berlin Chen 32

The K-means Algorithm (cont)

bull Choice of initial cluster centers (seeds) is important

ndash Pick at randomndash Calculate the mean of all data and generate k initial centers

by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)

divide it range into k equal interval and take the mean of data in each group as the initial center

ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects

bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set

bull Poor seeds will result in sub-optimal clustering

im

im

δm plusmni

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 3: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 3

Clustering vs Classification

bull Classification is supervised and requires a set of labeled training instances for each group (class)ndash Learning with a teacher

bull Clustering is unsupervised and learns without a teacher to provide the labeling information of the training data set

ndash Also called automatic or unsupervised classification

IR ndash Berlin Chen 4

Types of Clustering Algorithms

bull Two types of structures produced by clustering algorithmsndash Flat or non-hierarchical clusteringndash Hierarchical clustering

bull Flat clusteringndash Simply consisting of a certain number of clusters and the relation

between clusters is often undeterminedndash Measurement construction error minimization or probabilistic

optimizationbull Hierarchical clustering

ndash A hierarchy with usual interpretation that each node stands for a subclass of its motherrsquos node

bull The leaves of the tree are the single objectsbull Each node represents the cluster that contains all the objects

of its descendantsndash Measurement similarities of instances

IR ndash Berlin Chen 5

Hard Assignment vs Soft Assignment

bull Another important distinction between clustering algorithms is whether they perform soft or hard assignment

bull Hard Assignmentndash Each object is assigned to one and only one cluster

bull Soft Assignment (probabilistic approach)ndash Each object may be assigned to multiple clustersndash An object has a probability distribution over

clusters where is the probability that is a member of

ndash Is somewhat more appropriate in many tasks such as NLP IR hellip

ix ( )ixP sdotjc

jc( )ji cxP ix

IR ndash Berlin Chen 6

Hard Assignment vs Soft Assignment (cont)

bull Hierarchical clustering usually adopts hard assignment

bull While in flat clustering both types of assignments are common

IR ndash Berlin Chen 7

Summarized Attributes of Clustering Algorithms bull Hierarchical Clustering

ndash Preferable for detailed data analysis

ndash Provide more information than flat clustering

ndash No single best algorithm (each of the algorithms only optimal for some applications)

ndash Less efficient than flat clustering (minimally have to compute n x nmatrix of similarity coefficients)

IR ndash Berlin Chen 8

Summarized Attributes of Clustering Algorithms (cont)

bull Flat Clusteringndash Preferable if efficiency is a consideration or data sets are very

large

ndash K-means is the conceptually method and should probably be used on a new data because its results are often sufficient

ndash K-means assumes a simple Euclidean representation space and so cannot be used for many data sets eg nominal data like colors (or samples with features of different scales)

ndash The EM algorithm is the most choice It can accommodate definition of clusters and allocation of objects based on complex probabilistic models

bull Its extensions can be used to handle topologicalhierarchical orders of samples

ndash Probabilistic Latent Semantic Analysis (PLSA) Topic Mixture Model (TMM) etc

IR ndash Berlin Chen 9

Hierarchical Clustering

IR ndash Berlin Chen 10

Hierarchical Clustering

bull Can be in either bottom-up or top-down mannersndash Bottom-up (agglomerative)

bull Start with individual objects and grouping the most similar ones

ndash Eg with the minimum distance apart

bull The procedure terminates when one cluster containing all objects has been formed

ndash Top-down (divisive)bull Start with all objects in a group and divide them into groups

so as to maximize within-group similarity

( ) ( )yxdyxsim

11

+=

凝集的

分裂的

distance measures willbe discussed later on

IR ndash Berlin Chen 11

Hierarchical Agglomerative Clustering (HAC)

bull A bottom-up approach

bull Assume a similarity measure for determining the similarity of two objects

bull Start with all objects in a separate cluster and then repeatedly joins the two clusters that have the most similarity until there is one only cluster survived

bull The history of mergingclustering forms a binary tree or hierarchy

IR ndash Berlin Chen 12

HAC (cont)

bull Algorithm

cluster number

Initialization (for tree leaves)Each object is a cluster

merged as a new cluster

The original two clusters are removed

IR ndash Berlin Chen 13

Distance Metrics

bull Euclidian Distance (L2 norm)

ndash Make sure that all attributesdimensions have the same scale (orthe same variance)

bull L1 Norm (City-block distance)

bull Cosine Similarity (transform to a distance by subtracting from 1)

2

12 )()( i

m

ii yxyxL minus=sum

=

rr

sum=

minus=m

iii yxyxL

11 )( rr

yxyxrr

rr

sdotminus

bull1 ranged between 0 and 1

IR ndash Berlin Chen 14

Measures of Cluster Similaritybull Especially for the bottom-up approaches

bull Single-link clusteringndash The similarity between two clusters is the similarity of the two

closest objects in the clusters

ndash Search over all pairs of objects that are from the two differentclusters and select the pair with the greatest similarity

ndash Elongated clusters are achieved

Ci Cj

greatest similarity

( ) ( )yxsimccsimji cycxji

rrrrisinisin

= max

cf the minimal spanning tree

IR ndash Berlin Chen 15

Measures of Cluster Similarity (cont)

bull Complete-link clusteringndash The similarity between two clusters is the similarity of their two

most dissimilar members

ndash Sphere-shaped clusters are achieved

ndash Preferable for most IR and NLP applications

Ci Cj

least similarity

( ) ( )yxsimccsimji cycxji

rrrrisinisin

= min

IR ndash Berlin Chen 16

Measures of Cluster Similarity (cont)

single link

complete link

IR ndash Berlin Chen 17

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering

ndash The similarity between two clusters is the average similarity between members

ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine

bull There exists an fast algorithm for computing the average similarity

( ) ( ) yxyxyxyxyxsim

rrrr

rrrrrr

sdot=sdot

== cos

length-normalized vectors

Ci Cj

IR ndash Berlin Chen 18

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

ndash The average similarity SIM between vectors in a cluster cj is defined as

ndash The sum of members in a cluster cj

ndash Express in terms of

( ) ( ) ( ) ( )sum sumsum sumisin

neisinisin

neisin

sdotminus

=minus

=j jj j cx

xycyjjcx

xycyjj

j yxcc

yxsimcc

cSIMr

rrrr

rrr

rrrr

11

11

( ) sumisin

=jcx

j xcsr

rr

( )jcSIM ( )jcsr

( ) ( ) ( )( ) ( )( ) ( )

( ) ( ) ( )( )1

1

1

minus

minussdot=there4

+minus=

sdot+minus=

sdot=sdot=sdot

sum

sum sumsum

isin

isin isinisin

jj

jjj

j

jjjj

cxjjj

cx cyj

cxjj

ccccscs

cSIM

ccSIMcc

xxcSIMcc

yxcsxcscs

j

j jj

rr

rr

rrrrrr

r

r rr

=1

length-normalized vector

IR ndash Berlin Chen 19

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

-As merging two clusters ci and cj the cluster sum vectors and are known in advance

ndash The average similarity for their union will be

( )icsr ( )jcsr

( )( ) ( )( ) ( ) ( )( ) ( )

( )( )1

minus++

+minus+sdot+

=cup

jiji

jijiji

ji

cccccccscscscs

ccSIMrrrr

( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr

ic jc

ji cc +

Ci Cj ( )jcsr( )ics

r

IR ndash Berlin Chen 20

Example Word Clustering

bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words

ldquoberdquo has least similarity with the other 21 words

higher nodesdecreasingof similarity

IR ndash Berlin Chen 21

Divisive Clustering

bull A top-down approach

bull Start with all objects in a single cluster

bull At each iteration select the least coherent cluster and split it

bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved

bull The history of clustering forms a binary tree or hierarchy

IR ndash Berlin Chen 22

Divisive Clustering (cont)

bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure

bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation

egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)

IR ndash Berlin Chen 23

Divisive Clustering (cont)

bull Algorithm

split the least coherent cluster

Generate two new clusters and remove the original one

IR ndash Berlin Chen 24

Non-Hierarchical Clustering

IR ndash Berlin Chen 25

Non-hierarchical Clustering

bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)

bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters

bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm

group average similarity likelihood mutual information

k-1 rarr k rarr k+1

Hierarchical clustering also has to face this problem

IR ndash Berlin Chen 26

The K-means Algorithm

bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members

bull The K-means algorithm also can be regarded as ndash A kind of vector quantization

bull Map from a continuous space (high resolution) to a discrete space (low resolution)

ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3

vectorcode wordcode vectorreferenceor centriodcluster

1

index 1

j

kjj

jnt

t

m

mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28

IR ndash Berlin Chen 27

The K-means Algorithm (cont)

ndash and are unknownndash depends on and this optimization problem

can not be solved analytically

( )⎪⎩

⎪⎨⎧ minus=minus

=minus= sum sum= =

=otherwise 0

minif 1 where

errortion Reconstruc Total2

1 11

jt

jit

ti

N

t

k

ii

tti

kii bbE

mxmxmxXm

tib

imtib

im

label

IR ndash Berlin Chen 28

The K-means Algorithm (cont)

bull Initializationndash A set of initial cluster centers is needed

bull Recursionndash Assign each object to the cluster whose center is closest

ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members

bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)

kii 1=m

sum

sum

=

= sdot=

Nt

ti

tNt

ti

i bb

1

1 xm

tx

⎪⎩

⎪⎨⎧ minus=minus

=otherwise 0

minif 1 jt

jit

tib

mxmx

These two steps are repeated until stabilizesim

IR ndash Berlin Chen 29

The K-means Algorithm (cont)

bull Algorithm

IR ndash Berlin Chen 30

The K-means Algorithm (cont)

bull Example 1

IR ndash Berlin Chen 31

The K-means Algorithm (cont)

bull Example 2

governmentfinancesports

research

name

IR ndash Berlin Chen 32

The K-means Algorithm (cont)

bull Choice of initial cluster centers (seeds) is important

ndash Pick at randomndash Calculate the mean of all data and generate k initial centers

by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)

divide it range into k equal interval and take the mean of data in each group as the initial center

ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects

bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set

bull Poor seeds will result in sub-optimal clustering

im

im

δm plusmni

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 4: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 4

Types of Clustering Algorithms

bull Two types of structures produced by clustering algorithmsndash Flat or non-hierarchical clusteringndash Hierarchical clustering

bull Flat clusteringndash Simply consisting of a certain number of clusters and the relation

between clusters is often undeterminedndash Measurement construction error minimization or probabilistic

optimizationbull Hierarchical clustering

ndash A hierarchy with usual interpretation that each node stands for a subclass of its motherrsquos node

bull The leaves of the tree are the single objectsbull Each node represents the cluster that contains all the objects

of its descendantsndash Measurement similarities of instances

IR ndash Berlin Chen 5

Hard Assignment vs Soft Assignment

bull Another important distinction between clustering algorithms is whether they perform soft or hard assignment

bull Hard Assignmentndash Each object is assigned to one and only one cluster

bull Soft Assignment (probabilistic approach)ndash Each object may be assigned to multiple clustersndash An object has a probability distribution over

clusters where is the probability that is a member of

ndash Is somewhat more appropriate in many tasks such as NLP IR hellip

ix ( )ixP sdotjc

jc( )ji cxP ix

IR ndash Berlin Chen 6

Hard Assignment vs Soft Assignment (cont)

bull Hierarchical clustering usually adopts hard assignment

bull While in flat clustering both types of assignments are common

IR ndash Berlin Chen 7

Summarized Attributes of Clustering Algorithms bull Hierarchical Clustering

ndash Preferable for detailed data analysis

ndash Provide more information than flat clustering

ndash No single best algorithm (each of the algorithms only optimal for some applications)

ndash Less efficient than flat clustering (minimally have to compute n x nmatrix of similarity coefficients)

IR ndash Berlin Chen 8

Summarized Attributes of Clustering Algorithms (cont)

bull Flat Clusteringndash Preferable if efficiency is a consideration or data sets are very

large

ndash K-means is the conceptually method and should probably be used on a new data because its results are often sufficient

ndash K-means assumes a simple Euclidean representation space and so cannot be used for many data sets eg nominal data like colors (or samples with features of different scales)

ndash The EM algorithm is the most choice It can accommodate definition of clusters and allocation of objects based on complex probabilistic models

bull Its extensions can be used to handle topologicalhierarchical orders of samples

ndash Probabilistic Latent Semantic Analysis (PLSA) Topic Mixture Model (TMM) etc

IR ndash Berlin Chen 9

Hierarchical Clustering

IR ndash Berlin Chen 10

Hierarchical Clustering

bull Can be in either bottom-up or top-down mannersndash Bottom-up (agglomerative)

bull Start with individual objects and grouping the most similar ones

ndash Eg with the minimum distance apart

bull The procedure terminates when one cluster containing all objects has been formed

ndash Top-down (divisive)bull Start with all objects in a group and divide them into groups

so as to maximize within-group similarity

( ) ( )yxdyxsim

11

+=

凝集的

分裂的

distance measures willbe discussed later on

IR ndash Berlin Chen 11

Hierarchical Agglomerative Clustering (HAC)

bull A bottom-up approach

bull Assume a similarity measure for determining the similarity of two objects

bull Start with all objects in a separate cluster and then repeatedly joins the two clusters that have the most similarity until there is one only cluster survived

bull The history of mergingclustering forms a binary tree or hierarchy

IR ndash Berlin Chen 12

HAC (cont)

bull Algorithm

cluster number

Initialization (for tree leaves)Each object is a cluster

merged as a new cluster

The original two clusters are removed

IR ndash Berlin Chen 13

Distance Metrics

bull Euclidian Distance (L2 norm)

ndash Make sure that all attributesdimensions have the same scale (orthe same variance)

bull L1 Norm (City-block distance)

bull Cosine Similarity (transform to a distance by subtracting from 1)

2

12 )()( i

m

ii yxyxL minus=sum

=

rr

sum=

minus=m

iii yxyxL

11 )( rr

yxyxrr

rr

sdotminus

bull1 ranged between 0 and 1

IR ndash Berlin Chen 14

Measures of Cluster Similaritybull Especially for the bottom-up approaches

bull Single-link clusteringndash The similarity between two clusters is the similarity of the two

closest objects in the clusters

ndash Search over all pairs of objects that are from the two differentclusters and select the pair with the greatest similarity

ndash Elongated clusters are achieved

Ci Cj

greatest similarity

( ) ( )yxsimccsimji cycxji

rrrrisinisin

= max

cf the minimal spanning tree

IR ndash Berlin Chen 15

Measures of Cluster Similarity (cont)

bull Complete-link clusteringndash The similarity between two clusters is the similarity of their two

most dissimilar members

ndash Sphere-shaped clusters are achieved

ndash Preferable for most IR and NLP applications

Ci Cj

least similarity

( ) ( )yxsimccsimji cycxji

rrrrisinisin

= min

IR ndash Berlin Chen 16

Measures of Cluster Similarity (cont)

single link

complete link

IR ndash Berlin Chen 17

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering

ndash The similarity between two clusters is the average similarity between members

ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine

bull There exists an fast algorithm for computing the average similarity

( ) ( ) yxyxyxyxyxsim

rrrr

rrrrrr

sdot=sdot

== cos

length-normalized vectors

Ci Cj

IR ndash Berlin Chen 18

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

ndash The average similarity SIM between vectors in a cluster cj is defined as

ndash The sum of members in a cluster cj

ndash Express in terms of

( ) ( ) ( ) ( )sum sumsum sumisin

neisinisin

neisin

sdotminus

=minus

=j jj j cx

xycyjjcx

xycyjj

j yxcc

yxsimcc

cSIMr

rrrr

rrr

rrrr

11

11

( ) sumisin

=jcx

j xcsr

rr

( )jcSIM ( )jcsr

( ) ( ) ( )( ) ( )( ) ( )

( ) ( ) ( )( )1

1

1

minus

minussdot=there4

+minus=

sdot+minus=

sdot=sdot=sdot

sum

sum sumsum

isin

isin isinisin

jj

jjj

j

jjjj

cxjjj

cx cyj

cxjj

ccccscs

cSIM

ccSIMcc

xxcSIMcc

yxcsxcscs

j

j jj

rr

rr

rrrrrr

r

r rr

=1

length-normalized vector

IR ndash Berlin Chen 19

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

-As merging two clusters ci and cj the cluster sum vectors and are known in advance

ndash The average similarity for their union will be

( )icsr ( )jcsr

( )( ) ( )( ) ( ) ( )( ) ( )

( )( )1

minus++

+minus+sdot+

=cup

jiji

jijiji

ji

cccccccscscscs

ccSIMrrrr

( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr

ic jc

ji cc +

Ci Cj ( )jcsr( )ics

r

IR ndash Berlin Chen 20

Example Word Clustering

bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words

ldquoberdquo has least similarity with the other 21 words

higher nodesdecreasingof similarity

IR ndash Berlin Chen 21

Divisive Clustering

bull A top-down approach

bull Start with all objects in a single cluster

bull At each iteration select the least coherent cluster and split it

bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved

bull The history of clustering forms a binary tree or hierarchy

IR ndash Berlin Chen 22

Divisive Clustering (cont)

bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure

bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation

egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)

IR ndash Berlin Chen 23

Divisive Clustering (cont)

bull Algorithm

split the least coherent cluster

Generate two new clusters and remove the original one

IR ndash Berlin Chen 24

Non-Hierarchical Clustering

IR ndash Berlin Chen 25

Non-hierarchical Clustering

bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)

bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters

bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm

group average similarity likelihood mutual information

k-1 rarr k rarr k+1

Hierarchical clustering also has to face this problem

IR ndash Berlin Chen 26

The K-means Algorithm

bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members

bull The K-means algorithm also can be regarded as ndash A kind of vector quantization

bull Map from a continuous space (high resolution) to a discrete space (low resolution)

ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3

vectorcode wordcode vectorreferenceor centriodcluster

1

index 1

j

kjj

jnt

t

m

mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28

IR ndash Berlin Chen 27

The K-means Algorithm (cont)

ndash and are unknownndash depends on and this optimization problem

can not be solved analytically

( )⎪⎩

⎪⎨⎧ minus=minus

=minus= sum sum= =

=otherwise 0

minif 1 where

errortion Reconstruc Total2

1 11

jt

jit

ti

N

t

k

ii

tti

kii bbE

mxmxmxXm

tib

imtib

im

label

IR ndash Berlin Chen 28

The K-means Algorithm (cont)

bull Initializationndash A set of initial cluster centers is needed

bull Recursionndash Assign each object to the cluster whose center is closest

ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members

bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)

kii 1=m

sum

sum

=

= sdot=

Nt

ti

tNt

ti

i bb

1

1 xm

tx

⎪⎩

⎪⎨⎧ minus=minus

=otherwise 0

minif 1 jt

jit

tib

mxmx

These two steps are repeated until stabilizesim

IR ndash Berlin Chen 29

The K-means Algorithm (cont)

bull Algorithm

IR ndash Berlin Chen 30

The K-means Algorithm (cont)

bull Example 1

IR ndash Berlin Chen 31

The K-means Algorithm (cont)

bull Example 2

governmentfinancesports

research

name

IR ndash Berlin Chen 32

The K-means Algorithm (cont)

bull Choice of initial cluster centers (seeds) is important

ndash Pick at randomndash Calculate the mean of all data and generate k initial centers

by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)

divide it range into k equal interval and take the mean of data in each group as the initial center

ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects

bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set

bull Poor seeds will result in sub-optimal clustering

im

im

δm plusmni

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 5: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 5

Hard Assignment vs Soft Assignment

bull Another important distinction between clustering algorithms is whether they perform soft or hard assignment

bull Hard Assignmentndash Each object is assigned to one and only one cluster

bull Soft Assignment (probabilistic approach)ndash Each object may be assigned to multiple clustersndash An object has a probability distribution over

clusters where is the probability that is a member of

ndash Is somewhat more appropriate in many tasks such as NLP IR hellip

ix ( )ixP sdotjc

jc( )ji cxP ix

IR ndash Berlin Chen 6

Hard Assignment vs Soft Assignment (cont)

bull Hierarchical clustering usually adopts hard assignment

bull While in flat clustering both types of assignments are common

IR ndash Berlin Chen 7

Summarized Attributes of Clustering Algorithms bull Hierarchical Clustering

ndash Preferable for detailed data analysis

ndash Provide more information than flat clustering

ndash No single best algorithm (each of the algorithms only optimal for some applications)

ndash Less efficient than flat clustering (minimally have to compute n x nmatrix of similarity coefficients)

IR ndash Berlin Chen 8

Summarized Attributes of Clustering Algorithms (cont)

bull Flat Clusteringndash Preferable if efficiency is a consideration or data sets are very

large

ndash K-means is the conceptually method and should probably be used on a new data because its results are often sufficient

ndash K-means assumes a simple Euclidean representation space and so cannot be used for many data sets eg nominal data like colors (or samples with features of different scales)

ndash The EM algorithm is the most choice It can accommodate definition of clusters and allocation of objects based on complex probabilistic models

bull Its extensions can be used to handle topologicalhierarchical orders of samples

ndash Probabilistic Latent Semantic Analysis (PLSA) Topic Mixture Model (TMM) etc

IR ndash Berlin Chen 9

Hierarchical Clustering

IR ndash Berlin Chen 10

Hierarchical Clustering

bull Can be in either bottom-up or top-down mannersndash Bottom-up (agglomerative)

bull Start with individual objects and grouping the most similar ones

ndash Eg with the minimum distance apart

bull The procedure terminates when one cluster containing all objects has been formed

ndash Top-down (divisive)bull Start with all objects in a group and divide them into groups

so as to maximize within-group similarity

( ) ( )yxdyxsim

11

+=

凝集的

分裂的

distance measures willbe discussed later on

IR ndash Berlin Chen 11

Hierarchical Agglomerative Clustering (HAC)

bull A bottom-up approach

bull Assume a similarity measure for determining the similarity of two objects

bull Start with all objects in a separate cluster and then repeatedly joins the two clusters that have the most similarity until there is one only cluster survived

bull The history of mergingclustering forms a binary tree or hierarchy

IR ndash Berlin Chen 12

HAC (cont)

bull Algorithm

cluster number

Initialization (for tree leaves)Each object is a cluster

merged as a new cluster

The original two clusters are removed

IR ndash Berlin Chen 13

Distance Metrics

bull Euclidian Distance (L2 norm)

ndash Make sure that all attributesdimensions have the same scale (orthe same variance)

bull L1 Norm (City-block distance)

bull Cosine Similarity (transform to a distance by subtracting from 1)

2

12 )()( i

m

ii yxyxL minus=sum

=

rr

sum=

minus=m

iii yxyxL

11 )( rr

yxyxrr

rr

sdotminus

bull1 ranged between 0 and 1

IR ndash Berlin Chen 14

Measures of Cluster Similaritybull Especially for the bottom-up approaches

bull Single-link clusteringndash The similarity between two clusters is the similarity of the two

closest objects in the clusters

ndash Search over all pairs of objects that are from the two differentclusters and select the pair with the greatest similarity

ndash Elongated clusters are achieved

Ci Cj

greatest similarity

( ) ( )yxsimccsimji cycxji

rrrrisinisin

= max

cf the minimal spanning tree

IR ndash Berlin Chen 15

Measures of Cluster Similarity (cont)

bull Complete-link clusteringndash The similarity between two clusters is the similarity of their two

most dissimilar members

ndash Sphere-shaped clusters are achieved

ndash Preferable for most IR and NLP applications

Ci Cj

least similarity

( ) ( )yxsimccsimji cycxji

rrrrisinisin

= min

IR ndash Berlin Chen 16

Measures of Cluster Similarity (cont)

single link

complete link

IR ndash Berlin Chen 17

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering

ndash The similarity between two clusters is the average similarity between members

ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine

bull There exists an fast algorithm for computing the average similarity

( ) ( ) yxyxyxyxyxsim

rrrr

rrrrrr

sdot=sdot

== cos

length-normalized vectors

Ci Cj

IR ndash Berlin Chen 18

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

ndash The average similarity SIM between vectors in a cluster cj is defined as

ndash The sum of members in a cluster cj

ndash Express in terms of

( ) ( ) ( ) ( )sum sumsum sumisin

neisinisin

neisin

sdotminus

=minus

=j jj j cx

xycyjjcx

xycyjj

j yxcc

yxsimcc

cSIMr

rrrr

rrr

rrrr

11

11

( ) sumisin

=jcx

j xcsr

rr

( )jcSIM ( )jcsr

( ) ( ) ( )( ) ( )( ) ( )

( ) ( ) ( )( )1

1

1

minus

minussdot=there4

+minus=

sdot+minus=

sdot=sdot=sdot

sum

sum sumsum

isin

isin isinisin

jj

jjj

j

jjjj

cxjjj

cx cyj

cxjj

ccccscs

cSIM

ccSIMcc

xxcSIMcc

yxcsxcscs

j

j jj

rr

rr

rrrrrr

r

r rr

=1

length-normalized vector

IR ndash Berlin Chen 19

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

-As merging two clusters ci and cj the cluster sum vectors and are known in advance

ndash The average similarity for their union will be

( )icsr ( )jcsr

( )( ) ( )( ) ( ) ( )( ) ( )

( )( )1

minus++

+minus+sdot+

=cup

jiji

jijiji

ji

cccccccscscscs

ccSIMrrrr

( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr

ic jc

ji cc +

Ci Cj ( )jcsr( )ics

r

IR ndash Berlin Chen 20

Example Word Clustering

bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words

ldquoberdquo has least similarity with the other 21 words

higher nodesdecreasingof similarity

IR ndash Berlin Chen 21

Divisive Clustering

bull A top-down approach

bull Start with all objects in a single cluster

bull At each iteration select the least coherent cluster and split it

bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved

bull The history of clustering forms a binary tree or hierarchy

IR ndash Berlin Chen 22

Divisive Clustering (cont)

bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure

bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation

egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)

IR ndash Berlin Chen 23

Divisive Clustering (cont)

bull Algorithm

split the least coherent cluster

Generate two new clusters and remove the original one

IR ndash Berlin Chen 24

Non-Hierarchical Clustering

IR ndash Berlin Chen 25

Non-hierarchical Clustering

bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)

bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters

bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm

group average similarity likelihood mutual information

k-1 rarr k rarr k+1

Hierarchical clustering also has to face this problem

IR ndash Berlin Chen 26

The K-means Algorithm

bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members

bull The K-means algorithm also can be regarded as ndash A kind of vector quantization

bull Map from a continuous space (high resolution) to a discrete space (low resolution)

ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3

vectorcode wordcode vectorreferenceor centriodcluster

1

index 1

j

kjj

jnt

t

m

mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28

IR ndash Berlin Chen 27

The K-means Algorithm (cont)

ndash and are unknownndash depends on and this optimization problem

can not be solved analytically

( )⎪⎩

⎪⎨⎧ minus=minus

=minus= sum sum= =

=otherwise 0

minif 1 where

errortion Reconstruc Total2

1 11

jt

jit

ti

N

t

k

ii

tti

kii bbE

mxmxmxXm

tib

imtib

im

label

IR ndash Berlin Chen 28

The K-means Algorithm (cont)

bull Initializationndash A set of initial cluster centers is needed

bull Recursionndash Assign each object to the cluster whose center is closest

ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members

bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)

kii 1=m

sum

sum

=

= sdot=

Nt

ti

tNt

ti

i bb

1

1 xm

tx

⎪⎩

⎪⎨⎧ minus=minus

=otherwise 0

minif 1 jt

jit

tib

mxmx

These two steps are repeated until stabilizesim

IR ndash Berlin Chen 29

The K-means Algorithm (cont)

bull Algorithm

IR ndash Berlin Chen 30

The K-means Algorithm (cont)

bull Example 1

IR ndash Berlin Chen 31

The K-means Algorithm (cont)

bull Example 2

governmentfinancesports

research

name

IR ndash Berlin Chen 32

The K-means Algorithm (cont)

bull Choice of initial cluster centers (seeds) is important

ndash Pick at randomndash Calculate the mean of all data and generate k initial centers

by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)

divide it range into k equal interval and take the mean of data in each group as the initial center

ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects

bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set

bull Poor seeds will result in sub-optimal clustering

im

im

δm plusmni

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 6: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 6

Hard Assignment vs Soft Assignment (cont)

bull Hierarchical clustering usually adopts hard assignment

bull While in flat clustering both types of assignments are common

IR ndash Berlin Chen 7

Summarized Attributes of Clustering Algorithms bull Hierarchical Clustering

ndash Preferable for detailed data analysis

ndash Provide more information than flat clustering

ndash No single best algorithm (each of the algorithms only optimal for some applications)

ndash Less efficient than flat clustering (minimally have to compute n x nmatrix of similarity coefficients)

IR ndash Berlin Chen 8

Summarized Attributes of Clustering Algorithms (cont)

bull Flat Clusteringndash Preferable if efficiency is a consideration or data sets are very

large

ndash K-means is the conceptually method and should probably be used on a new data because its results are often sufficient

ndash K-means assumes a simple Euclidean representation space and so cannot be used for many data sets eg nominal data like colors (or samples with features of different scales)

ndash The EM algorithm is the most choice It can accommodate definition of clusters and allocation of objects based on complex probabilistic models

bull Its extensions can be used to handle topologicalhierarchical orders of samples

ndash Probabilistic Latent Semantic Analysis (PLSA) Topic Mixture Model (TMM) etc

IR ndash Berlin Chen 9

Hierarchical Clustering

IR ndash Berlin Chen 10

Hierarchical Clustering

bull Can be in either bottom-up or top-down mannersndash Bottom-up (agglomerative)

bull Start with individual objects and grouping the most similar ones

ndash Eg with the minimum distance apart

bull The procedure terminates when one cluster containing all objects has been formed

ndash Top-down (divisive)bull Start with all objects in a group and divide them into groups

so as to maximize within-group similarity

( ) ( )yxdyxsim

11

+=

凝集的

分裂的

distance measures willbe discussed later on

IR ndash Berlin Chen 11

Hierarchical Agglomerative Clustering (HAC)

bull A bottom-up approach

bull Assume a similarity measure for determining the similarity of two objects

bull Start with all objects in a separate cluster and then repeatedly joins the two clusters that have the most similarity until there is one only cluster survived

bull The history of mergingclustering forms a binary tree or hierarchy

IR ndash Berlin Chen 12

HAC (cont)

bull Algorithm

cluster number

Initialization (for tree leaves)Each object is a cluster

merged as a new cluster

The original two clusters are removed

IR ndash Berlin Chen 13

Distance Metrics

bull Euclidian Distance (L2 norm)

ndash Make sure that all attributesdimensions have the same scale (orthe same variance)

bull L1 Norm (City-block distance)

bull Cosine Similarity (transform to a distance by subtracting from 1)

2

12 )()( i

m

ii yxyxL minus=sum

=

rr

sum=

minus=m

iii yxyxL

11 )( rr

yxyxrr

rr

sdotminus

bull1 ranged between 0 and 1

IR ndash Berlin Chen 14

Measures of Cluster Similaritybull Especially for the bottom-up approaches

bull Single-link clusteringndash The similarity between two clusters is the similarity of the two

closest objects in the clusters

ndash Search over all pairs of objects that are from the two differentclusters and select the pair with the greatest similarity

ndash Elongated clusters are achieved

Ci Cj

greatest similarity

( ) ( )yxsimccsimji cycxji

rrrrisinisin

= max

cf the minimal spanning tree

IR ndash Berlin Chen 15

Measures of Cluster Similarity (cont)

bull Complete-link clusteringndash The similarity between two clusters is the similarity of their two

most dissimilar members

ndash Sphere-shaped clusters are achieved

ndash Preferable for most IR and NLP applications

Ci Cj

least similarity

( ) ( )yxsimccsimji cycxji

rrrrisinisin

= min

IR ndash Berlin Chen 16

Measures of Cluster Similarity (cont)

single link

complete link

IR ndash Berlin Chen 17

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering

ndash The similarity between two clusters is the average similarity between members

ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine

bull There exists an fast algorithm for computing the average similarity

( ) ( ) yxyxyxyxyxsim

rrrr

rrrrrr

sdot=sdot

== cos

length-normalized vectors

Ci Cj

IR ndash Berlin Chen 18

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

ndash The average similarity SIM between vectors in a cluster cj is defined as

ndash The sum of members in a cluster cj

ndash Express in terms of

( ) ( ) ( ) ( )sum sumsum sumisin

neisinisin

neisin

sdotminus

=minus

=j jj j cx

xycyjjcx

xycyjj

j yxcc

yxsimcc

cSIMr

rrrr

rrr

rrrr

11

11

( ) sumisin

=jcx

j xcsr

rr

( )jcSIM ( )jcsr

( ) ( ) ( )( ) ( )( ) ( )

( ) ( ) ( )( )1

1

1

minus

minussdot=there4

+minus=

sdot+minus=

sdot=sdot=sdot

sum

sum sumsum

isin

isin isinisin

jj

jjj

j

jjjj

cxjjj

cx cyj

cxjj

ccccscs

cSIM

ccSIMcc

xxcSIMcc

yxcsxcscs

j

j jj

rr

rr

rrrrrr

r

r rr

=1

length-normalized vector

IR ndash Berlin Chen 19

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

-As merging two clusters ci and cj the cluster sum vectors and are known in advance

ndash The average similarity for their union will be

( )icsr ( )jcsr

( )( ) ( )( ) ( ) ( )( ) ( )

( )( )1

minus++

+minus+sdot+

=cup

jiji

jijiji

ji

cccccccscscscs

ccSIMrrrr

( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr

ic jc

ji cc +

Ci Cj ( )jcsr( )ics

r

IR ndash Berlin Chen 20

Example Word Clustering

bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words

ldquoberdquo has least similarity with the other 21 words

higher nodesdecreasingof similarity

IR ndash Berlin Chen 21

Divisive Clustering

bull A top-down approach

bull Start with all objects in a single cluster

bull At each iteration select the least coherent cluster and split it

bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved

bull The history of clustering forms a binary tree or hierarchy

IR ndash Berlin Chen 22

Divisive Clustering (cont)

bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure

bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation

egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)

IR ndash Berlin Chen 23

Divisive Clustering (cont)

bull Algorithm

split the least coherent cluster

Generate two new clusters and remove the original one

IR ndash Berlin Chen 24

Non-Hierarchical Clustering

IR ndash Berlin Chen 25

Non-hierarchical Clustering

bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)

bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters

bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm

group average similarity likelihood mutual information

k-1 rarr k rarr k+1

Hierarchical clustering also has to face this problem

IR ndash Berlin Chen 26

The K-means Algorithm

bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members

bull The K-means algorithm also can be regarded as ndash A kind of vector quantization

bull Map from a continuous space (high resolution) to a discrete space (low resolution)

ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3

vectorcode wordcode vectorreferenceor centriodcluster

1

index 1

j

kjj

jnt

t

m

mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28

IR ndash Berlin Chen 27

The K-means Algorithm (cont)

ndash and are unknownndash depends on and this optimization problem

can not be solved analytically

( )⎪⎩

⎪⎨⎧ minus=minus

=minus= sum sum= =

=otherwise 0

minif 1 where

errortion Reconstruc Total2

1 11

jt

jit

ti

N

t

k

ii

tti

kii bbE

mxmxmxXm

tib

imtib

im

label

IR ndash Berlin Chen 28

The K-means Algorithm (cont)

bull Initializationndash A set of initial cluster centers is needed

bull Recursionndash Assign each object to the cluster whose center is closest

ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members

bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)

kii 1=m

sum

sum

=

= sdot=

Nt

ti

tNt

ti

i bb

1

1 xm

tx

⎪⎩

⎪⎨⎧ minus=minus

=otherwise 0

minif 1 jt

jit

tib

mxmx

These two steps are repeated until stabilizesim

IR ndash Berlin Chen 29

The K-means Algorithm (cont)

bull Algorithm

IR ndash Berlin Chen 30

The K-means Algorithm (cont)

bull Example 1

IR ndash Berlin Chen 31

The K-means Algorithm (cont)

bull Example 2

governmentfinancesports

research

name

IR ndash Berlin Chen 32

The K-means Algorithm (cont)

bull Choice of initial cluster centers (seeds) is important

ndash Pick at randomndash Calculate the mean of all data and generate k initial centers

by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)

divide it range into k equal interval and take the mean of data in each group as the initial center

ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects

bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set

bull Poor seeds will result in sub-optimal clustering

im

im

δm plusmni

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 7: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 7

Summarized Attributes of Clustering Algorithms bull Hierarchical Clustering

ndash Preferable for detailed data analysis

ndash Provide more information than flat clustering

ndash No single best algorithm (each of the algorithms only optimal for some applications)

ndash Less efficient than flat clustering (minimally have to compute n x nmatrix of similarity coefficients)

IR ndash Berlin Chen 8

Summarized Attributes of Clustering Algorithms (cont)

bull Flat Clusteringndash Preferable if efficiency is a consideration or data sets are very

large

ndash K-means is the conceptually method and should probably be used on a new data because its results are often sufficient

ndash K-means assumes a simple Euclidean representation space and so cannot be used for many data sets eg nominal data like colors (or samples with features of different scales)

ndash The EM algorithm is the most choice It can accommodate definition of clusters and allocation of objects based on complex probabilistic models

bull Its extensions can be used to handle topologicalhierarchical orders of samples

ndash Probabilistic Latent Semantic Analysis (PLSA) Topic Mixture Model (TMM) etc

IR ndash Berlin Chen 9

Hierarchical Clustering

IR ndash Berlin Chen 10

Hierarchical Clustering

bull Can be in either bottom-up or top-down mannersndash Bottom-up (agglomerative)

bull Start with individual objects and grouping the most similar ones

ndash Eg with the minimum distance apart

bull The procedure terminates when one cluster containing all objects has been formed

ndash Top-down (divisive)bull Start with all objects in a group and divide them into groups

so as to maximize within-group similarity

( ) ( )yxdyxsim

11

+=

凝集的

分裂的

distance measures willbe discussed later on

IR ndash Berlin Chen 11

Hierarchical Agglomerative Clustering (HAC)

bull A bottom-up approach

bull Assume a similarity measure for determining the similarity of two objects

bull Start with all objects in a separate cluster and then repeatedly joins the two clusters that have the most similarity until there is one only cluster survived

bull The history of mergingclustering forms a binary tree or hierarchy

IR ndash Berlin Chen 12

HAC (cont)

bull Algorithm

cluster number

Initialization (for tree leaves)Each object is a cluster

merged as a new cluster

The original two clusters are removed

IR ndash Berlin Chen 13

Distance Metrics

bull Euclidian Distance (L2 norm)

ndash Make sure that all attributesdimensions have the same scale (orthe same variance)

bull L1 Norm (City-block distance)

bull Cosine Similarity (transform to a distance by subtracting from 1)

2

12 )()( i

m

ii yxyxL minus=sum

=

rr

sum=

minus=m

iii yxyxL

11 )( rr

yxyxrr

rr

sdotminus

bull1 ranged between 0 and 1

IR ndash Berlin Chen 14

Measures of Cluster Similaritybull Especially for the bottom-up approaches

bull Single-link clusteringndash The similarity between two clusters is the similarity of the two

closest objects in the clusters

ndash Search over all pairs of objects that are from the two differentclusters and select the pair with the greatest similarity

ndash Elongated clusters are achieved

Ci Cj

greatest similarity

( ) ( )yxsimccsimji cycxji

rrrrisinisin

= max

cf the minimal spanning tree

IR ndash Berlin Chen 15

Measures of Cluster Similarity (cont)

bull Complete-link clusteringndash The similarity between two clusters is the similarity of their two

most dissimilar members

ndash Sphere-shaped clusters are achieved

ndash Preferable for most IR and NLP applications

Ci Cj

least similarity

( ) ( )yxsimccsimji cycxji

rrrrisinisin

= min

IR ndash Berlin Chen 16

Measures of Cluster Similarity (cont)

single link

complete link

IR ndash Berlin Chen 17

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering

ndash The similarity between two clusters is the average similarity between members

ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine

bull There exists an fast algorithm for computing the average similarity

( ) ( ) yxyxyxyxyxsim

rrrr

rrrrrr

sdot=sdot

== cos

length-normalized vectors

Ci Cj

IR ndash Berlin Chen 18

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

ndash The average similarity SIM between vectors in a cluster cj is defined as

ndash The sum of members in a cluster cj

ndash Express in terms of

( ) ( ) ( ) ( )sum sumsum sumisin

neisinisin

neisin

sdotminus

=minus

=j jj j cx

xycyjjcx

xycyjj

j yxcc

yxsimcc

cSIMr

rrrr

rrr

rrrr

11

11

( ) sumisin

=jcx

j xcsr

rr

( )jcSIM ( )jcsr

( ) ( ) ( )( ) ( )( ) ( )

( ) ( ) ( )( )1

1

1

minus

minussdot=there4

+minus=

sdot+minus=

sdot=sdot=sdot

sum

sum sumsum

isin

isin isinisin

jj

jjj

j

jjjj

cxjjj

cx cyj

cxjj

ccccscs

cSIM

ccSIMcc

xxcSIMcc

yxcsxcscs

j

j jj

rr

rr

rrrrrr

r

r rr

=1

length-normalized vector

IR ndash Berlin Chen 19

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

-As merging two clusters ci and cj the cluster sum vectors and are known in advance

ndash The average similarity for their union will be

( )icsr ( )jcsr

( )( ) ( )( ) ( ) ( )( ) ( )

( )( )1

minus++

+minus+sdot+

=cup

jiji

jijiji

ji

cccccccscscscs

ccSIMrrrr

( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr

ic jc

ji cc +

Ci Cj ( )jcsr( )ics

r

IR ndash Berlin Chen 20

Example Word Clustering

bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words

ldquoberdquo has least similarity with the other 21 words

higher nodesdecreasingof similarity

IR ndash Berlin Chen 21

Divisive Clustering

bull A top-down approach

bull Start with all objects in a single cluster

bull At each iteration select the least coherent cluster and split it

bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved

bull The history of clustering forms a binary tree or hierarchy

IR ndash Berlin Chen 22

Divisive Clustering (cont)

bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure

bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation

egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)

IR ndash Berlin Chen 23

Divisive Clustering (cont)

bull Algorithm

split the least coherent cluster

Generate two new clusters and remove the original one

IR ndash Berlin Chen 24

Non-Hierarchical Clustering

IR ndash Berlin Chen 25

Non-hierarchical Clustering

bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)

bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters

bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm

group average similarity likelihood mutual information

k-1 rarr k rarr k+1

Hierarchical clustering also has to face this problem

IR ndash Berlin Chen 26

The K-means Algorithm

bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members

bull The K-means algorithm also can be regarded as ndash A kind of vector quantization

bull Map from a continuous space (high resolution) to a discrete space (low resolution)

ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3

vectorcode wordcode vectorreferenceor centriodcluster

1

index 1

j

kjj

jnt

t

m

mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28

IR ndash Berlin Chen 27

The K-means Algorithm (cont)

ndash and are unknownndash depends on and this optimization problem

can not be solved analytically

( )⎪⎩

⎪⎨⎧ minus=minus

=minus= sum sum= =

=otherwise 0

minif 1 where

errortion Reconstruc Total2

1 11

jt

jit

ti

N

t

k

ii

tti

kii bbE

mxmxmxXm

tib

imtib

im

label

IR ndash Berlin Chen 28

The K-means Algorithm (cont)

bull Initializationndash A set of initial cluster centers is needed

bull Recursionndash Assign each object to the cluster whose center is closest

ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members

bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)

kii 1=m

sum

sum

=

= sdot=

Nt

ti

tNt

ti

i bb

1

1 xm

tx

⎪⎩

⎪⎨⎧ minus=minus

=otherwise 0

minif 1 jt

jit

tib

mxmx

These two steps are repeated until stabilizesim

IR ndash Berlin Chen 29

The K-means Algorithm (cont)

bull Algorithm

IR ndash Berlin Chen 30

The K-means Algorithm (cont)

bull Example 1

IR ndash Berlin Chen 31

The K-means Algorithm (cont)

bull Example 2

governmentfinancesports

research

name

IR ndash Berlin Chen 32

The K-means Algorithm (cont)

bull Choice of initial cluster centers (seeds) is important

ndash Pick at randomndash Calculate the mean of all data and generate k initial centers

by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)

divide it range into k equal interval and take the mean of data in each group as the initial center

ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects

bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set

bull Poor seeds will result in sub-optimal clustering

im

im

δm plusmni

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 8: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 8

Summarized Attributes of Clustering Algorithms (cont)

bull Flat Clusteringndash Preferable if efficiency is a consideration or data sets are very

large

ndash K-means is the conceptually method and should probably be used on a new data because its results are often sufficient

ndash K-means assumes a simple Euclidean representation space and so cannot be used for many data sets eg nominal data like colors (or samples with features of different scales)

ndash The EM algorithm is the most choice It can accommodate definition of clusters and allocation of objects based on complex probabilistic models

bull Its extensions can be used to handle topologicalhierarchical orders of samples

ndash Probabilistic Latent Semantic Analysis (PLSA) Topic Mixture Model (TMM) etc

IR ndash Berlin Chen 9

Hierarchical Clustering

IR ndash Berlin Chen 10

Hierarchical Clustering

bull Can be in either bottom-up or top-down mannersndash Bottom-up (agglomerative)

bull Start with individual objects and grouping the most similar ones

ndash Eg with the minimum distance apart

bull The procedure terminates when one cluster containing all objects has been formed

ndash Top-down (divisive)bull Start with all objects in a group and divide them into groups

so as to maximize within-group similarity

( ) ( )yxdyxsim

11

+=

凝集的

分裂的

distance measures willbe discussed later on

IR ndash Berlin Chen 11

Hierarchical Agglomerative Clustering (HAC)

bull A bottom-up approach

bull Assume a similarity measure for determining the similarity of two objects

bull Start with all objects in a separate cluster and then repeatedly joins the two clusters that have the most similarity until there is one only cluster survived

bull The history of mergingclustering forms a binary tree or hierarchy

IR ndash Berlin Chen 12

HAC (cont)

bull Algorithm

cluster number

Initialization (for tree leaves)Each object is a cluster

merged as a new cluster

The original two clusters are removed

IR ndash Berlin Chen 13

Distance Metrics

bull Euclidian Distance (L2 norm)

ndash Make sure that all attributesdimensions have the same scale (orthe same variance)

bull L1 Norm (City-block distance)

bull Cosine Similarity (transform to a distance by subtracting from 1)

2

12 )()( i

m

ii yxyxL minus=sum

=

rr

sum=

minus=m

iii yxyxL

11 )( rr

yxyxrr

rr

sdotminus

bull1 ranged between 0 and 1

IR ndash Berlin Chen 14

Measures of Cluster Similaritybull Especially for the bottom-up approaches

bull Single-link clusteringndash The similarity between two clusters is the similarity of the two

closest objects in the clusters

ndash Search over all pairs of objects that are from the two differentclusters and select the pair with the greatest similarity

ndash Elongated clusters are achieved

Ci Cj

greatest similarity

( ) ( )yxsimccsimji cycxji

rrrrisinisin

= max

cf the minimal spanning tree

IR ndash Berlin Chen 15

Measures of Cluster Similarity (cont)

bull Complete-link clusteringndash The similarity between two clusters is the similarity of their two

most dissimilar members

ndash Sphere-shaped clusters are achieved

ndash Preferable for most IR and NLP applications

Ci Cj

least similarity

( ) ( )yxsimccsimji cycxji

rrrrisinisin

= min

IR ndash Berlin Chen 16

Measures of Cluster Similarity (cont)

single link

complete link

IR ndash Berlin Chen 17

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering

ndash The similarity between two clusters is the average similarity between members

ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine

bull There exists an fast algorithm for computing the average similarity

( ) ( ) yxyxyxyxyxsim

rrrr

rrrrrr

sdot=sdot

== cos

length-normalized vectors

Ci Cj

IR ndash Berlin Chen 18

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

ndash The average similarity SIM between vectors in a cluster cj is defined as

ndash The sum of members in a cluster cj

ndash Express in terms of

( ) ( ) ( ) ( )sum sumsum sumisin

neisinisin

neisin

sdotminus

=minus

=j jj j cx

xycyjjcx

xycyjj

j yxcc

yxsimcc

cSIMr

rrrr

rrr

rrrr

11

11

( ) sumisin

=jcx

j xcsr

rr

( )jcSIM ( )jcsr

( ) ( ) ( )( ) ( )( ) ( )

( ) ( ) ( )( )1

1

1

minus

minussdot=there4

+minus=

sdot+minus=

sdot=sdot=sdot

sum

sum sumsum

isin

isin isinisin

jj

jjj

j

jjjj

cxjjj

cx cyj

cxjj

ccccscs

cSIM

ccSIMcc

xxcSIMcc

yxcsxcscs

j

j jj

rr

rr

rrrrrr

r

r rr

=1

length-normalized vector

IR ndash Berlin Chen 19

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

-As merging two clusters ci and cj the cluster sum vectors and are known in advance

ndash The average similarity for their union will be

( )icsr ( )jcsr

( )( ) ( )( ) ( ) ( )( ) ( )

( )( )1

minus++

+minus+sdot+

=cup

jiji

jijiji

ji

cccccccscscscs

ccSIMrrrr

( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr

ic jc

ji cc +

Ci Cj ( )jcsr( )ics

r

IR ndash Berlin Chen 20

Example Word Clustering

bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words

ldquoberdquo has least similarity with the other 21 words

higher nodesdecreasingof similarity

IR ndash Berlin Chen 21

Divisive Clustering

bull A top-down approach

bull Start with all objects in a single cluster

bull At each iteration select the least coherent cluster and split it

bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved

bull The history of clustering forms a binary tree or hierarchy

IR ndash Berlin Chen 22

Divisive Clustering (cont)

bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure

bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation

egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)

IR ndash Berlin Chen 23

Divisive Clustering (cont)

bull Algorithm

split the least coherent cluster

Generate two new clusters and remove the original one

IR ndash Berlin Chen 24

Non-Hierarchical Clustering

IR ndash Berlin Chen 25

Non-hierarchical Clustering

bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)

bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters

bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm

group average similarity likelihood mutual information

k-1 rarr k rarr k+1

Hierarchical clustering also has to face this problem

IR ndash Berlin Chen 26

The K-means Algorithm

bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members

bull The K-means algorithm also can be regarded as ndash A kind of vector quantization

bull Map from a continuous space (high resolution) to a discrete space (low resolution)

ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3

vectorcode wordcode vectorreferenceor centriodcluster

1

index 1

j

kjj

jnt

t

m

mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28

IR ndash Berlin Chen 27

The K-means Algorithm (cont)

ndash and are unknownndash depends on and this optimization problem

can not be solved analytically

( )⎪⎩

⎪⎨⎧ minus=minus

=minus= sum sum= =

=otherwise 0

minif 1 where

errortion Reconstruc Total2

1 11

jt

jit

ti

N

t

k

ii

tti

kii bbE

mxmxmxXm

tib

imtib

im

label

IR ndash Berlin Chen 28

The K-means Algorithm (cont)

bull Initializationndash A set of initial cluster centers is needed

bull Recursionndash Assign each object to the cluster whose center is closest

ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members

bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)

kii 1=m

sum

sum

=

= sdot=

Nt

ti

tNt

ti

i bb

1

1 xm

tx

⎪⎩

⎪⎨⎧ minus=minus

=otherwise 0

minif 1 jt

jit

tib

mxmx

These two steps are repeated until stabilizesim

IR ndash Berlin Chen 29

The K-means Algorithm (cont)

bull Algorithm

IR ndash Berlin Chen 30

The K-means Algorithm (cont)

bull Example 1

IR ndash Berlin Chen 31

The K-means Algorithm (cont)

bull Example 2

governmentfinancesports

research

name

IR ndash Berlin Chen 32

The K-means Algorithm (cont)

bull Choice of initial cluster centers (seeds) is important

ndash Pick at randomndash Calculate the mean of all data and generate k initial centers

by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)

divide it range into k equal interval and take the mean of data in each group as the initial center

ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects

bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set

bull Poor seeds will result in sub-optimal clustering

im

im

δm plusmni

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 9: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 9

Hierarchical Clustering

IR ndash Berlin Chen 10

Hierarchical Clustering

bull Can be in either bottom-up or top-down mannersndash Bottom-up (agglomerative)

bull Start with individual objects and grouping the most similar ones

ndash Eg with the minimum distance apart

bull The procedure terminates when one cluster containing all objects has been formed

ndash Top-down (divisive)bull Start with all objects in a group and divide them into groups

so as to maximize within-group similarity

( ) ( )yxdyxsim

11

+=

凝集的

分裂的

distance measures willbe discussed later on

IR ndash Berlin Chen 11

Hierarchical Agglomerative Clustering (HAC)

bull A bottom-up approach

bull Assume a similarity measure for determining the similarity of two objects

bull Start with all objects in a separate cluster and then repeatedly joins the two clusters that have the most similarity until there is one only cluster survived

bull The history of mergingclustering forms a binary tree or hierarchy

IR ndash Berlin Chen 12

HAC (cont)

bull Algorithm

cluster number

Initialization (for tree leaves)Each object is a cluster

merged as a new cluster

The original two clusters are removed

IR ndash Berlin Chen 13

Distance Metrics

bull Euclidian Distance (L2 norm)

ndash Make sure that all attributesdimensions have the same scale (orthe same variance)

bull L1 Norm (City-block distance)

bull Cosine Similarity (transform to a distance by subtracting from 1)

2

12 )()( i

m

ii yxyxL minus=sum

=

rr

sum=

minus=m

iii yxyxL

11 )( rr

yxyxrr

rr

sdotminus

bull1 ranged between 0 and 1

IR ndash Berlin Chen 14

Measures of Cluster Similaritybull Especially for the bottom-up approaches

bull Single-link clusteringndash The similarity between two clusters is the similarity of the two

closest objects in the clusters

ndash Search over all pairs of objects that are from the two differentclusters and select the pair with the greatest similarity

ndash Elongated clusters are achieved

Ci Cj

greatest similarity

( ) ( )yxsimccsimji cycxji

rrrrisinisin

= max

cf the minimal spanning tree

IR ndash Berlin Chen 15

Measures of Cluster Similarity (cont)

bull Complete-link clusteringndash The similarity between two clusters is the similarity of their two

most dissimilar members

ndash Sphere-shaped clusters are achieved

ndash Preferable for most IR and NLP applications

Ci Cj

least similarity

( ) ( )yxsimccsimji cycxji

rrrrisinisin

= min

IR ndash Berlin Chen 16

Measures of Cluster Similarity (cont)

single link

complete link

IR ndash Berlin Chen 17

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering

ndash The similarity between two clusters is the average similarity between members

ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine

bull There exists an fast algorithm for computing the average similarity

( ) ( ) yxyxyxyxyxsim

rrrr

rrrrrr

sdot=sdot

== cos

length-normalized vectors

Ci Cj

IR ndash Berlin Chen 18

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

ndash The average similarity SIM between vectors in a cluster cj is defined as

ndash The sum of members in a cluster cj

ndash Express in terms of

( ) ( ) ( ) ( )sum sumsum sumisin

neisinisin

neisin

sdotminus

=minus

=j jj j cx

xycyjjcx

xycyjj

j yxcc

yxsimcc

cSIMr

rrrr

rrr

rrrr

11

11

( ) sumisin

=jcx

j xcsr

rr

( )jcSIM ( )jcsr

( ) ( ) ( )( ) ( )( ) ( )

( ) ( ) ( )( )1

1

1

minus

minussdot=there4

+minus=

sdot+minus=

sdot=sdot=sdot

sum

sum sumsum

isin

isin isinisin

jj

jjj

j

jjjj

cxjjj

cx cyj

cxjj

ccccscs

cSIM

ccSIMcc

xxcSIMcc

yxcsxcscs

j

j jj

rr

rr

rrrrrr

r

r rr

=1

length-normalized vector

IR ndash Berlin Chen 19

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

-As merging two clusters ci and cj the cluster sum vectors and are known in advance

ndash The average similarity for their union will be

( )icsr ( )jcsr

( )( ) ( )( ) ( ) ( )( ) ( )

( )( )1

minus++

+minus+sdot+

=cup

jiji

jijiji

ji

cccccccscscscs

ccSIMrrrr

( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr

ic jc

ji cc +

Ci Cj ( )jcsr( )ics

r

IR ndash Berlin Chen 20

Example Word Clustering

bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words

ldquoberdquo has least similarity with the other 21 words

higher nodesdecreasingof similarity

IR ndash Berlin Chen 21

Divisive Clustering

bull A top-down approach

bull Start with all objects in a single cluster

bull At each iteration select the least coherent cluster and split it

bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved

bull The history of clustering forms a binary tree or hierarchy

IR ndash Berlin Chen 22

Divisive Clustering (cont)

bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure

bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation

egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)

IR ndash Berlin Chen 23

Divisive Clustering (cont)

bull Algorithm

split the least coherent cluster

Generate two new clusters and remove the original one

IR ndash Berlin Chen 24

Non-Hierarchical Clustering

IR ndash Berlin Chen 25

Non-hierarchical Clustering

bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)

bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters

bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm

group average similarity likelihood mutual information

k-1 rarr k rarr k+1

Hierarchical clustering also has to face this problem

IR ndash Berlin Chen 26

The K-means Algorithm

bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members

bull The K-means algorithm also can be regarded as ndash A kind of vector quantization

bull Map from a continuous space (high resolution) to a discrete space (low resolution)

ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3

vectorcode wordcode vectorreferenceor centriodcluster

1

index 1

j

kjj

jnt

t

m

mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28

IR ndash Berlin Chen 27

The K-means Algorithm (cont)

ndash and are unknownndash depends on and this optimization problem

can not be solved analytically

( )⎪⎩

⎪⎨⎧ minus=minus

=minus= sum sum= =

=otherwise 0

minif 1 where

errortion Reconstruc Total2

1 11

jt

jit

ti

N

t

k

ii

tti

kii bbE

mxmxmxXm

tib

imtib

im

label

IR ndash Berlin Chen 28

The K-means Algorithm (cont)

bull Initializationndash A set of initial cluster centers is needed

bull Recursionndash Assign each object to the cluster whose center is closest

ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members

bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)

kii 1=m

sum

sum

=

= sdot=

Nt

ti

tNt

ti

i bb

1

1 xm

tx

⎪⎩

⎪⎨⎧ minus=minus

=otherwise 0

minif 1 jt

jit

tib

mxmx

These two steps are repeated until stabilizesim

IR ndash Berlin Chen 29

The K-means Algorithm (cont)

bull Algorithm

IR ndash Berlin Chen 30

The K-means Algorithm (cont)

bull Example 1

IR ndash Berlin Chen 31

The K-means Algorithm (cont)

bull Example 2

governmentfinancesports

research

name

IR ndash Berlin Chen 32

The K-means Algorithm (cont)

bull Choice of initial cluster centers (seeds) is important

ndash Pick at randomndash Calculate the mean of all data and generate k initial centers

by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)

divide it range into k equal interval and take the mean of data in each group as the initial center

ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects

bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set

bull Poor seeds will result in sub-optimal clustering

im

im

δm plusmni

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 10: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 10

Hierarchical Clustering

bull Can be in either bottom-up or top-down mannersndash Bottom-up (agglomerative)

bull Start with individual objects and grouping the most similar ones

ndash Eg with the minimum distance apart

bull The procedure terminates when one cluster containing all objects has been formed

ndash Top-down (divisive)bull Start with all objects in a group and divide them into groups

so as to maximize within-group similarity

( ) ( )yxdyxsim

11

+=

凝集的

分裂的

distance measures willbe discussed later on

IR ndash Berlin Chen 11

Hierarchical Agglomerative Clustering (HAC)

bull A bottom-up approach

bull Assume a similarity measure for determining the similarity of two objects

bull Start with all objects in a separate cluster and then repeatedly joins the two clusters that have the most similarity until there is one only cluster survived

bull The history of mergingclustering forms a binary tree or hierarchy

IR ndash Berlin Chen 12

HAC (cont)

bull Algorithm

cluster number

Initialization (for tree leaves)Each object is a cluster

merged as a new cluster

The original two clusters are removed

IR ndash Berlin Chen 13

Distance Metrics

bull Euclidian Distance (L2 norm)

ndash Make sure that all attributesdimensions have the same scale (orthe same variance)

bull L1 Norm (City-block distance)

bull Cosine Similarity (transform to a distance by subtracting from 1)

2

12 )()( i

m

ii yxyxL minus=sum

=

rr

sum=

minus=m

iii yxyxL

11 )( rr

yxyxrr

rr

sdotminus

bull1 ranged between 0 and 1

IR ndash Berlin Chen 14

Measures of Cluster Similaritybull Especially for the bottom-up approaches

bull Single-link clusteringndash The similarity between two clusters is the similarity of the two

closest objects in the clusters

ndash Search over all pairs of objects that are from the two differentclusters and select the pair with the greatest similarity

ndash Elongated clusters are achieved

Ci Cj

greatest similarity

( ) ( )yxsimccsimji cycxji

rrrrisinisin

= max

cf the minimal spanning tree

IR ndash Berlin Chen 15

Measures of Cluster Similarity (cont)

bull Complete-link clusteringndash The similarity between two clusters is the similarity of their two

most dissimilar members

ndash Sphere-shaped clusters are achieved

ndash Preferable for most IR and NLP applications

Ci Cj

least similarity

( ) ( )yxsimccsimji cycxji

rrrrisinisin

= min

IR ndash Berlin Chen 16

Measures of Cluster Similarity (cont)

single link

complete link

IR ndash Berlin Chen 17

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering

ndash The similarity between two clusters is the average similarity between members

ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine

bull There exists an fast algorithm for computing the average similarity

( ) ( ) yxyxyxyxyxsim

rrrr

rrrrrr

sdot=sdot

== cos

length-normalized vectors

Ci Cj

IR ndash Berlin Chen 18

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

ndash The average similarity SIM between vectors in a cluster cj is defined as

ndash The sum of members in a cluster cj

ndash Express in terms of

( ) ( ) ( ) ( )sum sumsum sumisin

neisinisin

neisin

sdotminus

=minus

=j jj j cx

xycyjjcx

xycyjj

j yxcc

yxsimcc

cSIMr

rrrr

rrr

rrrr

11

11

( ) sumisin

=jcx

j xcsr

rr

( )jcSIM ( )jcsr

( ) ( ) ( )( ) ( )( ) ( )

( ) ( ) ( )( )1

1

1

minus

minussdot=there4

+minus=

sdot+minus=

sdot=sdot=sdot

sum

sum sumsum

isin

isin isinisin

jj

jjj

j

jjjj

cxjjj

cx cyj

cxjj

ccccscs

cSIM

ccSIMcc

xxcSIMcc

yxcsxcscs

j

j jj

rr

rr

rrrrrr

r

r rr

=1

length-normalized vector

IR ndash Berlin Chen 19

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

-As merging two clusters ci and cj the cluster sum vectors and are known in advance

ndash The average similarity for their union will be

( )icsr ( )jcsr

( )( ) ( )( ) ( ) ( )( ) ( )

( )( )1

minus++

+minus+sdot+

=cup

jiji

jijiji

ji

cccccccscscscs

ccSIMrrrr

( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr

ic jc

ji cc +

Ci Cj ( )jcsr( )ics

r

IR ndash Berlin Chen 20

Example Word Clustering

bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words

ldquoberdquo has least similarity with the other 21 words

higher nodesdecreasingof similarity

IR ndash Berlin Chen 21

Divisive Clustering

bull A top-down approach

bull Start with all objects in a single cluster

bull At each iteration select the least coherent cluster and split it

bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved

bull The history of clustering forms a binary tree or hierarchy

IR ndash Berlin Chen 22

Divisive Clustering (cont)

bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure

bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation

egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)

IR ndash Berlin Chen 23

Divisive Clustering (cont)

bull Algorithm

split the least coherent cluster

Generate two new clusters and remove the original one

IR ndash Berlin Chen 24

Non-Hierarchical Clustering

IR ndash Berlin Chen 25

Non-hierarchical Clustering

bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)

bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters

bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm

group average similarity likelihood mutual information

k-1 rarr k rarr k+1

Hierarchical clustering also has to face this problem

IR ndash Berlin Chen 26

The K-means Algorithm

bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members

bull The K-means algorithm also can be regarded as ndash A kind of vector quantization

bull Map from a continuous space (high resolution) to a discrete space (low resolution)

ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3

vectorcode wordcode vectorreferenceor centriodcluster

1

index 1

j

kjj

jnt

t

m

mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28

IR ndash Berlin Chen 27

The K-means Algorithm (cont)

ndash and are unknownndash depends on and this optimization problem

can not be solved analytically

( )⎪⎩

⎪⎨⎧ minus=minus

=minus= sum sum= =

=otherwise 0

minif 1 where

errortion Reconstruc Total2

1 11

jt

jit

ti

N

t

k

ii

tti

kii bbE

mxmxmxXm

tib

imtib

im

label

IR ndash Berlin Chen 28

The K-means Algorithm (cont)

bull Initializationndash A set of initial cluster centers is needed

bull Recursionndash Assign each object to the cluster whose center is closest

ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members

bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)

kii 1=m

sum

sum

=

= sdot=

Nt

ti

tNt

ti

i bb

1

1 xm

tx

⎪⎩

⎪⎨⎧ minus=minus

=otherwise 0

minif 1 jt

jit

tib

mxmx

These two steps are repeated until stabilizesim

IR ndash Berlin Chen 29

The K-means Algorithm (cont)

bull Algorithm

IR ndash Berlin Chen 30

The K-means Algorithm (cont)

bull Example 1

IR ndash Berlin Chen 31

The K-means Algorithm (cont)

bull Example 2

governmentfinancesports

research

name

IR ndash Berlin Chen 32

The K-means Algorithm (cont)

bull Choice of initial cluster centers (seeds) is important

ndash Pick at randomndash Calculate the mean of all data and generate k initial centers

by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)

divide it range into k equal interval and take the mean of data in each group as the initial center

ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects

bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set

bull Poor seeds will result in sub-optimal clustering

im

im

δm plusmni

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 11: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 11

Hierarchical Agglomerative Clustering (HAC)

bull A bottom-up approach

bull Assume a similarity measure for determining the similarity of two objects

bull Start with all objects in a separate cluster and then repeatedly joins the two clusters that have the most similarity until there is one only cluster survived

bull The history of mergingclustering forms a binary tree or hierarchy

IR ndash Berlin Chen 12

HAC (cont)

bull Algorithm

cluster number

Initialization (for tree leaves)Each object is a cluster

merged as a new cluster

The original two clusters are removed

IR ndash Berlin Chen 13

Distance Metrics

bull Euclidian Distance (L2 norm)

ndash Make sure that all attributesdimensions have the same scale (orthe same variance)

bull L1 Norm (City-block distance)

bull Cosine Similarity (transform to a distance by subtracting from 1)

2

12 )()( i

m

ii yxyxL minus=sum

=

rr

sum=

minus=m

iii yxyxL

11 )( rr

yxyxrr

rr

sdotminus

bull1 ranged between 0 and 1

IR ndash Berlin Chen 14

Measures of Cluster Similaritybull Especially for the bottom-up approaches

bull Single-link clusteringndash The similarity between two clusters is the similarity of the two

closest objects in the clusters

ndash Search over all pairs of objects that are from the two differentclusters and select the pair with the greatest similarity

ndash Elongated clusters are achieved

Ci Cj

greatest similarity

( ) ( )yxsimccsimji cycxji

rrrrisinisin

= max

cf the minimal spanning tree

IR ndash Berlin Chen 15

Measures of Cluster Similarity (cont)

bull Complete-link clusteringndash The similarity between two clusters is the similarity of their two

most dissimilar members

ndash Sphere-shaped clusters are achieved

ndash Preferable for most IR and NLP applications

Ci Cj

least similarity

( ) ( )yxsimccsimji cycxji

rrrrisinisin

= min

IR ndash Berlin Chen 16

Measures of Cluster Similarity (cont)

single link

complete link

IR ndash Berlin Chen 17

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering

ndash The similarity between two clusters is the average similarity between members

ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine

bull There exists an fast algorithm for computing the average similarity

( ) ( ) yxyxyxyxyxsim

rrrr

rrrrrr

sdot=sdot

== cos

length-normalized vectors

Ci Cj

IR ndash Berlin Chen 18

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

ndash The average similarity SIM between vectors in a cluster cj is defined as

ndash The sum of members in a cluster cj

ndash Express in terms of

( ) ( ) ( ) ( )sum sumsum sumisin

neisinisin

neisin

sdotminus

=minus

=j jj j cx

xycyjjcx

xycyjj

j yxcc

yxsimcc

cSIMr

rrrr

rrr

rrrr

11

11

( ) sumisin

=jcx

j xcsr

rr

( )jcSIM ( )jcsr

( ) ( ) ( )( ) ( )( ) ( )

( ) ( ) ( )( )1

1

1

minus

minussdot=there4

+minus=

sdot+minus=

sdot=sdot=sdot

sum

sum sumsum

isin

isin isinisin

jj

jjj

j

jjjj

cxjjj

cx cyj

cxjj

ccccscs

cSIM

ccSIMcc

xxcSIMcc

yxcsxcscs

j

j jj

rr

rr

rrrrrr

r

r rr

=1

length-normalized vector

IR ndash Berlin Chen 19

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

-As merging two clusters ci and cj the cluster sum vectors and are known in advance

ndash The average similarity for their union will be

( )icsr ( )jcsr

( )( ) ( )( ) ( ) ( )( ) ( )

( )( )1

minus++

+minus+sdot+

=cup

jiji

jijiji

ji

cccccccscscscs

ccSIMrrrr

( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr

ic jc

ji cc +

Ci Cj ( )jcsr( )ics

r

IR ndash Berlin Chen 20

Example Word Clustering

bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words

ldquoberdquo has least similarity with the other 21 words

higher nodesdecreasingof similarity

IR ndash Berlin Chen 21

Divisive Clustering

bull A top-down approach

bull Start with all objects in a single cluster

bull At each iteration select the least coherent cluster and split it

bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved

bull The history of clustering forms a binary tree or hierarchy

IR ndash Berlin Chen 22

Divisive Clustering (cont)

bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure

bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation

egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)

IR ndash Berlin Chen 23

Divisive Clustering (cont)

bull Algorithm

split the least coherent cluster

Generate two new clusters and remove the original one

IR ndash Berlin Chen 24

Non-Hierarchical Clustering

IR ndash Berlin Chen 25

Non-hierarchical Clustering

bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)

bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters

bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm

group average similarity likelihood mutual information

k-1 rarr k rarr k+1

Hierarchical clustering also has to face this problem

IR ndash Berlin Chen 26

The K-means Algorithm

bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members

bull The K-means algorithm also can be regarded as ndash A kind of vector quantization

bull Map from a continuous space (high resolution) to a discrete space (low resolution)

ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3

vectorcode wordcode vectorreferenceor centriodcluster

1

index 1

j

kjj

jnt

t

m

mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28

IR ndash Berlin Chen 27

The K-means Algorithm (cont)

ndash and are unknownndash depends on and this optimization problem

can not be solved analytically

( )⎪⎩

⎪⎨⎧ minus=minus

=minus= sum sum= =

=otherwise 0

minif 1 where

errortion Reconstruc Total2

1 11

jt

jit

ti

N

t

k

ii

tti

kii bbE

mxmxmxXm

tib

imtib

im

label

IR ndash Berlin Chen 28

The K-means Algorithm (cont)

bull Initializationndash A set of initial cluster centers is needed

bull Recursionndash Assign each object to the cluster whose center is closest

ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members

bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)

kii 1=m

sum

sum

=

= sdot=

Nt

ti

tNt

ti

i bb

1

1 xm

tx

⎪⎩

⎪⎨⎧ minus=minus

=otherwise 0

minif 1 jt

jit

tib

mxmx

These two steps are repeated until stabilizesim

IR ndash Berlin Chen 29

The K-means Algorithm (cont)

bull Algorithm

IR ndash Berlin Chen 30

The K-means Algorithm (cont)

bull Example 1

IR ndash Berlin Chen 31

The K-means Algorithm (cont)

bull Example 2

governmentfinancesports

research

name

IR ndash Berlin Chen 32

The K-means Algorithm (cont)

bull Choice of initial cluster centers (seeds) is important

ndash Pick at randomndash Calculate the mean of all data and generate k initial centers

by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)

divide it range into k equal interval and take the mean of data in each group as the initial center

ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects

bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set

bull Poor seeds will result in sub-optimal clustering

im

im

δm plusmni

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 12: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 12

HAC (cont)

bull Algorithm

cluster number

Initialization (for tree leaves)Each object is a cluster

merged as a new cluster

The original two clusters are removed

IR ndash Berlin Chen 13

Distance Metrics

bull Euclidian Distance (L2 norm)

ndash Make sure that all attributesdimensions have the same scale (orthe same variance)

bull L1 Norm (City-block distance)

bull Cosine Similarity (transform to a distance by subtracting from 1)

2

12 )()( i

m

ii yxyxL minus=sum

=

rr

sum=

minus=m

iii yxyxL

11 )( rr

yxyxrr

rr

sdotminus

bull1 ranged between 0 and 1

IR ndash Berlin Chen 14

Measures of Cluster Similaritybull Especially for the bottom-up approaches

bull Single-link clusteringndash The similarity between two clusters is the similarity of the two

closest objects in the clusters

ndash Search over all pairs of objects that are from the two differentclusters and select the pair with the greatest similarity

ndash Elongated clusters are achieved

Ci Cj

greatest similarity

( ) ( )yxsimccsimji cycxji

rrrrisinisin

= max

cf the minimal spanning tree

IR ndash Berlin Chen 15

Measures of Cluster Similarity (cont)

bull Complete-link clusteringndash The similarity between two clusters is the similarity of their two

most dissimilar members

ndash Sphere-shaped clusters are achieved

ndash Preferable for most IR and NLP applications

Ci Cj

least similarity

( ) ( )yxsimccsimji cycxji

rrrrisinisin

= min

IR ndash Berlin Chen 16

Measures of Cluster Similarity (cont)

single link

complete link

IR ndash Berlin Chen 17

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering

ndash The similarity between two clusters is the average similarity between members

ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine

bull There exists an fast algorithm for computing the average similarity

( ) ( ) yxyxyxyxyxsim

rrrr

rrrrrr

sdot=sdot

== cos

length-normalized vectors

Ci Cj

IR ndash Berlin Chen 18

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

ndash The average similarity SIM between vectors in a cluster cj is defined as

ndash The sum of members in a cluster cj

ndash Express in terms of

( ) ( ) ( ) ( )sum sumsum sumisin

neisinisin

neisin

sdotminus

=minus

=j jj j cx

xycyjjcx

xycyjj

j yxcc

yxsimcc

cSIMr

rrrr

rrr

rrrr

11

11

( ) sumisin

=jcx

j xcsr

rr

( )jcSIM ( )jcsr

( ) ( ) ( )( ) ( )( ) ( )

( ) ( ) ( )( )1

1

1

minus

minussdot=there4

+minus=

sdot+minus=

sdot=sdot=sdot

sum

sum sumsum

isin

isin isinisin

jj

jjj

j

jjjj

cxjjj

cx cyj

cxjj

ccccscs

cSIM

ccSIMcc

xxcSIMcc

yxcsxcscs

j

j jj

rr

rr

rrrrrr

r

r rr

=1

length-normalized vector

IR ndash Berlin Chen 19

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

-As merging two clusters ci and cj the cluster sum vectors and are known in advance

ndash The average similarity for their union will be

( )icsr ( )jcsr

( )( ) ( )( ) ( ) ( )( ) ( )

( )( )1

minus++

+minus+sdot+

=cup

jiji

jijiji

ji

cccccccscscscs

ccSIMrrrr

( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr

ic jc

ji cc +

Ci Cj ( )jcsr( )ics

r

IR ndash Berlin Chen 20

Example Word Clustering

bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words

ldquoberdquo has least similarity with the other 21 words

higher nodesdecreasingof similarity

IR ndash Berlin Chen 21

Divisive Clustering

bull A top-down approach

bull Start with all objects in a single cluster

bull At each iteration select the least coherent cluster and split it

bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved

bull The history of clustering forms a binary tree or hierarchy

IR ndash Berlin Chen 22

Divisive Clustering (cont)

bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure

bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation

egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)

IR ndash Berlin Chen 23

Divisive Clustering (cont)

bull Algorithm

split the least coherent cluster

Generate two new clusters and remove the original one

IR ndash Berlin Chen 24

Non-Hierarchical Clustering

IR ndash Berlin Chen 25

Non-hierarchical Clustering

bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)

bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters

bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm

group average similarity likelihood mutual information

k-1 rarr k rarr k+1

Hierarchical clustering also has to face this problem

IR ndash Berlin Chen 26

The K-means Algorithm

bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members

bull The K-means algorithm also can be regarded as ndash A kind of vector quantization

bull Map from a continuous space (high resolution) to a discrete space (low resolution)

ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3

vectorcode wordcode vectorreferenceor centriodcluster

1

index 1

j

kjj

jnt

t

m

mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28

IR ndash Berlin Chen 27

The K-means Algorithm (cont)

ndash and are unknownndash depends on and this optimization problem

can not be solved analytically

( )⎪⎩

⎪⎨⎧ minus=minus

=minus= sum sum= =

=otherwise 0

minif 1 where

errortion Reconstruc Total2

1 11

jt

jit

ti

N

t

k

ii

tti

kii bbE

mxmxmxXm

tib

imtib

im

label

IR ndash Berlin Chen 28

The K-means Algorithm (cont)

bull Initializationndash A set of initial cluster centers is needed

bull Recursionndash Assign each object to the cluster whose center is closest

ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members

bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)

kii 1=m

sum

sum

=

= sdot=

Nt

ti

tNt

ti

i bb

1

1 xm

tx

⎪⎩

⎪⎨⎧ minus=minus

=otherwise 0

minif 1 jt

jit

tib

mxmx

These two steps are repeated until stabilizesim

IR ndash Berlin Chen 29

The K-means Algorithm (cont)

bull Algorithm

IR ndash Berlin Chen 30

The K-means Algorithm (cont)

bull Example 1

IR ndash Berlin Chen 31

The K-means Algorithm (cont)

bull Example 2

governmentfinancesports

research

name

IR ndash Berlin Chen 32

The K-means Algorithm (cont)

bull Choice of initial cluster centers (seeds) is important

ndash Pick at randomndash Calculate the mean of all data and generate k initial centers

by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)

divide it range into k equal interval and take the mean of data in each group as the initial center

ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects

bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set

bull Poor seeds will result in sub-optimal clustering

im

im

δm plusmni

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 13: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 13

Distance Metrics

bull Euclidian Distance (L2 norm)

ndash Make sure that all attributesdimensions have the same scale (orthe same variance)

bull L1 Norm (City-block distance)

bull Cosine Similarity (transform to a distance by subtracting from 1)

2

12 )()( i

m

ii yxyxL minus=sum

=

rr

sum=

minus=m

iii yxyxL

11 )( rr

yxyxrr

rr

sdotminus

bull1 ranged between 0 and 1

IR ndash Berlin Chen 14

Measures of Cluster Similaritybull Especially for the bottom-up approaches

bull Single-link clusteringndash The similarity between two clusters is the similarity of the two

closest objects in the clusters

ndash Search over all pairs of objects that are from the two differentclusters and select the pair with the greatest similarity

ndash Elongated clusters are achieved

Ci Cj

greatest similarity

( ) ( )yxsimccsimji cycxji

rrrrisinisin

= max

cf the minimal spanning tree

IR ndash Berlin Chen 15

Measures of Cluster Similarity (cont)

bull Complete-link clusteringndash The similarity between two clusters is the similarity of their two

most dissimilar members

ndash Sphere-shaped clusters are achieved

ndash Preferable for most IR and NLP applications

Ci Cj

least similarity

( ) ( )yxsimccsimji cycxji

rrrrisinisin

= min

IR ndash Berlin Chen 16

Measures of Cluster Similarity (cont)

single link

complete link

IR ndash Berlin Chen 17

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering

ndash The similarity between two clusters is the average similarity between members

ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine

bull There exists an fast algorithm for computing the average similarity

( ) ( ) yxyxyxyxyxsim

rrrr

rrrrrr

sdot=sdot

== cos

length-normalized vectors

Ci Cj

IR ndash Berlin Chen 18

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

ndash The average similarity SIM between vectors in a cluster cj is defined as

ndash The sum of members in a cluster cj

ndash Express in terms of

( ) ( ) ( ) ( )sum sumsum sumisin

neisinisin

neisin

sdotminus

=minus

=j jj j cx

xycyjjcx

xycyjj

j yxcc

yxsimcc

cSIMr

rrrr

rrr

rrrr

11

11

( ) sumisin

=jcx

j xcsr

rr

( )jcSIM ( )jcsr

( ) ( ) ( )( ) ( )( ) ( )

( ) ( ) ( )( )1

1

1

minus

minussdot=there4

+minus=

sdot+minus=

sdot=sdot=sdot

sum

sum sumsum

isin

isin isinisin

jj

jjj

j

jjjj

cxjjj

cx cyj

cxjj

ccccscs

cSIM

ccSIMcc

xxcSIMcc

yxcsxcscs

j

j jj

rr

rr

rrrrrr

r

r rr

=1

length-normalized vector

IR ndash Berlin Chen 19

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

-As merging two clusters ci and cj the cluster sum vectors and are known in advance

ndash The average similarity for their union will be

( )icsr ( )jcsr

( )( ) ( )( ) ( ) ( )( ) ( )

( )( )1

minus++

+minus+sdot+

=cup

jiji

jijiji

ji

cccccccscscscs

ccSIMrrrr

( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr

ic jc

ji cc +

Ci Cj ( )jcsr( )ics

r

IR ndash Berlin Chen 20

Example Word Clustering

bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words

ldquoberdquo has least similarity with the other 21 words

higher nodesdecreasingof similarity

IR ndash Berlin Chen 21

Divisive Clustering

bull A top-down approach

bull Start with all objects in a single cluster

bull At each iteration select the least coherent cluster and split it

bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved

bull The history of clustering forms a binary tree or hierarchy

IR ndash Berlin Chen 22

Divisive Clustering (cont)

bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure

bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation

egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)

IR ndash Berlin Chen 23

Divisive Clustering (cont)

bull Algorithm

split the least coherent cluster

Generate two new clusters and remove the original one

IR ndash Berlin Chen 24

Non-Hierarchical Clustering

IR ndash Berlin Chen 25

Non-hierarchical Clustering

bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)

bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters

bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm

group average similarity likelihood mutual information

k-1 rarr k rarr k+1

Hierarchical clustering also has to face this problem

IR ndash Berlin Chen 26

The K-means Algorithm

bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members

bull The K-means algorithm also can be regarded as ndash A kind of vector quantization

bull Map from a continuous space (high resolution) to a discrete space (low resolution)

ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3

vectorcode wordcode vectorreferenceor centriodcluster

1

index 1

j

kjj

jnt

t

m

mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28

IR ndash Berlin Chen 27

The K-means Algorithm (cont)

ndash and are unknownndash depends on and this optimization problem

can not be solved analytically

( )⎪⎩

⎪⎨⎧ minus=minus

=minus= sum sum= =

=otherwise 0

minif 1 where

errortion Reconstruc Total2

1 11

jt

jit

ti

N

t

k

ii

tti

kii bbE

mxmxmxXm

tib

imtib

im

label

IR ndash Berlin Chen 28

The K-means Algorithm (cont)

bull Initializationndash A set of initial cluster centers is needed

bull Recursionndash Assign each object to the cluster whose center is closest

ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members

bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)

kii 1=m

sum

sum

=

= sdot=

Nt

ti

tNt

ti

i bb

1

1 xm

tx

⎪⎩

⎪⎨⎧ minus=minus

=otherwise 0

minif 1 jt

jit

tib

mxmx

These two steps are repeated until stabilizesim

IR ndash Berlin Chen 29

The K-means Algorithm (cont)

bull Algorithm

IR ndash Berlin Chen 30

The K-means Algorithm (cont)

bull Example 1

IR ndash Berlin Chen 31

The K-means Algorithm (cont)

bull Example 2

governmentfinancesports

research

name

IR ndash Berlin Chen 32

The K-means Algorithm (cont)

bull Choice of initial cluster centers (seeds) is important

ndash Pick at randomndash Calculate the mean of all data and generate k initial centers

by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)

divide it range into k equal interval and take the mean of data in each group as the initial center

ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects

bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set

bull Poor seeds will result in sub-optimal clustering

im

im

δm plusmni

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 14: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 14

Measures of Cluster Similaritybull Especially for the bottom-up approaches

bull Single-link clusteringndash The similarity between two clusters is the similarity of the two

closest objects in the clusters

ndash Search over all pairs of objects that are from the two differentclusters and select the pair with the greatest similarity

ndash Elongated clusters are achieved

Ci Cj

greatest similarity

( ) ( )yxsimccsimji cycxji

rrrrisinisin

= max

cf the minimal spanning tree

IR ndash Berlin Chen 15

Measures of Cluster Similarity (cont)

bull Complete-link clusteringndash The similarity between two clusters is the similarity of their two

most dissimilar members

ndash Sphere-shaped clusters are achieved

ndash Preferable for most IR and NLP applications

Ci Cj

least similarity

( ) ( )yxsimccsimji cycxji

rrrrisinisin

= min

IR ndash Berlin Chen 16

Measures of Cluster Similarity (cont)

single link

complete link

IR ndash Berlin Chen 17

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering

ndash The similarity between two clusters is the average similarity between members

ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine

bull There exists an fast algorithm for computing the average similarity

( ) ( ) yxyxyxyxyxsim

rrrr

rrrrrr

sdot=sdot

== cos

length-normalized vectors

Ci Cj

IR ndash Berlin Chen 18

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

ndash The average similarity SIM between vectors in a cluster cj is defined as

ndash The sum of members in a cluster cj

ndash Express in terms of

( ) ( ) ( ) ( )sum sumsum sumisin

neisinisin

neisin

sdotminus

=minus

=j jj j cx

xycyjjcx

xycyjj

j yxcc

yxsimcc

cSIMr

rrrr

rrr

rrrr

11

11

( ) sumisin

=jcx

j xcsr

rr

( )jcSIM ( )jcsr

( ) ( ) ( )( ) ( )( ) ( )

( ) ( ) ( )( )1

1

1

minus

minussdot=there4

+minus=

sdot+minus=

sdot=sdot=sdot

sum

sum sumsum

isin

isin isinisin

jj

jjj

j

jjjj

cxjjj

cx cyj

cxjj

ccccscs

cSIM

ccSIMcc

xxcSIMcc

yxcsxcscs

j

j jj

rr

rr

rrrrrr

r

r rr

=1

length-normalized vector

IR ndash Berlin Chen 19

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

-As merging two clusters ci and cj the cluster sum vectors and are known in advance

ndash The average similarity for their union will be

( )icsr ( )jcsr

( )( ) ( )( ) ( ) ( )( ) ( )

( )( )1

minus++

+minus+sdot+

=cup

jiji

jijiji

ji

cccccccscscscs

ccSIMrrrr

( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr

ic jc

ji cc +

Ci Cj ( )jcsr( )ics

r

IR ndash Berlin Chen 20

Example Word Clustering

bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words

ldquoberdquo has least similarity with the other 21 words

higher nodesdecreasingof similarity

IR ndash Berlin Chen 21

Divisive Clustering

bull A top-down approach

bull Start with all objects in a single cluster

bull At each iteration select the least coherent cluster and split it

bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved

bull The history of clustering forms a binary tree or hierarchy

IR ndash Berlin Chen 22

Divisive Clustering (cont)

bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure

bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation

egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)

IR ndash Berlin Chen 23

Divisive Clustering (cont)

bull Algorithm

split the least coherent cluster

Generate two new clusters and remove the original one

IR ndash Berlin Chen 24

Non-Hierarchical Clustering

IR ndash Berlin Chen 25

Non-hierarchical Clustering

bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)

bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters

bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm

group average similarity likelihood mutual information

k-1 rarr k rarr k+1

Hierarchical clustering also has to face this problem

IR ndash Berlin Chen 26

The K-means Algorithm

bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members

bull The K-means algorithm also can be regarded as ndash A kind of vector quantization

bull Map from a continuous space (high resolution) to a discrete space (low resolution)

ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3

vectorcode wordcode vectorreferenceor centriodcluster

1

index 1

j

kjj

jnt

t

m

mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28

IR ndash Berlin Chen 27

The K-means Algorithm (cont)

ndash and are unknownndash depends on and this optimization problem

can not be solved analytically

( )⎪⎩

⎪⎨⎧ minus=minus

=minus= sum sum= =

=otherwise 0

minif 1 where

errortion Reconstruc Total2

1 11

jt

jit

ti

N

t

k

ii

tti

kii bbE

mxmxmxXm

tib

imtib

im

label

IR ndash Berlin Chen 28

The K-means Algorithm (cont)

bull Initializationndash A set of initial cluster centers is needed

bull Recursionndash Assign each object to the cluster whose center is closest

ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members

bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)

kii 1=m

sum

sum

=

= sdot=

Nt

ti

tNt

ti

i bb

1

1 xm

tx

⎪⎩

⎪⎨⎧ minus=minus

=otherwise 0

minif 1 jt

jit

tib

mxmx

These two steps are repeated until stabilizesim

IR ndash Berlin Chen 29

The K-means Algorithm (cont)

bull Algorithm

IR ndash Berlin Chen 30

The K-means Algorithm (cont)

bull Example 1

IR ndash Berlin Chen 31

The K-means Algorithm (cont)

bull Example 2

governmentfinancesports

research

name

IR ndash Berlin Chen 32

The K-means Algorithm (cont)

bull Choice of initial cluster centers (seeds) is important

ndash Pick at randomndash Calculate the mean of all data and generate k initial centers

by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)

divide it range into k equal interval and take the mean of data in each group as the initial center

ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects

bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set

bull Poor seeds will result in sub-optimal clustering

im

im

δm plusmni

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 15: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 15

Measures of Cluster Similarity (cont)

bull Complete-link clusteringndash The similarity between two clusters is the similarity of their two

most dissimilar members

ndash Sphere-shaped clusters are achieved

ndash Preferable for most IR and NLP applications

Ci Cj

least similarity

( ) ( )yxsimccsimji cycxji

rrrrisinisin

= min

IR ndash Berlin Chen 16

Measures of Cluster Similarity (cont)

single link

complete link

IR ndash Berlin Chen 17

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering

ndash The similarity between two clusters is the average similarity between members

ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine

bull There exists an fast algorithm for computing the average similarity

( ) ( ) yxyxyxyxyxsim

rrrr

rrrrrr

sdot=sdot

== cos

length-normalized vectors

Ci Cj

IR ndash Berlin Chen 18

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

ndash The average similarity SIM between vectors in a cluster cj is defined as

ndash The sum of members in a cluster cj

ndash Express in terms of

( ) ( ) ( ) ( )sum sumsum sumisin

neisinisin

neisin

sdotminus

=minus

=j jj j cx

xycyjjcx

xycyjj

j yxcc

yxsimcc

cSIMr

rrrr

rrr

rrrr

11

11

( ) sumisin

=jcx

j xcsr

rr

( )jcSIM ( )jcsr

( ) ( ) ( )( ) ( )( ) ( )

( ) ( ) ( )( )1

1

1

minus

minussdot=there4

+minus=

sdot+minus=

sdot=sdot=sdot

sum

sum sumsum

isin

isin isinisin

jj

jjj

j

jjjj

cxjjj

cx cyj

cxjj

ccccscs

cSIM

ccSIMcc

xxcSIMcc

yxcsxcscs

j

j jj

rr

rr

rrrrrr

r

r rr

=1

length-normalized vector

IR ndash Berlin Chen 19

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

-As merging two clusters ci and cj the cluster sum vectors and are known in advance

ndash The average similarity for their union will be

( )icsr ( )jcsr

( )( ) ( )( ) ( ) ( )( ) ( )

( )( )1

minus++

+minus+sdot+

=cup

jiji

jijiji

ji

cccccccscscscs

ccSIMrrrr

( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr

ic jc

ji cc +

Ci Cj ( )jcsr( )ics

r

IR ndash Berlin Chen 20

Example Word Clustering

bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words

ldquoberdquo has least similarity with the other 21 words

higher nodesdecreasingof similarity

IR ndash Berlin Chen 21

Divisive Clustering

bull A top-down approach

bull Start with all objects in a single cluster

bull At each iteration select the least coherent cluster and split it

bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved

bull The history of clustering forms a binary tree or hierarchy

IR ndash Berlin Chen 22

Divisive Clustering (cont)

bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure

bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation

egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)

IR ndash Berlin Chen 23

Divisive Clustering (cont)

bull Algorithm

split the least coherent cluster

Generate two new clusters and remove the original one

IR ndash Berlin Chen 24

Non-Hierarchical Clustering

IR ndash Berlin Chen 25

Non-hierarchical Clustering

bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)

bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters

bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm

group average similarity likelihood mutual information

k-1 rarr k rarr k+1

Hierarchical clustering also has to face this problem

IR ndash Berlin Chen 26

The K-means Algorithm

bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members

bull The K-means algorithm also can be regarded as ndash A kind of vector quantization

bull Map from a continuous space (high resolution) to a discrete space (low resolution)

ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3

vectorcode wordcode vectorreferenceor centriodcluster

1

index 1

j

kjj

jnt

t

m

mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28

IR ndash Berlin Chen 27

The K-means Algorithm (cont)

ndash and are unknownndash depends on and this optimization problem

can not be solved analytically

( )⎪⎩

⎪⎨⎧ minus=minus

=minus= sum sum= =

=otherwise 0

minif 1 where

errortion Reconstruc Total2

1 11

jt

jit

ti

N

t

k

ii

tti

kii bbE

mxmxmxXm

tib

imtib

im

label

IR ndash Berlin Chen 28

The K-means Algorithm (cont)

bull Initializationndash A set of initial cluster centers is needed

bull Recursionndash Assign each object to the cluster whose center is closest

ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members

bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)

kii 1=m

sum

sum

=

= sdot=

Nt

ti

tNt

ti

i bb

1

1 xm

tx

⎪⎩

⎪⎨⎧ minus=minus

=otherwise 0

minif 1 jt

jit

tib

mxmx

These two steps are repeated until stabilizesim

IR ndash Berlin Chen 29

The K-means Algorithm (cont)

bull Algorithm

IR ndash Berlin Chen 30

The K-means Algorithm (cont)

bull Example 1

IR ndash Berlin Chen 31

The K-means Algorithm (cont)

bull Example 2

governmentfinancesports

research

name

IR ndash Berlin Chen 32

The K-means Algorithm (cont)

bull Choice of initial cluster centers (seeds) is important

ndash Pick at randomndash Calculate the mean of all data and generate k initial centers

by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)

divide it range into k equal interval and take the mean of data in each group as the initial center

ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects

bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set

bull Poor seeds will result in sub-optimal clustering

im

im

δm plusmni

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 16: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 16

Measures of Cluster Similarity (cont)

single link

complete link

IR ndash Berlin Chen 17

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering

ndash The similarity between two clusters is the average similarity between members

ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine

bull There exists an fast algorithm for computing the average similarity

( ) ( ) yxyxyxyxyxsim

rrrr

rrrrrr

sdot=sdot

== cos

length-normalized vectors

Ci Cj

IR ndash Berlin Chen 18

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

ndash The average similarity SIM between vectors in a cluster cj is defined as

ndash The sum of members in a cluster cj

ndash Express in terms of

( ) ( ) ( ) ( )sum sumsum sumisin

neisinisin

neisin

sdotminus

=minus

=j jj j cx

xycyjjcx

xycyjj

j yxcc

yxsimcc

cSIMr

rrrr

rrr

rrrr

11

11

( ) sumisin

=jcx

j xcsr

rr

( )jcSIM ( )jcsr

( ) ( ) ( )( ) ( )( ) ( )

( ) ( ) ( )( )1

1

1

minus

minussdot=there4

+minus=

sdot+minus=

sdot=sdot=sdot

sum

sum sumsum

isin

isin isinisin

jj

jjj

j

jjjj

cxjjj

cx cyj

cxjj

ccccscs

cSIM

ccSIMcc

xxcSIMcc

yxcsxcscs

j

j jj

rr

rr

rrrrrr

r

r rr

=1

length-normalized vector

IR ndash Berlin Chen 19

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

-As merging two clusters ci and cj the cluster sum vectors and are known in advance

ndash The average similarity for their union will be

( )icsr ( )jcsr

( )( ) ( )( ) ( ) ( )( ) ( )

( )( )1

minus++

+minus+sdot+

=cup

jiji

jijiji

ji

cccccccscscscs

ccSIMrrrr

( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr

ic jc

ji cc +

Ci Cj ( )jcsr( )ics

r

IR ndash Berlin Chen 20

Example Word Clustering

bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words

ldquoberdquo has least similarity with the other 21 words

higher nodesdecreasingof similarity

IR ndash Berlin Chen 21

Divisive Clustering

bull A top-down approach

bull Start with all objects in a single cluster

bull At each iteration select the least coherent cluster and split it

bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved

bull The history of clustering forms a binary tree or hierarchy

IR ndash Berlin Chen 22

Divisive Clustering (cont)

bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure

bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation

egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)

IR ndash Berlin Chen 23

Divisive Clustering (cont)

bull Algorithm

split the least coherent cluster

Generate two new clusters and remove the original one

IR ndash Berlin Chen 24

Non-Hierarchical Clustering

IR ndash Berlin Chen 25

Non-hierarchical Clustering

bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)

bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters

bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm

group average similarity likelihood mutual information

k-1 rarr k rarr k+1

Hierarchical clustering also has to face this problem

IR ndash Berlin Chen 26

The K-means Algorithm

bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members

bull The K-means algorithm also can be regarded as ndash A kind of vector quantization

bull Map from a continuous space (high resolution) to a discrete space (low resolution)

ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3

vectorcode wordcode vectorreferenceor centriodcluster

1

index 1

j

kjj

jnt

t

m

mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28

IR ndash Berlin Chen 27

The K-means Algorithm (cont)

ndash and are unknownndash depends on and this optimization problem

can not be solved analytically

( )⎪⎩

⎪⎨⎧ minus=minus

=minus= sum sum= =

=otherwise 0

minif 1 where

errortion Reconstruc Total2

1 11

jt

jit

ti

N

t

k

ii

tti

kii bbE

mxmxmxXm

tib

imtib

im

label

IR ndash Berlin Chen 28

The K-means Algorithm (cont)

bull Initializationndash A set of initial cluster centers is needed

bull Recursionndash Assign each object to the cluster whose center is closest

ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members

bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)

kii 1=m

sum

sum

=

= sdot=

Nt

ti

tNt

ti

i bb

1

1 xm

tx

⎪⎩

⎪⎨⎧ minus=minus

=otherwise 0

minif 1 jt

jit

tib

mxmx

These two steps are repeated until stabilizesim

IR ndash Berlin Chen 29

The K-means Algorithm (cont)

bull Algorithm

IR ndash Berlin Chen 30

The K-means Algorithm (cont)

bull Example 1

IR ndash Berlin Chen 31

The K-means Algorithm (cont)

bull Example 2

governmentfinancesports

research

name

IR ndash Berlin Chen 32

The K-means Algorithm (cont)

bull Choice of initial cluster centers (seeds) is important

ndash Pick at randomndash Calculate the mean of all data and generate k initial centers

by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)

divide it range into k equal interval and take the mean of data in each group as the initial center

ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects

bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set

bull Poor seeds will result in sub-optimal clustering

im

im

δm plusmni

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 17: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 17

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clusteringndash A compromise between single-link and complete-link clustering

ndash The similarity between two clusters is the average similarity between members

ndash If the objects are represented as length-normalized vectors and the similarity measure is the cosine

bull There exists an fast algorithm for computing the average similarity

( ) ( ) yxyxyxyxyxsim

rrrr

rrrrrr

sdot=sdot

== cos

length-normalized vectors

Ci Cj

IR ndash Berlin Chen 18

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

ndash The average similarity SIM between vectors in a cluster cj is defined as

ndash The sum of members in a cluster cj

ndash Express in terms of

( ) ( ) ( ) ( )sum sumsum sumisin

neisinisin

neisin

sdotminus

=minus

=j jj j cx

xycyjjcx

xycyjj

j yxcc

yxsimcc

cSIMr

rrrr

rrr

rrrr

11

11

( ) sumisin

=jcx

j xcsr

rr

( )jcSIM ( )jcsr

( ) ( ) ( )( ) ( )( ) ( )

( ) ( ) ( )( )1

1

1

minus

minussdot=there4

+minus=

sdot+minus=

sdot=sdot=sdot

sum

sum sumsum

isin

isin isinisin

jj

jjj

j

jjjj

cxjjj

cx cyj

cxjj

ccccscs

cSIM

ccSIMcc

xxcSIMcc

yxcsxcscs

j

j jj

rr

rr

rrrrrr

r

r rr

=1

length-normalized vector

IR ndash Berlin Chen 19

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

-As merging two clusters ci and cj the cluster sum vectors and are known in advance

ndash The average similarity for their union will be

( )icsr ( )jcsr

( )( ) ( )( ) ( ) ( )( ) ( )

( )( )1

minus++

+minus+sdot+

=cup

jiji

jijiji

ji

cccccccscscscs

ccSIMrrrr

( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr

ic jc

ji cc +

Ci Cj ( )jcsr( )ics

r

IR ndash Berlin Chen 20

Example Word Clustering

bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words

ldquoberdquo has least similarity with the other 21 words

higher nodesdecreasingof similarity

IR ndash Berlin Chen 21

Divisive Clustering

bull A top-down approach

bull Start with all objects in a single cluster

bull At each iteration select the least coherent cluster and split it

bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved

bull The history of clustering forms a binary tree or hierarchy

IR ndash Berlin Chen 22

Divisive Clustering (cont)

bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure

bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation

egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)

IR ndash Berlin Chen 23

Divisive Clustering (cont)

bull Algorithm

split the least coherent cluster

Generate two new clusters and remove the original one

IR ndash Berlin Chen 24

Non-Hierarchical Clustering

IR ndash Berlin Chen 25

Non-hierarchical Clustering

bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)

bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters

bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm

group average similarity likelihood mutual information

k-1 rarr k rarr k+1

Hierarchical clustering also has to face this problem

IR ndash Berlin Chen 26

The K-means Algorithm

bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members

bull The K-means algorithm also can be regarded as ndash A kind of vector quantization

bull Map from a continuous space (high resolution) to a discrete space (low resolution)

ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3

vectorcode wordcode vectorreferenceor centriodcluster

1

index 1

j

kjj

jnt

t

m

mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28

IR ndash Berlin Chen 27

The K-means Algorithm (cont)

ndash and are unknownndash depends on and this optimization problem

can not be solved analytically

( )⎪⎩

⎪⎨⎧ minus=minus

=minus= sum sum= =

=otherwise 0

minif 1 where

errortion Reconstruc Total2

1 11

jt

jit

ti

N

t

k

ii

tti

kii bbE

mxmxmxXm

tib

imtib

im

label

IR ndash Berlin Chen 28

The K-means Algorithm (cont)

bull Initializationndash A set of initial cluster centers is needed

bull Recursionndash Assign each object to the cluster whose center is closest

ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members

bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)

kii 1=m

sum

sum

=

= sdot=

Nt

ti

tNt

ti

i bb

1

1 xm

tx

⎪⎩

⎪⎨⎧ minus=minus

=otherwise 0

minif 1 jt

jit

tib

mxmx

These two steps are repeated until stabilizesim

IR ndash Berlin Chen 29

The K-means Algorithm (cont)

bull Algorithm

IR ndash Berlin Chen 30

The K-means Algorithm (cont)

bull Example 1

IR ndash Berlin Chen 31

The K-means Algorithm (cont)

bull Example 2

governmentfinancesports

research

name

IR ndash Berlin Chen 32

The K-means Algorithm (cont)

bull Choice of initial cluster centers (seeds) is important

ndash Pick at randomndash Calculate the mean of all data and generate k initial centers

by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)

divide it range into k equal interval and take the mean of data in each group as the initial center

ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects

bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set

bull Poor seeds will result in sub-optimal clustering

im

im

δm plusmni

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 18: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 18

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

ndash The average similarity SIM between vectors in a cluster cj is defined as

ndash The sum of members in a cluster cj

ndash Express in terms of

( ) ( ) ( ) ( )sum sumsum sumisin

neisinisin

neisin

sdotminus

=minus

=j jj j cx

xycyjjcx

xycyjj

j yxcc

yxsimcc

cSIMr

rrrr

rrr

rrrr

11

11

( ) sumisin

=jcx

j xcsr

rr

( )jcSIM ( )jcsr

( ) ( ) ( )( ) ( )( ) ( )

( ) ( ) ( )( )1

1

1

minus

minussdot=there4

+minus=

sdot+minus=

sdot=sdot=sdot

sum

sum sumsum

isin

isin isinisin

jj

jjj

j

jjjj

cxjjj

cx cyj

cxjj

ccccscs

cSIM

ccSIMcc

xxcSIMcc

yxcsxcscs

j

j jj

rr

rr

rrrrrr

r

r rr

=1

length-normalized vector

IR ndash Berlin Chen 19

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

-As merging two clusters ci and cj the cluster sum vectors and are known in advance

ndash The average similarity for their union will be

( )icsr ( )jcsr

( )( ) ( )( ) ( ) ( )( ) ( )

( )( )1

minus++

+minus+sdot+

=cup

jiji

jijiji

ji

cccccccscscscs

ccSIMrrrr

( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr

ic jc

ji cc +

Ci Cj ( )jcsr( )ics

r

IR ndash Berlin Chen 20

Example Word Clustering

bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words

ldquoberdquo has least similarity with the other 21 words

higher nodesdecreasingof similarity

IR ndash Berlin Chen 21

Divisive Clustering

bull A top-down approach

bull Start with all objects in a single cluster

bull At each iteration select the least coherent cluster and split it

bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved

bull The history of clustering forms a binary tree or hierarchy

IR ndash Berlin Chen 22

Divisive Clustering (cont)

bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure

bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation

egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)

IR ndash Berlin Chen 23

Divisive Clustering (cont)

bull Algorithm

split the least coherent cluster

Generate two new clusters and remove the original one

IR ndash Berlin Chen 24

Non-Hierarchical Clustering

IR ndash Berlin Chen 25

Non-hierarchical Clustering

bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)

bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters

bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm

group average similarity likelihood mutual information

k-1 rarr k rarr k+1

Hierarchical clustering also has to face this problem

IR ndash Berlin Chen 26

The K-means Algorithm

bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members

bull The K-means algorithm also can be regarded as ndash A kind of vector quantization

bull Map from a continuous space (high resolution) to a discrete space (low resolution)

ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3

vectorcode wordcode vectorreferenceor centriodcluster

1

index 1

j

kjj

jnt

t

m

mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28

IR ndash Berlin Chen 27

The K-means Algorithm (cont)

ndash and are unknownndash depends on and this optimization problem

can not be solved analytically

( )⎪⎩

⎪⎨⎧ minus=minus

=minus= sum sum= =

=otherwise 0

minif 1 where

errortion Reconstruc Total2

1 11

jt

jit

ti

N

t

k

ii

tti

kii bbE

mxmxmxXm

tib

imtib

im

label

IR ndash Berlin Chen 28

The K-means Algorithm (cont)

bull Initializationndash A set of initial cluster centers is needed

bull Recursionndash Assign each object to the cluster whose center is closest

ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members

bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)

kii 1=m

sum

sum

=

= sdot=

Nt

ti

tNt

ti

i bb

1

1 xm

tx

⎪⎩

⎪⎨⎧ minus=minus

=otherwise 0

minif 1 jt

jit

tib

mxmx

These two steps are repeated until stabilizesim

IR ndash Berlin Chen 29

The K-means Algorithm (cont)

bull Algorithm

IR ndash Berlin Chen 30

The K-means Algorithm (cont)

bull Example 1

IR ndash Berlin Chen 31

The K-means Algorithm (cont)

bull Example 2

governmentfinancesports

research

name

IR ndash Berlin Chen 32

The K-means Algorithm (cont)

bull Choice of initial cluster centers (seeds) is important

ndash Pick at randomndash Calculate the mean of all data and generate k initial centers

by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)

divide it range into k equal interval and take the mean of data in each group as the initial center

ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects

bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set

bull Poor seeds will result in sub-optimal clustering

im

im

δm plusmni

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 19: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 19

Measures of Cluster Similarity (cont)

bull Group-average agglomerative clustering (cont)

-As merging two clusters ci and cj the cluster sum vectors and are known in advance

ndash The average similarity for their union will be

( )icsr ( )jcsr

( )( ) ( )( ) ( ) ( )( ) ( )

( )( )1

minus++

+minus+sdot+

=cup

jiji

jijiji

ji

cccccccscscscs

ccSIMrrrr

( ) ( ) ( ) jiNewjiNew ccccscscs +=+= rrr

ic jc

ji cc +

Ci Cj ( )jcsr( )ics

r

IR ndash Berlin Chen 20

Example Word Clustering

bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words

ldquoberdquo has least similarity with the other 21 words

higher nodesdecreasingof similarity

IR ndash Berlin Chen 21

Divisive Clustering

bull A top-down approach

bull Start with all objects in a single cluster

bull At each iteration select the least coherent cluster and split it

bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved

bull The history of clustering forms a binary tree or hierarchy

IR ndash Berlin Chen 22

Divisive Clustering (cont)

bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure

bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation

egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)

IR ndash Berlin Chen 23

Divisive Clustering (cont)

bull Algorithm

split the least coherent cluster

Generate two new clusters and remove the original one

IR ndash Berlin Chen 24

Non-Hierarchical Clustering

IR ndash Berlin Chen 25

Non-hierarchical Clustering

bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)

bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters

bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm

group average similarity likelihood mutual information

k-1 rarr k rarr k+1

Hierarchical clustering also has to face this problem

IR ndash Berlin Chen 26

The K-means Algorithm

bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members

bull The K-means algorithm also can be regarded as ndash A kind of vector quantization

bull Map from a continuous space (high resolution) to a discrete space (low resolution)

ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3

vectorcode wordcode vectorreferenceor centriodcluster

1

index 1

j

kjj

jnt

t

m

mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28

IR ndash Berlin Chen 27

The K-means Algorithm (cont)

ndash and are unknownndash depends on and this optimization problem

can not be solved analytically

( )⎪⎩

⎪⎨⎧ minus=minus

=minus= sum sum= =

=otherwise 0

minif 1 where

errortion Reconstruc Total2

1 11

jt

jit

ti

N

t

k

ii

tti

kii bbE

mxmxmxXm

tib

imtib

im

label

IR ndash Berlin Chen 28

The K-means Algorithm (cont)

bull Initializationndash A set of initial cluster centers is needed

bull Recursionndash Assign each object to the cluster whose center is closest

ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members

bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)

kii 1=m

sum

sum

=

= sdot=

Nt

ti

tNt

ti

i bb

1

1 xm

tx

⎪⎩

⎪⎨⎧ minus=minus

=otherwise 0

minif 1 jt

jit

tib

mxmx

These two steps are repeated until stabilizesim

IR ndash Berlin Chen 29

The K-means Algorithm (cont)

bull Algorithm

IR ndash Berlin Chen 30

The K-means Algorithm (cont)

bull Example 1

IR ndash Berlin Chen 31

The K-means Algorithm (cont)

bull Example 2

governmentfinancesports

research

name

IR ndash Berlin Chen 32

The K-means Algorithm (cont)

bull Choice of initial cluster centers (seeds) is important

ndash Pick at randomndash Calculate the mean of all data and generate k initial centers

by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)

divide it range into k equal interval and take the mean of data in each group as the initial center

ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects

bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set

bull Poor seeds will result in sub-optimal clustering

im

im

δm plusmni

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 20: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 20

Example Word Clustering

bull Words (objects) are described and clustered using a set of features and valuesndash Eg the left and right neighbors of tokens of words

ldquoberdquo has least similarity with the other 21 words

higher nodesdecreasingof similarity

IR ndash Berlin Chen 21

Divisive Clustering

bull A top-down approach

bull Start with all objects in a single cluster

bull At each iteration select the least coherent cluster and split it

bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved

bull The history of clustering forms a binary tree or hierarchy

IR ndash Berlin Chen 22

Divisive Clustering (cont)

bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure

bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation

egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)

IR ndash Berlin Chen 23

Divisive Clustering (cont)

bull Algorithm

split the least coherent cluster

Generate two new clusters and remove the original one

IR ndash Berlin Chen 24

Non-Hierarchical Clustering

IR ndash Berlin Chen 25

Non-hierarchical Clustering

bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)

bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters

bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm

group average similarity likelihood mutual information

k-1 rarr k rarr k+1

Hierarchical clustering also has to face this problem

IR ndash Berlin Chen 26

The K-means Algorithm

bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members

bull The K-means algorithm also can be regarded as ndash A kind of vector quantization

bull Map from a continuous space (high resolution) to a discrete space (low resolution)

ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3

vectorcode wordcode vectorreferenceor centriodcluster

1

index 1

j

kjj

jnt

t

m

mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28

IR ndash Berlin Chen 27

The K-means Algorithm (cont)

ndash and are unknownndash depends on and this optimization problem

can not be solved analytically

( )⎪⎩

⎪⎨⎧ minus=minus

=minus= sum sum= =

=otherwise 0

minif 1 where

errortion Reconstruc Total2

1 11

jt

jit

ti

N

t

k

ii

tti

kii bbE

mxmxmxXm

tib

imtib

im

label

IR ndash Berlin Chen 28

The K-means Algorithm (cont)

bull Initializationndash A set of initial cluster centers is needed

bull Recursionndash Assign each object to the cluster whose center is closest

ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members

bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)

kii 1=m

sum

sum

=

= sdot=

Nt

ti

tNt

ti

i bb

1

1 xm

tx

⎪⎩

⎪⎨⎧ minus=minus

=otherwise 0

minif 1 jt

jit

tib

mxmx

These two steps are repeated until stabilizesim

IR ndash Berlin Chen 29

The K-means Algorithm (cont)

bull Algorithm

IR ndash Berlin Chen 30

The K-means Algorithm (cont)

bull Example 1

IR ndash Berlin Chen 31

The K-means Algorithm (cont)

bull Example 2

governmentfinancesports

research

name

IR ndash Berlin Chen 32

The K-means Algorithm (cont)

bull Choice of initial cluster centers (seeds) is important

ndash Pick at randomndash Calculate the mean of all data and generate k initial centers

by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)

divide it range into k equal interval and take the mean of data in each group as the initial center

ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects

bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set

bull Poor seeds will result in sub-optimal clustering

im

im

δm plusmni

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 21: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 21

Divisive Clustering

bull A top-down approach

bull Start with all objects in a single cluster

bull At each iteration select the least coherent cluster and split it

bull Continue the iterations until a predefined criterion (eg the cluster number) is achieved

bull The history of clustering forms a binary tree or hierarchy

IR ndash Berlin Chen 22

Divisive Clustering (cont)

bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure

bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation

egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)

IR ndash Berlin Chen 23

Divisive Clustering (cont)

bull Algorithm

split the least coherent cluster

Generate two new clusters and remove the original one

IR ndash Berlin Chen 24

Non-Hierarchical Clustering

IR ndash Berlin Chen 25

Non-hierarchical Clustering

bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)

bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters

bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm

group average similarity likelihood mutual information

k-1 rarr k rarr k+1

Hierarchical clustering also has to face this problem

IR ndash Berlin Chen 26

The K-means Algorithm

bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members

bull The K-means algorithm also can be regarded as ndash A kind of vector quantization

bull Map from a continuous space (high resolution) to a discrete space (low resolution)

ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3

vectorcode wordcode vectorreferenceor centriodcluster

1

index 1

j

kjj

jnt

t

m

mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28

IR ndash Berlin Chen 27

The K-means Algorithm (cont)

ndash and are unknownndash depends on and this optimization problem

can not be solved analytically

( )⎪⎩

⎪⎨⎧ minus=minus

=minus= sum sum= =

=otherwise 0

minif 1 where

errortion Reconstruc Total2

1 11

jt

jit

ti

N

t

k

ii

tti

kii bbE

mxmxmxXm

tib

imtib

im

label

IR ndash Berlin Chen 28

The K-means Algorithm (cont)

bull Initializationndash A set of initial cluster centers is needed

bull Recursionndash Assign each object to the cluster whose center is closest

ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members

bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)

kii 1=m

sum

sum

=

= sdot=

Nt

ti

tNt

ti

i bb

1

1 xm

tx

⎪⎩

⎪⎨⎧ minus=minus

=otherwise 0

minif 1 jt

jit

tib

mxmx

These two steps are repeated until stabilizesim

IR ndash Berlin Chen 29

The K-means Algorithm (cont)

bull Algorithm

IR ndash Berlin Chen 30

The K-means Algorithm (cont)

bull Example 1

IR ndash Berlin Chen 31

The K-means Algorithm (cont)

bull Example 2

governmentfinancesports

research

name

IR ndash Berlin Chen 32

The K-means Algorithm (cont)

bull Choice of initial cluster centers (seeds) is important

ndash Pick at randomndash Calculate the mean of all data and generate k initial centers

by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)

divide it range into k equal interval and take the mean of data in each group as the initial center

ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects

bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set

bull Poor seeds will result in sub-optimal clustering

im

im

δm plusmni

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 22: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 22

Divisive Clustering (cont)

bull To select the least coherent cluster the measures used in bottom-up clustering (eg HAC) can be used again herendash Single link measurendash Complete-link measurendash Group-average measure

bull How to split a clusterndash Also is a clustering task (finding two sub-clusters)ndash Any clustering algorithm can be used for the splitting operation

egbull Bottom-up (agglomerative) algorithmsbull Non-hierarchical clustering algorithms (eg K-means)

IR ndash Berlin Chen 23

Divisive Clustering (cont)

bull Algorithm

split the least coherent cluster

Generate two new clusters and remove the original one

IR ndash Berlin Chen 24

Non-Hierarchical Clustering

IR ndash Berlin Chen 25

Non-hierarchical Clustering

bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)

bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters

bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm

group average similarity likelihood mutual information

k-1 rarr k rarr k+1

Hierarchical clustering also has to face this problem

IR ndash Berlin Chen 26

The K-means Algorithm

bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members

bull The K-means algorithm also can be regarded as ndash A kind of vector quantization

bull Map from a continuous space (high resolution) to a discrete space (low resolution)

ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3

vectorcode wordcode vectorreferenceor centriodcluster

1

index 1

j

kjj

jnt

t

m

mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28

IR ndash Berlin Chen 27

The K-means Algorithm (cont)

ndash and are unknownndash depends on and this optimization problem

can not be solved analytically

( )⎪⎩

⎪⎨⎧ minus=minus

=minus= sum sum= =

=otherwise 0

minif 1 where

errortion Reconstruc Total2

1 11

jt

jit

ti

N

t

k

ii

tti

kii bbE

mxmxmxXm

tib

imtib

im

label

IR ndash Berlin Chen 28

The K-means Algorithm (cont)

bull Initializationndash A set of initial cluster centers is needed

bull Recursionndash Assign each object to the cluster whose center is closest

ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members

bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)

kii 1=m

sum

sum

=

= sdot=

Nt

ti

tNt

ti

i bb

1

1 xm

tx

⎪⎩

⎪⎨⎧ minus=minus

=otherwise 0

minif 1 jt

jit

tib

mxmx

These two steps are repeated until stabilizesim

IR ndash Berlin Chen 29

The K-means Algorithm (cont)

bull Algorithm

IR ndash Berlin Chen 30

The K-means Algorithm (cont)

bull Example 1

IR ndash Berlin Chen 31

The K-means Algorithm (cont)

bull Example 2

governmentfinancesports

research

name

IR ndash Berlin Chen 32

The K-means Algorithm (cont)

bull Choice of initial cluster centers (seeds) is important

ndash Pick at randomndash Calculate the mean of all data and generate k initial centers

by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)

divide it range into k equal interval and take the mean of data in each group as the initial center

ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects

bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set

bull Poor seeds will result in sub-optimal clustering

im

im

δm plusmni

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 23: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 23

Divisive Clustering (cont)

bull Algorithm

split the least coherent cluster

Generate two new clusters and remove the original one

IR ndash Berlin Chen 24

Non-Hierarchical Clustering

IR ndash Berlin Chen 25

Non-hierarchical Clustering

bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)

bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters

bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm

group average similarity likelihood mutual information

k-1 rarr k rarr k+1

Hierarchical clustering also has to face this problem

IR ndash Berlin Chen 26

The K-means Algorithm

bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members

bull The K-means algorithm also can be regarded as ndash A kind of vector quantization

bull Map from a continuous space (high resolution) to a discrete space (low resolution)

ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3

vectorcode wordcode vectorreferenceor centriodcluster

1

index 1

j

kjj

jnt

t

m

mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28

IR ndash Berlin Chen 27

The K-means Algorithm (cont)

ndash and are unknownndash depends on and this optimization problem

can not be solved analytically

( )⎪⎩

⎪⎨⎧ minus=minus

=minus= sum sum= =

=otherwise 0

minif 1 where

errortion Reconstruc Total2

1 11

jt

jit

ti

N

t

k

ii

tti

kii bbE

mxmxmxXm

tib

imtib

im

label

IR ndash Berlin Chen 28

The K-means Algorithm (cont)

bull Initializationndash A set of initial cluster centers is needed

bull Recursionndash Assign each object to the cluster whose center is closest

ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members

bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)

kii 1=m

sum

sum

=

= sdot=

Nt

ti

tNt

ti

i bb

1

1 xm

tx

⎪⎩

⎪⎨⎧ minus=minus

=otherwise 0

minif 1 jt

jit

tib

mxmx

These two steps are repeated until stabilizesim

IR ndash Berlin Chen 29

The K-means Algorithm (cont)

bull Algorithm

IR ndash Berlin Chen 30

The K-means Algorithm (cont)

bull Example 1

IR ndash Berlin Chen 31

The K-means Algorithm (cont)

bull Example 2

governmentfinancesports

research

name

IR ndash Berlin Chen 32

The K-means Algorithm (cont)

bull Choice of initial cluster centers (seeds) is important

ndash Pick at randomndash Calculate the mean of all data and generate k initial centers

by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)

divide it range into k equal interval and take the mean of data in each group as the initial center

ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects

bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set

bull Poor seeds will result in sub-optimal clustering

im

im

δm plusmni

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 24: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 24

Non-Hierarchical Clustering

IR ndash Berlin Chen 25

Non-hierarchical Clustering

bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)

bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters

bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm

group average similarity likelihood mutual information

k-1 rarr k rarr k+1

Hierarchical clustering also has to face this problem

IR ndash Berlin Chen 26

The K-means Algorithm

bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members

bull The K-means algorithm also can be regarded as ndash A kind of vector quantization

bull Map from a continuous space (high resolution) to a discrete space (low resolution)

ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3

vectorcode wordcode vectorreferenceor centriodcluster

1

index 1

j

kjj

jnt

t

m

mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28

IR ndash Berlin Chen 27

The K-means Algorithm (cont)

ndash and are unknownndash depends on and this optimization problem

can not be solved analytically

( )⎪⎩

⎪⎨⎧ minus=minus

=minus= sum sum= =

=otherwise 0

minif 1 where

errortion Reconstruc Total2

1 11

jt

jit

ti

N

t

k

ii

tti

kii bbE

mxmxmxXm

tib

imtib

im

label

IR ndash Berlin Chen 28

The K-means Algorithm (cont)

bull Initializationndash A set of initial cluster centers is needed

bull Recursionndash Assign each object to the cluster whose center is closest

ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members

bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)

kii 1=m

sum

sum

=

= sdot=

Nt

ti

tNt

ti

i bb

1

1 xm

tx

⎪⎩

⎪⎨⎧ minus=minus

=otherwise 0

minif 1 jt

jit

tib

mxmx

These two steps are repeated until stabilizesim

IR ndash Berlin Chen 29

The K-means Algorithm (cont)

bull Algorithm

IR ndash Berlin Chen 30

The K-means Algorithm (cont)

bull Example 1

IR ndash Berlin Chen 31

The K-means Algorithm (cont)

bull Example 2

governmentfinancesports

research

name

IR ndash Berlin Chen 32

The K-means Algorithm (cont)

bull Choice of initial cluster centers (seeds) is important

ndash Pick at randomndash Calculate the mean of all data and generate k initial centers

by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)

divide it range into k equal interval and take the mean of data in each group as the initial center

ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects

bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set

bull Poor seeds will result in sub-optimal clustering

im

im

δm plusmni

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 25: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 25

Non-hierarchical Clustering

bull Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partitionndash In a multi-pass manner (recursioniterations)

bull Problems associated with non-hierarchical clusteringndash When to stopndash What is the right number of clusters

bull Algorithms introduced herendash The K-means algorithmndash The EM algorithm

group average similarity likelihood mutual information

k-1 rarr k rarr k+1

Hierarchical clustering also has to face this problem

IR ndash Berlin Chen 26

The K-means Algorithm

bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members

bull The K-means algorithm also can be regarded as ndash A kind of vector quantization

bull Map from a continuous space (high resolution) to a discrete space (low resolution)

ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3

vectorcode wordcode vectorreferenceor centriodcluster

1

index 1

j

kjj

jnt

t

m

mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28

IR ndash Berlin Chen 27

The K-means Algorithm (cont)

ndash and are unknownndash depends on and this optimization problem

can not be solved analytically

( )⎪⎩

⎪⎨⎧ minus=minus

=minus= sum sum= =

=otherwise 0

minif 1 where

errortion Reconstruc Total2

1 11

jt

jit

ti

N

t

k

ii

tti

kii bbE

mxmxmxXm

tib

imtib

im

label

IR ndash Berlin Chen 28

The K-means Algorithm (cont)

bull Initializationndash A set of initial cluster centers is needed

bull Recursionndash Assign each object to the cluster whose center is closest

ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members

bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)

kii 1=m

sum

sum

=

= sdot=

Nt

ti

tNt

ti

i bb

1

1 xm

tx

⎪⎩

⎪⎨⎧ minus=minus

=otherwise 0

minif 1 jt

jit

tib

mxmx

These two steps are repeated until stabilizesim

IR ndash Berlin Chen 29

The K-means Algorithm (cont)

bull Algorithm

IR ndash Berlin Chen 30

The K-means Algorithm (cont)

bull Example 1

IR ndash Berlin Chen 31

The K-means Algorithm (cont)

bull Example 2

governmentfinancesports

research

name

IR ndash Berlin Chen 32

The K-means Algorithm (cont)

bull Choice of initial cluster centers (seeds) is important

ndash Pick at randomndash Calculate the mean of all data and generate k initial centers

by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)

divide it range into k equal interval and take the mean of data in each group as the initial center

ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects

bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set

bull Poor seeds will result in sub-optimal clustering

im

im

δm plusmni

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 26: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 26

The K-means Algorithm

bull Also called Linde-Buzo-Gray (LBG) in signal processingndash A hard clustering algorithmndash Define clusters by the center of mass of their members

bull The K-means algorithm also can be regarded as ndash A kind of vector quantization

bull Map from a continuous space (high resolution) to a discrete space (low resolution)

ndash Eg color quantizationbull 24 bitspixel (16 million colors) rarr 8 bitspixel (256 colors)bull A compression rate of 3

vectorcode wordcode vectorreferenceor centriodcluster

1

index 1

j

kjj

jnt

t

m

mF xX== =⎯⎯ rarr⎯= Dim(xt)=24 rarr k=28

IR ndash Berlin Chen 27

The K-means Algorithm (cont)

ndash and are unknownndash depends on and this optimization problem

can not be solved analytically

( )⎪⎩

⎪⎨⎧ minus=minus

=minus= sum sum= =

=otherwise 0

minif 1 where

errortion Reconstruc Total2

1 11

jt

jit

ti

N

t

k

ii

tti

kii bbE

mxmxmxXm

tib

imtib

im

label

IR ndash Berlin Chen 28

The K-means Algorithm (cont)

bull Initializationndash A set of initial cluster centers is needed

bull Recursionndash Assign each object to the cluster whose center is closest

ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members

bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)

kii 1=m

sum

sum

=

= sdot=

Nt

ti

tNt

ti

i bb

1

1 xm

tx

⎪⎩

⎪⎨⎧ minus=minus

=otherwise 0

minif 1 jt

jit

tib

mxmx

These two steps are repeated until stabilizesim

IR ndash Berlin Chen 29

The K-means Algorithm (cont)

bull Algorithm

IR ndash Berlin Chen 30

The K-means Algorithm (cont)

bull Example 1

IR ndash Berlin Chen 31

The K-means Algorithm (cont)

bull Example 2

governmentfinancesports

research

name

IR ndash Berlin Chen 32

The K-means Algorithm (cont)

bull Choice of initial cluster centers (seeds) is important

ndash Pick at randomndash Calculate the mean of all data and generate k initial centers

by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)

divide it range into k equal interval and take the mean of data in each group as the initial center

ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects

bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set

bull Poor seeds will result in sub-optimal clustering

im

im

δm plusmni

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 27: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 27

The K-means Algorithm (cont)

ndash and are unknownndash depends on and this optimization problem

can not be solved analytically

( )⎪⎩

⎪⎨⎧ minus=minus

=minus= sum sum= =

=otherwise 0

minif 1 where

errortion Reconstruc Total2

1 11

jt

jit

ti

N

t

k

ii

tti

kii bbE

mxmxmxXm

tib

imtib

im

label

IR ndash Berlin Chen 28

The K-means Algorithm (cont)

bull Initializationndash A set of initial cluster centers is needed

bull Recursionndash Assign each object to the cluster whose center is closest

ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members

bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)

kii 1=m

sum

sum

=

= sdot=

Nt

ti

tNt

ti

i bb

1

1 xm

tx

⎪⎩

⎪⎨⎧ minus=minus

=otherwise 0

minif 1 jt

jit

tib

mxmx

These two steps are repeated until stabilizesim

IR ndash Berlin Chen 29

The K-means Algorithm (cont)

bull Algorithm

IR ndash Berlin Chen 30

The K-means Algorithm (cont)

bull Example 1

IR ndash Berlin Chen 31

The K-means Algorithm (cont)

bull Example 2

governmentfinancesports

research

name

IR ndash Berlin Chen 32

The K-means Algorithm (cont)

bull Choice of initial cluster centers (seeds) is important

ndash Pick at randomndash Calculate the mean of all data and generate k initial centers

by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)

divide it range into k equal interval and take the mean of data in each group as the initial center

ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects

bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set

bull Poor seeds will result in sub-optimal clustering

im

im

δm plusmni

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 28: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 28

The K-means Algorithm (cont)

bull Initializationndash A set of initial cluster centers is needed

bull Recursionndash Assign each object to the cluster whose center is closest

ndash Then re-compute the center of each cluster as the centroid or mean (average) of its members

bull Using the medoid as the cluster center (a medoid is one of the objects in the cluster)

kii 1=m

sum

sum

=

= sdot=

Nt

ti

tNt

ti

i bb

1

1 xm

tx

⎪⎩

⎪⎨⎧ minus=minus

=otherwise 0

minif 1 jt

jit

tib

mxmx

These two steps are repeated until stabilizesim

IR ndash Berlin Chen 29

The K-means Algorithm (cont)

bull Algorithm

IR ndash Berlin Chen 30

The K-means Algorithm (cont)

bull Example 1

IR ndash Berlin Chen 31

The K-means Algorithm (cont)

bull Example 2

governmentfinancesports

research

name

IR ndash Berlin Chen 32

The K-means Algorithm (cont)

bull Choice of initial cluster centers (seeds) is important

ndash Pick at randomndash Calculate the mean of all data and generate k initial centers

by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)

divide it range into k equal interval and take the mean of data in each group as the initial center

ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects

bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set

bull Poor seeds will result in sub-optimal clustering

im

im

δm plusmni

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 29: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 29

The K-means Algorithm (cont)

bull Algorithm

IR ndash Berlin Chen 30

The K-means Algorithm (cont)

bull Example 1

IR ndash Berlin Chen 31

The K-means Algorithm (cont)

bull Example 2

governmentfinancesports

research

name

IR ndash Berlin Chen 32

The K-means Algorithm (cont)

bull Choice of initial cluster centers (seeds) is important

ndash Pick at randomndash Calculate the mean of all data and generate k initial centers

by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)

divide it range into k equal interval and take the mean of data in each group as the initial center

ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects

bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set

bull Poor seeds will result in sub-optimal clustering

im

im

δm plusmni

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 30: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 30

The K-means Algorithm (cont)

bull Example 1

IR ndash Berlin Chen 31

The K-means Algorithm (cont)

bull Example 2

governmentfinancesports

research

name

IR ndash Berlin Chen 32

The K-means Algorithm (cont)

bull Choice of initial cluster centers (seeds) is important

ndash Pick at randomndash Calculate the mean of all data and generate k initial centers

by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)

divide it range into k equal interval and take the mean of data in each group as the initial center

ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects

bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set

bull Poor seeds will result in sub-optimal clustering

im

im

δm plusmni

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 31: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 31

The K-means Algorithm (cont)

bull Example 2

governmentfinancesports

research

name

IR ndash Berlin Chen 32

The K-means Algorithm (cont)

bull Choice of initial cluster centers (seeds) is important

ndash Pick at randomndash Calculate the mean of all data and generate k initial centers

by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)

divide it range into k equal interval and take the mean of data in each group as the initial center

ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects

bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set

bull Poor seeds will result in sub-optimal clustering

im

im

δm plusmni

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 32: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 32

The K-means Algorithm (cont)

bull Choice of initial cluster centers (seeds) is important

ndash Pick at randomndash Calculate the mean of all data and generate k initial centers

by adding small random vector to the meanndash Project data onto the principal component (first eigenvector)

divide it range into k equal interval and take the mean of data in each group as the initial center

ndash Or use another method such as hierarchical clustering algorithm on a subset of the objects

bull Eg buckshot algorithm uses the group-average agglomerative clustering to randomly sample of the data that has size square root of the complete set

bull Poor seeds will result in sub-optimal clustering

im

im

δm plusmni

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 33: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 33

The K-means Algorithm (cont)

bull How to break ties when in case there are several centers with the same distance from an objectndash Randomly assign the object to one of the candidate clusters

ndash Or perturb objects slightly

bull Applications of the K-means Algorithmndash Clusteringndash Vector quantization ndash A preprocessing stage before classification or regression

bull Map from the original space to l-dimensional spacehypercube

l=log2k (k clusters)Nodes on the hypercube

A linear classifier

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 34: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 34

The K-means Algorithm (cont)

bull Eg the LBG algorithmndash By Linde Buzo and Gray

Global mean Cluster 1 mean

Cluster 2mean

μ11Σ11ω11μ12Σ12ω12

μ13Σ13ω13 μ14Σ14ω14

Mrarr2M at each iteration

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 35: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 35

The EM Algorithmbull A soft version of the K-mean algorithm

ndash Each object could be the member of multiple clustersndash Clustering as estimating a mixture of (continuous) probability

distributions

sumixr

( )1cxP i( )11 cP=π

( )22 cP=π

( )KK cP=π

( )2cxP i

( )Ki cxP

( ) ( ) ( )sum=

=ΘK

kkkii cPcxPxP

1

ΘΘ rr

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ= minus

kikT

ki

kmki xxcxP μμ

π

rrrrr 1

21exp

2

Continuous caseLikelihood function fordata samples

( ) ( )

( ) ( ) cPcxP

xPP

n

i

K

kkki

n

ii

i

iiprod sum

prod

= =

=

=

=

1 1

1

ΘΘ

ΘΘ

r

rX

A Mixture Gaussian HMM(or A Mixture of Gaussians)

xxx nr

Krr 21=X

( ) ( ) ( )( )

( ) ( )ΘΘmax

ΘΘmax max

tionclassifica

kkik

i

kki

kikk

cPcx

xPcPcx

xcP

r

r

rr

=

Θ=Θ

(iid) ddistributey identicallt independen are sixr xxx n

rL

rr 21=X

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 36: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 36

The EM Algorithm (cont)

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 37: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 37

Maximum Likelihood Estimation

bull Hard Assignment

State S1

P(B| S1)=24=05

P(W| S1)=24=05

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 38: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 38

Maximum Likelihood Estimation

bull Soft Assignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(B| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(B| S2)=(06+05)(03+06+01+05)

=01115=073

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 39: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 39

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Derive the complete data likelihood function

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )( ) ( )[ ]( )[ ]

( )[ ]( )[ ]

( )[ ]sum

sum sumsum

sum sumsum

sum sum prodsum

sum sum prodsum

prod sumprod

=

=

=

=

=

++times

times++=

==

= ==

= ==

= = ==

= = ==

= ==

CCX

X

Θ

Θ

Θ

Θ

ΘΘ

ΘΘΘΘ

ΘΘΘΘ

ΘΘΘΘ

1 121

1

1 121

1

1 1 11

1 1 11

11

1111

1 11

21

2

21

2

2

2

P

cxcxcxP

cxcxcxP

cxP

cPcxP

cPcxPcPcxP

cPcxPcPcxP

cPcxP xPP

K

k

K

kknkk

K

k

K

k

K

kknkk

K

k

K

k

K

k

n

iki

K

k

K

k

K

k

n

ikki

K

k

KKnn

KK

n

i

K

kkki

n

ii

i n

n

i n

n

i n

i

i n

ii

i

ii

rL

rrL

rL

rrL

rL

rL

rr

Lrr

rr

nn kkkk

nn

ccccxxxx

121

121

minus== minus

L

rrL

rr

CX

the complete data likelihood function

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

likelihood function

)kinds( ofkindsmany How

nKC

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 40: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 40

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash Define the auxiliary function as the expectation of the

log complete likelihood function LCM with respect to thehiddenlatent variable C conditioned on known data

ndash Maximize the log likelihood function by maximizing the expectation of the log complete likelihood function

bull We have shown this property when deriving the HMM-based retrieval model

( )ΘΘΦ

( )ΘX

( ) [ ] ( )[ ]( ) ( )( )( ) ( )sum

sum

=

=

==Φ

C

C

XCXC

CXX

CX

CXXC

CX

ΘlogΘΘ

ΘlogΘ

ΘloglogΘΘΘΘ

PP

P

PP

PELE CM

( )Θlog XP( )ΘΘΦ

known unknown

( )ΘΘΦ ( )Θlog XP

( ) ( ) ( ) ( )ΘΘΘΘ XX PPQQ minusΘleminusΘ

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 41: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 41

The EM Algorithm (cont)bull Endashstep (Expectation)

ndash The auxiliary function ( )ΘΘΦ

( ) ( )( ) ( )( )( ) ( )

( ) ( )

( ) ( )

( )[ ] ( )

( )[ ] ( ) ( ) ( ) ( ) ( ) ( )[ ] ( ) ( ) ( ) ( ) sum sum sum sum

sum sum

sum sum

sum sum

sum sum sum prod

sum sum prodsum

sum sumprod

sum prodprod

sum

= = = =

= =

= =

= =

= = = =

= = ==

= ==

= ==

+=

=

=

=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=

⎥⎦

⎤⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

m

k

n

i

m

k

n

ikijkkjk

m

k

n

ikkijk

m

k

n

ikijk

m

k

n

iikki

m

k

n

i ccc

n

jjkkkki

ccc

m

k

n

jjkki

n

ikk

cccki

n

i

n

jjk

ccc

n

iki

n

j j

kj

cxPxcPcPxcP

cPcxPxcP

cxPxcP

xcPcxP

xcPcxP

xcPcxP

cxPxcP

cxPxP

cxP

PP

P

jj

j

j

nkkk

ji

nkkk

ji

nkkk

ij

nkkk

i

j

1 1 1 1

1 1

1 1

1 1

1 1 1

1 11

11

11

ΘlogΘΘlogΘ

ΘΘlogΘ

ΘlogΘ

ΘΘlog

ΘΘlog

ΘΘlog

ΘlogΘ

ΘlogΘ

Θ

ΘlogΘΘ

ΘΘ

21

21

21

21

rrr

rr

rr

rr

rr

rr

rr

rr

r

K

K

K

K

C

C

C

C

C

CXX

CX

δ

δ

⎩⎨⎧ =

=otherwise 0

if 1

kk ikk i

δ

See Next Slide

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 42: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 42

The EM Algorithm (cont)

ndash Note that

( )

( )[ ]

( )[ ]

( ) ( )

( )

( )Θ

Θ1

ΘΘ

Θ

Θ

Θ

1

1

1 1

1 1 1 1

1 1 1 1

1

1 2

1 2

21

ik

ik

n

ijj

m

cikkk

n

ijj

m

kjk

m

c

m

c

m

c

n

jjkkk

m

c

m

c

m

c

n

jjkkk

ccc

n

jjkkk

xcP

xcP

xcPxcP

xcP

xcP

xcP

ik

ii

j

j

k k nk

ji

k k nk

ji

nkkk

ji

r

r

rr

rL

rL

r

K

=

⎥⎦

⎤⎢⎣

⎡=

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎥⎥⎦

⎢⎢⎣

⎡=

=

=

⎥⎦

⎤⎢⎣

prod

sumprod sum

sum sum sum prod

sum sum sum prod

sum prod

ne=

=ne= =

= = = =

= = = =

= =

δ

δ

δ

δC

( )( )( ) ( )

sum sum sum prod

prod sum

= = = =

= =

=

+++++++++=M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

ki cx toaligned beonly can r

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 43: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 43

The EM Algorithm (cont)

bull Endashstep (Expectation)ndash The auxiliary function can also be divided into two

( ) ( ) ( )

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )( ) ( )

( )

( ) ( ) ( )( ) ( )( ) ( )

( )sum sumsum

sum sum

sum sumsum

sum sum

sum sum

= =

=

= =

= =

=

= =

= =

=

=

=

Φ+Φ=Φ

n

i

K

kkiK

llli

kki

n

i

K

kkiikb

n

i

K

kkK

llli

kki

n

i

K

kk

i

kki

n

i

K

kkika

ba

cxPcPcxP

cPcxP

cxPxcP

cPcPcxP

cPcxP

cPxP

cPcxP

cPxcP

1 1

1

1 1

1 1

1

1 1

1 1

ΘlogΘΘ

ΘΘ

ΘlogΘΘΘ

ΘlogΘΘ

ΘΘ

ΘlogΘ

ΘΘ

ΘlogΘΘΘ

whereΘΘΘΘΘΘ

r

r

r

rr

r

r

r

r

r

auxiliary function for mixture weights

auxiliary function for cluster distributions

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 44: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 44

The EM Algorithm (cont)

bull M-step (Maximization)ndash Remember that

bull Maximize a function F with a constraint by applying Lagrange multiplier

sum

sumsumsum

sum sumsum

=

===

= ==

=there4

minus=rArrminus=

forallminus=rArr=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛minus+=rArr=

N

jj

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwy

jyw

yw

yF

yywFywF

1

111

1 11

1logˆlog that Suppose

Multiplier Lagrange applyingBy

ll

ll

l

l

partpart

Constraint

jj

j

yyy 1logNote

=part

part

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 45: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 45

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘaΦ

( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) 1ΘΘlog

ΘΘ

ΘΘ

1ΘΘΘΘΘ

11 1

1

1

⎟⎠

⎞⎜⎝

⎛minus+=

⎟⎠

⎞⎜⎝

⎛minus+Φ=Φ

sumsum sumsum

sum

== =

=

=

K

kk

K

k

n

ikK

llli

kki

K

kkaa

cPlcPcPcxP

cPcxP

cPl

r

r

kw ky

( )

( ) ( )( ) ( )( ) ( )( ) ( )

( ) ( )( ) ( )

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

ΘΘ

Θˆ

1

1

1 1

1

1

1

1

n

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

cPcxP

w

wcP

n

iK

llli

kki

K

k

n

iK

llli

kki

n

iK

llli

kki

K

kk

kkk

sumsum

sum sumsum

sumsum

sum

=

=

= =

=

=

=

=

====rArr

r

r

r

r

r

r

π

auxiliary function for mixture weights (or priors for Gaussians)

kii

k cxr classin falls that timesofnumber expected the r

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 46: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 46

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize ( )ΘΘbΦ

( ) ( ) ( )( ) ( )

( )sum sumsum= =

=

=Φn

i

K

kkiK

llli

kkib cxP

cPcxP

cPcxP

1 1

1

ΘlogΘΘ

ΘΘ ΘΘ r

r

r

auxiliary function for (multivariate) Gaussian Means and Variances

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ minusΣminusminus

Σ=Θ minus

kiT

ki

kmki xxcxP

kμμ

π

rrrrr 1

21exp

2

1

( ) ( )( ) ( )

and ΘΘ

ΘΘLet

1

sum=

= K

llli

kkiik

cPcxP

cPcxPw

r

r ( )( ) ( ) ( )ki

Tkik

ki

xxm

cxP

kμμπ rrrr

r

minusΣminusminusΣminussdotminus

minus1

21log2

12log2

log

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1log21 ΘΘ μμ rrrr

constant

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 47: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 47

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to

( ) ( ) ( )( )

( ) ( )( ) ( )( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sum

=

=

=

=

=

=

=

minus

ΘΘ

ΘΘ

sdotΘΘ

ΘΘ

=sdot

=rArr

=minusminusΣsdotsdotsdotminus=part

Φpart

n

iK

llli

kki

n

iiK

llli

kki

n

iik

n

iiik

k

n

ikikik

k

b

cPcxP

cPcxP

xcPcxP

cPcxP

w

xw

xw

1

1

1

1

1

1

1

1

ˆ

01ˆˆ221 ˆ

ΘΘ

r

r

r

r

r

r

r

rrr

μ

μμ

( )ΘΘbΦ kμr

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( )here symmetric is and

)(

1minus

+=

kΣd

d xCCxCxx T

T

ki

ik

cxr

classin falls that timesofnumber expected the

r

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 48: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 48

The EM Algorithm (cont)

bull M-step (Maximization)ndash Maximize with respect to( )ΘΘbΦ

( )[ ]

here symmetric is and

)det(det

kΣd

d TXXX

X minussdot=

( ) ( ) ( )sum sum= =

minus +⎥⎦⎤

⎢⎣⎡ minusΣminus+Σminus=Φ

n

i

K

kkik

T

kikikb Dxxw1 1

1

ˆˆˆ2

1ˆlog21 ΘΘ μμ rrrr

( ) ( )( )

( )( )

( )( )

( )( )

( )( )( ) ( )( ) ( )

( )( )

( ) ( )( ) ( )

sumsum

sumsum

sum

sum

sumsum

sumsum

sumsum

sum

=

=

=

=

=

=

==

=

minusminus

=

minus

=

minusminus

=

minus

=

minusminusminusminus

ΘΘ

ΘΘ

minusminussdotΘΘ

ΘΘ

=minusminussdot

=ΣrArr

minusminussdot=ΣsdotrArr

ΣΣminusminusΣΣsdot=ΣΣΣsdotrArr

ΣminusminusΣsdot=ΣsdotrArr

=⎥⎦⎤

⎢⎣⎡ ΣminusminusΣminusΣsdotΣsdotΣsdotminus=

ΣpartΦpart

n

iK

llli

kki

n

i

T

kikiK

llli

kki

n

iik

n

i

T

kikiik

k

n

i

T

kikiik

n

ikik

k

n

ik

T

kikikkik

n

ikkkik

n

ik

T

kikikik

n

ikik

n

ik

T

kikikkkkikk

b

cPcxP

cPcxP

xxcPcxP

cPcxP

w

xxw

xxww

xxww

xxww

xxw

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1111

ˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆˆˆˆˆˆˆ

ˆˆˆˆˆ

0ˆˆˆˆˆ2

1 ˆΘΘ

r

r

rrrr

r

r

rrrr

rrrr

rrrr

rrrr

rrrr

μμμμ

μμ

μμ

μμ

μμ

111 )( minusminusminus

minus= XabXX

bXa TT

dd

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 49: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 49

The EM Algorithm (cont)

bull The initial cluster distributions can be estimated using the K-means algorithm

bull The procedure terminates when the likelihoodfunction is converged or maximum numberof iterations is reached

( )ΘXP

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 50: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 50

Hierarchical Document Organizationbull Explore the Probabilistic Latent Topical Information

ndash TMMPLSA approach

bull Documents are clustered by the latent topics and organized in a two-dimensional tree structure or a two-layer map

bull Those related documents are in the same cluster and the relationships among the clusters have to do with the distance on the map

bull When a cluster has many documents we can further analyze it into an other map on the next layer

Two-dimensional Tree Structure

for Organized Topics

( ) ( ) ( ) ( )sum ⎥⎦

⎤⎢⎣

⎡sum=

= =

K

k

K

lljklikij TwPYTPDTPDwP

1 1

( ) ( )⎥⎥⎦

⎢⎢⎣

⎡minus= 2

2

2exp

21

σσπlk

klTTdistTTE( ) ( ) ( )22 jijiji yyxxTTdist minus+minus=

( ) ( )( )sum

=

=

K

sks

klkl

TTE

TTEYTP

1

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 51: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 51

Hierarchical Document Organization (cont)

bull The model can be trained by maximizing the total log-likelihood of all terms observed in the document collection

ndash EM training can be performed

( ) ( )

( ) ( ) ( ) ( )⎭⎬⎫

⎩⎨⎧sum ⎥

⎤⎢⎣

⎡sumsum sum=

sum sum=

= == =

= =

K

k

K

lljklik

N

i

J

nij

ijN

i

J

nijT

TwPYTPDTPDwc

DwPDwcL

1 11 1

1 1

log

log

( )( ) ( )

( ) ( )sum sum

sum=

=prime =primeprimeprimeprimeprime

=J

j

N

iijkij

N

iijkij

kjDwTPDwc

DwTPDwcTwP

1 1

1

|

||ˆ

( )( ) ( )

( ) |

|ˆ 1

i

J

jijkij

ik Dc

DwTPDwcDTP

sum= =

where

( )( ) ( ) ( )

( ) ( ) ( )sum⎭⎬⎫

⎩⎨⎧

sdot⎥⎦

⎤⎢⎣

⎡sum

sdot⎥⎦

⎤⎢⎣

⎡sum

=prime

=primeprime

=primeprimeprimeprime

=K

kik

K

lkllj

ikK

lkllj

ijk

DTPTTPTwP

DTPTTPTwPDwTP

1 1

1

|||

||||

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 52: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 52

Hierarchical Document Organization (cont)

bull Criterion for Topic Word Selecting

( )( ) ( )

( ) ( )sum minus

sum=

=primeprimeprime

=N

iikij

N

iikij

kjDTPDwc

DTPDwcTwS

1

1

]|1[

|

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 53: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 53

Hierarchical Document Organization (cont)

bull Example

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 54: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 54

Hierarchical Document Organization (cont)

bull Example (cont)

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 55: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 55

Hierarchical Document Organization (cont)

bull Self-Organization Map (SOM) ndash A recursive regression process

[ ]Tnmmmm 121111 =

(Mapping Layer

Input Layer

[ ]Tnxxxx 21=Input Vector

[ ]Tniiii mmmm 21 =

Weight Vector

)]()()[()()1( )( tmtxthtmtm iixcii minus+=+

ii

mxxc primeprime

minus= minarg)(

where( )sum minus=minus primeprime n nini mxmx 2

⎟⎟⎟

⎜⎜⎜

⎛ minusminus=

)(2exp)()( 2

2

)()( t

rrtth xci

ixc σα

imx

ii mx minus

imprime

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 56: Clustering Techniques - NTNUberlin.csie.ntnu.edu.tw/.../IR2005F-Lecture09-Clustering.pdfClustering Techniques Berlin Chen 2005 References: 1. Introduction to Machine Learning , Chapter

IR ndash Berlin Chen 56

Hierarchical Document Organization (cont)

bull Results

20604100SOM1917540194773020650201916510

TMM

distBetweendistWithinIterationsModel

Within

BetweenDist dist

distR =

sumsum

sumsum

= +=

= +==D

i

D

ijBetween

D

i

D

ijBetween

Between

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ ne

=otherwise

TTijdistjif jrirMap

Between 0

)()(

( ) ( )22)( jijiMap yyxxijdist minus+minus=

⎩⎨⎧ ne

= 0

1 )(

otherwise

TTjiC jrir

Between

sumsum

sumsum

= +=

= +== D

i

D

ijWithin

D

i

D

ijWithin

Within

jiC

jifdist

1 1

1 1

)(

)(⎩⎨⎧ =

= 0

)()(

otherwise

TTijdistjif jrirMap

Within

⎪⎩

⎪⎨⎧ =

= 0

1 )(

otherwise

TTjiC jrir

Within

where

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice


Recommended