Bayesian Clustering with the Dirichlet Process:Issues with priors and interpreting MCMC
Shane T. Jensen
Department of Statistics
The Wharton School, University of Pennsylvania
Collaborative work with J. Liu, L. Dicker, and G. Tuteja
Shane T. Jensen 1 May 13, 2006
Introduction
• Bayesian non-parametric or semi-parametric models are
very useful in many applications
• Non-parametric: random variables realizations from
unspecified probability distribution e.g.,
Xi ! F(·) i = 1, . . . , n
• Xi’s can be observed data, latent variables or unknown
parameters (often in a hierarchical setting)
• Prior distributions for F(·) play an important role in
non-parametric modeling
Shane T. Jensen 2 May 13, 2006
Dirichlet Process Priors
• A commonly-used prior distribution for an unknown
probability distribution is the Dirichlet process
F(·) ! DP(!, F0)
• F0 is a probability measure
– can represent prior belief in form of F
• ! is a weight parameter
– can represent degree of belief in prior form F0
• Ferguson (1973,1974); Antoniak (1974); many others
• Important consequence of Dirichlet process is that it
induces a discretized posterior distribution
Shane T. Jensen 3 May 13, 2006
Consequence of DP priors
• Ferguson, 1974: using a Dirichlet process DP(!, F0) prior
for F(·) results in a posterior mixture of F0 and point
masses at observation Xi:
F(·)|X1, . . . , Xn ! DP
!
! + n , F0 +n
"
i=1
"(Xi)
#
• For density estimation, discreteness may be a problem:
convolutions with kernel functions can be used to produce
a continuous density estimate
• In other applications, discreteness is not a disadvantage!
Shane T. Jensen 4 May 13, 2006
Clustering with a DP prior
• Point mass component of posterior leads to a random
partition of our variables
• Consider a new variable Xn+1 and let X1, . . . , XC be the
unique values of X1:n = (X1, . . . , Xn). Then,
P (Xn+1 = XC | X1:n) =Nc
! + nc = 1, . . . , C
P (Xn+1 = new | X1:n) =!
! + n
• Nc = size of cluster c: number in X1:n that equal Xc
“Rich get richer”: will return to this...
Shane T. Jensen 5 May 13, 2006
Motivating Application: TF motifs
• Genes are regulated by transcription factor (TF) proteins
that bind to the DNA sequence near to gene
• TF proteins can selectively control only certain target genes
by only binding to the “same” sequence, called a motif
• The motif sites are highly conserved but not identical, so
we use a matrix description of the motif appearance
Frequency Matrix - Xi
A 0.05 0.02 0.85 0.02 0.21 0.06
C 0.04 0.02 0.03 0.93 0.05 0.06
G 0.06 0.94 0.06 0.04 0.70 0.11
T 0.85 0.02 0.06 0.01 0.04 0.77
Sequence Logo
Shane T. Jensen 6 May 13, 2006
Collections of TF motifs
• Large databases contain motif information on many TFs
but with large amount of redundancy
– TRANSFAC and JASPAR are largest (100’s in each)
• Want to cluster motifs together to either reduce
redundancy in databases or match new motifs to database
• Nucleotide conservation varies both within a single motif
(between positions) and between di!erent motifs
Tal1beta-E47SAGL3
Shane T. Jensen 7 May 13, 2006
Motif Clustering with DP prior
• Hierarchical model with levels for both within-unit and
between-unit variability in discovered motifs
– Observed count matrix Yi is a product multinomial
realization of frequency matrix Xi
– Unknown Xi’s share unknown distribution F(·)
• Dirichlet process DP(!, F0) prior for F(·) leads to
posterior mixture of F0 and point masses at each Xi
• Our prior measure F0 in this application is a product
Dirichlet distribution
Shane T. Jensen 8 May 13, 2006
Benefits and Issues with DP prior
• Allows unknown number of clusters without need to model
number of clusters directly
– No real prior knowledge about number of clusters in our
application
• However, with DP there are implicit assumptions about
number of clusters (and their size distribution)
• “Rich get richer” property influences prior predictive
number of clusters and cluster size distribution
– How influential is this property in an application?
Shane T. Jensen 9 May 13, 2006
Benefits and Issues with MCMC
• DP-based model is easy to implement via Gibbs sampling
– p(Xi|X"i) is same choice structure as p(Xn+1|X1:n)
– Xi either sampled into one of current clusters defined by
X"i or sampled from F0 to form a new cluster
• Alternative is direct model on number of clusters and then
use something like Reversible Jump MCMC
• Mixing can be an issue with Gibbs sampler
– collapsed Gibbs sampler: integrate out Xi and deal
directly with clustering indicators
– split/merge moves to speed up mixing: lots of great
work by R. Neal, D. Dahl and others
Shane T. Jensen 10 May 13, 2006
Main Issue 1: Posterior Inference from MCMC
• However, there are still issues posterior inference based on
Gibbs sampling output also has issues
• Need to infer a set of clusters from sampled partitions, but
we have a label switching problem (Stephens, 1999)
• cluster labels are exchangeable for a particular partition
• usual summaries such as posterior mean can be misleading
mixtures of these exchangeable labeling
• need summaries that are uninfluenced by labeling
Shane T. Jensen 11 May 13, 2006
Posterior Inference Options
• Option 1: clusters defined by last partition visited
– sampled partition produced at end of Gibbs chain
– surprisingly popular, e.g. Latent Dirichlet Alloc. models
• Option 2: clusters defined by MAP partition
– sampled partition with highest posterior density
– simple and popular
• Option 3: clusters defined by threshold on pairwise
posterior probabilities Pij
– frequency of iterations with motifs i & j in same cluster
Shane T. Jensen 12 May 13, 2006
Main Issue 2: Implicit DP Assumptions
• DP has implicit “rich get richer” property: easy to see
from the predictive distribution:
P (Xn+1 joins cluster c ) =Nc
! + nc = 1, . . . , C
P (Xn+1 forms new cluster) =!
! + n
• Chinese restaurant process: new customer chooses table
– sits at current table with probability # Nc, the number
of customers already sitting there
– sits at entirely new table with probability # !
Shane T. Jensen 13 May 13, 2006
Alternative Priors for Clustering
• Uniform Prior: socialism, no one gets rich
P (Xn+1 joins cluster c ) =1
! + Cc = 1, . . . , C
P (Xn+1 forms new cluster) =!
! + C
• Pitman-Yor Prior: rich get richer, but charitable
P (Xn+1 joins cluster c ) =Nc " #
! + nc = 1, . . . , C
P (Xn+1 forms new cluster) =! " C · #
! + n
• 0 $ # $ 1 is often called the “discount factor”
Shane T. Jensen 14 May 13, 2006
Asymptotic Comparison of Priors
• Number of clusters Cn is clearly a function of sample size n
• How does Cn grow as n "% & ?
DP Prior : E(Cn) ' ! · log(n)
Pitman " Yor Prior : E(Cn) ' K(!, #) · n#
Uniform Prior : E(Cn) ' K(!) · n1
2
• DP prior shows slowest growth in number of clusters Cn
• Interestingly, Pitman-Yor can lead to either faster or slower
growth vs. Uniform, depending on #
• Also working on results for distribution of cluster sizes
Shane T. Jensen 15 May 13, 2006
Finite Sample Comparison of Priors
• Y = Cn vs. X = n for di!erent values of !
1e+02 5e+02 5e+03 5e+04
510
2050
100
200
500
1000
θ = 1
n = number of observations
Expe
cted N
umbe
r of C
luster
s
DPUNPY (α = 0.5)PY (α = 0.25)PY (α = 0.75)
1e+02 5e+02 5e+03 5e+04
5010
020
050
010
0020
00
θ = 10
n = number of observations
Expe
cted N
umbe
r of C
luster
s
DPUNPY (α = 0.5)PY (α = 0.25)PY (α = 0.75)
1e+02 5e+02 5e+03 5e+04
100
200
500
1000
2000
5000
θ = 100
n = number of observations
Expe
cted N
umbe
r of C
luster
s
DPUNPY (α = 0.5)PY (α = 0.25)PY (α = 0.75)
Shane T. Jensen 16 May 13, 2006
Simulation Study of Motif Clustering
• Evaluation of di!erent priors and modes of inference in
context of motif clustering application
• Simulated realistic collections of motifs (known partitions)
• Di!erent simulation conditions to vary clustering di"culty:
– high to low within-cluster similarity
– high to low between-cluster similarity
• Success measured by Jacard similarity between true
partition z and inferred partition z
J(z, z) =TP
TP + FP + FN
Shane T. Jensen 17 May 13, 2006
Simulation Comparison of Inference Alternatives
2 4 6 8
0.20.4
0.60.8
1.0
Increasing Clustering Difficulty
Jaca
rd In
dex
MAPProb > 0.5Prob > 0.25
• MAP partition consistently inferior to pairwise probs.
• Post. probs. incorporate uncertainty across iterations
Shane T. Jensen 18 May 13, 2006
Simulation Comparison of Prior Alternatives
2 4 6 8
0.70
0.75
0.80
0.85
0.90
0.95
Increasing Clustering Difficulty
Jaca
rd In
dex
UniformPY 0.25PY 0.5PY 0.75DP
• Not much di!erence in general between priors
• Uniform does a little worse in most situations
Shane T. Jensen 19 May 13, 2006
Real Data Results: Clustering JASPAR database
• Tree based on pairwise posterior probabilities:Hom
o.sapien
s−NUCL
EAR−MA
0065
Homo.sa
piens−N
UCLEAR
−MA007
2Dros
ophila.m
elanoga
ster−NU
CLEAR−
MA0016
Homo.sa
piens−N
UCLEAR
−MA007
4Hom
o.sapien
s−NUCL
EAR−MA
0066
Homo.sa
piens−N
UCLEAR
−MA007
1Arab
idopsis.t
haliana−
HOMEO.
ZIP−MA0
008Arab
idopsis.t
haliana−
HOMEO.
ZIP−MA0
110Mus
.muscul
us−bHLH
.ZIP−MA
0104
Homo.sa
piens−bH
LH.ZIP−M
A0093
Homo.sa
piens−bH
LH.ZIP−M
A0059
Mus.mu
sculus−b
HLH−MA
0004
Homo.sa
piens−bH
LH.ZIP−M
A0058
Mus.mu
sculus−H
MG−MA0
078Ratt
us.norve
gicus−F
ORKHEA
D−MA00
41Ratt
us.norve
gicus−F
ORKHEA
D−MA00
47Ratt
us.norve
gicus−F
ORKHEA
D−MA00
40Hom
o.sapien
s−FORK
HEAD−M
A0042
Rattus.n
orvegicu
s−bZIP−M
A0019
Homo.sa
piens−bH
LH−MA0
091Hom
o.sapien
s−ZN.FI
NGER−M
A0073
Homo.sa
piens−T
EA−MA0
090Hom
o.sapien
s−NUCL
EAR−MA
0017
Gallus.g
allus−ZN
.FINGER
−MA010
3 Homo.sa
piens−P
53−MA0
106Dros
ophila.m
elanoga
ster−ZN
.FINGER
−MA001
1Hom
o.sapien
s−MADS
−MA008
3Arab
idopsis.t
haliana−
MADS−M
A0001
Homo.sa
piens−A
P2−MA0
003Hom
o.sapien
s−ZN.FI
NGER−M
A0095
Arabidop
sis.thalia
na−MAD
S−MA00
05Hom
o.sapien
s−FORK
HEAD−M
A0032
Homo.sa
piens−bH
LH−MA0
048Anti
rrhinum.
majus−M
ADS−MA
0082
Oryctolag
us.cunic
ulus−ZN
.FINGER
−MA010
9Hom
o.sapien
s−Unkno
wn−MA0
024Pisu
m.sativum
−HMG−M
A0044
Homo.sa
piens−R
UNT−MA
0002
Mus.mu
sculus−b
HLH−MA
0006
Homo.sa
piens−P
AIRED−
MA0069
Petunia.h
ybrida−T
RP.CLU
STER−M
A0054
Hordeum
.vulgare−
TRP.CLU
STER−M
A0034
Xenupu
s.laevis−
ZN.FING
ER−MA0
088Gall
us.gallus
−ETS−M
A0098
Homo.sa
piens−bH
LH−MA0
055Mus
.muscul
us−T.BO
X−MA00
09Dros
ophila.m
elanoga
ster−ZN
.FINGER
−MA008
6NA−
bZIP−MA
0102
Homo.sa
piens−bZ
IP−MA00
25Mus
.muscul
us−HOME
O−MA00
63Hom
o.sapien
s−bZIP−M
A0018
Antirrhin
um.maju
s−bZIP−M
A0096
Antirrhin
um.maju
s−bZIP−M
A0097
Mus.mu
sculus−P
AIRED−
MA0067
Homo.sa
piens−bZ
IP−MA00
43Gall
us.gallus
−bZIP−M
A0089 Mus
.muscul
us−bZIP−
MA0099
Drosoph
ila.melan
ogaster
−REL−M
A0022
Homo.sa
piens−Z
N.FINGE
R−MA00
79Mus
.muscul
us−bHLH
.ZIP−MA
0111
Homo.sa
piens−R
EL−MA0
107Hom
o.sapien
s−REL−
MA0101
Vertebra
tes−REL
−MA006
1Dros
ophila.m
elanoga
ster−RE
L−MA00
23Hom
o.sapien
s−REL−
MA0105
Mus.mu
sculus−P
AIRED−
MA0014
Homo.sa
piens−Z
N.FINGE
R−MA00
57Mus
.muscul
us−ZN.FI
NGER−M
A0039
Mus.mu
sculus−H
OMEO−M
A0027
Homo.sa
piens−Z
N.FINGE
R−MA00
56Hom
o.sapien
s−ETS−
MA0081
Homo.sa
piens−E
TS−MA0
062Hom
o.sapien
s−ETS−
MA0028
Drosoph
ila.melan
ogaster
−ETS−M
A0026
Homo.sa
piens−E
TS−MA0
076Dros
ophila.m
elanoga
ster−IPT
/TIG−MA
0085
Rattus.r
attus−N
UCLEAR
−MA000
7Hom
o.sapien
s−ETS−
MA0080
Drosoph
ila.melan
ogaster
−ZN.FIN
GER−MA
0015
NA−TAT
A.box−M
A0108
Mus.mu
sculus−Z
N.FINGE
R−MA00
35Mus
.muscul
us−ZN.FI
NGER−M
A0029
Homo.sa
piens−Z
N.FINGE
R−MA00
37Ratt
us.norve
gicus−Z
N.FINGE
R−MA00
38Hom
o.sapien
s−TRP.
CLUSTE
R−MA00
50Hom
o.sapien
s−TRP.
CLUSTE
R−MA00
51Zea
.mays−Z
N.FINGE
R−MA00
20Zea
.mays−Z
N.FINGE
R−MA00
21Hom
o.sapien
s−HOME
O−MA00
70Hom
o.sapien
s−HMG−
MA0077
Homo.sa
piens−H
MG−MA0
084Mus
.muscul
us−HMG−
MA0087
Drosoph
ila.melan
ogaster
−ZN.FIN
GER−MA
0013
Homo.sa
piens−F
ORKHEA
D−MA00
30Hom
o.sapien
s−FORK
HEAD−M
A0031
Homo.sa
piens−F
ORKHEA
D−MA00
33Mus
.muscul
us−PAIR
ED.HOM
EO−MA0
068Dros
ophila.m
elanoga
ster−ZN
.FINGER
−MA001
0Pisu
m.sativum
−HMG−M
A0045
Drosoph
ila.melan
ogaster
−ZN.FIN
GER−MA
0012
Drosoph
ila.melan
ogaster
−ZN.FIN
GER−MA
0049
0.00.2
0.40.6
0.81.0
1−Prob(
Clusterin
g)
• Post-processed MAP partition to remove weak
relationships, then very similar to thresholded post. probs.
Shane T. Jensen 20 May 13, 2006
Comparing Priors: Clustering JASPAR databaseNumber of Clusters − Unif
Freq
uenc
y
20 25 30 35
050
100
200
300
Number of Clusters − DP
Freq
uenc
y
20 25 30 35
050
100
200
300
Average Cluster Size − Unif
Freq
uenc
y
2.5 3.0 3.5
010
020
030
040
0
Average Cluster Size − DP
Freq
uenc
y
2.5 3.0 3.50
100
200
300
400
• Very little di!erence between using DP and uniform prior
• Likelihood is dominating any prior assumption on partition
Shane T. Jensen 21 May 13, 2006
Summary
• Non-parametric Bayesian approaches based on Dirichlet
process can be very useful for clustering applications
• Issues with MCMC inference: popular MAP partitions
seem inferior to partitions based on posterior probabilities
• Issues with implicit DP assumptions: alternative priors give
quite di!erent prior partitions
• Posterior di!erences between priors are small in our motif
application, but can be larger in other applications
• Jensen and Liu, JASA (forthcoming) plus other
manuscripts soon available on my website
http://stat.wharton.upenn.edu/!stjensen
Shane T. Jensen 22 May 13, 2006