Post on 02-Aug-2020
transcript
Bayesian Clustering with the Dirichlet Process:Issues with priors and interpreting MCMC
Shane T. Jensen
Department of Statistics
The Wharton School, University of Pennsylvania
stjensen@wharton.upenn.edu
Collaborative work with J. Liu, L. Dicker, and G. Tuteja
Shane T. Jensen 1 May 13, 2006
Introduction
• Bayesian non-parametric or semi-parametric models are
very useful in many applications
• Non-parametric: random variables realizations from
unspecified probability distribution e.g.,
Xi ! F(·) i = 1, . . . , n
• Xi’s can be observed data, latent variables or unknown
parameters (often in a hierarchical setting)
• Prior distributions for F(·) play an important role in
non-parametric modeling
Shane T. Jensen 2 May 13, 2006
Dirichlet Process Priors
• A commonly-used prior distribution for an unknown
probability distribution is the Dirichlet process
F(·) ! DP(!, F0)
• F0 is a probability measure
– can represent prior belief in form of F
• ! is a weight parameter
– can represent degree of belief in prior form F0
• Ferguson (1973,1974); Antoniak (1974); many others
• Important consequence of Dirichlet process is that it
induces a discretized posterior distribution
Shane T. Jensen 3 May 13, 2006
Consequence of DP priors
• Ferguson, 1974: using a Dirichlet process DP(!, F0) prior
for F(·) results in a posterior mixture of F0 and point
masses at observation Xi:
F(·)|X1, . . . , Xn ! DP
!
! + n , F0 +n
"
i=1
"(Xi)
#
• For density estimation, discreteness may be a problem:
convolutions with kernel functions can be used to produce
a continuous density estimate
• In other applications, discreteness is not a disadvantage!
Shane T. Jensen 4 May 13, 2006
Clustering with a DP prior
• Point mass component of posterior leads to a random
partition of our variables
• Consider a new variable Xn+1 and let X1, . . . , XC be the
unique values of X1:n = (X1, . . . , Xn). Then,
P (Xn+1 = XC | X1:n) =Nc
! + nc = 1, . . . , C
P (Xn+1 = new | X1:n) =!
! + n
• Nc = size of cluster c: number in X1:n that equal Xc
“Rich get richer”: will return to this...
Shane T. Jensen 5 May 13, 2006
Motivating Application: TF motifs
• Genes are regulated by transcription factor (TF) proteins
that bind to the DNA sequence near to gene
• TF proteins can selectively control only certain target genes
by only binding to the “same” sequence, called a motif
• The motif sites are highly conserved but not identical, so
we use a matrix description of the motif appearance
Frequency Matrix - Xi
A 0.05 0.02 0.85 0.02 0.21 0.06
C 0.04 0.02 0.03 0.93 0.05 0.06
G 0.06 0.94 0.06 0.04 0.70 0.11
T 0.85 0.02 0.06 0.01 0.04 0.77
Sequence Logo
Shane T. Jensen 6 May 13, 2006
Collections of TF motifs
• Large databases contain motif information on many TFs
but with large amount of redundancy
– TRANSFAC and JASPAR are largest (100’s in each)
• Want to cluster motifs together to either reduce
redundancy in databases or match new motifs to database
• Nucleotide conservation varies both within a single motif
(between positions) and between di!erent motifs
Tal1beta-E47SAGL3
Shane T. Jensen 7 May 13, 2006
Motif Clustering with DP prior
• Hierarchical model with levels for both within-unit and
between-unit variability in discovered motifs
– Observed count matrix Yi is a product multinomial
realization of frequency matrix Xi
– Unknown Xi’s share unknown distribution F(·)
• Dirichlet process DP(!, F0) prior for F(·) leads to
posterior mixture of F0 and point masses at each Xi
• Our prior measure F0 in this application is a product
Dirichlet distribution
Shane T. Jensen 8 May 13, 2006
Benefits and Issues with DP prior
• Allows unknown number of clusters without need to model
number of clusters directly
– No real prior knowledge about number of clusters in our
application
• However, with DP there are implicit assumptions about
number of clusters (and their size distribution)
• “Rich get richer” property influences prior predictive
number of clusters and cluster size distribution
– How influential is this property in an application?
Shane T. Jensen 9 May 13, 2006
Benefits and Issues with MCMC
• DP-based model is easy to implement via Gibbs sampling
– p(Xi|X"i) is same choice structure as p(Xn+1|X1:n)
– Xi either sampled into one of current clusters defined by
X"i or sampled from F0 to form a new cluster
• Alternative is direct model on number of clusters and then
use something like Reversible Jump MCMC
• Mixing can be an issue with Gibbs sampler
– collapsed Gibbs sampler: integrate out Xi and deal
directly with clustering indicators
– split/merge moves to speed up mixing: lots of great
work by R. Neal, D. Dahl and others
Shane T. Jensen 10 May 13, 2006
Main Issue 1: Posterior Inference from MCMC
• However, there are still issues posterior inference based on
Gibbs sampling output also has issues
• Need to infer a set of clusters from sampled partitions, but
we have a label switching problem (Stephens, 1999)
• cluster labels are exchangeable for a particular partition
• usual summaries such as posterior mean can be misleading
mixtures of these exchangeable labeling
• need summaries that are uninfluenced by labeling
Shane T. Jensen 11 May 13, 2006
Posterior Inference Options
• Option 1: clusters defined by last partition visited
– sampled partition produced at end of Gibbs chain
– surprisingly popular, e.g. Latent Dirichlet Alloc. models
• Option 2: clusters defined by MAP partition
– sampled partition with highest posterior density
– simple and popular
• Option 3: clusters defined by threshold on pairwise
posterior probabilities Pij
– frequency of iterations with motifs i & j in same cluster
Shane T. Jensen 12 May 13, 2006
Main Issue 2: Implicit DP Assumptions
• DP has implicit “rich get richer” property: easy to see
from the predictive distribution:
P (Xn+1 joins cluster c ) =Nc
! + nc = 1, . . . , C
P (Xn+1 forms new cluster) =!
! + n
• Chinese restaurant process: new customer chooses table
– sits at current table with probability # Nc, the number
of customers already sitting there
– sits at entirely new table with probability # !
Shane T. Jensen 13 May 13, 2006
Alternative Priors for Clustering
• Uniform Prior: socialism, no one gets rich
P (Xn+1 joins cluster c ) =1
! + Cc = 1, . . . , C
P (Xn+1 forms new cluster) =!
! + C
• Pitman-Yor Prior: rich get richer, but charitable
P (Xn+1 joins cluster c ) =Nc " #
! + nc = 1, . . . , C
P (Xn+1 forms new cluster) =! " C · #
! + n
• 0 $ # $ 1 is often called the “discount factor”
Shane T. Jensen 14 May 13, 2006
Asymptotic Comparison of Priors
• Number of clusters Cn is clearly a function of sample size n
• How does Cn grow as n "% & ?
DP Prior : E(Cn) ' ! · log(n)
Pitman " Yor Prior : E(Cn) ' K(!, #) · n#
Uniform Prior : E(Cn) ' K(!) · n1
2
• DP prior shows slowest growth in number of clusters Cn
• Interestingly, Pitman-Yor can lead to either faster or slower
growth vs. Uniform, depending on #
• Also working on results for distribution of cluster sizes
Shane T. Jensen 15 May 13, 2006
Finite Sample Comparison of Priors
• Y = Cn vs. X = n for di!erent values of !
1e+02 5e+02 5e+03 5e+04
510
2050
100
200
500
1000
θ = 1
n = number of observations
Expe
cted N
umbe
r of C
luster
s
DPUNPY (α = 0.5)PY (α = 0.25)PY (α = 0.75)
1e+02 5e+02 5e+03 5e+04
5010
020
050
010
0020
00
θ = 10
n = number of observations
Expe
cted N
umbe
r of C
luster
s
DPUNPY (α = 0.5)PY (α = 0.25)PY (α = 0.75)
1e+02 5e+02 5e+03 5e+04
100
200
500
1000
2000
5000
θ = 100
n = number of observations
Expe
cted N
umbe
r of C
luster
s
DPUNPY (α = 0.5)PY (α = 0.25)PY (α = 0.75)
Shane T. Jensen 16 May 13, 2006
Simulation Study of Motif Clustering
• Evaluation of di!erent priors and modes of inference in
context of motif clustering application
• Simulated realistic collections of motifs (known partitions)
• Di!erent simulation conditions to vary clustering di"culty:
– high to low within-cluster similarity
– high to low between-cluster similarity
• Success measured by Jacard similarity between true
partition z and inferred partition z
J(z, z) =TP
TP + FP + FN
Shane T. Jensen 17 May 13, 2006
Simulation Comparison of Inference Alternatives
2 4 6 8
0.20.4
0.60.8
1.0
Increasing Clustering Difficulty
Jaca
rd In
dex
MAPProb > 0.5Prob > 0.25
• MAP partition consistently inferior to pairwise probs.
• Post. probs. incorporate uncertainty across iterations
Shane T. Jensen 18 May 13, 2006
Simulation Comparison of Prior Alternatives
2 4 6 8
0.70
0.75
0.80
0.85
0.90
0.95
Increasing Clustering Difficulty
Jaca
rd In
dex
UniformPY 0.25PY 0.5PY 0.75DP
• Not much di!erence in general between priors
• Uniform does a little worse in most situations
Shane T. Jensen 19 May 13, 2006
Real Data Results: Clustering JASPAR database
• Tree based on pairwise posterior probabilities:Hom
o.sapien
s−NUCL
EAR−MA
0065
Homo.sa
piens−N
UCLEAR
−MA007
2Dros
ophila.m
elanoga
ster−NU
CLEAR−
MA0016
Homo.sa
piens−N
UCLEAR
−MA007
4Hom
o.sapien
s−NUCL
EAR−MA
0066
Homo.sa
piens−N
UCLEAR
−MA007
1Arab
idopsis.t
haliana−
HOMEO.
ZIP−MA0
008Arab
idopsis.t
haliana−
HOMEO.
ZIP−MA0
110Mus
.muscul
us−bHLH
.ZIP−MA
0104
Homo.sa
piens−bH
LH.ZIP−M
A0093
Homo.sa
piens−bH
LH.ZIP−M
A0059
Mus.mu
sculus−b
HLH−MA
0004
Homo.sa
piens−bH
LH.ZIP−M
A0058
Mus.mu
sculus−H
MG−MA0
078Ratt
us.norve
gicus−F
ORKHEA
D−MA00
41Ratt
us.norve
gicus−F
ORKHEA
D−MA00
47Ratt
us.norve
gicus−F
ORKHEA
D−MA00
40Hom
o.sapien
s−FORK
HEAD−M
A0042
Rattus.n
orvegicu
s−bZIP−M
A0019
Homo.sa
piens−bH
LH−MA0
091Hom
o.sapien
s−ZN.FI
NGER−M
A0073
Homo.sa
piens−T
EA−MA0
090Hom
o.sapien
s−NUCL
EAR−MA
0017
Gallus.g
allus−ZN
.FINGER
−MA010
3 Homo.sa
piens−P
53−MA0
106Dros
ophila.m
elanoga
ster−ZN
.FINGER
−MA001
1Hom
o.sapien
s−MADS
−MA008
3Arab
idopsis.t
haliana−
MADS−M
A0001
Homo.sa
piens−A
P2−MA0
003Hom
o.sapien
s−ZN.FI
NGER−M
A0095
Arabidop
sis.thalia
na−MAD
S−MA00
05Hom
o.sapien
s−FORK
HEAD−M
A0032
Homo.sa
piens−bH
LH−MA0
048Anti
rrhinum.
majus−M
ADS−MA
0082
Oryctolag
us.cunic
ulus−ZN
.FINGER
−MA010
9Hom
o.sapien
s−Unkno
wn−MA0
024Pisu
m.sativum
−HMG−M
A0044
Homo.sa
piens−R
UNT−MA
0002
Mus.mu
sculus−b
HLH−MA
0006
Homo.sa
piens−P
AIRED−
MA0069
Petunia.h
ybrida−T
RP.CLU
STER−M
A0054
Hordeum
.vulgare−
TRP.CLU
STER−M
A0034
Xenupu
s.laevis−
ZN.FING
ER−MA0
088Gall
us.gallus
−ETS−M
A0098
Homo.sa
piens−bH
LH−MA0
055Mus
.muscul
us−T.BO
X−MA00
09Dros
ophila.m
elanoga
ster−ZN
.FINGER
−MA008
6NA−
bZIP−MA
0102
Homo.sa
piens−bZ
IP−MA00
25Mus
.muscul
us−HOME
O−MA00
63Hom
o.sapien
s−bZIP−M
A0018
Antirrhin
um.maju
s−bZIP−M
A0096
Antirrhin
um.maju
s−bZIP−M
A0097
Mus.mu
sculus−P
AIRED−
MA0067
Homo.sa
piens−bZ
IP−MA00
43Gall
us.gallus
−bZIP−M
A0089 Mus
.muscul
us−bZIP−
MA0099
Drosoph
ila.melan
ogaster
−REL−M
A0022
Homo.sa
piens−Z
N.FINGE
R−MA00
79Mus
.muscul
us−bHLH
.ZIP−MA
0111
Homo.sa
piens−R
EL−MA0
107Hom
o.sapien
s−REL−
MA0101
Vertebra
tes−REL
−MA006
1Dros
ophila.m
elanoga
ster−RE
L−MA00
23Hom
o.sapien
s−REL−
MA0105
Mus.mu
sculus−P
AIRED−
MA0014
Homo.sa
piens−Z
N.FINGE
R−MA00
57Mus
.muscul
us−ZN.FI
NGER−M
A0039
Mus.mu
sculus−H
OMEO−M
A0027
Homo.sa
piens−Z
N.FINGE
R−MA00
56Hom
o.sapien
s−ETS−
MA0081
Homo.sa
piens−E
TS−MA0
062Hom
o.sapien
s−ETS−
MA0028
Drosoph
ila.melan
ogaster
−ETS−M
A0026
Homo.sa
piens−E
TS−MA0
076Dros
ophila.m
elanoga
ster−IPT
/TIG−MA
0085
Rattus.r
attus−N
UCLEAR
−MA000
7Hom
o.sapien
s−ETS−
MA0080
Drosoph
ila.melan
ogaster
−ZN.FIN
GER−MA
0015
NA−TAT
A.box−M
A0108
Mus.mu
sculus−Z
N.FINGE
R−MA00
35Mus
.muscul
us−ZN.FI
NGER−M
A0029
Homo.sa
piens−Z
N.FINGE
R−MA00
37Ratt
us.norve
gicus−Z
N.FINGE
R−MA00
38Hom
o.sapien
s−TRP.
CLUSTE
R−MA00
50Hom
o.sapien
s−TRP.
CLUSTE
R−MA00
51Zea
.mays−Z
N.FINGE
R−MA00
20Zea
.mays−Z
N.FINGE
R−MA00
21Hom
o.sapien
s−HOME
O−MA00
70Hom
o.sapien
s−HMG−
MA0077
Homo.sa
piens−H
MG−MA0
084Mus
.muscul
us−HMG−
MA0087
Drosoph
ila.melan
ogaster
−ZN.FIN
GER−MA
0013
Homo.sa
piens−F
ORKHEA
D−MA00
30Hom
o.sapien
s−FORK
HEAD−M
A0031
Homo.sa
piens−F
ORKHEA
D−MA00
33Mus
.muscul
us−PAIR
ED.HOM
EO−MA0
068Dros
ophila.m
elanoga
ster−ZN
.FINGER
−MA001
0Pisu
m.sativum
−HMG−M
A0045
Drosoph
ila.melan
ogaster
−ZN.FIN
GER−MA
0012
Drosoph
ila.melan
ogaster
−ZN.FIN
GER−MA
0049
0.00.2
0.40.6
0.81.0
1−Prob(
Clusterin
g)
• Post-processed MAP partition to remove weak
relationships, then very similar to thresholded post. probs.
Shane T. Jensen 20 May 13, 2006
Comparing Priors: Clustering JASPAR databaseNumber of Clusters − Unif
Freq
uenc
y
20 25 30 35
050
100
200
300
Number of Clusters − DP
Freq
uenc
y
20 25 30 35
050
100
200
300
Average Cluster Size − Unif
Freq
uenc
y
2.5 3.0 3.5
010
020
030
040
0
Average Cluster Size − DP
Freq
uenc
y
2.5 3.0 3.50
100
200
300
400
• Very little di!erence between using DP and uniform prior
• Likelihood is dominating any prior assumption on partition
Shane T. Jensen 21 May 13, 2006
Summary
• Non-parametric Bayesian approaches based on Dirichlet
process can be very useful for clustering applications
• Issues with MCMC inference: popular MAP partitions
seem inferior to partitions based on posterior probabilities
• Issues with implicit DP assumptions: alternative priors give
quite di!erent prior partitions
• Posterior di!erences between priors are small in our motif
application, but can be larger in other applications
• Jensen and Liu, JASA (forthcoming) plus other
manuscripts soon available on my website
http://stat.wharton.upenn.edu/!stjensen
Shane T. Jensen 22 May 13, 2006