edustjensen/stat542/lecture26... · 2011-04-19 · MA0038 Homo.sapiens − TRP.CLUSTER − MA0050...

Bayesian Clustering with the Dirichlet Process:Issues with priors and interpreting MCMC

Shane T. Jensen

Department of Statistics

The Wharton School, University of Pennsylvania

[email protected]

Collaborative work with J. Liu, L. Dicker, and G. Tuteja

Shane T. Jensen 1 May 13, 2006

Introduction

• Bayesian non-parametric or semi-parametric models are

very useful in many applications

• Non-parametric: random variables realizations from

unspecified probability distribution e.g.,

Xi ! F(·) i = 1, . . . , n

• Xi’s can be observed data, latent variables or unknown

parameters (often in a hierarchical setting)

• Prior distributions for F(·) play an important role in

non-parametric modeling


Dirichlet Process Priors

• A commonly-used prior distribution for an unknown

probability distribution is the Dirichlet process

F(·) ! DP(!, F0)

• F0 is a probability measure

– can represent prior belief in form of F

• ! is a weight parameter

– can represent degree of belief in prior form F0

• Ferguson (1973,1974); Antoniak (1974); many others

• Important consequence of Dirichlet process is that it

induces a discretized posterior distribution


Consequence of DP priors

• Ferguson, 1974: using a Dirichlet process DP(!, F0) prior

for F(·) results in a posterior mixture of F0 and point

masses at observation Xi:

F(·)|X1, . . . , Xn ! DP

!

! + n , F0 +n

"

i=1

"(Xi)

#

• For density estimation, discreteness may be a problem:

convolutions with kernel functions can be used to produce

a continuous density estimate

• In other applications, discreteness is not a disadvantage!


Clustering with a DP prior

• Point mass component of posterior leads to a random

partition of our variables

• Consider a new variable Xn+1 and let X1, . . . , XC be the

unique values of X1:n = (X1, . . . , Xn). Then,

P (Xn+1 = XC | X1:n) =Nc

! + nc = 1, . . . , C

P (Xn+1 = new | X1:n) =!

! + n

• Nc = size of cluster c: number in X1:n that equal Xc

“Rich get richer”: will return to this...


Motivating Application: TF motifs

• Genes are regulated by transcription factor (TF) proteins

that bind to the DNA sequence near to gene

• TF proteins can selectively control only certain target genes

by only binding to the “same” sequence, called a motif

• The motif sites are highly conserved but not identical, so

we use a matrix description of the motif appearance

Frequency Matrix - Xi

A 0.05 0.02 0.85 0.02 0.21 0.06

C 0.04 0.02 0.03 0.93 0.05 0.06

G 0.06 0.94 0.06 0.04 0.70 0.11

T 0.85 0.02 0.06 0.01 0.04 0.77

Sequence Logo


Collections of TF motifs

• Large databases contain motif information on many TFs

but with large amount of redundancy

– TRANSFAC and JASPAR are largest (100’s in each)

• Want to cluster motifs together to either reduce

redundancy in databases or match new motifs to database

• Nucleotide conservation varies both within a single motif

(between positions) and between di!erent motifs

Tal1beta-E47SAGL3


Motif Clustering with DP prior

• Hierarchical model with levels for both within-unit and

between-unit variability in discovered motifs

– Observed count matrix Yi is a product multinomial

realization of frequency matrix Xi

– Unknown Xi’s share unknown distribution F(·)

• Dirichlet process DP(!, F0) prior for F(·) leads to

posterior mixture of F0 and point masses at each Xi

• Our prior measure F0 in this application is a product

Dirichlet distribution


Benefits and Issues with DP prior

• Allows unknown number of clusters without need to model

number of clusters directly

– No real prior knowledge about number of clusters in our

application

• However, with DP there are implicit assumptions about

number of clusters (and their size distribution)

• “Rich get richer” property influences prior predictive

number of clusters and cluster size distribution

– How influential is this property in an application?


Benefits and Issues with MCMC

• DP-based model is easy to implement via Gibbs sampling

– p(Xi|X"i) is same choice structure as p(Xn+1|X1:n)

– Xi either sampled into one of current clusters defined by

X"i or sampled from F0 to form a new cluster

• Alternative is direct model on number of clusters and then

use something like Reversible Jump MCMC

• Mixing can be an issue with Gibbs sampler

– collapsed Gibbs sampler: integrate out Xi and deal

directly with clustering indicators

– split/merge moves to speed up mixing: lots of great

work by R. Neal, D. Dahl and others


Main Issue 1: Posterior Inference from MCMC

• However, there are still issues posterior inference based on

Gibbs sampling output also has issues

• Need to infer a set of clusters from sampled partitions, but

we have a label switching problem (Stephens, 1999)

• cluster labels are exchangeable for a particular partition

• usual summaries such as posterior mean can be misleading

mixtures of these exchangeable labeling

• need summaries that are uninfluenced by labeling


Posterior Inference Options

• Option 1: clusters defined by last partition visited

– sampled partition produced at end of Gibbs chain

– surprisingly popular, e.g. Latent Dirichlet Alloc. models

• Option 2: clusters defined by MAP partition

– sampled partition with highest posterior density

– simple and popular

• Option 3: clusters defined by threshold on pairwise

posterior probabilities Pij

– frequency of iterations with motifs i & j in same cluster


Main Issue 2: Implicit DP Assumptions

• DP has implicit “rich get richer” property: easy to see

from the predictive distribution:

P (Xn+1 joins cluster c ) =Nc

! + nc = 1, . . . , C

P (Xn+1 forms new cluster) =!

! + n

• Chinese restaurant process: new customer chooses table

– sits at current table with probability # Nc, the number

of customers already sitting there

– sits at entirely new table with probability # !


Alternative Priors for Clustering

• Uniform Prior: socialism, no one gets rich

P (Xn+1 joins cluster c ) =1

! + Cc = 1, . . . , C

P (Xn+1 forms new cluster) =!

! + C

• Pitman-Yor Prior: rich get richer, but charitable

P (Xn+1 joins cluster c ) =Nc " #

! + nc = 1, . . . , C

P (Xn+1 forms new cluster) =! " C · #

! + n

• 0 $ # $ 1 is often called the “discount factor”


Asymptotic Comparison of Priors

• Number of clusters Cn is clearly a function of sample size n

• How does Cn grow as n "% & ?

DP Prior : E(Cn) ' ! · log(n)

Pitman " Yor Prior : E(Cn) ' K(!, #) · n#

Uniform Prior : E(Cn) ' K(!) · n1

2

• DP prior shows slowest growth in number of clusters Cn

• Interestingly, Pitman-Yor can lead to either faster or slower

growth vs. Uniform, depending on #

• Also working on results for distribution of cluster sizes


Finite Sample Comparison of Priors

• Y = Cn vs. X = n for di!erent values of !

1e+02 5e+02 5e+03 5e+04

510

2050

100

200

500

1000

θ = 1

n = number of observations

Expe

cted N

umbe

r of C

luster

s

DPUNPY (α = 0.5)PY (α = 0.25)PY (α = 0.75)

1e+02 5e+02 5e+03 5e+04

5010

020

050

010

0020

00

θ = 10


Expe

cted N

umbe

r of C

luster

s

DPUNPY (α = 0.5)PY (α = 0.25)PY (α = 0.75)

1e+02 5e+02 5e+03 5e+04

100

200

500

1000

2000

5000

θ = 100


Expe

cted N

umbe

r of C

luster

s

DPUNPY (α = 0.5)PY (α = 0.25)PY (α = 0.75)


Simulation Study of Motif Clustering

• Evaluation of di!erent priors and modes of inference in

context of motif clustering application

• Simulated realistic collections of motifs (known partitions)

• Di!erent simulation conditions to vary clustering di"culty:

– high to low within-cluster similarity

– high to low between-cluster similarity

• Success measured by Jacard similarity between true

partition z and inferred partition z

J(z, z) =TP

TP + FP + FN


Simulation Comparison of Inference Alternatives

2 4 6 8

0.20.4

0.60.8

1.0

Increasing Clustering Difficulty

Jaca

rd In

dex

MAPProb > 0.5Prob > 0.25

• MAP partition consistently inferior to pairwise probs.

• Post. probs. incorporate uncertainty across iterations


Simulation Comparison of Prior Alternatives

2 4 6 8

0.70

0.75

0.80

0.85

0.90

0.95

Increasing Clustering Difficulty

Jaca

rd In

dex

UniformPY 0.25PY 0.5PY 0.75DP

• Not much di!erence in general between priors

• Uniform does a little worse in most situations


Real Data Results: Clustering JASPAR database

• Tree based on pairwise posterior probabilities:Hom

o.sapien

s−NUCL

EAR−MA

0065

Homo.sa

piens−N

UCLEAR

−MA007

2Dros

ophila.m

elanoga

ster−NU

CLEAR−

MA0016

Homo.sa

piens−N

UCLEAR

−MA007

4Hom

o.sapien

s−NUCL

EAR−MA

0066

Homo.sa

piens−N

UCLEAR

−MA007

1Arab

idopsis.t

haliana−

HOMEO.

ZIP−MA0

008Arab

idopsis.t

haliana−

HOMEO.

ZIP−MA0

110Mus

.muscul

us−bHLH

.ZIP−MA

0104

Homo.sa

piens−bH

LH.ZIP−M

A0093

Homo.sa

piens−bH

LH.ZIP−M

A0059

Mus.mu

sculus−b

HLH−MA

0004

Homo.sa

piens−bH

LH.ZIP−M

A0058

Mus.mu

sculus−H

MG−MA0

078Ratt

us.norve

gicus−F

ORKHEA

D−MA00

41Ratt

us.norve

gicus−F

ORKHEA

D−MA00

47Ratt

us.norve

gicus−F

ORKHEA

D−MA00

40Hom

o.sapien

s−FORK

HEAD−M

A0042

Rattus.n

orvegicu

s−bZIP−M

A0019

Homo.sa

piens−bH

LH−MA0

091Hom

o.sapien

s−ZN.FI

NGER−M

A0073

Homo.sa

piens−T

EA−MA0

090Hom

o.sapien

s−NUCL

EAR−MA

0017

Gallus.g

allus−ZN

.FINGER

−MA010

3 Homo.sa

piens−P

53−MA0

106Dros

ophila.m

elanoga

ster−ZN

.FINGER

−MA001

1Hom

o.sapien

s−MADS

−MA008

3Arab

idopsis.t

haliana−

MADS−M

A0001

Homo.sa

piens−A

P2−MA0

003Hom

o.sapien

s−ZN.FI

NGER−M

A0095

Arabidop

sis.thalia

na−MAD

S−MA00

05Hom

o.sapien

s−FORK

HEAD−M

A0032

Homo.sa

piens−bH

LH−MA0

048Anti

rrhinum.

majus−M

ADS−MA

0082

Oryctolag

us.cunic

ulus−ZN

.FINGER

−MA010

9Hom

o.sapien

s−Unkno

wn−MA0

024Pisu

m.sativum

−HMG−M

A0044

Homo.sa

piens−R

UNT−MA

0002

Mus.mu

sculus−b

HLH−MA

0006

Homo.sa

piens−P

AIRED−

MA0069

Petunia.h

ybrida−T

RP.CLU

STER−M

A0054

Hordeum

.vulgare−

TRP.CLU

STER−M

A0034

Xenupu

s.laevis−

ZN.FING

ER−MA0

088Gall

us.gallus

−ETS−M

A0098

Homo.sa

piens−bH

LH−MA0

055Mus

.muscul

us−T.BO

X−MA00

09Dros

ophila.m

elanoga

ster−ZN

.FINGER

−MA008

6NA−

bZIP−MA

0102

Homo.sa

piens−bZ

IP−MA00

25Mus

.muscul

us−HOME

O−MA00

63Hom

o.sapien

s−bZIP−M

A0018

Antirrhin

um.maju

s−bZIP−M

A0096

Antirrhin

um.maju

s−bZIP−M

A0097

Mus.mu

sculus−P

AIRED−

MA0067

Homo.sa

piens−bZ

IP−MA00

43Gall

us.gallus

−bZIP−M

A0089 Mus

.muscul

us−bZIP−

MA0099

Drosoph

ila.melan

ogaster

−REL−M

A0022

Homo.sa

piens−Z

N.FINGE

R−MA00

79Mus

.muscul

us−bHLH

.ZIP−MA

0111

Homo.sa

piens−R

EL−MA0

107Hom

o.sapien

s−REL−

MA0101

Vertebra

tes−REL

−MA006

1Dros

ophila.m

elanoga

ster−RE

L−MA00

23Hom

o.sapien

s−REL−

MA0105

Mus.mu

sculus−P

AIRED−

MA0014

Homo.sa

piens−Z

N.FINGE

R−MA00

57Mus

.muscul

us−ZN.FI

NGER−M

A0039

Mus.mu

sculus−H

OMEO−M

A0027

Homo.sa

piens−Z

N.FINGE

R−MA00

56Hom

o.sapien

s−ETS−

MA0081

Homo.sa

piens−E

TS−MA0

062Hom

o.sapien

s−ETS−

MA0028

Drosoph

ila.melan

ogaster

−ETS−M

A0026

Homo.sa

piens−E

TS−MA0

076Dros

ophila.m

elanoga

ster−IPT

/TIG−MA

0085

Rattus.r

attus−N

UCLEAR

−MA000

7Hom

o.sapien

s−ETS−

MA0080

Drosoph

ila.melan

ogaster

−ZN.FIN

GER−MA

0015

NA−TAT

A.box−M

A0108

Mus.mu

sculus−Z

N.FINGE

R−MA00

35Mus

.muscul

us−ZN.FI

NGER−M

A0029

Homo.sa

piens−Z

N.FINGE

R−MA00

37Ratt

us.norve

gicus−Z

N.FINGE

R−MA00

38Hom

o.sapien

s−TRP.

CLUSTE

R−MA00

50Hom

o.sapien

s−TRP.

CLUSTE

R−MA00

51Zea

.mays−Z

N.FINGE

R−MA00

20Zea

.mays−Z

N.FINGE

R−MA00

21Hom

o.sapien

s−HOME

O−MA00

70Hom

o.sapien

s−HMG−

MA0077

Homo.sa

piens−H

MG−MA0

084Mus

.muscul

us−HMG−

MA0087

Drosoph

ila.melan

ogaster

−ZN.FIN

GER−MA

0013

Homo.sa

piens−F

ORKHEA

D−MA00

30Hom

o.sapien

s−FORK

HEAD−M

A0031

Homo.sa

piens−F

ORKHEA

D−MA00

33Mus

.muscul

us−PAIR

ED.HOM

EO−MA0

068Dros

ophila.m

elanoga

ster−ZN

.FINGER

−MA001

0Pisu

m.sativum

−HMG−M

A0045

Drosoph

ila.melan

ogaster

−ZN.FIN

GER−MA

0012

Drosoph

ila.melan

ogaster

−ZN.FIN

GER−MA

0049

0.00.2

0.40.6

0.81.0

1−Prob(

Clusterin

g)

• Post-processed MAP partition to remove weak

relationships, then very similar to thresholded post. probs.


Comparing Priors: Clustering JASPAR databaseNumber of Clusters − Unif

Freq

uenc

y

20 25 30 35

050

100

200

300

Number of Clusters − DP

Freq

uenc

y

20 25 30 35

050

100

200

300

Average Cluster Size − Unif

Freq

uenc

y

2.5 3.0 3.5

010

020

030

040

0

Average Cluster Size − DP

Freq

uenc

y

2.5 3.0 3.50

100

200

300

400

• Very little di!erence between using DP and uniform prior

• Likelihood is dominating any prior assumption on partition


Summary

• Non-parametric Bayesian approaches based on Dirichlet

process can be very useful for clustering applications

• Issues with MCMC inference: popular MAP partitions

seem inferior to partitions based on posterior probabilities

• Issues with implicit DP assumptions: alternative priors give

quite di!erent prior partitions

• Posterior di!erences between priors are small in our motif

application, but can be larger in other applications

• Jensen and Liu, JASA (forthcoming) plus other

manuscripts soon available on my website

http://stat.wharton.upenn.edu/!stjensen


Date post:	02-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

edustjensen/stat542/lecture26... · 2011-04-19 · MA0038 Homo.sapiens − TRP.CLUSTER − MA0050...

Documents