edustjensen/stat542/lecture26... · 2011-04-19 · MA0038 Homo.sapiens − TRP.CLUSTER − MA0050...

transcript

Bayesian Clustering with the Dirichlet Process:Issues with priors and interpreting MCMC

Shane T. Jensen

Department of Statistics

The Wharton School, University of Pennsylvania

stjensen@wharton.upenn.edu

Collaborative work with J. Liu, L. Dicker, and G. Tuteja

Shane T. Jensen 1 May 13, 2006

Introduction

• Bayesian non-parametric or semi-parametric models are

very useful in many applications

• Non-parametric: random variables realizations from

unspecified probability distribution e.g.,

Xi ! F(·) i = 1, . . . , n

• Xi’s can be observed data, latent variables or unknown

parameters (often in a hierarchical setting)

• Prior distributions for F(·) play an important role in

non-parametric modeling

Dirichlet Process Priors

• A commonly-used prior distribution for an unknown

probability distribution is the Dirichlet process

F(·) ! DP(!, F0)

• F0 is a probability measure

– can represent prior belief in form of F

• ! is a weight parameter

– can represent degree of belief in prior form F0

• Ferguson (1973,1974); Antoniak (1974); many others

• Important consequence of Dirichlet process is that it

induces a discretized posterior distribution

Consequence of DP priors

• Ferguson, 1974: using a Dirichlet process DP(!, F0) prior

for F(·) results in a posterior mixture of F0 and point

masses at observation Xi:

F(·)|X1, . . . , Xn ! DP

! + n , F0 +n

• For density estimation, discreteness may be a problem:

convolutions with kernel functions can be used to produce

a continuous density estimate

• In other applications, discreteness is not a disadvantage!

Clustering with a DP prior

• Point mass component of posterior leads to a random

partition of our variables

• Consider a new variable Xn+1 and let X1, . . . , XC be the

unique values of X1:n = (X1, . . . , Xn). Then,

P (Xn+1 = XC | X1:n) =Nc

! + nc = 1, . . . , C

P (Xn+1 = new | X1:n) =!

• Nc = size of cluster c: number in X1:n that equal Xc

“Rich get richer”: will return to this...

Motivating Application: TF motifs

• Genes are regulated by transcription factor (TF) proteins

that bind to the DNA sequence near to gene

• TF proteins can selectively control only certain target genes

by only binding to the “same” sequence, called a motif

• The motif sites are highly conserved but not identical, so

we use a matrix description of the motif appearance

Frequency Matrix - Xi

A 0.05 0.02 0.85 0.02 0.21 0.06

C 0.04 0.02 0.03 0.93 0.05 0.06

G 0.06 0.94 0.06 0.04 0.70 0.11

T 0.85 0.02 0.06 0.01 0.04 0.77

Sequence Logo

Collections of TF motifs

• Large databases contain motif information on many TFs

but with large amount of redundancy

– TRANSFAC and JASPAR are largest (100’s in each)

• Want to cluster motifs together to either reduce

redundancy in databases or match new motifs to database

• Nucleotide conservation varies both within a single motif

(between positions) and between di!erent motifs

Tal1beta-E47SAGL3

Motif Clustering with DP prior

• Hierarchical model with levels for both within-unit and

between-unit variability in discovered motifs

– Observed count matrix Yi is a product multinomial

realization of frequency matrix Xi

– Unknown Xi’s share unknown distribution F(·)

• Dirichlet process DP(!, F0) prior for F(·) leads to

posterior mixture of F0 and point masses at each Xi

• Our prior measure F0 in this application is a product

Dirichlet distribution

Benefits and Issues with DP prior

• Allows unknown number of clusters without need to model

number of clusters directly

– No real prior knowledge about number of clusters in our

application

• However, with DP there are implicit assumptions about

number of clusters (and their size distribution)

• “Rich get richer” property influences prior predictive

number of clusters and cluster size distribution

– How influential is this property in an application?

Benefits and Issues with MCMC

• DP-based model is easy to implement via Gibbs sampling

– p(Xi|X"i) is same choice structure as p(Xn+1|X1:n)

– Xi either sampled into one of current clusters defined by

X"i or sampled from F0 to form a new cluster

• Alternative is direct model on number of clusters and then

use something like Reversible Jump MCMC

• Mixing can be an issue with Gibbs sampler

– collapsed Gibbs sampler: integrate out Xi and deal

directly with clustering indicators

– split/merge moves to speed up mixing: lots of great

work by R. Neal, D. Dahl and others

Main Issue 1: Posterior Inference from MCMC

• However, there are still issues posterior inference based on

Gibbs sampling output also has issues

• Need to infer a set of clusters from sampled partitions, but

we have a label switching problem (Stephens, 1999)

• cluster labels are exchangeable for a particular partition

• usual summaries such as posterior mean can be misleading

mixtures of these exchangeable labeling

• need summaries that are uninfluenced by labeling

Posterior Inference Options

• Option 1: clusters defined by last partition visited

– sampled partition produced at end of Gibbs chain

– surprisingly popular, e.g. Latent Dirichlet Alloc. models

• Option 2: clusters defined by MAP partition

– sampled partition with highest posterior density

– simple and popular

• Option 3: clusters defined by threshold on pairwise

posterior probabilities Pij

– frequency of iterations with motifs i & j in same cluster

Main Issue 2: Implicit DP Assumptions

• DP has implicit “rich get richer” property: easy to see

from the predictive distribution:

P (Xn+1 joins cluster c ) =Nc

! + nc = 1, . . . , C

P (Xn+1 forms new cluster) =!

• Chinese restaurant process: new customer chooses table

– sits at current table with probability # Nc, the number

of customers already sitting there

– sits at entirely new table with probability # !

Alternative Priors for Clustering

• Uniform Prior: socialism, no one gets rich

P (Xn+1 joins cluster c ) =1

! + Cc = 1, . . . , C

P (Xn+1 forms new cluster) =!

• Pitman-Yor Prior: rich get richer, but charitable

P (Xn+1 joins cluster c ) =Nc " #

! + nc = 1, . . . , C

P (Xn+1 forms new cluster) =! " C · #

• 0 $ # $ 1 is often called the “discount factor”

Asymptotic Comparison of Priors

• Number of clusters Cn is clearly a function of sample size n

• How does Cn grow as n "% & ?

DP Prior : E(Cn) ' ! · log(n)

Pitman " Yor Prior : E(Cn) ' K(!, #) · n#

Uniform Prior : E(Cn) ' K(!) · n1

• DP prior shows slowest growth in number of clusters Cn

• Interestingly, Pitman-Yor can lead to either faster or slower

growth vs. Uniform, depending on #

• Also working on results for distribution of cluster sizes

Finite Sample Comparison of Priors

• Y = Cn vs. X = n for di!erent values of !

1e+02 5e+02 5e+03 5e+04

θ = 1

n = number of observations

cted N

r of C

luster

DPUNPY (α = 0.5)PY (α = 0.25)PY (α = 0.75)

1e+02 5e+02 5e+03 5e+04

θ = 10

cted N

r of C

luster

DPUNPY (α = 0.5)PY (α = 0.25)PY (α = 0.75)

1e+02 5e+02 5e+03 5e+04

θ = 100

cted N

r of C

luster

DPUNPY (α = 0.5)PY (α = 0.25)PY (α = 0.75)

Simulation Study of Motif Clustering

• Evaluation of di!erent priors and modes of inference in

context of motif clustering application

• Simulated realistic collections of motifs (known partitions)

• Di!erent simulation conditions to vary clustering di"culty:

– high to low within-cluster similarity

– high to low between-cluster similarity

• Success measured by Jacard similarity between true

partition z and inferred partition z

J(z, z) =TP

TP + FP + FN

Simulation Comparison of Inference Alternatives

2 4 6 8

0.20.4

0.60.8

Increasing Clustering Difficulty

MAPProb > 0.5Prob > 0.25

• MAP partition consistently inferior to pairwise probs.

• Post. probs. incorporate uncertainty across iterations

Simulation Comparison of Prior Alternatives

2 4 6 8

Increasing Clustering Difficulty

UniformPY 0.25PY 0.5PY 0.75DP

• Not much di!erence in general between priors

• Uniform does a little worse in most situations

Real Data Results: Clustering JASPAR database

• Tree based on pairwise posterior probabilities:Hom

o.sapien

s−NUCL

EAR−MA

Homo.sa

piens−N

UCLEAR

−MA007

ophila.m

elanoga

ster−NU

CLEAR−

MA0016

Homo.sa

piens−N

UCLEAR

−MA007

o.sapien

s−NUCL

EAR−MA

Homo.sa

piens−N

UCLEAR

−MA007

idopsis.t

haliana−

HOMEO.

ZIP−MA0

008Arab

idopsis.t

haliana−

HOMEO.

ZIP−MA0

110Mus

.muscul

us−bHLH

.ZIP−MA

Homo.sa

piens−bH

LH.ZIP−M

Homo.sa

piens−bH

LH.ZIP−M

Mus.mu

sculus−b

HLH−MA

Homo.sa

piens−bH

LH.ZIP−M

Mus.mu

sculus−H

MG−MA0

078Ratt

us.norve

gicus−F

ORKHEA

D−MA00

41Ratt

us.norve

gicus−F

ORKHEA

D−MA00

47Ratt

us.norve

gicus−F

ORKHEA

D−MA00

o.sapien

s−FORK

HEAD−M

Rattus.n

orvegicu

s−bZIP−M

Homo.sa

piens−bH

LH−MA0

091Hom

o.sapien

s−ZN.FI

NGER−M

Homo.sa

piens−T

EA−MA0

090Hom

o.sapien

s−NUCL

EAR−MA

Gallus.g

allus−ZN

.FINGER

−MA010

3 Homo.sa

piens−P

53−MA0

106Dros

ophila.m

elanoga

ster−ZN

.FINGER

−MA001

o.sapien

s−MADS

−MA008

idopsis.t

haliana−

MADS−M

Homo.sa

piens−A

P2−MA0

003Hom

o.sapien

s−ZN.FI

NGER−M

Arabidop

sis.thalia

na−MAD

S−MA00

o.sapien

s−FORK

HEAD−M

Homo.sa

piens−bH

LH−MA0

048Anti

rrhinum.

majus−M

ADS−MA

Oryctolag

us.cunic

ulus−ZN

.FINGER

−MA010

o.sapien

s−Unkno

wn−MA0

024Pisu

m.sativum

−HMG−M

Homo.sa

piens−R

UNT−MA

Mus.mu

sculus−b

HLH−MA

Homo.sa

piens−P

AIRED−

MA0069

Petunia.h

ybrida−T

RP.CLU

STER−M

Hordeum

.vulgare−

TRP.CLU

STER−M

Xenupu

s.laevis−

ZN.FING

ER−MA0

088Gall

us.gallus

−ETS−M

Homo.sa

piens−bH

LH−MA0

055Mus

.muscul

us−T.BO

X−MA00

09Dros

ophila.m

elanoga

ster−ZN

.FINGER

−MA008

6NA−

bZIP−MA

Homo.sa

piens−bZ

IP−MA00

.muscul

us−HOME

O−MA00

o.sapien

s−bZIP−M

Antirrhin

um.maju

s−bZIP−M

Antirrhin

um.maju

s−bZIP−M

Mus.mu

sculus−P

AIRED−

MA0067

Homo.sa

piens−bZ

IP−MA00

43Gall

us.gallus

−bZIP−M

A0089 Mus

.muscul

us−bZIP−

MA0099

Drosoph

ila.melan

ogaster

−REL−M

Homo.sa

piens−Z

N.FINGE

R−MA00

.muscul

us−bHLH

.ZIP−MA

Homo.sa

piens−R

EL−MA0

107Hom

o.sapien

s−REL−

MA0101

Vertebra

tes−REL

−MA006

ophila.m

elanoga

ster−RE

L−MA00

o.sapien

s−REL−

MA0105

Mus.mu

sculus−P

AIRED−

MA0014

Homo.sa

piens−Z

N.FINGE

R−MA00

.muscul

us−ZN.FI

NGER−M

Mus.mu

sculus−H

OMEO−M

Homo.sa

piens−Z

N.FINGE

R−MA00

o.sapien

s−ETS−

MA0081

Homo.sa

piens−E

TS−MA0

062Hom

o.sapien

s−ETS−

MA0028

Drosoph

ila.melan

ogaster

−ETS−M

Homo.sa

piens−E

TS−MA0

076Dros

ophila.m

elanoga

ster−IPT

/TIG−MA

Rattus.r

attus−N

UCLEAR

−MA000

o.sapien

s−ETS−

MA0080

Drosoph

ila.melan

ogaster

−ZN.FIN

GER−MA

NA−TAT

A.box−M

Mus.mu

sculus−Z

N.FINGE

R−MA00

.muscul

us−ZN.FI

NGER−M

Homo.sa

piens−Z

N.FINGE

R−MA00

37Ratt

us.norve

gicus−Z

N.FINGE

R−MA00

o.sapien

s−TRP.

CLUSTE

R−MA00

o.sapien

s−TRP.

CLUSTE

R−MA00

.mays−Z

N.FINGE

R−MA00

.mays−Z

N.FINGE

R−MA00

o.sapien

s−HOME

O−MA00

o.sapien

s−HMG−

MA0077

Homo.sa

piens−H

MG−MA0

084Mus

.muscul

us−HMG−

MA0087

Drosoph

ila.melan

ogaster

−ZN.FIN

GER−MA

Homo.sa

piens−F

ORKHEA

D−MA00

o.sapien

s−FORK

HEAD−M

Homo.sa

piens−F

ORKHEA

D−MA00

.muscul

us−PAIR

ED.HOM

EO−MA0

068Dros

ophila.m

elanoga

ster−ZN

.FINGER

−MA001

m.sativum

−HMG−M

Drosoph

ila.melan

ogaster

−ZN.FIN

GER−MA

Drosoph

ila.melan

ogaster

−ZN.FIN

GER−MA

0.00.2

0.40.6

0.81.0

1−Prob(

Clusterin

• Post-processed MAP partition to remove weak

relationships, then very similar to thresholded post. probs.

Comparing Priors: Clustering JASPAR databaseNumber of Clusters − Unif

20 25 30 35

Number of Clusters − DP

20 25 30 35

Average Cluster Size − Unif

2.5 3.0 3.5

Average Cluster Size − DP

2.5 3.0 3.50

• Very little di!erence between using DP and uniform prior

• Likelihood is dominating any prior assumption on partition

Summary

• Non-parametric Bayesian approaches based on Dirichlet

process can be very useful for clustering applications

• Issues with MCMC inference: popular MAP partitions

seem inferior to partitions based on posterior probabilities

• Issues with implicit DP assumptions: alternative priors give

quite di!erent prior partitions

• Posterior di!erences between priors are small in our motif

application, but can be larger in other applications

• Jensen and Liu, JASA (forthcoming) plus other

manuscripts soon available on my website

http://stat.wharton.upenn.edu/!stjensen

edustjensen/stat542/lecture26... · 2011-04-19 · MA0038 Homo.sapiens − TRP.CLUSTER − MA0050...

Documents