Statistical Arti cial Intelligence Laboratory (SAIL)*...

Nonparametric Bayesian for Sequential Data1

Jaesik Choi

Statistical Artificial Intelligence Laboratory (SAIL)*

UNIST

*http://sail.unist.ac.kr

1Some slides are courtesy to [Zubin, UAI 2005], [Miller, 2009], and [Orbanz, NIPS2012]

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 1 / 38

http://sail.unist.ac.kr

Overview

1 Introduction

2 Dirichlet Processes - Nonparametric Clustering

3 Gaussian Processes - Nonparametric Regression

4 Case Study: The Relational Automatic Statistician System

5 AppendixConjugate PriorDirichlet ProcessesChinese Restaurant ProcessesBeta ProcessesIndian Buffet ProcessesNonparametric Hierarchical Models


Terminology

Parameteric model

Number of parameters fixed (or constantly bounded) w.r.t. sample size

Nonparametric model

Number of parameters grows with sample size∞-dimensional parameter space

Example: Density estimation


Nonparametric Bayesian Model

Definition A nonparametric Bayesian model is a Bayesian model on an∞-dimensional parameter space

Interpretation Parameter space T = set of possible model parameters(or pattern), for example:

Problem TDensity estimation Probabilistic distributionsRegression Smooth functionsClustering Partitions

Solution to Bayesian problem = posterior distribution on modelparameters


Exchangeability

Can we justify our assumptions?Assumption: data = model + noiseIn Bayes’ theorem

p(model |data) =p(data|model)

p(data)p(model)

Definition X1, · · ·Xn are exchangeable if P(X1, · · · ,Xn) is invariantunder any permutation π:

p(X1 = x1, · · · ,Xn = xn) = p(X1 = xπ(1), · · · ,Xn = xπ(n))

e.g.p(1, 1, 0, 0, 0, 1, 1, 0) = p(1, 0, 1, 0, 1, 0, 0, 1)

Order of observations does not matter.


Exchangeability

De Finetti’s Theorem (binary cases)Binary sequence case: all exchangeable binary sequences are mixturesof Bernoulli sequences [de Finetti, 1931]

p(x1, · · · , xn) =

∫ 1

0θtn(1− θ)n−tndF (θ),

where p(x1, · · · , xn) = p(X1 = x1, · · · ,Xn = xn) and tn = Σni=1xi .

Implications

Exchangeable data decomposes into mixtures of models with i.i.d.sequencesCaution: θ is in general an ∞-dimensional quantity


Contents

1 Introduction






Clustering

Observations X1,X2, · · ·Each observation belongs to exactly one cluster

Unknown pattern = partition of 1, · · · , n


Mixture Models

Mixture models

p(data|model) =

∫

Ωθ

p(data|θ)m(dθ)

m is called the mixing measure

Two-stage samplingSample data X ∼ p(·|m) as:1. Θ ∼ m2. X ∼ p(·|θ)

Finite mixture model

m(·) = ΣKk=1ckδθk (·)


Bayesian Mixture Models (BMMs)

Random mixing measurem(·) = ΣK

k=1ckδθk (·)Conjugate priorsA Bayesian model is conjugate if the posterior is an element of thesame class of distribution as the prior (‘closure under sampling’).

p(data|model), likelihood p(model) conjugate prior

Multinomial DirichletGaussian Smooth functionsClustering Partitions

Choice of priors in BMM conjugate prior

Choose conjugate prior for each parameterE.g.: Dirichlet prior


Dirichlet Process Mixtures

Dirichlet process (DP) [Ferguson, 73] [Sethuraman, 94]A DP is a distribution on random probability measures of the form

m(·) =∑∞

k=1 ckδθk (·) where∑∞

k=1 ck = 1Dirichlet Process

Constructive definition of DP(α, G0)θk ∼iid G0

Vk ∼iid Beta(1, α)Compute ck as

ck = Vk

k−1∏

i=1

(1− Vi )

This procedure is called ‘Stick-breaking construction’


Posterior Distribution

DP Posterior

θn+1|θ1, · · · , θn ∼1

n + α

n∑

j=1

δθj (θn+1) +α

n + αG0(θn+1)

Mixture Posterior

p(xn+1|x1, · · · , xn) =Kn∑

k=1

nkn + α

p(xn+1|θ∗k)+α

n + α

∫p(xn+1|θ)G0(θ)dθ

Conjugacy

The posterior of DP(α,G0) is DP(α + n, 1n+α (

∑k nkδθ∗

k+ αG0))

The Dirichlet process is conjugate.


Inference

Latent variables

p(xn+1|x1, · · · , xn) =Kn∑

k=1

nkn+α

p(xn+1|θ∗k)+α

n + α


We observe xi and do not actually observe θk (latent).

Assignment probabilities

qjk ∝ nkp(xj |θ∗k )qj0 ∝ α

∫p(xj |θ)G0(θ)dθ

Gibbs SamplingUses an assignment variable φj for each observation xj .

Assignment step: Sampling φj ∼ Multinomial(qj0, · · · , qjKn)Parameter sampling:

θ∗k ∼ G0(θ∗k )∏

xj∈Clusterkp(xj |θ∗k )


Number of Clusters

Dirichlet process

Kn = # of clusters in sample of size n

E[Kn] = O(log(n))

Modeling assumption

Parametric clustering: K∞ is finite (possibly unknown, but fixed).Nonparametric clustering: K∞ is infinite.

Rephrasing the question

Estimate of Kn is controlled by distribution of the cluster sizes ck in∑k ckδθk

What should we assume about the distribution of ck


Generalizing the DP

Pitman-Yor process

p(xn+1|x1, · · ·, xn) =Kn∑k=1

nk−dn+α p(xn+1|θ∗k) + α+Kn·d

n+α


Discount parameter d ∈ [0, 1].

Cluster sizes


Power Laws

The distribution of cluster sizes is called a power law if

cj ∼ γ(β) · j−β

for some β ∈ [0, 1].Examples of power laws

Word frequenciesPopularity (# of friends) in social networks

Pitman-Yor language model


Random Partitions

Discrete measures and partitionsSampling from a discrete measure determines a partition of N intoblocks bk :

θn ∼iid

∞∑k=1

ckδθ∗k and set n ∈ bk ⇔ θn = θ∗k

As n→∞, the block proportions converge: |bk |n → ck

Induced random partitionThe distribution of a random discrete measure m =

∑∞k=1 ckδθk

induces the distribution of a random partition∏

= (B1,B2, · · · ).Exchangeable random partitions∏

is called exchangeable if its distribution depends only on the sizes ofits blocks.All exchangeable random partitions, and only those, can be representedby a random discrete distribution (Kingman’s theorem).


Chinese Restaurant Process (CRP)

Chinese Restaurant ProcessThe distribution of the random partition induced by the Dirichletprocess. CRF

‘Customers and tables’ analogy

Customers = observations (indices in N)Tables = clusters (blocks)

Historical remarkOriginally introduced by Dubins & Pitman as a distribution on infinitepermutationsA permutation of n items defines a partition of 1, · · · , n (regardcycles of permutation as blocks of partition)The induced distribution on partitions is the CRP we use in clustering


Summary: Clustering

Nonparametric Bayesian clustering

Infinite # of clusters, Kn ≤ n of which are observed.If partition exchangeable, it can be represented by a random discretedistribution.

Inference Latent variable algorithms, since assignments (i.e.,partition) not observed.

Gibbs samplingVariational algorithms

Prior assumption

Distribution of cluster sizesImplies prior assumption on # Kn of clusters.


Contents

1 Introduction






Gaussian Distributions

Gaussian

p(x) =1

√2πσ2 exp

(− (x−µ)2

2σ2

)

Multivariate Gaussian

p(x) =1√

(2π)n|Σ| exp(−1

2 (x − µ)TΣ−1(x − µ))


Gaussian Processes: Definition

GPs are distributions over functions such that any finite set offunction evaluations [f (x1), · · · , f (xn)] have a jointly Gaussiandistribution.

A GP is completely specified by its mean function µ(x) = E(f (x))and covariance kernel function k(x , x ′) = Cov(f (x), f (x ′)).

Given data x = [x1, · · · , xn] and y = [f (x1), · · · , f (xn)], GP modelspecified by µ(x) and k(x , x ′).

Notation for ‘f follows the GP’ is:

f ∼ GP(µ(x), k(x , x ′)

Likelihood is:

p(y |X ) =1√

(2π)n|Σ|exp

(−1

2(y − µ)TΣ−1(y − µ)

)

where µ = [µ(x1), · · · , µ(xn)] and Σij = k(xi , xj).


Gaussian Processes: Samples with different kernels


Gaussian Processes: Predictions with different kernels

A sample from the prior for each covariance function

Corresponding predictions, mean with two standard deviations:


Gaussian Processes

Nonparametric regressionparameter spaces = continuous functions, say on [a,b]:

f : [a, b]→ R T = c[a, b]

Gaussian Process

f ∼ GP ↔ (f (x1), · · · , f (xd)) is d-dimensional Gaussian

for any finite set X ⊂ [a, b].

Construction: Intuition

The marginal of the GP for any finite X ⊂ [a, b] is a Gaussian.All these Gaussians are marginals of each other.


Examples

Applications Parameter space Bayesian Nonparametric model

Regression Function Gaussian processClustering Partition Chinese restaurant processDensity estimation Density Dirichlet process mixture

Hierarchical clusteringHierarchical

partition

Dirichlet/Pitmann-or diffusion tree,

Nested CRPLatent variable modeling Features Beta process/Indian buffet processDictionary learning Dictionary Beta process/Indian buffet process

Survival analysis HazardBeta process,

Neural-to-the-right processPower-law behaviour Pitman-Yor process, Stable-beta processDeep learning Features Cascading/nested Indian buffet processDimensionality reduction Manifold Gaussian process latent variable model

Topic modelsAtomic

distribution Hierarhical Dirichlet processRelational modeling Relations Infinite relational model,Mondrian process


Contents

1 Introduction






Recent advances in Nonparametric Regression

Problem

Descriptive prediction of multiple time series

Problem


Problem

Linear function

decrease x/week

Smooth function

Length scale: y weeks

Rapidly varying

smooth function

Length scale: z hours


Problem (this paper)


Constant function

Sudden drop btw

9/12/01 ~ 9/15/01

Smooth function


Rapidly varying

smooth function


Models

Automatic Bayesian Covariance Discovery* [Lloyd et. al. 2014] [Ghahramani. 2015]

* http://www.automaticstatistician.com/

Gaussian Processes

𝑓 𝑥 ~𝒢𝒫(𝜇 𝑥 , 𝑘 𝑥, 𝑥′ )

Function Gaussian Process

Mean function

Covariance kernel function

𝜇 𝑥 = 𝔼 𝑓 𝑥

𝑘 𝑥, 𝑥′ = Cov 𝑓 𝑥 , 𝑓 𝑥′

Gaussian Processes

𝑓(𝑥)~𝒢𝒫(𝜇 𝑥 , 𝑘 𝑥, 𝑥′ )

Function Gaussian Process

Mean function

Covariance kernel function

𝜇 𝑥 = 𝔼 𝑓 𝑥

𝑘 𝑥, 𝑥′ = Cov 𝑓 𝑥 , 𝑓 𝑥′

[𝑓(𝑥1), … , 𝑓(𝑥𝑁)]~𝒩(𝝁, 𝚺)

Function evaluations Multivariate Gaussian

follows

Mean vector

Covariance matrix

𝝁 = [𝜇 𝑥1 , … , 𝜇 𝑥𝑁 ]

𝚺𝑖𝑗 = 𝑘 𝑥𝑖 , 𝑥𝑗

* Automatic Bayesian Covariance Discovery (http://www.automaticstatistician.com/)

[Lloyd; Duvenaud; Grosse; Tenenbaum; Ghahramani. 2014.] [Ghahramani. Nature. 2015.]

The Automatic Statistician*

𝑓(𝑥)~𝒢𝒫(𝜇 𝑥 , 𝑘 𝑥, 𝑥′ )

Find appropriate kernel(1) Encode characteristic




If 𝑔(𝑥)~𝒢𝒫 0, 𝑘𝑔 , ℎ 𝑥 ~𝒢𝒫 0, 𝑘ℎ and 𝑔 𝑥 ⊥ ℎ 𝑥

, then

𝑔 𝑥 + ℎ 𝑥 ~ 𝒢𝒫 0, 𝑘𝑔 + 𝑘ℎ

𝑔 𝑥 × ℎ 𝑥 ~ 𝒢𝒫 0, 𝑘𝑔 × 𝑘ℎ

𝑓(𝑥)~𝒢𝒫(𝜇 𝑥 , 𝑘 𝑥, 𝑥′ )

Find appropriate kernel(1) Encode characteristic

(2) Compose new kernel

The Automatic Statistician: Base kernels

Base

kernel

Encoding

functionKernel function

Paramet

ers

Example kernel

function shape

Example

encoded

functions

LIN(𝑥, 𝑥’)Linear

function𝜎2(𝑥 − ℓ)(𝑥′ − ℓ) 𝜎, ℓ

SE(𝑥, 𝑥’)Smooth

function𝜎2 exp −

𝑥 − 𝑥′ 2

2ℓ2𝜎, ℓ

PER(𝑥, 𝑥’)Periodic

functionIn appendix 𝜎, ℓ, 𝑝

The Automatic Statistician: Operators

Op. Concept Params ExampleExample kernel

function shape

Example encoded

functions

+

Addition

Superposition

OR operator

N/A

SE + PER

LIN + PER

×Multiplication

AND operatorN/A SE × PER

The Automatic Statistician: Operators

Op. Concept Params Example Example encoded functions

CPDivide

left vs rightℓ, 𝑠 CP(LIN,LIN)

CWDivide

in vs outℓ, 𝑠, 𝑤 CW(LIN,C)

× ×+ =

× ×+ =

The Automatic statistician: Learning

BIC ℳ = −2 log𝑃 𝐷 ℳ + ℳ log 𝐷

Negative log-likelihood

Model complexity

Num. of model parameters

Num. of data points

(1) Optimization criterion: Bayesian Information Criterion (BIC)

The Automatic statistician: Learning

(2) Optimize

(1) Expand

(3) Select

Iteratively select best model (structure 𝑘, parameter )

(1) Expand: the current kernel

(2) Optimize: conjugate gradient descent

(3) Select: the best kernel in the level (greedy)

(4) Iterate: get back to (1) for the next level

(2) Learning algorithm (Composite Kernel Learning)

Negative log-likelihood

Model complexity

Num. of model parameters

Num. of data points

(1) Optimization criteria: Bayesian Information Criterion (BIC)

BIC ℳ = −2 log𝑃 𝐷 ℳ + ℳ log 𝐷

Example Result

Linear function

decrease x/week

Smooth function


Rapidly varying

smooth function





Prediction Performance (Extrapolation)

13 famous regression datasets

The Relational Automatic Statistician

[Hwang, Tong and Choi, 2016]

* http://www.automaticstatistician.com/

Motivation

Linear function

decrease x/week

Smooth function


Rapidly varying

smooth function


Adjusted Close of

General Electronics

9/11, 2001

The Automatic Statistician

Motivation

Adjusted Close of

General Electronics, Microsoft, ExxonMobil

9/11, 2001

• Exploit multiple time series

• Find global descriptions

• Hope better predictive

performance

The Relational Automatic Statistician

Latent function

Latent function evaluation

Observation

Fixed Given Optimize

Covariance kernel functionMean function

Gaussian Process

𝑃 𝐷 ℳ = 𝑃 𝐷 𝒢𝒫 0, 𝑘 𝑥, 𝑥′; 𝜃

Model: Gaussian Process

Gaussian Processes → CKL → RKL → SRKL

Fixed Grammar Optimize


Observation


Gaussian Process

Latent function

𝑃 𝐷 ℳ = 𝑃 𝐷 𝒢𝒫 0, 𝑘 𝑥, 𝑥′; 𝜃

Model: Composite Kernel Learning (CKL)

GPs → Composite Kernel Learning → RKL → SRKL

Generalized Multi Kernel Learning



Observation

Scale parameter Shift parameter


Gaussian Process

Latent function

𝑃 𝐷 ℳ =ෑ

𝑗=1

𝑀

𝑃 𝑑𝑗 𝒢𝒫 0, 𝜎𝑗 × 𝑘 𝑥, 𝑥′; 𝜃 + 𝑐𝑗

Model: Relational Kernel Learning (RKL)

GPs → CKL → Relational Kernel Learning → SRKL

10

0

100

0

10

0

20

10

Scale

Shift

Scale parameter

Distinctive kernel

Latent function



Observation


Gaussian Process

Fixed (Spectral Mixture) Optimize

𝑃 𝐷 ℳ =ෑ

𝑗=1

𝑀

𝑃 𝑑𝑗 𝒢𝒫 0, 𝜎𝑗 × 𝑘 𝑥, 𝑥′; 𝜃 + 𝑘𝑗(𝑥, 𝑥′; 𝜃𝑗)

Model: Semi-Relational Kernel Learning (SRKL)

GPs → CKL → RKL → Semi-Relational Kernel Learning

Semi-Relational Kernel Learning (SRKL)

Input: M time series

Output: A shared kernel 𝑘, M spectral mixture (SM) kernels

(1) Expand: the current shared kernel for all time series

(2) Optimize: expanded kernels for all M time series (conjugate gradient descent)

For each series, individual distinction is handled by the SM kernel.

(3) Select: the best shared kernel for all time series (greedy)

A shared kernel (𝑘) and M SM kernels (𝑘𝑗)

(4) Iterative: get back to (1) when level s is not reached

Find the best shared and distinctive kernels iteratively!

Experimental Results

Three real-world data sets:

US top 9 stocks in year 2001

US top 6 housing markets from 2003 to 2013

Currency exchange of 4 emerging market

Descriptions Graphs (normalized) Details

9 adjusted close

of stock figures

GE,MSFT, XOM,

PFE, C, WMT,

INTC, BP, AIG

6 US housing price

indices

New York, Los Angeles

Chicago, Pheonix,

San Diego, San Fancisco

4 emerging market

currency exchanges

Indonesian - IDR

Malaysian - MYR

South African - ZAR

Russian - RUB

Data sets

Qualitative Results

Adjusted Closes Component 1 Adjusted Closes Component 2

US stock market values suddenly drop after US 9/11 attacks.

Currency exchange is affected by FED’s policy change in interest rates around

middle Sep 2015.

Automatic Statistician Relational Automatic Statistician

4 currency exchange rates Learned component

An Automatically Generated Report

Quantitative Results

STOCK3 = GE, MSFT, XOM

STOCK6= STOCK3 + PFE, C, WMT

STOCK9 = STOCK6 + INTC, BP, AIG

HOUSE2 = NY, LA

HOUSE4 = HOUSE2 + Chicago, Pheonix

HOUSE6 = HOUSE4 + San Diego, San Francisco

CURRENCY4 = IDR, MYR,ZAR,RUB

Quantitative Results (box plots)

9 stocks 6 house price indices 4 currency exchanges

Conclusion

• Our research topic is about“Solving descriptive prediction problem of multiple timeseries”

• We proposed models that can“Exploit both common and distinct changes”

• We found that our models“Show the better qualitative and quantitative performancecompared to the state-of-the-art GP regression method”

http://saildemo.unist.ac.kr/automatic_statistician/

Contents

1 Introduction






Introduction

Simple Example

Task: Toss a (potentially biased) coin N times. Compute θ, theprobability of heads.

Suppose we observe: T, H, H, T. What do we think θ is? Themaximum likelihood estimate is θ = 1/2. Seems reasonable.

Now suppose we observe: H, H, H, H. What do we think θ is? Themaximum likelihood estimate is θ = 1. Seem reasonable?

Not really. Why?

Kurt T. Miller Dr. Nonparametric Bayes 9

Introduction

Simple Example

Task: Toss a (potentially biased) coin N times. Compute θ, theprobability of heads.

Suppose we observe: T, H, H, T. What do we think θ is? Themaximum likelihood estimate is θ = 1/2. Seems reasonable.

Now suppose we observe: H, H, H, H. What do we think θ is? Themaximum likelihood estimate is θ = 1. Seem reasonable?

Not really. Why?


Introduction

Simple Example

When we observe H, H, H, H, why does θ = 1 seem unreasonable?

Prior knowledge! We believe coins generally have θ ≈ 1/2. How toencode this? By using a Beta prior on θ.


Introduction

Simple Example

When we observe H, H, H, H, why does θ = 1 seem unreasonable?

Prior knowledge! We believe coins generally have θ ≈ 1/2. How toencode this? By using a Beta prior on θ.


Introduction

Bayesian Approach to Estimating θ

Place a Beta(a, b) prior on θ. This prior has the form

p(θ) ∝ θa−1(1− θ)b−1.

What does this distribution look like?

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

3.5

α1=1.0, α

2=0.1

α1=1.0, α

2=1.0

α1=1.0, α

2=5.0

α1=1.0, α

2=10.0

α1=9.0, α

2=3.0


Introduction


Place a Beta(a, b) prior on θ. This prior has the form

p(θ) ∝ θa−1(1− θ)b−1.

What does this distribution look like?

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

3.5

α1=1.0, α

2=0.1

α1=1.0, α

2=1.0

α1=1.0, α

2=5.0

α1=1.0, α

2=10.0

α1=9.0, α

2=3.0


Introduction


After observing X, a sequence with n heads and m tails, the posterior onθ is:

p(θ|X) ∝ p(X|θ)p(θ)∝ θa+n−1(1− θ)b+m−1

∼ Beta(a+ n, b+m).

If a = b = 1 and we observe 5 heads and 2 tails, Beta(6, 3) looks like

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3


Introduction


After observing X, a sequence with n heads and m tails, the posterior onθ is:

p(θ|X) ∝ p(X|θ)p(θ)∝ θa+n−1(1− θ)b+m−1

∼ Beta(a+ n, b+m).

If a = b = 1 and we observe 5 heads and 2 tails, Beta(6, 3) looks like

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3


Parametric Bayesian Clustering

The Dirichlet Distribution

We hadπ ∼ Dirichlet(α1, . . . , αK)

The Dirichlet density is defined as

p(π|α) =Γ(∑K

k=1 αk

)

∏Kk=1 Γ(αk)

πα1−11 πα2−1

2 · · ·παK−1K

where πK = 1−∑K−1k=1 πk.

The expectations of π are

E(πi) =αi∑Ki=1 αi



The Beta DistributionA special case of the Dirichlet distribution is the Beta distribution forwhen K = 2.

p(π|α1, α2) =Γ (α1 + α2)

Γ(α1)Γ(α2)πα1−1(1− π)α2−1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

3.5

α1=1.0, α

2=0.1

α1=1.0, α

2=1.0

α1=1.0, α

2=5.0

α1=1.0, α

2=10.0

α1=9.0, α

2=3.0



The Dirichlet Distribution

In three dimensions:

p(π|α1, α2, α3) =Γ (α1 + α2 + α3)

Γ(α1)Γ(α2)Γ(α3)πα1−11 πα2−1

2 (1− π1 − π2)α3−1

α = (2, 2, 2) α = (5, 5, 5)


(2, 2, 5)


Draws from the Dirichlet Distribution

α = (2, 2, 2) 1 2 30

0.2

0.4

0.6

0.8

1 2 30

0.1

0.2

0.3

0.4

0.5

1 2 30

0.1

0.2

0.3

0.4

α = (5, 5, 5) 1 2 30

0.1

0.2

0.3

0.4

0.5

1 2 30

0.2

0.4

0.6

0.8

1 2 30

0.2

0.4

0.6

0.8

α = (2, 2, 5) 1 2 30

0.2

0.4

0.6

0.8

1 2 30

0.2

0.4

0.6

0.8

1 2 30

0.2

0.4

0.6

0.8



Key Property of the Dirichlet Distribution

The Aggregation Property: If

(π1, . . . , πi, πi+1, . . . , πK) ∼ Dir(α1, . . . , αi, αi+1, . . . , αK)

then

(π1, . . . , πi + πi+1, . . . , πK) ∼ Dir(α1, . . . , αi + αi+1, . . . , αK)

This is also valid for any aggregation:

(π1 + π2,

∑

k=3K

πk

)∼ Beta

(α1 + α2,

K∑

k=3

αk

)



Multinomial-Dirichlet Conjugacy

Let Z ∼ Multinomial(π) and π ∼ Dir(α).

Posterior:

p(π|z) ∝ p(z|π)p(π)= (πz1

1 · · ·πzKK )(πα1−1

1 · · ·παK−1K )

= (πz1+α1−11 · · ·πzK+αK−1

K )

which is Dir(α+ z).


Contents

1 Introduction






The Dirichlet Process Model

Parameters for the Dirichlet Process

• α - The concentration parameter.

• G0 - The base measure. A prior distribution for the cluster specificparameters.

The Dirichlet Process (DP) is a distribution over distributions. We write

G ∼ DP (α,G0)

to indicate G is a distribution drawn from the DP.

It will become clearer in a bit what α and G0 are.



The Dirichlet Process

Definition: Let G0 be a probability measure on the measurable space(Ω, B) and α ∈ R+.

The Dirichlet Process DP (α,G0) is the distribution on probabilitymeasures G such that for any finite partition (A1, . . . , Am) of Ω,

(G(A1), . . . , G(Am)) ∼ Dir(αG0(A1), . . . , αG0(Am)).

A

A

A

A

A

1

2

3

4

5Ω

(Ferguson, ’73)



Mathematical Properties of the Dirichlet Process

Suppose we sample

• G ∼ DP (α,G0)

• θ1 ∼ G

What is the posterior distribution of G given θ1?

G|θ1 ∼ DP

(α+ 1,

α

α+ 1G0 +

1

α+ 1δθ1

)

More generally

G|θ1, . . . , θn ∼ DP

(α+ n,

α

α+ nG0 +

1

α+ n

n∑

i=1

δθi

)




Suppose we sample

• G ∼ DP (α,G0)

• θ1 ∼ G

What is the posterior distribution of G given θ1?

G|θ1 ∼ DP

(α+ 1,

α

α+ 1G0 +

1

α+ 1δθ1

)

More generally

G|θ1, . . . , θn ∼ DP

(α+ n,

α

α+ nG0 +

1

α+ n

n∑

i=1

δθi

)




With probability 1, a sample G ∼ DP (α,G0) is of the form

G =∞∑

k=1

πkδφk

(Sethuraman, ’94)



The Stick-Breaking Process

• Define an infinite sequence of Beta random variables:

βk ∼ Beta(1, α) k = 1, 2, . . .

• And then define an infinite sequence of mixing proportions as:

π1 = β1

πk = βk

k−1∏

l=1

(1− βl) k = 2, 3, . . .

• This can be viewed as breaking off portions of a stick:

1 2...

1β β (1−β )

• When π are drawn this way, we can write π ∼ GEM(α).



The Stick-Breaking Process

• We now have an explicit formula for each πk:πk = βk

∏k−1l=1 (1− βl)

• We can also easily see that∑∞

k=1 πk = 1 (wp1):

1−K∑

k=1

πk = 1− β1 − β2(1− β1)− β3(1− β1)(1− β2)− · · ·

= (1− β1)(1− β2 − β3(1− β2)− · · · )

=K∏

k=1

(1− βk)

→ 0 (wp1 as K → ∞)

• So now G =∑∞

k=1 πkδφkhas a clean definition as a random

measure


Contents

1 Introduction







The Chinese Restaurant Process (CRP)

• A random process in which n customers sit down in a Chineserestaurant with an infinite number of tables

• first customer sits at the first table• mth subsequent customer sits at a table drawn from the following

distribution:

P (previously occupied table i|Fm−1) ∝ ni

P (the next unoccupied table|Fm−1) ∝ α

where ni is the number of customers currently at table i and whereFm−1 denotes the state of the restaurant after m− 1 customershave been seated



The CRP and Clustering

• Data points are customers; tables are clusters

• the CRP defines a prior distribution on the partitioning of the dataand on the number of tables

• This prior can be completed with:

• a likelihood—e.g., associate a parameterized probability distributionwith each table

• a prior for the parameters—the first customer to sit at table kchooses the parameter vector for that table (φk) from the prior

φ2φ1 φ3 φ

4

• So we now have a distribution—or can obtain one—for any quantitythat we might care about in the clustering setting



The CRP Prior, Gaussian Likelihood, Conjugate Prior



The CRP and the DP

OK, so we’ve seen how the CRP relates to clustering. How does it relateto the DP?

Important fact: The CRP is exchangeable.

Remember De Finetti’s Theorem: If (x1, x2, . . .) are infinitelyexchangeable, then ∀n

p(x1, . . . , xn) =

∫ ( n∏

i=1

p(xi|G)

)dP (G)

for some random variable G.



The CRP and the DP

OK, so we’ve seen how the CRP relates to clustering. How does it relateto the DP?

Important fact: The CRP is exchangeable.

Remember De Finetti’s Theorem: If (x1, x2, . . .) are infinitelyexchangeable, then ∀n

p(x1, . . . , xn) =

∫ ( n∏

i=1

p(xi|G)

)dP (G)

for some random variable G.



The CRP and the DP

The Dirichlet Process is the De Finetti mixingdistribution for the CRP.

That means, when we integrate outG, we get the CRP.

p(θ1, . . . , θn) =

∫ n∏

i=1

p(θi|G)dP (G)

G

!i

xi

N

!

!

G0



The CRP and the DP


That means, when we integrate outG, we get the CRP.

p(θ1, . . . , θn) =

∫ n∏

i=1

p(θi|G)dP (G)

G

!i

xi

N

!

!

G0



The CRP and the DP


In English, this means that if the DP is the prior onG, then the CRP defines how points are assigned toclusters when we integrate out G.



The DP, CRP, and Stick-Breaking Process Summary

G

!i

xi

N

!

!

G0G ! DP(!, G0)

Stick-Breaking Process(just the weights)

The CRP describes thepartitions of ! when Gis marginalized out.


Beta Processes

Definition [Hjort, 90]: Let H0 be a continuous probability measure(Ω,B) and α ∈ R+. Then, Beta Process BP(α,H0) is the distributionon probability measures H such that for any (disjoint) finite partition(A1, · · · ,Ak) of Ω satisfies

H(Ai ) ∼ Beta(αH0(Ai ), α(1− H0(Ai )))

with K →∞ and H0(Ai )→ 0 for i = 1, · · · ,K .

The beta process can be written in set function form,

H(w) =∞∑

k=1

πkδwk(w)


Contents

1 Introduction






Indian Buffet Processes

Latent feature models

Clustering is not enough for mixed membershipsGrouping problem with overlapping clustersEncode as binary matrix: Observation n in cluster k ⇔ xnk = 1Alternatively: Item n possesses feature k ⇔ xnk = 1

Indian buffet process (IBP)1. Customer 1 tries Poisson(α) dishes.2. Subsequent customer n + 1:

tries a previously tried dish k with probability nkn+1

tries Poisson( αn+1 ) new dishes.

Properties

An exchangeable distribution over finite sets (of dishes).Observation (=customer) n in cluster (=dish) k if customer ‘tries dishk’


Indian Buffet Process

Alternative description

1. Sample w1, · · · ,wk ∼iid Beta(1, α/K )2. Sample X1k , · · · ,Xnk ∼iid Bernoulli(wk)

Beta Process (BP)Beta Process

Distribution on objects of the form

θ =∞∑k=1

wkδφkwith wk ∈ [0, 1].

IBP matrix entries are sampled as xnk ∼iid Bernoulli(wk).Beta process is the de Finetti measure of the IBP.θ is a random measure


Binary matrices in left-order form


Contents

1 Introduction






Hierarchical Gaussian Processes

Apply Bayesian representation recursively Split parameter Θ:

Θ→ Ψ and Θ|Ψ

p(data) = p(data|Θ)p(Θ)

= p(data|Θ)p(Θ|Ψ)p(Ψ)

Example: Hierarchical Gaussian process

Sample Ψ ∼ p(Ψ)(e.g., large length-scale, mean 0)

Sample Θ|Ψ ∼ p(·|Ψ)(e.g., smaller length scale, mean Ψ)

Decompose underlying pattern:

Low-frequency component Ψ

High-frequency component Θ


Hierarchical Dirichlet Processes

Sampling scheme

Sample G0 ∼ DP(γ,H)Sample G1,G2, · · · ∼ DP(α,G0)Sample xij ∼ Gj G1,G2, · · · have common‘vocabulary’ of atoms.

Nonparametric Latent Dirichlet Allocation(LDA)

G0 =∞∑

k=1

ckδθ∗k Gj =∞∑

l=1

D ji δΦj

l

θk = finite probability (=‘topic’)ck = occurrence probability of topic kDocument j drawn from weighted combinationof topics, with proportions D j

l (‘admixturemodel’)


Date post:	04-Mar-2020
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

Statistical Arti cial Intelligence Laboratory (SAIL)*...

Documents