+ All Categories
Home > Documents > Statistical Arti cial Intelligence Laboratory (SAIL)*...

Statistical Arti cial Intelligence Laboratory (SAIL)*...

Date post: 04-Mar-2020
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
100
Nonparametric Bayesian for Sequential Data 1 Jaesik Choi Statistical Artificial Intelligence Laboratory (SAIL)* UNIST *http://sail.unist.ac.kr 1 Some slides are courtesy to [Zubin, UAI 2005], [Miller, 2009], and [Orbanz, NIPS 2012] Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 1 / 38
Transcript
Page 1: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Nonparametric Bayesian for Sequential Data1

Jaesik Choi

Statistical Artificial Intelligence Laboratory (SAIL)*

UNIST

*http://sail.unist.ac.kr

1Some slides are courtesy to [Zubin, UAI 2005], [Miller, 2009], and [Orbanz, NIPS2012]

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 1 / 38

Page 2: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Overview

1 Introduction

2 Dirichlet Processes - Nonparametric Clustering

3 Gaussian Processes - Nonparametric Regression

4 Case Study: The Relational Automatic Statistician System

5 AppendixConjugate PriorDirichlet ProcessesChinese Restaurant ProcessesBeta ProcessesIndian Buffet ProcessesNonparametric Hierarchical Models

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 2 / 38

Page 3: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Terminology

Parameteric model

Number of parameters fixed (or constantly bounded) w.r.t. sample size

Nonparametric model

Number of parameters grows with sample size∞-dimensional parameter space

Example: Density estimation

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 3 / 38

Page 4: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Nonparametric Bayesian Model

Definition A nonparametric Bayesian model is a Bayesian model on an∞-dimensional parameter space

Interpretation Parameter space T = set of possible model parameters(or pattern), for example:

Problem TDensity estimation Probabilistic distributionsRegression Smooth functionsClustering Partitions

Solution to Bayesian problem = posterior distribution on modelparameters

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 4 / 38

Page 5: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Exchangeability

Can we justify our assumptions?Assumption: data = model + noiseIn Bayes’ theorem

p(model |data) =p(data|model)

p(data)p(model)

Definition X1, · · ·Xn are exchangeable if P(X1, · · · ,Xn) is invariantunder any permutation π:

p(X1 = x1, · · · ,Xn = xn) = p(X1 = xπ(1), · · · ,Xn = xπ(n))

e.g.p(1, 1, 0, 0, 0, 1, 1, 0) = p(1, 0, 1, 0, 1, 0, 0, 1)

Order of observations does not matter.

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 5 / 38

Page 6: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Exchangeability

De Finetti’s Theorem (binary cases)Binary sequence case: all exchangeable binary sequences are mixturesof Bernoulli sequences [de Finetti, 1931]

p(x1, · · · , xn) =

∫ 1

0θtn(1− θ)n−tndF (θ),

where p(x1, · · · , xn) = p(X1 = x1, · · · ,Xn = xn) and tn = Σni=1xi .

Implications

Exchangeable data decomposes into mixtures of models with i.i.d.sequencesCaution: θ is in general an ∞-dimensional quantity

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 6 / 38

Page 7: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Contents

1 Introduction

2 Dirichlet Processes - Nonparametric Clustering

3 Gaussian Processes - Nonparametric Regression

4 Case Study: The Relational Automatic Statistician System

5 AppendixConjugate PriorDirichlet ProcessesChinese Restaurant ProcessesBeta ProcessesIndian Buffet ProcessesNonparametric Hierarchical Models

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 7 / 38

Page 8: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Clustering

Observations X1,X2, · · ·Each observation belongs to exactly one cluster

Unknown pattern = partition of 1, · · · , n

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 8 / 38

Page 9: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Mixture Models

Mixture models

p(data|model) =

Ωθ

p(data|θ)m(dθ)

m is called the mixing measure

Two-stage samplingSample data X ∼ p(·|m) as:1. Θ ∼ m2. X ∼ p(·|θ)

Finite mixture model

m(·) = ΣKk=1ckδθk (·)

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 9 / 38

Page 10: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Bayesian Mixture Models (BMMs)

Random mixing measurem(·) = ΣK

k=1ckδθk (·)Conjugate priorsA Bayesian model is conjugate if the posterior is an element of thesame class of distribution as the prior (‘closure under sampling’).

p(data|model), likelihood p(model) conjugate prior

Multinomial DirichletGaussian Smooth functionsClustering Partitions

Choice of priors in BMM conjugate prior

Choose conjugate prior for each parameterE.g.: Dirichlet prior

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 10 / 38

Page 11: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Dirichlet Process Mixtures

Dirichlet process (DP) [Ferguson, 73] [Sethuraman, 94]A DP is a distribution on random probability measures of the form

m(·) =∑∞

k=1 ckδθk (·) where∑∞

k=1 ck = 1Dirichlet Process

Constructive definition of DP(α, G0)θk ∼iid G0

Vk ∼iid Beta(1, α)Compute ck as

ck = Vk

k−1∏

i=1

(1− Vi )

This procedure is called ‘Stick-breaking construction’

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 11 / 38

Page 12: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Posterior Distribution

DP Posterior

θn+1|θ1, · · · , θn ∼1

n + α

n∑

j=1

δθj (θn+1) +α

n + αG0(θn+1)

Mixture Posterior

p(xn+1|x1, · · · , xn) =Kn∑

k=1

nkn + α

p(xn+1|θ∗k)+α

n + α

∫p(xn+1|θ)G0(θ)dθ

Conjugacy

The posterior of DP(α,G0) is DP(α + n, 1n+α (

∑k nkδθ∗

k+ αG0))

The Dirichlet process is conjugate.

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 12 / 38

Page 13: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Inference

Latent variables

p(xn+1|x1, · · · , xn) =Kn∑

k=1

nkn+α

p(xn+1|θ∗k)+α

n + α

∫p(xn+1|θ)G0(θ)dθ

We observe xi and do not actually observe θk (latent).

Assignment probabilities

qjk ∝ nkp(xj |θ∗k )qj0 ∝ α

∫p(xj |θ)G0(θ)dθ

Gibbs SamplingUses an assignment variable φj for each observation xj .

Assignment step: Sampling φj ∼ Multinomial(qj0, · · · , qjKn)Parameter sampling:

θ∗k ∼ G0(θ∗k )∏

xj∈Clusterkp(xj |θ∗k )

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 13 / 38

Page 14: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Number of Clusters

Dirichlet process

Kn = # of clusters in sample of size n

E[Kn] = O(log(n))

Modeling assumption

Parametric clustering: K∞ is finite (possibly unknown, but fixed).Nonparametric clustering: K∞ is infinite.

Rephrasing the question

Estimate of Kn is controlled by distribution of the cluster sizes ck in∑k ckδθk

What should we assume about the distribution of ck

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 14 / 38

Page 15: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Generalizing the DP

Pitman-Yor process

p(xn+1|x1, · · ·, xn) =Kn∑k=1

nk−dn+α p(xn+1|θ∗k) + α+Kn·d

n+α

∫p(xn+1|θ)G0(θ)dθ

Discount parameter d ∈ [0, 1].

Cluster sizes

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 15 / 38

Page 16: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Power Laws

The distribution of cluster sizes is called a power law if

cj ∼ γ(β) · j−β

for some β ∈ [0, 1].Examples of power laws

Word frequenciesPopularity (# of friends) in social networks

Pitman-Yor language model

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 16 / 38

Page 17: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Random Partitions

Discrete measures and partitionsSampling from a discrete measure determines a partition of N intoblocks bk :

θn ∼iid

∞∑k=1

ckδθ∗k and set n ∈ bk ⇔ θn = θ∗k

As n→∞, the block proportions converge: |bk |n → ck

Induced random partitionThe distribution of a random discrete measure m =

∑∞k=1 ckδθk

induces the distribution of a random partition∏

= (B1,B2, · · · ).Exchangeable random partitions∏

is called exchangeable if its distribution depends only on the sizes ofits blocks.All exchangeable random partitions, and only those, can be representedby a random discrete distribution (Kingman’s theorem).

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 17 / 38

Page 18: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Chinese Restaurant Process (CRP)

Chinese Restaurant ProcessThe distribution of the random partition induced by the Dirichletprocess. CRF

‘Customers and tables’ analogy

Customers = observations (indices in N)Tables = clusters (blocks)

Historical remarkOriginally introduced by Dubins & Pitman as a distribution on infinitepermutationsA permutation of n items defines a partition of 1, · · · , n (regardcycles of permutation as blocks of partition)The induced distribution on partitions is the CRP we use in clustering

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 18 / 38

Page 19: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Summary: Clustering

Nonparametric Bayesian clustering

Infinite # of clusters, Kn ≤ n of which are observed.If partition exchangeable, it can be represented by a random discretedistribution.

Inference Latent variable algorithms, since assignments (i.e.,partition) not observed.

Gibbs samplingVariational algorithms

Prior assumption

Distribution of cluster sizesImplies prior assumption on # Kn of clusters.

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 19 / 38

Page 20: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Contents

1 Introduction

2 Dirichlet Processes - Nonparametric Clustering

3 Gaussian Processes - Nonparametric Regression

4 Case Study: The Relational Automatic Statistician System

5 AppendixConjugate PriorDirichlet ProcessesChinese Restaurant ProcessesBeta ProcessesIndian Buffet ProcessesNonparametric Hierarchical Models

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 20 / 38

Page 21: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Gaussian Distributions

Gaussian

p(x) =1

√2πσ2 exp

(− (x−µ)2

2σ2

)

Multivariate Gaussian

p(x) =1√

(2π)n|Σ| exp(−1

2 (x − µ)TΣ−1(x − µ))

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 21 / 38

Page 22: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Gaussian Processes: Definition

GPs are distributions over functions such that any finite set offunction evaluations [f (x1), · · · , f (xn)] have a jointly Gaussiandistribution.

A GP is completely specified by its mean function µ(x) = E(f (x))and covariance kernel function k(x , x ′) = Cov(f (x), f (x ′)).

Given data x = [x1, · · · , xn] and y = [f (x1), · · · , f (xn)], GP modelspecified by µ(x) and k(x , x ′).

Notation for ‘f follows the GP’ is:

f ∼ GP(µ(x), k(x , x ′)

Likelihood is:

p(y |X ) =1√

(2π)n|Σ|exp

(−1

2(y − µ)TΣ−1(y − µ)

)

where µ = [µ(x1), · · · , µ(xn)] and Σij = k(xi , xj).

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 22 / 38

Page 23: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Gaussian Processes: Samples with different kernels

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 23 / 38

Page 24: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Gaussian Processes: Predictions with different kernels

A sample from the prior for each covariance function

Corresponding predictions, mean with two standard deviations:

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 24 / 38

Page 25: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Gaussian Processes

Nonparametric regressionparameter spaces = continuous functions, say on [a,b]:

f : [a, b]→ R T = c[a, b]

Gaussian Process

f ∼ GP ↔ (f (x1), · · · , f (xd)) is d-dimensional Gaussian

for any finite set X ⊂ [a, b].

Construction: Intuition

The marginal of the GP for any finite X ⊂ [a, b] is a Gaussian.All these Gaussians are marginals of each other.

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 25 / 38

Page 26: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Examples

Applications Parameter space Bayesian Nonparametric model

Regression Function Gaussian processClustering Partition Chinese restaurant processDensity estimation Density Dirichlet process mixture

Hierarchical clusteringHierarchical

partition

Dirichlet/Pitmann-or diffusion tree,

Nested CRPLatent variable modeling Features Beta process/Indian buffet processDictionary learning Dictionary Beta process/Indian buffet process

Survival analysis HazardBeta process,

Neural-to-the-right processPower-law behaviour Pitman-Yor process, Stable-beta processDeep learning Features Cascading/nested Indian buffet processDimensionality reduction Manifold Gaussian process latent variable model

Topic modelsAtomic

distribution Hierarhical Dirichlet processRelational modeling Relations Infinite relational model,Mondrian process

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 26 / 38

Page 27: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Contents

1 Introduction

2 Dirichlet Processes - Nonparametric Clustering

3 Gaussian Processes - Nonparametric Regression

4 Case Study: The Relational Automatic Statistician System

5 AppendixConjugate PriorDirichlet ProcessesChinese Restaurant ProcessesBeta ProcessesIndian Buffet ProcessesNonparametric Hierarchical Models

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 27 / 38

Page 28: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Recent advances in Nonparametric Regression

Page 29: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Problem

Descriptive prediction of multiple time series

Page 30: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Problem

Descriptive prediction of multiple time series

Page 31: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Problem

Linear function

decrease x/week

Smooth function

Length scale: y weeks

Rapidly varying

smooth function

Length scale: z hours

Descriptive prediction of multiple time series

Page 32: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Problem (this paper)

Descriptive prediction of multiple time series

Constant function

Sudden drop btw

9/12/01 ~ 9/15/01

Smooth function

Length scale: y weeks

Rapidly varying

smooth function

Length scale: z hours

Page 33: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Models

Automatic Bayesian Covariance Discovery* [Lloyd et. al. 2014] [Ghahramani. 2015]

* http://www.automaticstatistician.com/

Page 34: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Gaussian Processes

𝑓 𝑥 ~𝒢𝒫(𝜇 𝑥 , 𝑘 𝑥, 𝑥′ )

Function Gaussian Process

Mean function

Covariance kernel function

𝜇 𝑥 = 𝔼 𝑓 𝑥

𝑘 𝑥, 𝑥′ = Cov 𝑓 𝑥 , 𝑓 𝑥′

Page 35: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Gaussian Processes

𝑓(𝑥)~𝒢𝒫(𝜇 𝑥 , 𝑘 𝑥, 𝑥′ )

Function Gaussian Process

Mean function

Covariance kernel function

𝜇 𝑥 = 𝔼 𝑓 𝑥

𝑘 𝑥, 𝑥′ = Cov 𝑓 𝑥 , 𝑓 𝑥′

[𝑓(𝑥1), … , 𝑓(𝑥𝑁)]~𝒩(𝝁, 𝚺)

Function evaluations Multivariate Gaussian

follows

Mean vector

Covariance matrix

𝝁 = [𝜇 𝑥1 , … , 𝜇 𝑥𝑁 ]

𝚺𝑖𝑗 = 𝑘 𝑥𝑖 , 𝑥𝑗

Page 36: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

* Automatic Bayesian Covariance Discovery (http://www.automaticstatistician.com/)

[Lloyd; Duvenaud; Grosse; Tenenbaum; Ghahramani. 2014.] [Ghahramani. Nature. 2015.]

The Automatic Statistician*

𝑓(𝑥)~𝒢𝒫(𝜇 𝑥 , 𝑘 𝑥, 𝑥′ )

Find appropriate kernel(1) Encode characteristic

Page 37: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

* Automatic Bayesian Covariance Discovery (http://www.automaticstatistician.com/)

[Lloyd; Duvenaud; Grosse; Tenenbaum; Ghahramani. 2014.] [Ghahramani. Nature. 2015.]

The Automatic Statistician*

If 𝑔(𝑥)~𝒢𝒫 0, 𝑘𝑔 , ℎ 𝑥 ~𝒢𝒫 0, 𝑘ℎ and 𝑔 𝑥 ⊥ ℎ 𝑥

, then

𝑔 𝑥 + ℎ 𝑥 ~ 𝒢𝒫 0, 𝑘𝑔 + 𝑘ℎ

𝑔 𝑥 × ℎ 𝑥 ~ 𝒢𝒫 0, 𝑘𝑔 × 𝑘ℎ

𝑓(𝑥)~𝒢𝒫(𝜇 𝑥 , 𝑘 𝑥, 𝑥′ )

Find appropriate kernel(1) Encode characteristic

(2) Compose new kernel

Page 38: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

The Automatic Statistician: Base kernels

Base

kernel

Encoding

functionKernel function

Paramet

ers

Example kernel

function shape

Example

encoded

functions

LIN(𝑥, 𝑥’)Linear

function𝜎2(𝑥 − ℓ)(𝑥′ − ℓ) 𝜎, ℓ

SE(𝑥, 𝑥’)Smooth

function𝜎2 exp −

𝑥 − 𝑥′ 2

2ℓ2𝜎, ℓ

PER(𝑥, 𝑥’)Periodic

functionIn appendix 𝜎, ℓ, 𝑝

Page 39: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

The Automatic Statistician: Operators

Op. Concept Params ExampleExample kernel

function shape

Example encoded

functions

+

Addition

Superposition

OR operator

N/A

SE + PER

LIN + PER

×Multiplication

AND operatorN/A SE × PER

Page 40: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

The Automatic Statistician: Operators

Op. Concept Params Example Example encoded functions

CPDivide

left vs rightℓ, 𝑠 CP(LIN,LIN)

CWDivide

in vs outℓ, 𝑠, 𝑤 CW(LIN,C)

× ×+ =

× ×+ =

Page 41: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

The Automatic statistician: Learning

BIC ℳ = −2 log𝑃 𝐷 ℳ + ℳ log 𝐷

Negative log-likelihood

Model complexity

Num. of model parameters

Num. of data points

(1) Optimization criterion: Bayesian Information Criterion (BIC)

Page 42: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

The Automatic statistician: Learning

(2) Optimize

(1) Expand

(3) Select

Iteratively select best model (structure 𝑘, parameter )

(1) Expand: the current kernel

(2) Optimize: conjugate gradient descent

(3) Select: the best kernel in the level (greedy)

(4) Iterate: get back to (1) for the next level

(2) Learning algorithm (Composite Kernel Learning)

Negative log-likelihood

Model complexity

Num. of model parameters

Num. of data points

(1) Optimization criteria: Bayesian Information Criterion (BIC)

BIC ℳ = −2 log𝑃 𝐷 ℳ + ℳ log 𝐷

Page 43: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Example Result

Linear function

decrease x/week

Smooth function

Length scale: y weeks

Rapidly varying

smooth function

Length scale: z hours

The Automatic Statistician*

* Automatic Bayesian Covariance Discovery (http://www.automaticstatistician.com/)

[Lloyd; Duvenaud; Grosse; Tenenbaum; Ghahramani. 2014.] [Ghahramani. Nature. 2015.]

Page 44: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Prediction Performance (Extrapolation)

13 famous regression datasets

Page 45: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

The Relational Automatic Statistician

[Hwang, Tong and Choi, 2016]

* http://www.automaticstatistician.com/

Page 46: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Motivation

Linear function

decrease x/week

Smooth function

Length scale: y weeks

Rapidly varying

smooth function

Length scale: z hours

Adjusted Close of

General Electronics

9/11, 2001

The Automatic Statistician

Page 47: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Motivation

Adjusted Close of

General Electronics, Microsoft, ExxonMobil

9/11, 2001

• Exploit multiple time series

• Find global descriptions

• Hope better predictive

performance

The Relational Automatic Statistician

Page 48: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Latent function

Latent function evaluation

Observation

Fixed Given Optimize

Covariance kernel functionMean function

Gaussian Process

𝑃 𝐷 ℳ = 𝑃 𝐷 𝒢𝒫 0, 𝑘 𝑥, 𝑥′; 𝜃

Model: Gaussian Process

Gaussian Processes → CKL → RKL → SRKL

Page 49: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Fixed Grammar Optimize

Latent function evaluation

Observation

Covariance kernel functionMean function

Gaussian Process

Latent function

𝑃 𝐷 ℳ = 𝑃 𝐷 𝒢𝒫 0, 𝑘 𝑥, 𝑥′; 𝜃

Model: Composite Kernel Learning (CKL)

GPs → Composite Kernel Learning → RKL → SRKL

Generalized Multi Kernel Learning

Page 50: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Fixed Grammar Optimize

Latent function evaluation

Observation

Scale parameter Shift parameter

Covariance kernel functionMean function

Gaussian Process

Latent function

𝑃 𝐷 ℳ =ෑ

𝑗=1

𝑀

𝑃 𝑑𝑗 𝒢𝒫 0, 𝜎𝑗 × 𝑘 𝑥, 𝑥′; 𝜃 + 𝑐𝑗

Model: Relational Kernel Learning (RKL)

GPs → CKL → Relational Kernel Learning → SRKL

10

0

100

0

10

0

20

10

Scale

Shift

Page 51: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Scale parameter

Distinctive kernel

Latent function

Fixed Grammar Optimize

Latent function evaluation

Observation

Covariance kernel functionMean function

Gaussian Process

Fixed (Spectral Mixture) Optimize

𝑃 𝐷 ℳ =ෑ

𝑗=1

𝑀

𝑃 𝑑𝑗 𝒢𝒫 0, 𝜎𝑗 × 𝑘 𝑥, 𝑥′; 𝜃 + 𝑘𝑗(𝑥, 𝑥′; 𝜃𝑗)

Model: Semi-Relational Kernel Learning (SRKL)

GPs → CKL → RKL → Semi-Relational Kernel Learning

Page 52: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Semi-Relational Kernel Learning (SRKL)

Input: M time series

Output: A shared kernel 𝑘, M spectral mixture (SM) kernels

(1) Expand: the current shared kernel for all time series

(2) Optimize: expanded kernels for all M time series (conjugate gradient descent)

For each series, individual distinction is handled by the SM kernel.

(3) Select: the best shared kernel for all time series (greedy)

A shared kernel (𝑘) and M SM kernels (𝑘𝑗)

(4) Iterative: get back to (1) when level s is not reached

Find the best shared and distinctive kernels iteratively!

Page 53: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Experimental Results

Three real-world data sets:

US top 9 stocks in year 2001

US top 6 housing markets from 2003 to 2013

Currency exchange of 4 emerging market

Page 54: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Descriptions Graphs (normalized) Details

9 adjusted close

of stock figures

GE,MSFT, XOM,

PFE, C, WMT,

INTC, BP, AIG

6 US housing price

indices

New York, Los Angeles

Chicago, Pheonix,

San Diego, San Fancisco

4 emerging market

currency exchanges

Indonesian - IDR

Malaysian - MYR

South African - ZAR

Russian - RUB

Data sets

Page 55: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Qualitative Results

Adjusted Closes Component 1 Adjusted Closes Component 2

US stock market values suddenly drop after US 9/11 attacks.

Currency exchange is affected by FED’s policy change in interest rates around

middle Sep 2015.

Automatic Statistician Relational Automatic Statistician

4 currency exchange rates Learned component

Page 56: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

An Automatically Generated Report

Page 57: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Quantitative Results

STOCK3 = GE, MSFT, XOM

STOCK6= STOCK3 + PFE, C, WMT

STOCK9 = STOCK6 + INTC, BP, AIG

HOUSE2 = NY, LA

HOUSE4 = HOUSE2 + Chicago, Pheonix

HOUSE6 = HOUSE4 + San Diego, San Francisco

CURRENCY4 = IDR, MYR,ZAR,RUB

Page 58: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Quantitative Results (box plots)

9 stocks 6 house price indices 4 currency exchanges

Page 59: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Conclusion

• Our research topic is about“Solving descriptive prediction problem of multiple timeseries”

• We proposed models that can“Exploit both common and distinct changes”

• We found that our models“Show the better qualitative and quantitative performancecompared to the state-of-the-art GP regression method”

http://saildemo.unist.ac.kr/automatic_statistician/

Page 60: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Contents

1 Introduction

2 Dirichlet Processes - Nonparametric Clustering

3 Gaussian Processes - Nonparametric Regression

4 Case Study: The Relational Automatic Statistician System

5 AppendixConjugate PriorDirichlet ProcessesChinese Restaurant ProcessesBeta ProcessesIndian Buffet ProcessesNonparametric Hierarchical Models

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 28 / 38

Page 61: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Introduction

Simple Example

Task: Toss a (potentially biased) coin N times. Compute θ, theprobability of heads.

Suppose we observe: T, H, H, T. What do we think θ is? Themaximum likelihood estimate is θ = 1/2. Seems reasonable.

Now suppose we observe: H, H, H, H. What do we think θ is? Themaximum likelihood estimate is θ = 1. Seem reasonable?

Not really. Why?

Kurt T. Miller Dr. Nonparametric Bayes 9

Page 62: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Introduction

Simple Example

Task: Toss a (potentially biased) coin N times. Compute θ, theprobability of heads.

Suppose we observe: T, H, H, T. What do we think θ is? Themaximum likelihood estimate is θ = 1/2. Seems reasonable.

Now suppose we observe: H, H, H, H. What do we think θ is? Themaximum likelihood estimate is θ = 1. Seem reasonable?

Not really. Why?

Kurt T. Miller Dr. Nonparametric Bayes 9

Page 63: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Introduction

Simple Example

When we observe H, H, H, H, why does θ = 1 seem unreasonable?

Prior knowledge! We believe coins generally have θ ≈ 1/2. How toencode this? By using a Beta prior on θ.

Kurt T. Miller Dr. Nonparametric Bayes 10

Page 64: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Introduction

Simple Example

When we observe H, H, H, H, why does θ = 1 seem unreasonable?

Prior knowledge! We believe coins generally have θ ≈ 1/2. How toencode this? By using a Beta prior on θ.

Kurt T. Miller Dr. Nonparametric Bayes 10

Page 65: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Introduction

Bayesian Approach to Estimating θ

Place a Beta(a, b) prior on θ. This prior has the form

p(θ) ∝ θa−1(1− θ)b−1.

What does this distribution look like?

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

3.5

α1=1.0, α

2=0.1

α1=1.0, α

2=1.0

α1=1.0, α

2=5.0

α1=1.0, α

2=10.0

α1=9.0, α

2=3.0

Kurt T. Miller Dr. Nonparametric Bayes 11

Page 66: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Introduction

Bayesian Approach to Estimating θ

Place a Beta(a, b) prior on θ. This prior has the form

p(θ) ∝ θa−1(1− θ)b−1.

What does this distribution look like?

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

3.5

α1=1.0, α

2=0.1

α1=1.0, α

2=1.0

α1=1.0, α

2=5.0

α1=1.0, α

2=10.0

α1=9.0, α

2=3.0

Kurt T. Miller Dr. Nonparametric Bayes 11

Page 67: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Introduction

Bayesian Approach to Estimating θ

After observing X, a sequence with n heads and m tails, the posterior onθ is:

p(θ|X) ∝ p(X|θ)p(θ)∝ θa+n−1(1− θ)b+m−1

∼ Beta(a+ n, b+m).

If a = b = 1 and we observe 5 heads and 2 tails, Beta(6, 3) looks like

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

Kurt T. Miller Dr. Nonparametric Bayes 12

Page 68: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Introduction

Bayesian Approach to Estimating θ

After observing X, a sequence with n heads and m tails, the posterior onθ is:

p(θ|X) ∝ p(X|θ)p(θ)∝ θa+n−1(1− θ)b+m−1

∼ Beta(a+ n, b+m).

If a = b = 1 and we observe 5 heads and 2 tails, Beta(6, 3) looks like

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

Kurt T. Miller Dr. Nonparametric Bayes 12

Page 69: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Parametric Bayesian Clustering

The Dirichlet Distribution

We hadπ ∼ Dirichlet(α1, . . . , αK)

The Dirichlet density is defined as

p(π|α) =Γ(∑K

k=1 αk

)

∏Kk=1 Γ(αk)

πα1−11 πα2−1

2 · · ·παK−1K

where πK = 1−∑K−1k=1 πk.

The expectations of π are

E(πi) =αi∑Ki=1 αi

Kurt T. Miller Dr. Nonparametric Bayes 24

Page 70: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Parametric Bayesian Clustering

The Beta DistributionA special case of the Dirichlet distribution is the Beta distribution forwhen K = 2.

p(π|α1, α2) =Γ (α1 + α2)

Γ(α1)Γ(α2)πα1−1(1− π)α2−1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

3.5

α1=1.0, α

2=0.1

α1=1.0, α

2=1.0

α1=1.0, α

2=5.0

α1=1.0, α

2=10.0

α1=9.0, α

2=3.0

Kurt T. Miller Dr. Nonparametric Bayes 25

Page 71: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Parametric Bayesian Clustering

The Dirichlet Distribution

In three dimensions:

p(π|α1, α2, α3) =Γ (α1 + α2 + α3)

Γ(α1)Γ(α2)Γ(α3)πα1−11 πα2−1

2 (1− π1 − π2)α3−1

α = (2, 2, 2) α = (5, 5, 5)

Kurt T. Miller Dr. Nonparametric Bayes 26

(2, 2, 5)

Page 72: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Parametric Bayesian Clustering

Draws from the Dirichlet Distribution

α = (2, 2, 2) 1 2 30

0.2

0.4

0.6

0.8

1 2 30

0.1

0.2

0.3

0.4

0.5

1 2 30

0.1

0.2

0.3

0.4

α = (5, 5, 5) 1 2 30

0.1

0.2

0.3

0.4

0.5

1 2 30

0.2

0.4

0.6

0.8

1 2 30

0.2

0.4

0.6

0.8

α = (2, 2, 5) 1 2 30

0.2

0.4

0.6

0.8

1 2 30

0.2

0.4

0.6

0.8

1 2 30

0.2

0.4

0.6

0.8

Kurt T. Miller Dr. Nonparametric Bayes 27

Page 73: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Parametric Bayesian Clustering

Key Property of the Dirichlet Distribution

The Aggregation Property: If

(π1, . . . , πi, πi+1, . . . , πK) ∼ Dir(α1, . . . , αi, αi+1, . . . , αK)

then

(π1, . . . , πi + πi+1, . . . , πK) ∼ Dir(α1, . . . , αi + αi+1, . . . , αK)

This is also valid for any aggregation:

(π1 + π2,

k=3K

πk

)∼ Beta

(α1 + α2,

K∑

k=3

αk

)

Kurt T. Miller Dr. Nonparametric Bayes 28

Page 74: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Parametric Bayesian Clustering

Multinomial-Dirichlet Conjugacy

Let Z ∼ Multinomial(π) and π ∼ Dir(α).

Posterior:

p(π|z) ∝ p(z|π)p(π)= (πz1

1 · · ·πzKK )(πα1−1

1 · · ·παK−1K )

= (πz1+α1−11 · · ·πzK+αK−1

K )

which is Dir(α+ z).

Kurt T. Miller Dr. Nonparametric Bayes 29

Page 75: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Contents

1 Introduction

2 Dirichlet Processes - Nonparametric Clustering

3 Gaussian Processes - Nonparametric Regression

4 Case Study: The Relational Automatic Statistician System

5 AppendixConjugate PriorDirichlet ProcessesChinese Restaurant ProcessesBeta ProcessesIndian Buffet ProcessesNonparametric Hierarchical Models

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 29 / 38

Page 76: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

The Dirichlet Process Model

Parameters for the Dirichlet Process

• α - The concentration parameter.

• G0 - The base measure. A prior distribution for the cluster specificparameters.

The Dirichlet Process (DP) is a distribution over distributions. We write

G ∼ DP (α,G0)

to indicate G is a distribution drawn from the DP.

It will become clearer in a bit what α and G0 are.

Kurt T. Miller Dr. Nonparametric Bayes 52

Page 77: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

The Dirichlet Process Model

The Dirichlet Process

Definition: Let G0 be a probability measure on the measurable space(Ω, B) and α ∈ R+.

The Dirichlet Process DP (α,G0) is the distribution on probabilitymeasures G such that for any finite partition (A1, . . . , Am) of Ω,

(G(A1), . . . , G(Am)) ∼ Dir(αG0(A1), . . . , αG0(Am)).

A

A

A

A

A

1

2

3

4

(Ferguson, ’73)

Kurt T. Miller Dr. Nonparametric Bayes 54

Page 78: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

The Dirichlet Process Model

Mathematical Properties of the Dirichlet Process

Suppose we sample

• G ∼ DP (α,G0)

• θ1 ∼ G

What is the posterior distribution of G given θ1?

G|θ1 ∼ DP

(α+ 1,

α

α+ 1G0 +

1

α+ 1δθ1

)

More generally

G|θ1, . . . , θn ∼ DP

(α+ n,

α

α+ nG0 +

1

α+ n

n∑

i=1

δθi

)

Kurt T. Miller Dr. Nonparametric Bayes 55

Page 79: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

The Dirichlet Process Model

Mathematical Properties of the Dirichlet Process

Suppose we sample

• G ∼ DP (α,G0)

• θ1 ∼ G

What is the posterior distribution of G given θ1?

G|θ1 ∼ DP

(α+ 1,

α

α+ 1G0 +

1

α+ 1δθ1

)

More generally

G|θ1, . . . , θn ∼ DP

(α+ n,

α

α+ nG0 +

1

α+ n

n∑

i=1

δθi

)

Kurt T. Miller Dr. Nonparametric Bayes 55

Page 80: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

The Dirichlet Process Model

Mathematical Properties of the Dirichlet Process

With probability 1, a sample G ∼ DP (α,G0) is of the form

G =∞∑

k=1

πkδφk

(Sethuraman, ’94)

Kurt T. Miller Dr. Nonparametric Bayes 56

Page 81: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

The Dirichlet Process Model

The Stick-Breaking Process

• Define an infinite sequence of Beta random variables:

βk ∼ Beta(1, α) k = 1, 2, . . .

• And then define an infinite sequence of mixing proportions as:

π1 = β1

πk = βk

k−1∏

l=1

(1− βl) k = 2, 3, . . .

• This can be viewed as breaking off portions of a stick:

1 2...

1β β (1−β )

• When π are drawn this way, we can write π ∼ GEM(α).

Kurt T. Miller Dr. Nonparametric Bayes 58

Page 82: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

The Dirichlet Process Model

The Stick-Breaking Process

• We now have an explicit formula for each πk:πk = βk

∏k−1l=1 (1− βl)

• We can also easily see that∑∞

k=1 πk = 1 (wp1):

1−K∑

k=1

πk = 1− β1 − β2(1− β1)− β3(1− β1)(1− β2)− · · ·

= (1− β1)(1− β2 − β3(1− β2)− · · · )

=K∏

k=1

(1− βk)

→ 0 (wp1 as K → ∞)

• So now G =∑∞

k=1 πkδφkhas a clean definition as a random

measure

Kurt T. Miller Dr. Nonparametric Bayes 59

Page 83: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Contents

1 Introduction

2 Dirichlet Processes - Nonparametric Clustering

3 Gaussian Processes - Nonparametric Regression

4 Case Study: The Relational Automatic Statistician System

5 AppendixConjugate PriorDirichlet ProcessesChinese Restaurant ProcessesBeta ProcessesIndian Buffet ProcessesNonparametric Hierarchical Models

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 30 / 38

Page 84: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

The Dirichlet Process Model

The Chinese Restaurant Process (CRP)

• A random process in which n customers sit down in a Chineserestaurant with an infinite number of tables

• first customer sits at the first table• mth subsequent customer sits at a table drawn from the following

distribution:

P (previously occupied table i|Fm−1) ∝ ni

P (the next unoccupied table|Fm−1) ∝ α

where ni is the number of customers currently at table i and whereFm−1 denotes the state of the restaurant after m− 1 customershave been seated

Kurt T. Miller Dr. Nonparametric Bayes 61

Page 85: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

The Dirichlet Process Model

The CRP and Clustering

• Data points are customers; tables are clusters

• the CRP defines a prior distribution on the partitioning of the dataand on the number of tables

• This prior can be completed with:

• a likelihood—e.g., associate a parameterized probability distributionwith each table

• a prior for the parameters—the first customer to sit at table kchooses the parameter vector for that table (φk) from the prior

φ2φ1 φ3 φ

4

• So we now have a distribution—or can obtain one—for any quantitythat we might care about in the clustering setting

Kurt T. Miller Dr. Nonparametric Bayes 62

Page 86: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

The Dirichlet Process Model

The CRP Prior, Gaussian Likelihood, Conjugate Prior

Kurt T. Miller Dr. Nonparametric Bayes 63

Page 87: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

The Dirichlet Process Model

The CRP and the DP

OK, so we’ve seen how the CRP relates to clustering. How does it relateto the DP?

Important fact: The CRP is exchangeable.

Remember De Finetti’s Theorem: If (x1, x2, . . .) are infinitelyexchangeable, then ∀n

p(x1, . . . , xn) =

∫ ( n∏

i=1

p(xi|G)

)dP (G)

for some random variable G.

Kurt T. Miller Dr. Nonparametric Bayes 64

Page 88: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

The Dirichlet Process Model

The CRP and the DP

OK, so we’ve seen how the CRP relates to clustering. How does it relateto the DP?

Important fact: The CRP is exchangeable.

Remember De Finetti’s Theorem: If (x1, x2, . . .) are infinitelyexchangeable, then ∀n

p(x1, . . . , xn) =

∫ ( n∏

i=1

p(xi|G)

)dP (G)

for some random variable G.

Kurt T. Miller Dr. Nonparametric Bayes 64

Page 89: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

The Dirichlet Process Model

The CRP and the DP

The Dirichlet Process is the De Finetti mixingdistribution for the CRP.

That means, when we integrate outG, we get the CRP.

p(θ1, . . . , θn) =

∫ n∏

i=1

p(θi|G)dP (G)

G

!i

xi

N

!

!

G0

Kurt T. Miller Dr. Nonparametric Bayes 65

Page 90: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

The Dirichlet Process Model

The CRP and the DP

The Dirichlet Process is the De Finetti mixingdistribution for the CRP.

That means, when we integrate outG, we get the CRP.

p(θ1, . . . , θn) =

∫ n∏

i=1

p(θi|G)dP (G)

G

!i

xi

N

!

!

G0

Kurt T. Miller Dr. Nonparametric Bayes 65

Page 91: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

The Dirichlet Process Model

The CRP and the DP

The Dirichlet Process is the De Finetti mixingdistribution for the CRP.

In English, this means that if the DP is the prior onG, then the CRP defines how points are assigned toclusters when we integrate out G.

Kurt T. Miller Dr. Nonparametric Bayes 66

Page 92: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

The Dirichlet Process Model

The DP, CRP, and Stick-Breaking Process Summary

G

!i

xi

N

!

!

G0G ! DP(!, G0)

Stick-Breaking Process(just the weights)

The CRP describes thepartitions of ! when Gis marginalized out.

Kurt T. Miller Dr. Nonparametric Bayes 67

Page 93: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Beta Processes

Definition [Hjort, 90]: Let H0 be a continuous probability measure(Ω,B) and α ∈ R+. Then, Beta Process BP(α,H0) is the distributionon probability measures H such that for any (disjoint) finite partition(A1, · · · ,Ak) of Ω satisfies

H(Ai ) ∼ Beta(αH0(Ai ), α(1− H0(Ai )))

with K →∞ and H0(Ai )→ 0 for i = 1, · · · ,K .

The beta process can be written in set function form,

H(w) =∞∑

k=1

πkδwk(w)

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 31 / 38

Page 94: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Contents

1 Introduction

2 Dirichlet Processes - Nonparametric Clustering

3 Gaussian Processes - Nonparametric Regression

4 Case Study: The Relational Automatic Statistician System

5 AppendixConjugate PriorDirichlet ProcessesChinese Restaurant ProcessesBeta ProcessesIndian Buffet ProcessesNonparametric Hierarchical Models

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 32 / 38

Page 95: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Indian Buffet Processes

Latent feature models

Clustering is not enough for mixed membershipsGrouping problem with overlapping clustersEncode as binary matrix: Observation n in cluster k ⇔ xnk = 1Alternatively: Item n possesses feature k ⇔ xnk = 1

Indian buffet process (IBP)1. Customer 1 tries Poisson(α) dishes.2. Subsequent customer n + 1:

tries a previously tried dish k with probability nkn+1

tries Poisson( αn+1 ) new dishes.

Properties

An exchangeable distribution over finite sets (of dishes).Observation (=customer) n in cluster (=dish) k if customer ‘tries dishk’

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 33 / 38

Page 96: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Indian Buffet Process

Alternative description

1. Sample w1, · · · ,wk ∼iid Beta(1, α/K )2. Sample X1k , · · · ,Xnk ∼iid Bernoulli(wk)

Beta Process (BP)Beta Process

Distribution on objects of the form

θ =∞∑k=1

wkδφkwith wk ∈ [0, 1].

IBP matrix entries are sampled as xnk ∼iid Bernoulli(wk).Beta process is the de Finetti measure of the IBP.θ is a random measure

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 34 / 38

Page 97: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Binary matrices in left-order form

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 35 / 38

Page 98: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Contents

1 Introduction

2 Dirichlet Processes - Nonparametric Clustering

3 Gaussian Processes - Nonparametric Regression

4 Case Study: The Relational Automatic Statistician System

5 AppendixConjugate PriorDirichlet ProcessesChinese Restaurant ProcessesBeta ProcessesIndian Buffet ProcessesNonparametric Hierarchical Models

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 36 / 38

Page 99: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Hierarchical Gaussian Processes

Apply Bayesian representation recursively Split parameter Θ:

Θ→ Ψ and Θ|Ψ

p(data) = p(data|Θ)p(Θ)

= p(data|Θ)p(Θ|Ψ)p(Ψ)

Example: Hierarchical Gaussian process

Sample Ψ ∼ p(Ψ)(e.g., large length-scale, mean 0)

Sample Θ|Ψ ∼ p(·|Ψ)(e.g., smaller length scale, mean Ψ)

Decompose underlying pattern:

Low-frequency component Ψ

High-frequency component Θ

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 37 / 38

Page 100: Statistical Arti cial Intelligence Laboratory (SAIL)* UNISTsail.unist.ac.kr/talks/3_Bayesian_Nonparametric_ICCAS... · 2016-10-16 · Statistical Arti cial Intelligence Laboratory

Hierarchical Dirichlet Processes

Sampling scheme

Sample G0 ∼ DP(γ,H)Sample G1,G2, · · · ∼ DP(α,G0)Sample xij ∼ Gj G1,G2, · · · have common‘vocabulary’ of atoms.

Nonparametric Latent Dirichlet Allocation(LDA)

G0 =∞∑

k=1

ckδθ∗k Gj =∞∑

l=1

D ji δΦj

l

θk = finite probability (=‘topic’)ck = occurrence probability of topic kDocument j drawn from weighted combinationof topics, with proportions D j

l (‘admixturemodel’)

Jaesik Choi(UNIST) Part 3. Nonparametric Bayesian for Sequential Data 38 / 38


Recommended