+ All Categories
Home > Data & Analytics > Non parametric bayesian learning in discrete data

Non parametric bayesian learning in discrete data

Date post: 16-Jan-2017
Category:
Upload: yueshen-xu
View: 250 times
Download: 4 times
Share this document with a friend
24
Non-parametric Bayesian Learning in Discrete Data Yueshen Xu [email protected] / [email protected] Middleware, CCNT, ZJU Middleware, CCNT, ZJU 5/10/2016 Statistics & Computational Linguistics 1 Yueshen Xu
Transcript
Page 1: Non parametric bayesian learning in discrete data

Non-parametric Bayesian

Learning in Discrete Data

Yueshen [email protected] / [email protected]

Middleware, CCNT, ZJU

Middleware, CCNT, ZJU5/10/2016

Statistics & Computational Linguistics

1Yueshen Xu

Page 2: Non parametric bayesian learning in discrete data

Outline

Bayes’ Rule

Parametric Bayesian Learning

Concept & Example

Discrete & Continuous Data

Text Clustering & Topic Modeling

Pros and Cons

Some Important Concepts

Non-parametric Bayesian Learning

Dirichlet Process and Process Construction

Dirichlet Process Mixture

Hierarchical Dirichlet Process

Chinese Restaurant Process

5/10/2016 2 Middleware, CCNT, ZJUYueshen Xu

Example: Hierarchical Topic

Modeling

Markov Chain Monte Carlo

Reference

Discussion

Page 3: Non parametric bayesian learning in discrete data

Bayes’ Rule

Posterior = Prior * Likelihood

5/10/2016 Yueshen Xu 3 Middleware, CCNT, ZJU

𝑝 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝐷𝑎𝑡𝑎 =𝑝 𝐷𝑎𝑡𝑎 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝑝(𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠)

𝑝(𝐷𝑎𝑡𝑎)

Posterior

Likelihood Prior

Evidence

Update beliefs in hypotheses in response to data

Parametric or Non-parametric

The structure of hypothesis: constrain or not constrain

We have examples later

Your confidence to the prior

Page 4: Non parametric bayesian learning in discrete data

Parametric Bayesian Learning

5/10/2016 Yueshen Xu 4 Middleware, CCNT, ZJU

𝑝 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝐷𝑎𝑡𝑎 ∝ 𝑝 𝐷𝑎𝑡𝑎 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝑝(𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠)

Parametric or Non-parametric Hypothesis

Evidence is the fact

Constant No possibility Trick commonly used

Non-parametric != No parameters

Hyper-parameters

• Parameters of distributions

• Parameter vs. Variable

𝐷𝑖𝑟 𝜃 𝜶 =Γ(𝛼0)

Γ 𝛼1 …Γ 𝛼𝐾

𝑘=1

𝐾

𝜃𝑘𝛼𝑘−1

Variable

Hyper-parameter Parameter

p(θ|X) ∝ p(X|θ)p(θ)

Page 5: Non parametric bayesian learning in discrete data

Parametric Bayesian Learning

Some Examples

5/10/2016 Yueshen Xu 5 Middleware, CCNT, ZJU

Clustering Topic Modeling

K-Means/Medoid, NMF LSA, pLSA, LDA

Hierarchical Concept Building

Page 6: Non parametric bayesian learning in discrete data

Parametric Bayesian Learning

Serious Problems

How could we know

the number of clusters?

the number of topics?

the number of layers?

5/10/2016 Yueshen Xu 6 Middleware, CCNT, ZJU

Heuristic pre-processing?

Guessing and Tuning

Page 7: Non parametric bayesian learning in discrete data

Parametric Bayesian Learning

Some basics

Discrete Data & Continuous Data

Discrete Data: text be modeled as natural numbers

Continuous Data: stock, trading, signal, quality, rating be

modeled as real numbers

5/10/2016 Yueshen Xu 7 Middleware, CCNT, ZJU

Some important concepts (Also used in non-parametric case)

Discrete distribution: 𝑋𝑖|𝜃~𝐷𝑖𝑠𝑐𝑟𝑒𝑡𝑒(𝜃)

𝑝 𝑋 𝜃 =

𝑖=1

𝑛

𝐷𝑖𝑠𝑐𝑟𝑒𝑡𝑒 𝑋𝑖; 𝜃 =

𝑗=1

𝑚

𝜃𝑗

𝑁𝑗

Multinomial distribution: 𝑁|𝑛, 𝜃~𝑀𝑢𝑙𝑡𝑖(𝜃, 𝑛)

𝑝 𝑁 𝑛, 𝜃 =𝑛!

𝑗=1𝑚 𝑁𝑗!

𝑗=1

𝑚

𝜃𝑗

𝑁𝑗

Computer Sciencers

often mix them up

Page 8: Non parametric bayesian learning in discrete data

Parametric Bayesian Learning

Some important concepts (cont.)

Dirichlet distribution:𝜃|𝜶~𝐷𝑖𝑟(𝜶)

𝐷𝑖𝑟 𝜃 𝜶 =Γ(𝛼0)

Γ 𝛼1 …Γ 𝛼𝐾

𝑘=1

𝐾

𝜃𝑘𝛼𝑘−1

Conjugate Prior

the posterior p(θ|X) are in the same family as the p(θ), the prior is called

a conjugate prior of the likelihood p(X|θ)

Examples

Binomial Distribution ←→ Beta Distribution

Multinomial Distribution ←→ Dirichlet Distribution

5/10/2016 Yueshen Xu 8 Middleware, CCNT, ZJU

𝑝 𝜃 𝑵, 𝜶 =𝐷𝑖𝑟 𝜃 𝑵 + 𝜶 =Γ(𝛼0+𝑁)

Γ 𝛼1+𝑁1 …Γ 𝛼𝐾+𝑁𝐾 𝑘=1𝐾 𝜃𝑘𝛼𝑘−1+𝑁𝑘

𝑝(𝜃|𝜶) 𝑝 𝑵 𝜃

Why should prior and

posterior better be

conjugate distributions?

Page 9: Non parametric bayesian learning in discrete data

Parametric Bayesian Learning

Some important concepts (cont.)

Probabilistic Graphical Model

Modeling Bayesian Network using plates and circles

5/10/2016 Yueshen Xu 9 Middleware, CCNT, ZJU

Generative Model & Discriminative Model: 𝑝(𝜃|𝑋)

Generative Model: p(θ|X) ∝ p(X|θ)p(θ)

Naïve Bayes, GMM, pLSA, LDA, HMM, HDP… : Unsupervised Learning

Discriminative Model: 𝑝(𝜃|𝑋)

LR, KNN,SVM, Boosting, Decision Tree : Supervised Learning

Also have graphical model

representations

Page 10: Non parametric bayesian learning in discrete data

Non-parametric Bayesian Learning

When we talk about non-parametric, what do we usually talk

about?

Discrete Data: Dirichlet Distribution, Dirichlet Process, Chinese

Restaurant Process, Polya Urn, Pitman-Yor Process, Hierarchical

Dirichlet Process, Dirichlet Process Mixture, Dirichlet Process

Multinomial Model, Clustering, …

Continuous Data: Gaussian Distribution, Gaussian Process,

Regression, Classification, Factorization, Gradient Descent,

Covariance Matrix… Brownian Motion

5/10/2016 Yueshen Xu 10 Middleware, CCNT, ZJU

Infinite

Page 11: Non parametric bayesian learning in discrete data

Non-parametric Bayesian Learning

Dirichlet Process[Yee Whye Teh, etc]

𝐺0 : probabilistic measure/distribution (base distribution), 𝛼0: real

number, (𝐴1, 𝐴2, … , 𝐴𝑟) : partition of space, G: a probabilistic

distribution, iff

(𝐺 𝐴1 , … , 𝐺(𝐴𝑟))~𝐷𝑖𝑟(𝛼0𝐺0 𝐴1 , … , 𝛼0𝐺0 𝐴𝑟 )

then, 𝐺~DP(𝛼0, 𝐺0)

5/10/2016, Yueshen Xu 11 Middleware, CCNT, ZJU

𝐺0 : which exact distribution is 𝐺0? We don’t know

𝐺 : which exact distribution is 𝐺? We don’t know

Page 12: Non parametric bayesian learning in discrete data

Non-parametric Bayesian Learning

Where is infinite? Construction of DP We need to construct

a DP, since it does not exist naturally

Stick-breaking, Polya Urn Scheme, Chinese restaurant process

Middleware, CCNT, ZJU

Stick-breaking construction

(𝛽𝑘)𝑘=1∞ ,(𝜙𝑘)𝑘=1

∞ :iid sequence

𝑘=1∞ 𝜋𝑘 = 1 𝛿𝜙𝑘 is the probability of 𝜙𝑘

a distribution of positive integers

𝛽𝑘|𝛼0~𝐵𝑒𝑡𝑎(1, 𝛼0)𝜙𝑘|𝛼0~𝐺0

𝜋𝑘 = 𝛽𝑘

𝑙=1

𝑘−1

(1 − 𝛽𝑙)

𝐺 =

𝑘=1

𝜋𝑘 𝛿𝜙𝑘

Why DP? …

Page 13: Non parametric bayesian learning in discrete data

Non-parametric Bayesian Learning

Chinese Restaurant Process

A restaurant with an infinite number of tables, and customers

(word, generated from 𝜃𝑖, one-to-one) enter this restaurant

sequentially. The ith customer (𝜃𝑖) sits at a table (𝜙𝑘) according to

the probability :

5/10/2016 Yueshen Xu 13 Middleware, CCNT, ZJU

new table

𝜙𝑘: Clustering == 2/3 unsupervised learning clustering, topic modeling (two layer

clustering), hierarchical concept building, collaborative filtering, similarity computation…

Page 14: Non parametric bayesian learning in discrete data

Non-parametric Bayesian Learning

Dirichlet Process Mixture (DPM)

You can draw the graphical model yourself DP is not enough

We need similarity instead of cloning Mixture Models

Middleware, CCNT, ZJU

Mixture Models: an element is generated from a mixture/group of

variables (usually latent variables) ∶ GMM, LDA, pLSA…

DPM: 𝜃𝑖|𝐺~𝐺, 𝑥𝑖|𝜃𝑖~𝐹(𝜃𝑖) For text data, 𝐹(𝜃𝑖) is Discrete/Multinomial

Intuitive but not helpful

Construction

𝛽𝑘|𝛼0~𝐵𝑒𝑡𝑎(1, 𝛼0)𝜙𝑘|𝛼0~𝐺0

𝜋𝑘 = 𝛽𝑘

𝑙=1

𝑘−1

(1 − 𝛽𝑙)

𝐺 =

𝑘=1

𝜋𝑘 𝛿𝜙𝑘

Page 15: Non parametric bayesian learning in discrete data

Non-parametric Bayesian Learning

Dirichlet Process Mixture (DPM)

5/10/2016 Yueshen Xu 15 Middleware, CCNT, ZJU

Finite

Dirichlet Multinomial

Mixture Model

What can DMMM do?

(0,0,0,Caption,0,0,0,0,0,0,USA,0,0,0,0,0,0,0,0,0,Action,0,0,0,0,0,0,0,Hero,0,0 0,0,0,0,….)

C l u s t e r i n g

Page 16: Non parametric bayesian learning in discrete data

Non-parametric Bayesian Learning

Hierarchical Dirichlet Process (HDP)

5/10/2016 Yueshen Xu 16 Middleware, CCNT, ZJU

Construction

HDP: 𝜃𝑗𝑖|𝐺~𝐺, 𝑥𝑗𝑖|𝜃𝑗𝑖~𝐹(𝜃𝑗𝑖)

LDA

A very natural model for

those statistics guys,

but for our computer

guys…hehe….Finite (F: Mult)

LDA Hierarchical

Dirichlet Multinomial

Mixture Model

Page 17: Non parametric bayesian learning in discrete data

Non-parametric Bayesian Learning

Hierarchical Topic Modeling

What we can get from reviews, blogs, question answers, twitter,

news……? Only topics? Far not enough

What we really need is a hierarchy to illustrate what exactly the

text tells people, like

5/10/2016 Yueshen Xu 17 Middleware, CCNT, ZJU

Page 18: Non parametric bayesian learning in discrete data

Non-parametric Bayesian Learning

Hierarchical Topic Modeling

Prior: Nested CRP/DP (nCRP) [Blei and Jordan, NIPS, 04]

NCRP: In a restaurant, at the 1st level, there is one table, which is linked

with an infinite number of tables at the 2nd level. Each table at the

second level is also linked with an infinite number of tables at the 3rd

level. Such a structure is repeated...

CRP is the prior to choose a table to form a path

5/10/2016 Yueshen Xu Middleware, CCNT, ZJU

one document, one path

Doc 2

Matryoshka Doll

Page 19: Non parametric bayesian learning in discrete data

Non-parametric Bayesian Learning

Hierarchical Topic Modeling

Generative Process

1. Let 𝑐1 be the root restaurant (only one table)

2. For each level 𝑙 ∈ {2, … , 𝐿}:

Draw a table from restaurant 𝑐𝑙−1 using CRP. Set 𝑐𝑙 to be the restaurant referred to

by that table

3. Draw an 𝐿 -dimensional topic proportion vector 𝜃~𝐷𝑖𝑟(𝛼)

4. For each word 𝑤𝑛:

Draw 𝑧 ∈ 1,… , 𝐿 ~ Mult(𝜃)

Draw 𝑤𝑛 from the topic associated with restaurant 𝑐𝑧

5/10/2016 Yueshen Xu

α

zm,n

N

c1

c2

cL

T

γ

wm,n

M

β

k

m

𝐿 can be infinite, but not necessary

Page 20: Non parametric bayesian learning in discrete data

Non-parametric Bayesian Learning

What we can get

5/10/2016 Yueshen Xu 20 Middleware, CCNT, ZJU

Page 21: Non parametric bayesian learning in discrete data

Markov Chain Monte Carlo

Markov Chain

Initialization probability: 𝜋0 = {𝜋0 1 , 𝜋0 2 , … , 𝜋0(|𝑆|)}

𝜋𝑛 = 𝜋𝑛−1𝑃 = 𝜋𝑛−2𝑃2 = ⋯ = 𝜋0𝑃

𝑛: Chapman-Kolomogrov equation

Central-limit Theorem: Under the premise of connectivity of P, lim𝑛→∞𝑃𝑖𝑗𝑛

= 𝜋 𝑗 ; 𝜋 𝑗 = 𝑖=1|𝑆|𝜋 𝑖 𝑃𝑖𝑗

lim𝑛→∞𝜋0𝑃𝑛 =𝜋(1) … 𝜋(|𝑆|)⋮ ⋮ ⋮𝜋(1) 𝜋(|𝑆|)

𝜋 = {𝜋 1 , 𝜋 2 , … , 𝜋 𝑗 , … , 𝜋(|𝑆|)}

5/10/2016 21 Middleware, CCNT, ZJU

Stationary Distribution

𝑋0~𝜋0 𝑥 −→ 𝑋1~𝜋1 𝑥 −→ ⋯−→ 𝑋𝑛~𝜋 𝑥 −→ 𝑋𝑛+1~𝜋 𝑥 −→ 𝑋𝑛+2~𝜋 𝑥 −→

sampleConvergence

Stationary Distribution

Yueshen Xu

|)||(|...)2|(|)1|(|

)12(p...)22(p)12(p

|)|1(...)21()11(p

SSpSpSp

Spp

P

Xm

Xm+1

Page 22: Non parametric bayesian learning in discrete data

Markov Chain Monte Carlo

Gibbs Sampling

5/10/2016 Yueshen Xu 22 Middleware, CCNT, ZJU

Step1: Initialize: 𝑋0 = 𝑥0 = {𝑥1: 𝑖 = 1,2, …𝑛}

Step2: for t = 0, 1, 2, …

1. 𝑥1(𝑡+1)~𝑝 𝑥1 𝑥2

(𝑡), 𝑥3(𝑡), … , 𝑥𝑛

(𝑡);

2. 𝑥2𝑡+1~𝑝 𝑥2 𝑥1

(𝑡+1), 𝑥3(𝑡), … , 𝑥𝑛

(𝑡)

3. …

4. 𝑥𝑗𝑡+1~𝑝 𝑥𝑗 𝑥1

(𝑡+1), 𝑥𝑗−1(𝑡+1), 𝑥𝑗+1(𝑡)… , 𝑥𝑛(𝑡)

5. …

6. 𝑥𝑛𝑡+1~𝑝 𝑥𝑛 𝑥1

(𝑡+1), 𝑥2(𝑡+1), … , 𝑥𝑛−1

(𝑡+1)

𝑥𝑖~𝑝 𝑥 𝑥−𝑖

A(x1,x1)

B(x1,x2)

C(x2,x1)

D

Metropolis-Hastings Sampling

You want to know ‘Gibbs sampling for HDP/DPM/nCRP’ ? You’d better understand

Gibbs sampling for ‘LDA and DMMM’

Page 23: Non parametric bayesian learning in discrete data

Reference

• Yee Whye Teh. Dirichlet Processes: Tutorial and Practical Course, 2007

• Yee Whye Teh, Jordan M I, etc. Hierarchical Dirichlet Processes, American Statistical

Association, 2006

• David Blei. Probabilstic topic models. Communications of the ACM, 2012

• David Blei, etc. Latent Dirichlet Allocation, JMLR, 2003

• David Blei, etc. The Nested Chinese Restaurant Process and Bayesian Inference of Topic

Hierarchies. Journal of the ACM, 2010

• Gregor Heinrich. Parameter Estimation for Text Analysis, 2008

• T.S., Ferguson. A Bayesian Analysis of Some Nonparametric Problems. The Annals of

Statistics, 1973

• Martin J. Wainwright. Graphical Models, Exponential Families, and Variational Inference

• Rick Durrett. Probability: Theory and Examples, 2010

• Christopher Bishop. Pattern Recognition and Machine Learning, 2007

• Vasilis Vryniotis. DatumBox: The Dirichlet Process Mixture Model, 2014

• David P. Williams. Gaussian Processes, Duke University, 2006

5/10/2016 Yueshen Xu 23 Middleware, CCNT, ZJU

Page 24: Non parametric bayesian learning in discrete data

Q&A

5/10/2016 Middleware, CCNT, ZJU24Yueshen Xu


Recommended