+ All Categories
Home > Documents > Principles and Applications of Probabilistic...

Principles and Applications of Probabilistic...

Date post: 08-Aug-2020
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
233
Principles and Applications of Probabilistic Learning Padhraic Smyth Department of Computer Science University of California, Irvine www.ics.uci.edu/~smyth Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005
Transcript
Page 1: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Principles and Applications ofProbabilistic Learning

Padhraic SmythDepartment of Computer Science

University of California, Irvinewww.ics.uci.edu/~smyth

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 2: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

NEW

New Slides

• Original slides created in mid-July for ACM

– Some new slides have been added• “new” logo in upper left

– A few slides have been updated• “updated” logo in upper left

• Current slides (including new and updated) at:www.ics.uci.edu/~smyth/talks

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 3: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

UPDATED

New Slides

• Original slides created in mid-July for ACM

– Some new slides have been added• “new” logo in upper left

– A few slides have been updated• “updated” logo in upper left

• Current slides (including new and updated) at:www.ics.uci.edu/~smyth/talks

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 4: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

NEW

From the tutorial Web page:

“The intent of this tutorial is to provide a starting point for students and researchers……”

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 5: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Probabilistic Modeling vs. Function Approximation

• Two major themes in machine learning:

1. Function approximation/”black box” methods• e.g., for classification and regression• Learn a flexible function y = f(x)• e.g., SVMs, decision trees, boosting, etc

2. Probabilistic learning• e.g., for regression, model p(y|x) or p(y,x)• e.g, graphical models, mixture models, hidden Markov

models, etc

• Both approaches are useful in general– In this tutorial we will focus only on the 2nd approach,

probabilistic modeling

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 6: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Motivations for Probabilistic Modeling

• leverage prior knowledge

• generalize beyond data analysis in vector-spaces

• handle missing data

• combine multiple types of information into an analysis

• generate calibrated probability outputs

• quantify uncertainty about parameters, models, and predictions in a statistical manner

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 7: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

ProbabilisticModel

Real WorldData

P(Data | Parameters)

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 8: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

ProbabilisticModel

Real WorldData

P(Data | Parameters)

P(Parameters | Data)

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 9: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

(Generative Model)

ProbabilisticModel

Real WorldData

P(Data | Parameters)

P(Parameters | Data)

(Inference)

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 10: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Outline

1. Review of probability

2. Graphical models

3. Connecting probability models to data

4. Models with hidden variables

5. Case studies(i) Simulating and forecasting rainfall data

(ii) Curve clustering with cyclone trajectories

(iii) Topic modeling from text documents

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 11: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Part 1: Review of Probability

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 12: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Notation and Definitions

• X is a random variable– Lower-case x is some possible value for X– “X = x” is a logical proposition: that X takes value x– There is uncertainty about the value of X

• e.g., X is the Dow Jones index at 5pm tomorrow

• p(X = x) is the probability that proposition X=x is true– often shortened to p(x)

• If the set of possible x’s is finite, we have a probability distribution and Σ p(x) = 1

• If the set of possible x’s is infinite, p(x) is a density function, and p(x) integrates to 1 over the range of X

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 13: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Example

• Let X be the Dow Jones Index (DJI) at 5pm Monday August 22nd (tomorrow)

• X can take real values from 0 to some large number

• p(x) is a density representing our uncertainty about X– This density could be constructed from historical data,

e.g.,

– After 5pm p(x) becomes infinitely narrow around the true known x (no uncertainty)

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 14: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Probability as Degree of Belief

• Different agents can have different p(x)’s– Your p(x) and the p(x) of a Wall Street expert might be

quite different– OR: if we were on vacation we might not have access to

stock market information• we would still be uncertain about p(x) after 5pm

• So we should really think of p(x) as p(x | BI)– Where BI is background information available to agent I

– (will drop explicit conditioning on BI in notation)

• Thus, p(x) represents the degree of belief that agent I has in proposition x, conditioned on available background information

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 15: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Comments on Degree of Belief

• Different agents can have different probability models– There is no necessarily “correct” p(x)– Why? Because p(x) is a model built on whatever assumptions or

background information we use– Naturally leads to the notion of updating

• p(x | BI) -> p(x | BI, CI)

• This is the subjective Bayesian interpretation of probability– Generalizes other interpretations (such as frequentist)– Can be used in cases where frequentist reasoning is not applicable– We will use “degree of belief” as our interpretation of p(x) in this

tutorial

• Note!– Degree of belief is just our semantic interpretation of p(x)– The mathematics of probability (e.g., Bayes rule) remain the

same regardless of our semantic interpretation

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 16: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Multiple Variables

• p(x, y, z)– Probability that X=x AND Y=y AND Z =z– Possible values: cross-product of X Y Z

– e.g., X, Y, Z each take 10 possible values• x,y,z can take 103 possible values• p(x,y,z) is a 3-dimensional array/table

– Defines 103 probabilities• Note the exponential increase as we add more

variables

– e.g., X, Y, Z are all real-valued• x,y,z live in a 3-dimensional vector space• p(x,y,z) is a positive function defined over this space,

integrates to 1

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 17: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Conditional Probability

• p(x | y, z)– Probability of x given that Y=y and Z = z– Could be

• hypothetical, e.g., “if Y=y and if Z = z”• observational, e.g., we observed values y and z

– can also have p(x, y | z), etc– “all probabilities are conditional probabilities”

• Computing conditional probabilities is the basis of many prediction and learning problems, e.g.,– p(DJI tomorrow | DJI index last week)– expected value of [DJI tomorrow | DJI index next week)– most likely value of parameter α given observed data

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 18: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Computing Conditional Probabilities

• Variables A, B, C, D– All distributions of interest related to A,B,C,D can be computed

from the full joint distribution p(a,b,c,d)

• Examples, using the Law of Total Probability

– p(a) = Σ{b,c,d} p(a, b, c, d)

– p(c,d) = Σ{a,b} p(a, b, c, d)

– p(a,c | d) = Σ{b} p(a, b, c | d)where p(a, b, c | d) = p(a,b,c,d)/p(d)

• These are standard probability manipulations: however, we will see how to use these to make inferences about parameters and unobserved variables, given data

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 19: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Conditional Independence

• A is conditionally independent of B given C iffp(a | b, c) = p(a | c)

(also implies that B is conditionally independent of A given C)

• In words, B provides no information about A, if value of C is known

• Example:– a = “patient has upset stomach”– b = “patient has headache”– c = “patient has flu”

• Note that conditional independence does not imply marginal independence

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 20: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Two Practical Problems

(Assume for simplicity each variable takes K values)

• Problem 1: Computational Complexity– Conditional probability computations scale as O(KN)

• where N is the number of variables being summed over

• Problem 2: Model Specification– To specify a joint distribution we need a table of O(KN) numbers

– Where do these numbers come from?

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 21: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Two Key Ideas

• Problem 1: Computational Complexity– Idea: Graphical models

• Structured probability models lead to tractable inference

• Problem 2: Model Specification– Idea: Probabilistic learning

• General principles for learning from data

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 22: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Part 2: Graphical Models

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 23: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

“…probability theory is more fundamentally concerned with the structure of reasoning and causation than with numbers.”

Glenn Shafer and Judea PearlIntroduction to Readings in Uncertain Reasoning,Morgan Kaufmann, 1990

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 24: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Graphical Models

• Represent dependency structure with a directed graph– Node <-> random variable– Edges encode dependencies

• Absence of edge -> conditional independence– Directed and undirected versions

• Why is this useful?– A language for communication– A language for computation

• Origins: – Wright 1920’s– Independently developed by Spiegelhalter and Lauritzen in

statistics and Pearl in computer science in the late 1980’s

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 25: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Examples of 3-way Graphical Models

A CB Marginal Independence:p(A,B,C) = p(A) p(B) p(C)

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 26: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Examples of 3-way Graphical Models

A

CB

Conditionally independent effects:p(A,B,C) = p(B|A)p(C|A)p(A)

B and C are conditionally independentGiven A

e.g., A is a disease, and we model B and C as conditionally independentsymptoms given A

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 27: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Examples of 3-way Graphical Models

A B

C

Independent Causes:p(A,B,C) = p(C|A,B)p(A)p(B)

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 28: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Examples of 3-way Graphical Models

A CB Markov dependence:p(A,B,C) = p(C|B) p(B|A)p(A)

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 29: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Real-World Example

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Monitoring Intensive-Care Patients• 37 variables• 509 parameters

…instead of 237

(figure courtesy of KevinMurphy/Nir Friedman)

PCWP CO

HRBP

HREKG HRSAT

ERRCAUTERHRHISTORY

CATECHOL

SAO2 EXPCO2

ARTCO2

VENTALV

VENTLUNG VENITUBE

DISCONNECT

MINVOLSET

VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS

PAP SHUNT

ANAPHYLAXIS

MINOVL

PVSAT

FIO2PRESS

INSUFFANESTHTPR

LVFAILURE

ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME

HYPOVOLEMIA

CVP

BP

Page 30: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Directed Graphical Models

A B

C

p(A,B,C) = p(C|A,B)p(A)p(B)

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 31: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Directed Graphical Models

A B

C

p(A,B,C) = p(C|A,B)p(A)p(B)

In general,p(X1, X2,....XN) = Π p(Xi | parents(Xi ) )

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 32: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Directed Graphical Models

A B

C

p(A,B,C) = p(C|A,B)p(A)p(B)

In general,p(X1, X2,....XN) = Π p(Xi | parents(Xi ) )

• Probability model has simple factored form

• Directed edges => direct dependence

• Absence of an edge => conditional independence

• Also known as belief networks, Bayesian networks, causal networks

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 33: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Example

D

A

B

C F

E

G

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 34: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Example

D

A

B

C F

E

G

p(A, B, C, D, E, F, G) = Π p( variable | parents )

= p(A|B)p(C|B)p(B|D)p(F|E)p(G|E)p(E|D) p(D)

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 35: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Example

D

A

B

c F

E

g

Say we want to compute p(a | c, g)

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 36: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Example

D

A

B

c F

E

g

Direct calculation: p(a|c,g) = Σbdef p(a,b,d,e,f | c,g)

Complexity of the sum is O(K4)

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 37: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Example

D

A

B

c F

E

g

Reordering (using factorization):Σb p(a|b) Σd p(b|d,c) Σe p(d|e) Σf p(e,f |g)

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 38: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Example

D

A

B

c F

E

g

Reordering:Σb p(a|b) Σd p(b|d,c) Σe p(d|e) Σf p(e,f |g)

p(e|g)

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 39: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Example

D

A

B

c F

E

g

Reordering:Σb p(a|b) Σd p(b|d,c) Σe p(d|e) p(e|g)

p(d|g)

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 40: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Example

D

A

B

c F

E

g

Reordering:Σb p(a|b) Σd p(b|d,c) p(d|g)

p(b|c,g)

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 41: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Example

D

A

B

c F

E

g

Reordering:Σb p(a|b) p(b|c,g)

p(a|c,g) Complexity is O(K), compared to O(K4)

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 42: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

A More General Algorithm

• Message Passing (MP) Algorithm– Pearl, 1988; Lauritzen and Spiegelhalter, 1988

– Declare 1 node (any node) to be a root

– Schedule two phases of message-passing

• nodes pass messages up to the root

• messages are distributed back to the leaves

– In time O(N), we can compute P(….)

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 43: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Sketch of the MP algorithm in action

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 44: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Sketch of the MP algorithm in action

1

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 45: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Sketch of the MP algorithm in action

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

21

Page 46: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Sketch of the MP algorithm in action

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

21

3

Page 47: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Sketch of the MP algorithm in action

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

21

3 4

Page 48: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Complexity of the MP Algorithm

• Efficient– Complexity scales as O(N K m)

• N = number of variables• K = arity of variables• m = maximum number of parents for any node

– Compare to O(KN) for brute-force method

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 49: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Graphs with “loops”

D

A

B

C F

E

G

Message passing algorithm does not work whenthere are multiple paths between 2 nodes

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 50: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Graphs with “loops”

D

A

B

C F

E

G

General approach: “cluster” variablestogether to convert graph to a tree

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 51: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Reduce to a Tree

D

A

B, E

C F G

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 52: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Reduce to a Tree

D

A

B, E

C F G

Good news: can perform MP algorithm on this tree

Bad news: complexity is now O(K2)

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 53: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Probability Calculations on Graphs

• Structure of the graph reveals– Computational strategy– Dependency relations

• Complexity is typically O(K max(number of parents) )– If single parents (e.g., tree), -> O(K)– The sparser the graph the lower the complexity

• Technique can be “automated”– i.e., a fully general algorithm for arbitrary graphs– For continuous variables:

• replace sum with integral– For identification of most likely values

• Replace sum with max operator

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 54: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Hidden Markov Model (HMM)

Observed

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Y1

S1

Y2

S2

Y3

S3

Yn

Sn

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Hidden

Two key assumptions:1. hidden state sequence is Markov2. observation Yt is CI of all other variables given St

Widely used in speech recognition, protein sequence models

Motivation: switching dynamics, low-d representation of Y’s, etc

Page 55: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

HMMs as graphical models…

• Computations of interest

• p( Y ) = Σ p(Y , S = s) -> “forward-backward” algorithm

• arg maxs p(S = s | Y) -> Viterbi algorithm

• Both algorithms….– computation time linear in T– special cases of MP algorithm

• Many generalizations and extensions….– Make state S continuous -> Kalman filters– Add inputs -> convolutional decoding– Add additional dependencies in the model

• Generalized HMMs

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 56: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Part 3: Connecting Probability Models to Data

Recommended References for this Section:

• All of Statistics, L. Wasserman, Chapman and Hall, 2004 (Chapters 6,9,11)

• Pattern Classification and Scene Analysis, 1st ed, R. Duda and P. Hart, Wiley, 1973, Chapter 3.

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 57: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

(Generative Model)

ProbabilisticModel

Real WorldData

P(Data | Parameters)

P(Parameters | Data)

(Inference)

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 58: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

“Plate” Notation

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

yi

i=1:n

θ Model parameters

Data = {y1,…yn}

Plate = rectangle in graphical model

variables within a plate are replicatedin a conditionally independent manner

Page 59: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Example: Gaussian Model

yi

i=1:n

µ σ

Generative model: p(y1,…yn | µ, σ) = Π p(yi | µ, σ)

= p(data | parameters)

= p(D | θ) where θ = {µ, σ }

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 60: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

The Likelihood Function

• Likelihood = p(data | parameters)

= p( D | θ )

= L (θ)

• Likelihood tells us how likely the observed data are conditioned on a particular setting of the parameters

• Details– Constants that do not involve θ can be dropped in defining

L (θ)

– Often easier to work with log L (θ)

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 61: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Comments on the Likelihood Function

• Constructing a likelihood function L (θ) is the first step in probabilistic modeling

• The likelihood function implicitly assumes an underlying probabilistic model M with parameters θ

• L (θ) connects the model to the observed data

• Graphical models provide a useful language for constructing likelihoods

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 62: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

NEW

Binomial Likelihood• Binomial model

– N memoryless trials

– probability θ of success at each trial

• Observed data– r successes in n trials – Defines a likelihood:

L(θ) = p(D | θ) = p(succeses) p(non-successes)

= θ r (1-θ) n-r

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 63: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

NEW

Binomial Likelihood Examples

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 64: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Gaussian Model and Likelihood

Model assumptions:1. y’s are conditionally independent given model2. each y comes from a Gaussian (Normal) density

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 65: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 66: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Conditional Independence (CI)

• CI in a likelihood model means that we are assuming data points provide no information about each other, if the model parameters are assumed known.

p( D | θ ) = p(y1,… yN | θ ) = Π p(yi | θ )

• Works well for (e.g.)– Patients randomly arriving at a clinic– Web surfers randomly arriving at a Web site

• Does not work well for– Time-dependent data (e.g., stock market)– Spatial data (e.g., pixel correlations)

CI assumption

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 67: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Example: Markov Likelihood

• Motivation: wish to model data in a sequence where there is sequential dependence,– e.g., a first-order Markov chain for a DNA sequence

– Markov modeling assumption:p(yt | yt-1, yt-2, …yt) = p(yt | yt-1)

– θ = matrix of K x K transition matrix probabilities

L( θ ) = p( D | θ ) = p(y1,… yN | θ ) = Π p(yt | yt-1 , θ )

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 68: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Maximum Likelihood (ML) Principle(R. Fisher ~ 1922)

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

yi

i=1:n

θModel parameters

Data = {y1,…yn}

L (θ) = p(Data | θ ) = Π p(yi | θ )

Maximum Likelihood:θML = arg max{ Likelihood(θ) }

Select the parameters that make the observed data most likely

Page 69: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Example: ML for Gaussian Model

Maximum LikelhoodEstimate θML

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 70: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Maximizing the Likelihood

• More generally, we analytically solve for the θ value that maximizes the function L (θ) – With p parameters, L (θ) is a scalar function defined over a

p-dimensional space

– 2 situations:• We can analytically solve for the maxima of L (θ)

– This is rare

• We have to resort to iterative techniques to find θML– More common

• General approach– Define a generative probabilistic model– Define an associated likelihood (connect model to data)– Solve an optimization problem to find θML

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 71: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Analytical Solution for Gaussian Likelihood

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 72: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Graphical Model for Regression

yi

i=1:n

θ σxi

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 73: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Example

y

x

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 74: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Example

f(x ; θ ) this is unknown

y

x

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 75: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Example: ML for Linear Regression

• Generative model:y = ax + b + Gaussian noisep(y) = N(ax + b, σ)

• Conditional LikelihoodL(θ) = p(y1,… yN | x1,… xN, θ )

= Π p(yi | xi , θ ) , θ = {a, b}

• Can show (homework problem!) that

log L(θ) = - Σ [yi - (a xi – b) ]2

i.e., finding a,b to maximize log- likelihood is the same as finding a,b that minimizes least squares

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 76: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

ML and Regression

• Multivariate case– multiple x’s, multiple regression coefficients – with Gaussian noise, the ML solution is again equivalent to least-

squares (solutions to a set of linear equations)

• Non-linear multivariate model – With Gaussian noise we get

log L(θ) = - Σ [yi - f (xi ; θ) ]2

– Conditions for the q that maximizes L(θ) leads to a set of p non-linear equations in p variables

– e.g., f (xi ; θ) = a multilayer neural network with 1000 weights• Optimization = finding the maximum of a non-convex function in

1000 dimensional space!• Typically use iterative local search based on gradient (many

possible variations)

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 77: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

NEW

Probabilistic Learning and Classification

• 2 main approaches:

1. p(c | x) = p(x|c) p(c) / p(x) ~ p(x|c) p(c)-> learn a model for p(x|c) for each class, use Bayes rule to classify

- example: naïve Bayes- advantage: theoretically optimal if p(x|c) is “correct”- disadvantage: not directly optimizing predictive accuracy

2. Learn p(c|x) directly, e.g.,– logistic regression (see tutorial notes from D. Lewis)– other regression methods such as neural networks, etc.– Often quite effective in practice: very useful for ranking,

scoring, etc

– Contrast with purely discriminative methods such as SVMs, trees

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 78: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

The Bayesian Approach to Learning

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

yi

i=1:n

θα

Prior(θ) = p( θ | α )

Maximum A Posteriori:θMAP = arg max{ Likelihood(θ) x Prior(θ) }

Fully Bayesian:p( θ | Data) = p(Data | θ ) p(θ) / p(Data)

Page 79: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

The Bayesian Approach

Fully Bayesian:p( θ | Data) = p(Data | θ ) p(θ) / p(Data)

= Likelihood x Prior / Normalization term

Estimating p( θ | Data) can be viewed as inference in a graphical model

ML is a special case = MAP with a “flat” prior

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

yi

i=1:n

θα

Page 80: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

More Comments on Bayesian Learning

• “fully” Bayesian: report full posterior density p(θ |D)– For simple models, we can calculate p(θ |D) analytically– Otherwise we empirically estimate p(θ |D)

• Monte Carlo sampling methods are very useful

• Bayesian prediction (e.g., for regression):

p(y | x, D ) = integral p(y, θ| x, D) dθ

= integral p(y | θ, x) p(θ |D) dθ

-> prediction at each θ is weighted by p(θ|D)

[theoretically preferable to picking a single θ (as in ML)]

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 81: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

More Comments on Bayesian Learning

• In practice…– Fully Bayesian is theoretically optimal but not always the

most practical approach• E.g., computational limitations with large numbers of

parameters• assessing priors can be tricky

• Bayesian approach particularly useful for small data sets

• For large data sets, Bayesian, MAP, ML tend to agree– ML/MAP are much simpler => often used in practice

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 82: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Example of Bayesian Estimation

• Definition of Beta prior

• Definition of Binomial likelihood

• Form of Beta posterior

• Examples of plots with prior+likelihood -> posterior

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 83: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

NEW

Beta Density as a Prior

• Let θ be a proportion, – e.g., fraction of customers that respond to an email ad

– p(θ) is a prior for θ

– e.g. p(θ | α, β) = Beta density with parameters α and β

p(θ | α, β) ~ θ α-1 (1-θ) β-1

α /(α + β) influences the location

α + β controls the width

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 84: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

NEW

Examples of Beta Density Priors

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 85: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

NEW

Binomial Likelihood• Binomial model

– N memoryless trials

– probability θ of success at each trial

• Observed data– r successes in n trials – Defines a likelihood:

p(D | θ) = p(succeses) p(non-successes)

= θ r (1-θ) n-r

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 86: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

NEW

Beta + Binomial -> Betap(θ | D) = Posterior ~ Likelihood x Prior

= Binomial x Beta

~ θ r (1-θ) n-r x θ α-1 (1-θ) β-1

= Beta(α + r, β + n – r)

Prior is “updated” using data:

Parameters: α -> α+r, β -> β + n – r

Sample size: α + β -> α + β + n

Mean: α /(α + β) -> (α + r)/(α + β + n)

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 87: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

NEW

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 88: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

NEW

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 89: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

NEW

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 90: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

NEW

Extensions

• K categories with K probabilities that sum to 1– Dirichlet prior + Multinomial likelihood -> Dirichlet posterior– Used in text modeling, protein alignment algorithms, etc

• E.g. Biological Sequence Analysis, R. Durbin et al., Cambridge University Press, 1998.

• Hierarchical modeling– Multiple trials for different individuals– Each individual has their own θ– The θ’s ~ common population distribution

– For applications in marketing see• Market Segmentation: Conceptual and Methodological

Foundations, M. Wedel and W. A. Kamakura, Kluwer, 1998

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 91: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Example: Bayesian Gaussian Model

yi

i=1:n

µ σα β

Note: priors and parameters are assumed independent here

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 92: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Example: Bayesian Regression

yi

i=1:n

θ σα βxi

Model: yi = f [xi;θ] + e, e ~ N(0, σ2)

p(yi | xi) ~ N ( f[xi;θ] , σ2 )

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 93: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

UPDATED

Other Examples

• Bayesian examples– Bayesian neural networks

• Richer probabilistic models– Random effects models

• Learning graphical model structure– Chow-Liu trees– General graphical model structures

• Learning to align curves– Alignment of growth curves

Comprehensive reference:Bayesian Data Analysis, A. Gelman, J. B. Carlin. H. S. Stern, and D.

B. Rubin, Chapman and Hall, 2nd edition, 2003.

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 94: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Learning Shapes and Shifts

Original data Data after Learning

Data = smoothed growth acceleration data from teenagers

EM used to learn a spline model + time-shift for each curve

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 95: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Model Uncertainty

• How do we know what model M to select for our likelihood function?– In general, we don’t!

– However, we can use the data to help us infer which model from a set of possible models is best

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 96: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Method 1: Bayesian Approach

• Can evaluate the evidence for each model, p(M |D) = p(D|M) p(M)/ p(D) – Can get p(D|M) by integrating p(D, θ| M) over parameter

space (this is the “marginal likelihood”)

– in theory p(M |D) is how much evidence exists in the data for model M

• More complex models are automatically penalized because of the integration over higher-dimensional parameter spaces

– in practice p(M|D) can rarely be computed directly• Monte Carlo schemes are popular• Also: approximations such as BIC, Laplace, etc

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 97: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Comments on Bayesian Approach

• Bayesian Model Averaging (BMA):– Instead of selecting the single best model, for prediction

average over all available models (theoretically the correct thing to do)

– Weights used for averaging are p(M|D)

• Empirical alternatives– e.g., Stacking, Bagging– Idea is to learn a set of unconstrained combining weights

from the data, weights that optimize predictive accuracy• “emulate” BMA approach• may be more effective in practice

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 98: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Method 2: Predictive Validation

• Instead of the Bayesian approach, we could use the probability of new unseen test data as our metric for selecting models

• E.g., 2 models– If p(D | M1) > p(D | M2) then M1 is assigning higher

probability to new data than M2

– This will (with enough data) select the model that predicts the best, in a probabilistic sense

– Useful for problems where we have very large amounts of data and it is easy to create a large validation data set D

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 99: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

NEW

The Prediction Game

0 10Observed Data x

What is a good guess at p(x)?

0 10Model A for p(x)

Model B for p(x) 0 10

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 100: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

NEW

Which of Model A or B is better?

Test data generated from the true underlying q(x)

Model A

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Model B

We can score each model in terms of p(new data | model)

Asymptotically, this is a fair unbiased score(irrespective of the complexities of the models)

Note: empirical average of log p(data) scores ~ negative entropy

Page 101: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

20 40 60 80 100 120 140 160 180 2002

2.2

2.4

2.6

2.8

3

3.2

3.4

3.6

3.8

4N

egat

ive

log-

likel

ihoo

d [b

its/to

ken]

Number of mixture components [K]

Predictive Entropy Out-of-Sample

Mixtures of Multinomials

Mixtures of SFSMs

NEW

Model-based clustering and visualization of navigation patterns on a Web siteCadez et al, Journal of Data Mining and Knowledge Discovery, 2003

Page 102: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Simple Model Class

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 103: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Data-generatingprocess (“truth”)

Simple Model Class

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 104: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Data-generatingprocess (“truth”)

Simple Model Class

“Closest” model in termsof KL distance

Best model is relatively far from Truth=> High Bias

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 105: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Data-generatingprocess (“truth”)

Simple Model Class

Complex Model Class

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 106: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Data-generatingprocess (“truth”)

Simple Model Class

Complex Model ClassBest model is closer to Truth=> Low Bias

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 107: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Data-generatingprocess (“truth”)

Simple Model Class

However,…. this could be the model that best fits the observed data=> High Variance

Complex Model Class

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 108: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Part 4: Models with Hidden Variables

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 109: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Hidden or Latent Variables

• In many applications there are 2 sets of variables:– Variables whose values we can directly measure– Variables that are “hidden”, cannot be measured

• Examples:– Speech recognition:

• Observed: acoustic voice signal• Hidden: label of the word spoken

– Face tracking in images• Observed: pixel intensities• Hidden: position of the face in the image

– Text modeling• Observed: counts of words in a document• Hidden: topics that the document is about

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 110: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Mixture ModelsPearson, 1894, Phil. Trans. Roy. Soc. A.

p(Y) = Σk p(Y | S=k) p(S=k)

S Hidden discrete variable

Y Observed variable(s)

Motivation:1. models a true process (e.g., fish example)

2. approximation for a complex process

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 111: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

-5 0 5 100

0.1

0.2

0.3

0.4

0.5

Component 1 Component 2p(

x)

-5 0 5 100

0.1

0.2

0.3

0.4

0.5

Mixture Model

x

p(x)

Page 112: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

-5 0 5 100

0.1

0.2

0.3

0.4

0.5

Component 1 Component 2p(

x)

-5 0 5 100

0.1

0.2

0.3

0.4

0.5

Mixture Model

x

p(x)

Page 113: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

-5 0 5 100

0.5

1

1.5

2

Component Modelsp(

x)

-5 0 5 100

0.1

0.2

0.3

0.4

0.5

Mixture Model

x

p(x)

Page 114: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

A Graphical Model for Clustering

S

Yj

Hidden discrete (cluster) variable

YdY1

Observed variable(s)(assumed conditionally independent given S)

Clusters = p(Y1,…Yd | S = s)

Probabilistic Clustering = learning these probability distributions from data

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 115: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Hidden Markov Model (HMM)

Observed

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Y1

S1

Y2

S2

Y3

S3

Yn

Sn

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Hidden

Two key assumptions:1. hidden state sequence is Markov2. observation Yt is CI of all other variables given St

Widely used in speech recognition, protein sequence models

Motivation?- S can provide non-linear switching- S can encode low-dim time-dependence for high-dim Y

Page 116: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Generalizing HMMs

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Y1

S1

Y2

S2

Y3

S3

Yn

Sn

T1 T2T3 Tn

Two independent state variables, e.g., two processes evolving at different time-scales

Page 117: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Generalizing HMMs

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Y1

S1

Y2

S2

Y3

S3

Yn

Sn

I1 I2 I3 In

Inputs I provide context to influence switching,e.g., external forcing variables

Model is still a tree -> inference is still linear

Page 118: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Generalizing HMMs

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Y1

S1

Y2

S2

Y3

S3

Yn

Sn

I1 I2 I3 In

Add direct dependence between Y’s to better model persistence

Can merge each St and Yt to construct a tree-structured model

Page 119: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Mixture Model

Si

yi

i=1:n

θ

Likelihood(θ) = p(Data | θ )

= Πi p(yi | θ )

= Πi [ Σk p(yi |si = k , θ ) p(si = k) ]

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 120: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Learning with Missing Data

• Guess at some initial parameters θ0

• E-step (Inference)– For each case, and each unknown variable compute

p(S | known data, θ0 )

• M-step (Optimization)– Maximize L(θ) using p(S | ….. )– This yields new parameter estimates θ1

• This is the EM algorithm:– Guaranteed to converge to a (local) maximum of L(θ)– Dempster, Laird, Rubin, 1977

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 121: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

E-Step

Si

yi

i=1:n

θ

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 122: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

M-Step

Si

yi

i=1:n

θ

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 123: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

E-Step

Si

yi

i=1:n

θ

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 124: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

The E (Expectation) Step

Current K componentsand parameters

n objects

E step: Compute p(object i is in group k)

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 125: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

The M (Maximization) Step

New parameters forthe K components

n objects

M step: Compute θ, given n objects and memberships

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 126: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Complexity of EM for mixtures

K modelsn objects

Complexity per iteration scales as O( n K f(d) )

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 127: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4ANEMIA PATIENTS AND CONTROLS

Red Blood Cell Volume

Red

Blo

od C

ell H

emog

lobi

n C

once

ntra

tion

Data from Prof.Christine McLaren,Dept of Epidemiology,UC Irvine

Page 128: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4

Red Blood Cell Volume

Red

Blo

od C

ell H

emog

lobi

n C

once

ntra

tion

EM ITERATION 1

Page 129: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4

Red Blood Cell Volume

Red

Blo

od C

ell H

emog

lobi

n C

once

ntra

tion

EM ITERATION 3

Page 130: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4

Red Blood Cell Volume

Red

Blo

od C

ell H

emog

lobi

n C

once

ntra

tion

EM ITERATION 5

Page 131: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4

Red Blood Cell Volume

Red

Blo

od C

ell H

emog

lobi

n C

once

ntra

tion

EM ITERATION 10

Page 132: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4

Red Blood Cell Volume

Red

Blo

od C

ell H

emog

lobi

n C

once

ntra

tion

EM ITERATION 15

Page 133: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4

Red Blood Cell Volume

Red

Blo

od C

ell H

emog

lobi

n C

once

ntra

tion

EM ITERATION 25

Page 134: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4

Red Blood Cell Volume

Red

Blo

od C

ell H

emog

lobi

n C

once

ntra

tion

ANEMIA DATA WITH LABELS

Anemia Group

Control Group

Page 135: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

0 5 10 15 20 25400

410

420

430

440

450

460

470

480

490LOG-LIKELIHOOD AS A FUNCTION OF EM ITERATIONS

EM Iteration

Log-

Like

lihoo

d

Page 136: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Example of a Log-Likelihood Surface

10 20 30 40 50 60 70 80 90 100

50

100

150

200

250

300

350

400

Log Scale for Sigma 2

Mean 2

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 137: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

-50 -40 -30 -20 -10 0 10 20-80

-75

-70

-65

-60

-55

-50

-45Log-Likelihood Cross-Section

Log(sigma)

Log-

likel

ihoo

d

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 138: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

HMMs

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Y1

S1

Y2

S2

Y3

S3

YN

SN

Page 139: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Y1

S1

Y2

S2

Y3

S3

YN

SN

θ1

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 140: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Y1

S1

Y2

S2

Y3

S3

YN

SN

θ1

θ2

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 141: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

E-Step(linear inference)

Y1

S1

Y2

S2

Y3

S3

YN

SN

θ1

θ2

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 142: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

M-Step(closed form)

Y1

S1

Y2

S2

Y3

S3

YN

SN

θ1

θ2

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 143: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Alternatives to EM

• Method of Moments– EM is more efficient

• Direct optimization– e.g., gradient descent, Newton methods– EM is usually simpler to implement

• Sampling (e.g., MCMC)

• Minimum distance, e.g.,

( )[ ]2)()|()( xqxpEIMSE −= θθ

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 144: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Mixtures as “Data Simulators”

For i = 1 to N

classk ~ p(class1, class2, …., class K)

xi ~ p(x | classk)

end

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 145: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Mixtures with Markov Dependence

For i = 1 to N

classk ~ p(class1, class2, …., class K | class[xi-1] )

xi ~ p(x | classk)

end Current class depends onprevious class (Markov dependence)

This is a hidden Markov model

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 146: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Mixtures of Sequences

For i = 1 to N

classk ~ p(class1, class2, …., class K)

while non-end statexij ~ p(xj | xj-1, classk)

endend

Markov sequence model

Produces a variablelength sequence

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 147: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Mixtures of Curves

For i = 1 to N

classk ~ p(class1, class2, …., class K)Li ~ p(Li | classk)

for i = 1 to Liyij ~ f(y | xj, classk) + ek

endend

Length of curve

Independent variable x

Class-dependent curve model

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 148: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Mixtures of Image Models

For i = 1 to Nclassk ~ p(class1, class2, …., class K)sizei ~ p(size|classk)for i = 1 to Vi-1

intensityi ~ p(intensity | classk) end

end

Pixel generation model

Number of vertices

Global scale

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 149: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

More generally…..

k

K

kkii cDpDp α∑

=

=1

)|()(

Generative Model

- select a component ck for individual i

- generate data according to p(Di | ck)

- p(Di | ck) can be very general

- e.g., sets of sequences, spatial patterns, etc

[Note: given p(Di | ck), we can define an EM algorithm]

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 150: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

NEW

References

• The EM Algorithm and Mixture Models– The EM Algorithm and Extensions

G. McLachlan and T. Krishnan. John Wiley and Sons, New York, 1997.

• Mixture models– Statistical analysis of finite mixture distributions.

D. M. Titterington, A. F. M. Smith & U. E. Makov. Wiley & Sons, Inc., New York, 1985.

– Finite Mixture ModelsG.J. McLachlan and D. Peel, New York: Wiley (2000)

– Model-based clustering, discriminant analysis, and density estimation, C. Fraley and A. E. Raftery, Journal of the American Statistical Association 97:611-631 (2002).

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 151: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

NEW

References

• Hidden Markov Models– A tutorial on hidden Markov models and selected

applications in speech recognition, L. R. Rabiner, Proceedings of the IEEE, vol. 77, no.2, 257-287, 1989.

– Probabilistic independence networks for hidden Markov modelsP. Smyth, D. Heckerman, and M. Jordan, Neural Computation , vol.9, no. 2, 227-269, 1997.

– Hidden Markov models, A. Moore, online tutorial slides, http://www.autonlab.org/tutorials/hmm12.pdf

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 152: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Part 5: Case Studies

(i) Simulating and forecasting rainfall data

(ii) Curve clustering with cyclones

(iii) Topic modeling from text documents

and if time permits…..

(iv) Sequence clustering for Web data

(v) Analysis of time-course gene expression data

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 153: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Case Study 1:

Simulating and Predicting Rainfall Patterns

Joint work with:

Andy Robertson, International Research Institute for Climate Prediction

Sergey Kirshner, Department of Computer Science, UC Irvine

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 154: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Spatio-Temporal Rainfall Data

Northeast Brazil 1975-2002

90-day time series24 years 10 stations

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 155: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

10 20 30 40 50 60 70 80 90

5

10

15

20

25

30

35

DATA FOR ONE RAIN-STATION

DAY

YEAR

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 156: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Modeling Goals

• “Downscaling” – Modeling interannual variability– coupling rainfall to large-scale effects like El Nino

• Prediction– e.g., “hindcasting” of missing data

• Seasonal Forecasts– E.g. on Dec 1 produce simulations of likely 90-day winters

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 157: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

HMMs for Rainfall Modeling

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Y1

S1

Y2

S2

Y3

S3

YN

SN

I1 I2 I3 IN

S = unobserved weather state Y = spatial rainfall pattern (“outputs”)I = atmospheric variables (“inputs”)

Page 158: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Learned Weather States

States provide an interpretable “view” of spatio-temporal relationships in the data

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 159: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 160: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

WeatherStates

for Kenya

Page 161: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 162: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 163: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Spatial Chow-Liu Trees

- Spatial distribution given a state is a tree structure(a graphical model)

- Useful intermediate between full pair-wise model and conditional independence

- Optimal topology learned from data using minimum spanningtree algorithm

- Can use priors based on distance, topography

- Tree-structure over time also

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 164: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Missing Data

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 165: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Error rate v. fraction of missing data

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 166: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

NEW

References

• Trees and Hidden Markov Models– Conditional Chow-Liu tree structures for modeling discrete-

valued vector time seriesS. Kirshner, P. Smyth, and A. Robertsonin Proceedings of the 20th International Conference on Uncertainty in AI , 2004.

• Applications to rainfall modeling– Hidden Markov models for modeling daily rainfall

occurrence over BrazilA. Robertson, S. Kirshner, and P. Smyth Journal of Climate, November 2005.

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 167: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Summary

• Simple “empirical” probabilistic models can be very helpful in interpreting large scientific data sets– e.g., HMM states provide scientists with a basic but useful

classification of historical spatial rainfall patterns

• Graphical models provide “glue” to link together different information– Spatial– Temporal– Hidden states, etc

• “Generative” aspect of probabilistic models can be quite useful, e.g., for simulation

• Missing data is handled naturally in a probabilistic framework

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 168: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Case Study 2:

Clustering Cyclone Trajectories

Joint work with:

Suzana Camargo, Andy Robertson, International Research Institute for Climate Prediction

Scott Gaffney, Department of Computer Science, UC Irvine

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 169: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Storm Trajectories

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 170: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Microarray Gene Expression Data

0 2 4 6 8 10 12 14 16 18-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Time (7-minute increments)

Nor

mal

ized

log-

ratio

of i

nten

sity

TIME-COURSE GENE EXPRESSION DATA

Yeast Cell-Cycle DataSpellman et al (1998)

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 171: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Clustering “non-vector” data

• Challenges with the data….– May be of different “lengths”, “sizes”, etc– Not easily representable in vector spaces– Distance is not naturally defined a priori

• Possible approaches– “convert” into a fixed-dimensional vector space

• Apply standard vector clustering – but loses information– use hierarchical clustering

• But O(N2) and requires a distance measure– probabilistic clustering with mixtures

• Define a generative mixture model for the data• Learn distance and clustering simultaneously

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 172: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Graphical Models for Curves

Data = { (y1,t1),……. yT, tT) }

y

n

θ t

y = f(t ; θ )

e.g., y = at2 + bt + c, θ = {a, b, c}

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 173: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Graphical Models for Curves

y

T points

θ t

σ

y ~ Gaussian densitywith mean = f(t ; θ ), variance = σ2

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 174: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Example

y

t

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 175: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Example

f(t ; θ ) <- this is hidden

y

t

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 176: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Graphical Models for Sets of Curves

y

T

θ t

σ

N curves

Each curve: P(yi | ti, θ ) = product of Gaussians

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 177: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Curve-Specific Transformations

y

T

θ t

σ

N curves

α

Note: we can learn function parameters and shifts simultaneously with EM

e.g., yi = at2 + bt + c + αi, θ = {a, b, c, α1,….αN}

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 178: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Learning Shapes and Shifts

Original data Data after Learning

Data = smoothed growth acceleration data from teenagers

EM used to learn a spline model + time-shift for each curve

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 179: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Clustering: Mixtures of Curves

y

T

θ t

σ

N curves

α

c

Each set of trajectory points comes from 1 of K modelsModel for group k is a Gaussian curve modelMarginal probability for a trajectory = mixture model

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 180: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

The Learning Problem

• K cluster models– Each cluster is a shape model E[Y] = f(X;θ) with its own

parameters

• N observed curves: for each curve we learn– P(cluster k | curve data)– distribution on alignments, shifts, scaling, etc, given data

• Requires simultaneous learning of– Cluster models– Curve transformation parameters

• Results in an EM algorithm where E and M step are tractable

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 181: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

2 4 6 8 10 12 14 16 18 20-2

-1

0

1

2

3

4

5Simulated Curves (K=2 Clusters)

Time

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 182: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

0 5 10 15 20 25-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5Simulated Data after Alignment

Time

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 183: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Results on Simulated Data

0.1290.424-0.79K-means

0.0480.0191.340.99EM with Alignment

0.05002.011True Model

Within-Cluster σ

Error in Mean

LogPClassification Accuracy

Method

StandardEM

0.89 -7.87 0.171 0.105

*Averaged over 50 train/test sets

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 184: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Clusters of Trajectories

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 185: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Cluster Shapes for Pacific Cyclones

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 186: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

TROPICAL CYCLONES Western North Pacific 1983-2002

Page 187: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 188: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

NEW

References on Curve Clustering

• Functional Data AnalysisJ. O. Ramsay and B. W. Silverman, Springer, 1997.

• Probabilistic curve-aligned clustering and prediction with regression mixture models S. J. Gaffney, Phd Thesis, Department of Computer Science, University of California, Irvine, March 2004.

• Joint probabilistic curve clustering and alignment S. Gaffney and P. Smyth Advances in Neural Information Processing 17 , in press, 2005.

• Probabilistic clustering of extratropical cyclones using regression mixture modelsS. Gaffney, A. Robertson, P. Smyth, S. Camargo, M. Ghilpreprint, online at www.datalab.uci.edu.

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 189: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Summary

• Graphical models provide a flexible representational language for modeling complex scientific data– can build complex models from simpler building blocks

• Systematic variability in the data can be handled in a principled way– Variable length time-series– Misalignments in trajectories

• Generative probabilistic models are interpretable and understandable by scientists

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 190: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Case Study 3:

Topic Modeling from Text Documents

Joint work with:

Mark Steyvers, Dave Newman, Chaitanya Chemudugunta, UC Irvine

Michal Rosen-Zvi, Hebrew University, Jerusalem

Tom Griffiths, Brown University

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 191: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Enron email data

250,000 emails

5000 authors

1999-2002

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 192: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Questions of Interest

– What topics do these documents “span”?

– Which documents are about a particular topic?

– How have topics changed over time?

– What does author X write about?

– Who is likely to write about topic Y?

– Who wrote this specific document?

– and so on…..

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 193: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Graphical Model for Clustering

z

w

Cluster fordocument

Word

φCluster-Worddistributions

D

n

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 194: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Graphical Model for Topics

z

w

Topic

Word

θ

φ

Document-Topicdistributions

D

n

Topic-Worddistributions

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 195: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Topic = probability distribution over words

WORD PROB.

PROBABILISTIC 0.0778 BAYESIAN 0.0671

PROBABILITY 0.0532 CARLO 0.0309 MONTE 0.0308

DISTRIBUTION 0.0257 INFERENCE 0.0253

PROBABILITIES 0.0253 CONDITIONAL 0.0229

PRIOR 0.0219.... ...

TOPIC 209

WORD PROB.

RETRIEVAL 0.1179 TEXT 0.0853

DOCUMENTS 0.0527 INFORMATION 0.0504

DOCUMENT 0.0441 CONTENT 0.0242 INDEXING 0.0205

RELEVANCE 0.0159 COLLECTION 0.0146

RELEVANT 0.0136... ...

TOPIC 289

)|( zwP

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 196: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Topics vs. Other Approaches

• Clustering documents– Computationally simpler…– But a less accurate and less flexible model

• LSI/LSA– Projects words into a K-dimensional hidden space– Less interpretable– Not generalizable

• E.g., authors or other side-information– Not as accurate

• E.g., precision-recall: Hoffman, Blei et al, Buntine, etc

• Topic Models (aka LDA model)– “next-generation” text modeling, after LSI– More flexible and more accurate (in prediction)– Linear time complexity in fitting the model

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 197: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

What can Topic Models be used for?

– Queries• Who writes on this topic?

– e.g., finding experts or reviewers in a particular area• What topics does this person do research on?

– Comparing groups of authors or documents

– Discovering trends over time

– Detecting unusual papers and authors

– Interactive browsing of a digital library via topics

– Parsing documents (and parts of documents) by topic

– and more…..

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 198: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

1990 1992 1994 1996 1998 2000 20020

0.002

0.004

0.006

0.008

0.01

0.012

Year

Topi

c P

roba

bilit

yCHANGING TRENDS IN COMPUTER SCIENCE

OPERATINGSYSTEMS

INFORMATIONRETRIEVAL

WWW

PROGRAMMINGLANGUAGES

Page 199: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

1990 1992 1994 1996 1998 2000 20021

2

3

4

5

6

7

8

9x 10

-3

SECURITY-RELATED TOPICS

Year

Topi

c P

roba

bilit

y

COMPUTERSECURITY

ENCRYPTION

Page 200: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Enron email data

250,000 emails250,000 emails

5000 authors5000 authors

19991999--20022002

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 201: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Enron email topics

WORD PROB. WORD PROB. WORD PROB. WORD PROB.FEEDBACK 0.0781 PROJECT 0.0514 FERC 0.0554 ENVIRONMENTAL 0.0291

PERFORMANCE 0.0462 PLANT 0.028 MARKET 0.0328 AIR 0.0232PROCESS 0.0455 COST 0.0182 ISO 0.0226 MTBE 0.019

PEP 0.0446 CONSTRUCTION 0.0169 COMMISSION 0.0215 EMISSIONS 0.017MANAGEMENT 0.03 UNIT 0.0166 ORDER 0.0212 CLEAN 0.0143

COMPLETE 0.0205 FACILITY 0.0165 FILING 0.0149 EPA 0.0133QUESTIONS 0.0203 SITE 0.0136 COMMENTS 0.0116 PENDING 0.0129SELECTED 0.0187 PROJECTS 0.0117 PRICE 0.0116 SAFETY 0.0104

COMPLETED 0.0146 CONTRACT 0.011 CALIFORNIA 0.0110 WATER 0.0092SYSTEM 0.0146 UNITS 0.0106 FILED 0.0110 GASOLINE 0.0086

SENDER PROB. SENDER PROB. SENDER PROB. SENDER PROB.perfmgmt 0.2195 *** 0.0288 *** 0.0532 *** 0.1339

perf eval process 0.0784 *** 0.022 *** 0.0454 *** 0.0275enron announcements 0.0489 *** 0.0123 *** 0.0384 *** 0.0205

*** 0.0089 *** 0.0111 *** 0.0334 *** 0.0166*** 0.0048 *** 0.0108 *** 0.0317 *** 0.0129

TOPIC 23TOPIC 36 TOPIC 72 TOPIC 54

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 202: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Non-work Topics…

WORD PROB. WORD PROB. WORD PROB. WORD PROB.HOLIDAY 0.0857 TEXANS 0.0145 GOD 0.0357 AMAZON 0.0312PARTY 0.0368 WIN 0.0143 LIFE 0.0272 GIFT 0.0226YEAR 0.0316 FOOTBALL 0.0137 MAN 0.0116 CLICK 0.0193

SEASON 0.0305 FANTASY 0.0129 PEOPLE 0.0103 SAVE 0.0147COMPANY 0.0255 SPORTSLINE 0.0129 CHRIST 0.0092 SHOPPING 0.0140

CELEBRATION 0.0199 PLAY 0.0123 FAITH 0.0083 OFFER 0.0124ENRON 0.0198 TEAM 0.0114 LORD 0.0079 HOLIDAY 0.0122

TIME 0.0194 GAME 0.0112 JESUS 0.0075 RECEIVE 0.0102RECOGNIZE 0.019 SPORTS 0.011 SPIRITUAL 0.0066 SHIPPING 0.0100

MONTH 0.018 GAMES 0.0109 VISIT 0.0065 FLOWERS 0.0099

SENDER PROB. SENDER PROB. SENDER PROB. SENDER PROB.chairman & ceo 0.131 cbs sportsline com 0.0866 crosswalk com 0.2358 amazon com 0.1344

*** 0.0102 houston texans 0.0267 wordsmith 0.0208 jos a bank 0.0266*** 0.0046 houstontexans 0.0203 *** 0.0107 sharperimageoffers 0.0136*** 0.0022 sportsline rewards 0.0175 doctor dictionary 0.0101 travelocity com 0.0094

general announcement 0.0017 pro football 0.0136 *** 0.0061 barnes & noble com 0.0089

TOPIC 109TOPIC 66 TOPIC 182 TOPIC 113

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 203: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Topical Topics

WORD PROB. WORD PROB. WORD PROB. WORD PROB.POWER 0.0915 STATE 0.0253 COMMITTEE 0.0197 LAW 0.0380

CALIFORNIA 0.0756 PLAN 0.0245 BILL 0.0189 TESTIMONY 0.0201ELECTRICITY 0.0331 CALIFORNIA 0.0137 HOUSE 0.0169 ATTORNEY 0.0164

UTILITIES 0.0253 POLITICIAN Y 0.0137 WASHINGTON 0.0140 SETTLEMENT 0.0131PRICES 0.0249 RATE 0.0131 SENATE 0.0135 LEGAL 0.0100MARKET 0.0244 BANKRUPTCY 0.0126 POLITICIAN X 0.0114 EXHIBIT 0.0098PRICE 0.0207 SOCAL 0.0119 CONGRESS 0.0112 CLE 0.0093

UTILITY 0.0140 POWER 0.0114 PRESIDENT 0.0105 SOCALGAS 0.0093CUSTOMERS 0.0134 BONDS 0.0109 LEGISLATION 0.0099 METALS 0.0091

ELECTRIC 0.0120 MOU 0.0107 DC 0.0093 PERSON Z 0.0083

SENDER PROB. SENDER PROB. SENDER PROB. SENDER PROB.*** 0.1160 *** 0.0395 *** 0.0696 *** 0.0696*** 0.0518 *** 0.0337 *** 0.0453 *** 0.0453*** 0.0284 *** 0.0295 *** 0.0255 *** 0.0255*** 0.0272 *** 0.0251 *** 0.0173 *** 0.0173*** 0.0266 *** 0.0202 *** 0.0317 *** 0.0317

TOPIC 194TOPIC 18 TOPIC 22 TOPIC 114

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 204: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

UPDATED

Using Topic Models for Information Retrieval

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 205: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Author-Topic Models

• The author-topic model– a probabilistic model linking authors and topics

• authors -> topics -> words

– Topic = distribution over words– Author = distribution over topics– Document = generated from a mixture of author

distributions

– Learns about entities based on associated text

• Can be generalized– Replace author with any categorical doc information– e.g., publication type, source, year, country of origin, etc

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 206: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Author-Topic Graphical Model

x

z

w

a

Author

Topic

Word

θ

φ

Author-Topicdistributions

Topic-Worddistributions

D

n

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 207: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Learning Author-Topic Models from Text

• Full probabilistic model– Power of statistical learning can be leveraged– Learning algorithm is linear in number of word occurrences

• Scalable to very large data sets• Completely automated (no tweaking required)

– completely unsupervised, no labels

• Query answering– A wide variety of queries can be answered:

• Which authors write on topic X?• What are the spatial patterns in usage of topic Y?• How have authors A, B and C changed over time?

– Queries answered using probabilistic inference• Query time is real-time (learning is offline)

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 208: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Author-Topic Models for CiteSeer

WORD PROB. WORD PROB. WORD PROB. WORD PROB. DATA 0.1563 PROBABILISTIC 0.0778 RETRIEVAL 0.1179 QUERY 0.1848

MINING 0.0674 BAYESIAN 0.0671 TEXT 0.0853 QUERIES 0.1367 ATTRIBUTES 0.0462 PROBABILITY 0.0532 DOCUMENTS 0.0527 INDEX 0.0488 DISCOVERY 0.0401 CARLO 0.0309 INFORMATION 0.0504 DATA 0.0368

ASSOCIATION 0.0335 MONTE 0.0308 DOCUMENT 0.0441 JOIN 0.0260 LARGE 0.0280 DISTRIBUTION 0.0257 CONTENT 0.0242 INDEXING 0.0180

KNOWLEDGE 0.0260 INFERENCE 0.0253 INDEXING 0.0205 PROCESSING 0.0113 DATABASES 0.0210 PROBABILITIES 0.0253 RELEVANCE 0.0159 AGGREGATE 0.0110 ATTRIBUTE 0.0188 CONDITIONAL 0.0229 COLLECTION 0.0146 ACCESS 0.0102 DATASETS 0.0165 PRIOR 0.0219 RELEVANT 0.0136 PRESENT 0.0095

AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. Han_J 0.0196 Friedman_N 0.0094 Oard_D 0.0110 Suciu_D 0.0102

Rastogi_R 0.0094 Heckerman_D 0.0067 Croft_W 0.0056 Naughton_J 0.0095 Zaki_M 0.0084 Ghahramani_Z 0.0062 Jones_K 0.0053 Levy_A 0.0071 Shim_K 0.0077 Koller_D 0.0062 Schauble_P 0.0051 DeWitt_D 0.0068 Ng_R 0.0060 Jordan_M 0.0059 Voorhees_E 0.0050 Wong_L 0.0067 Liu_B 0.0058 Neal_R 0.0055 Singhal_A 0.0048 Chakrabarti_K 0.0064

Mannila_H 0.0056 Raftery_A 0.0054 Hawking_D 0.0048 Ross_K 0.0061 Brin_S 0.0054 Lukasiewicz_T 0.0053 Merkl_D 0.0042 Hellerstein_J 0.0059 Liu_H 0.0047 Halpern_J 0.0052 Allan_J 0.0040 Lenzerini_M 0.0054

Holder_L 0.0044 Muller_P 0.0048 Doermann_D 0.0039 Moerkotte_G 0.0053

TOPIC 205 TOPIC 209 TOPIC 289 TOPIC 10

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 209: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Author-Profiles

• Author = Andrew McCallum, U Mass:– Topic 1: classification, training, generalization, decision, data,…– Topic 2: learning, machine, examples, reinforcement, inductive,…..– Topic 3: retrieval, text, document, information, content,…

• Author = Hector Garcia-Molina, Stanford:- Topic 1: query, index, data, join, processing, aggregate….- Topic 2: transaction, concurrency, copy, permission, distributed….- Topic 3: source, separation, paper, heterogeneous, merging…..

• Author = Jerry Friedman, Stanford:– Topic 1: regression, estimate, variance, data, series,…– Topic 2: classification, training, accuracy, decision, data,…– Topic 3: distance, metric, similarity, measure, nearest,…

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 210: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

PROB. TOPIC WORDS

.9910 82 OUTLOOK, MIGRATION, NOTES, OWA, INFORMATION, EMAIL, BUTTON, SEND, MAILBOX, ACCESS

.0016 91 ENRON, CORP, SERVICES, BROADBAND, EBS, ADDITION, BUILDING, INCLUDES, ATTACHMENT, COMPETITION

.0005 77 EMAIL, ADDRESS, INTERNET, SEND, ECT, MESSAGING, BUSINESS, ADMINISTRATION, QUESTIONS, SUPPORT

.0004 83 ISSUE, GENERAL, ISSUES, CASE, DUE, INVOLVED, DISCUSSION, MENTIONED, PLACE, POINT

PROB. TOPIC WORDS

.3593 17 ANALYST, SERVICES, INDUSTRY, TELECOM, ENERGY, MARKETS, FOOL, BANDWIDTH, ESOURCE, TRAINING

.0773 177 ACCOUNT, ONLINE, OFFER, TRADE, TIME, INVESTMENT, ACCOUNTS, FREE, INFORMATION, ACCESS

.0713 169 HTTP, WWW, GIF, IMAGES, ASP, SPACER, EMAIL, CGI, HTML, CLICK

.0660 200 DECEMBER, JANUARY, MARCH, NOVEMBER, FEBRUARY, WEEK, FRIDAY, SEPTEMBER, WEDNESDAY, TUESDAY

PROB. TOPIC WORDS

.1855 105 CUSTOMERS, RATE, PG, CPUC, SCE, UTILITY, ACCESS, CUSTOMER, DECISION, DIRECT

.1289 54 FERC, MARKET, ISO, COMMISSION, ORDER, FILING, COMMENTS, PRICE, CALIFORNIA, FILED

.0920 44 MILLION, BILLION, YEAR, NEWS, CORP, CONTRACTS, GAS, COMPANY, COMPANIES, WATER

.0719 124 STATE, PUBLIC, DAVIS, SAN, GOVERNOR, COMMISSION, GOV, SUMMER, COSTS, HOUR

PROB. TOPIC WORDS

.2590 178 CAPACITY, GAS, EL, PASO, PIPELINE, MMBTU, CALIFORNIA, SHIPPERS, MMCF, RATE

.0902 74 GAS, CONTRACT, DAY, VOLUMES, CHANGE, DAILY, DAN, MONTH, KIM, CONTRACTS

.0645 70 GOOD, TIME, WORK, TALK, DON, BACK, WEEK, DIDN, THOUGHT, SEND

.0599 116 SYSTEM, FACILITIES, TIME, EXISTING, SERVICES, BASED, ADDITIONAL, CURRENT, END, AREA

PROB. TOPIC WORDS

.1268 42 MEXICO, ARGENTINA, ANDREA, BRAZIL, TAX, OFFICE, LOCAL, RICHARD, COPY, STAFF

.1045 189 AGREEMENT, ENA, LANGUAGE, CONTRACT, TRANSACTION, DEAL, FORWARD, REVIEW, TERMS, QUESTIONS

.0815 176 MARK, TRADING, LEGAL, LONDON, DERIVATIVES, ENRONONLINE, TRADE, ENTITY, COUNTERPARTY, HOUSTON

.0784 135 SUBJECT, REQUIRED, INCLUDING, BASIS, POLICY, BASED, APPROVAL, APPROVED, RIGHTS, DAYS

AUTHOR = Individual C (159 emails)

AUTHOR = Individual B (193 emails)

AUTHOR = Outlook Migration Team (132 emails)

AUTHOR = The Motley Fool (145 emails)

AUTHOR = Individual A (411 emails)

Page 211: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

PubMed-Query Topics

WORD PROB. WORD PROB. WORD PROB. WORD PROB. BIOLOGICAL 0.1002 PLAGUE 0.0296 BOTULISM 0.1014 HIV 0.0916 AGENTS 0.0889 MEDICAL 0.0287 BOTULINUM 0.0888 PROTEASE 0.0563 THREAT 0.0396 CENTURY 0.0280 TOXIN 0.0877 AMPRENAVIR 0.0527

BIOTERRORISM 0.0348 MEDICINE 0.0266 TYPE 0.0669 INHIBITORS 0.0366 WEAPONS 0.0328 HISTORY 0.0203 CLOSTRIDIUM 0.0340 INHIBITOR 0.0220 POTENTIAL 0.0305 EPIDEMIC 0.0106 INFANT 0.0245 PLASMA 0.0204 ATTACK 0.0290 GREAT 0.0091 NEUROTOXIN 0.0184 APV 0.0169

CHEMICAL 0.0288 EPIDEMICS 0.0090 BONT 0.0167 DRUG 0.0169 WARFARE 0.0219 CHINESE 0.0083 FOOD 0.0134 RITONAVIR 0.0164 ANTHRAX 0.0146 FRENCH 0.0082 PARALYSIS 0.0124 IMMUNODEFICIENC 0.0150

AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. Atlas_RM 0.0044 Károly_L 0.0089 Hatheway_CL 0.0254 Sadler_BM 0.0129 Tegnell_A 0.0036 Jian-ping_Z 0.0085 Schiavo_G 0.0141 Tisdale_M 0.0118 Aas_P 0.0036 Sabbatani_S 0.0080 Sugiyama_H 0.0111 Lou_Y 0.0069

Greenfield_RA 0.0032 Theodorides_J 0.0045 Arnon_SS 0.0108 Stein_DS 0.0069 Bricaire_F 0.0032 Bowers_JZ 0.0045 Simpson_LL 0.0093 Haubrich_R 0.0061

TOPIC 32TOPIC 188 TOPIC 63 TOPIC 85

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 212: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

PubMed-Query Topics

WORD PROB. WORD PROB. WORD PROB. WORD PROB. ANTHRACIS 0.1627 CHEMICAL 0.0578 HD 0.0657 ENZYME 0.0938 ANTHRAX 0.1402 SARIN 0.0454 MUSTARD 0.0639 ACTIVE 0.0429 BACILLUS 0.1219 AGENT 0.0332 EXPOSURE 0.0444 SUBSTRATE 0.0399 SPORES 0.0614 GAS 0.0312 SM 0.0353 SITE 0.0361 CEREUS 0.0382 AGENTS 0.0268 SULFUR 0.0343 ENZYMES 0.0308 SPORE 0.0274 VX 0.0264 SKIN 0.0208 REACTION 0.0225

THURINGIENSIS 0.0177 NERVE 0.0232 EXPOSED 0.0185 SUBSTRATES 0.0201 SUBTILIS 0.0152 ACID 0.0220 AGENT 0.0140 FOLD 0.0176 STERNE 0.0124 TOXIC 0.0197 EPIDERMAL 0.0129 CATALYTIC 0.0154

INHALATIONAL 0.0104 PRODUCTS 0.0170 DAMAGE 0.0116 RATE 0.0148

AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. Mock_M 0.0203 Minami_M 0.0093 Monteiro-Riviere_NA 0.0284 Masson_P 0.0166 Phillips_AP 0.0125 Hoskin_FC 0.0092 Smith_WJ 0.0219 Kovach_IM 0.0137

Welkos_SL 0.0083 Benschop_HP 0.0090 Lindsay_CD 0.0214 Schramm_VL 0.0094 Turnbull_PC 0.0071 Raushel_FM 0.0084 Sawyer_TW 0.0146 Barak_D 0.0076 Fouet_A 0.0067 Wild_JR 0.0075 Meier_HL 0.0139 Broomfield_CA 0.0072

TOPIC 178TOPIC 40 TOPIC 89 TOPIC 104

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 213: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

PubMed: Topics by Country

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

ISRAEL, n=196 authors TOPIC 188 TOPIC 6 TOPIC 133 TOPIC 104 TOPIC 159

p=0.049 p=0.045 p=0.043 p=0.027 p=0.025 BIOLOGICAL INJURY HEALTH HD EMERGENCY

AGENTS INJURIES PUBLIC MUSTARD RESPONSE THREAT WAR CARE EXPOSURE MEDICAL

BIOTERRORISM TERRORIST SERVICES SM PREPAREDNESS WEAPONS MILITARY EDUCATION SULFUR DISASTER POTENTIAL MEDICAL NATIONAL SKIN MANAGEMENT

ATTACK VICTIMS COMMUNITY EXPOSED TRAINING CHEMICAL TRAUMA INFORMATION AGENT EVENTS

WARFARE BLAST PREVENTION EPIDERMAL BIOTERRORISM ANTHRAX VETERANS LOCAL DAMAGE LOCAL

CHINA, n=1775 authors TOPIC 177 TOPIC 7 TOPIC 79 TOPIC 49 TOPIC 197

p=0.045 p=0.026 p=0.024 p=0.024 p=0.023 SARS RENAL FINDINGS METHODS PATIENTS

RESPIRATORY HFRS CHEST RESULTS HOSPITAL SEVERE VIRUS CT CONCLUSION PATIENT

COV SYNDROME LUNG OBJECTIVE ADMITTED SYNDROME FEVER CLINICAL CONCLUSIONS TWENTY

ACUTE HEMORRHAGIC PULMONARY BACKGROUND HOSPITALIZED CORONAVIRUS HANTAVIRUS ABNORMAL STUDY CONSECUTIVE

CHINA HANTAAN INVOLVEMENT OBJECTIVES PROSPECTIVELY KONG PUUMALA COMMON INVESTIGATE DIAGNOSED

Page 214: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

POTENTIAL MEDICAL NATIONAL SKIN MANAGEMENT ATTACK VICTIMS COMMUNITY EXPOSED TRAINING

CHEMICAL TRAUMA INFORMATION AGENT EVENTS WARFARE BLAST PREVENTION EPIDERMAL BIOTERRORISM

ANTHRAX VETERANS LOCAL DAMAGE LOCAL

CHINA, n=1775 authors TOPIC 177 TOPIC 7 TOPIC 79 TOPIC 49 TOPIC 197

p=0.045 p=0.026 p=0.024 p=0.024 p=0.023 SARS RENAL FINDINGS METHODS PATIENTS

RESPIRATORY HFRS CHEST RESULTS HOSPITAL SEVERE VIRUS CT CONCLUSION PATIENT

COV SYNDROME LUNG OBJECTIVE ADMITTED SYNDROME FEVER CLINICAL CONCLUSIONS TWENTY

ACUTE HEMORRHAGIC PULMONARY BACKGROUND HOSPITALIZED CORONAVIRUS HANTAVIRUS ABNORMAL STUDY CONSECUTIVE

CHINA HANTAAN INVOLVEMENT OBJECTIVES PROSPECTIVELY KONG PUUMALA COMMON INVESTIGATE DIAGNOSED

PROBABLE HANTAVIRUSES RADIOGRAPHIC DESIGN PROGNOSIS

PubMed-Query: Topics by Country

Page 215: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Extended Models

• Conditioning on non-authors– “side-information” other than authors– e.g., date, publication venue, country, etc– can use citations as authors

• Fictitious authors and common author– Allow 1 unique fictitious author per document

• Captures document specific effects– Assign 1 common fictitious author to each document

• Captures broad topics that are used in many documents

• Semantics and syntax model– Semantic topics = topics that are specific to certain documents– Syntactic topics = broad, across many documents– Probabilistic model that learns each type automatically

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 216: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Scientific syntax and semantics(Griffiths et al., NIPS 2004 – slides courtesy of Mark Steyvers and Tom Griffiths,

PNAS Symposium presentation, 2003)

Factorization of language based onstatistical dependency patterns:

long-range, document specificdependencies

short-range dependencies constantacross all documents

semantics: probabilistic topics

θ

z

w

zz

w w

xxx

syntax: probabilistic regular grammarProbabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 217: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

x = 2

OF 0.6 FOR 0.3BETWEEN 0.1

x = 1 0.8

0.7HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

z = 1 0.4

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

z = 2 0.6

0.10.3

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

THE 0.6 A 0.3MANY 0.1

x = 3

0.9

0.2

Page 218: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

x = 2

OF 0.6 FOR 0.3BETWEEN 0.1

x = 1 0.8

0.7HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

z = 1 0.4 z = 2 0.6

0.10.3

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

THE 0.6 A 0.3MANY 0.1

0.9

0.2x = 3

THE ………………………………

Page 219: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

x = 2

OF 0.6 FOR 0.3BETWEEN 0.1

x = 1 0.8

0.7HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

z = 1 0.4 z = 2 0.6

0.10.3

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

THE 0.6A 0.3MANY 0.1

0.9

0.2x = 3

THE LOVE……………………

Page 220: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

x = 2

OF 0.6 FOR 0.3BETWEEN 0.1

x = 1 0.8

0.7HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

z = 1 0.4 z = 2 0.6

0.10.3

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

THE 0.6 A 0.3MANY 0.1

0.9

0.2x = 3

THE LOVE OF………………

Page 221: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

x = 2

OF 0.6 FOR 0.3BETWEEN 0.1

x = 1 0.8

0.7HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

z = 1 0.4 z = 2 0.6

0.10.3

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

THE 0.6A 0.3MANY 0.1

0.9

0.2x = 3

THE LOVE OF RESEARCH ……

Page 222: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Semantic topics

29 46 51 71 115 125AGE SELECTION LOCI TUMOR MALE MEMORYLIFE POPULATION LOCUS CANCER FEMALE LEARNING

AGING SPECIES ALLELES TUMORS MALES BRAINOLD POPULATIONS ALLELE BREAST FEMALES TASK

YOUNG GENETIC GENETIC HUMAN SPERM CORTEXCRE EVOLUTION LINKAGE CARCINOMA SEX SUBJECTS

AGED SIZE POLYMORPHISM PROSTATE SEXUAL LEFTSENESCENCE NATURAL CHROMOSOME MELANOMA MATING RIGHTMORTALITY VARIATION MARKERS CANCERS REPRODUCTIVE SONG

AGES FITNESS SUSCEPTIBILITY NORMAL OFFSPRING TASKSCR MUTATION ALLELIC COLON PHEROMONE HIPPOCAMPAL

INFANTS PER POLYMORPHIC LUNG SOCIAL PERFORMANCESPAN NUCLEOTIDE POLYMORPHISMS APC EGG SPATIALMEN RATES RESTRICTION MAMMARY BEHAVIOR PREFRONTAL

WOMEN RATE FRAGMENT CARCINOMAS EGGS COGNITIVESENESCENT HYBRID HAPLOTYPE MALIGNANT FERTILIZATION TRAINING

LOXP DIVERSITY GENE CELL MATERNAL TOMOGRAPHYINDIVIDUALS SUBSTITUTION LENGTH GROWTH PATERNAL FRONTAL

CHILDREN SPECIATION DISEASE METASTATIC FERTILITY MOTORNORMAL EVOLUTIONARY MICROSATELLITE EPITHELIAL GERM EMISSION

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 223: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Syntactic classes

5IN

FORON

BETWEENDURINGAMONGFROM

UNDERWITHIN

THROUGHOUTTHROUGHTOWARD

INTOAT

INVOLVINGAFTER

ACROSSAGAINST

WHENALONG

8ARE

WEREWAS

ISWHEN

REMAINREMAINS

REMAINEDPREVIOUSLY

BECOMEBECAME

BEINGBUTGIVEMERE

APPEAREDAPPEAR

ALLOWEDNORMALLY

EACH

14THETHISITS

THEIRAN

EACHONEANY

INCREASEDEXOGENOUS

OURRECOMBINANTENDOGENOUS

TOTALPURIFIED

TILEFULL

CHRONICANOTHER

EXCESS

25SUGGESTINDICATE

SUGGESTINGSUGGESTSSHOWED

REVEALEDSHOW

DEMONSTRATEINDICATING

PROVIDESUPPORT

INDICATESPROVIDES

INDICATEDDEMONSTRATED

SHOWSSO

REVEALDEMONSTRATES

SUGGESTED

26LEVELS

NUMBERLEVELRATETIME

CONCENTRATIONSVARIETYRANGE

CONCENTRATIONDOSE

FAMILYSET

FREQUENCYSERIES

AMOUNTSRATESCLASS

VALUESAMOUNT

SITES

30RESULTS

ANALYSISDATA

STUDIESSTUDY

FINDINGSEXPERIMENTS

OBSERVATIONSHYPOTHESISANALYSES

ASSAYSPOSSIBILITY

MICROSCOPYPAPERWORK

EVIDENCEFINDING

MUTAGENESISOBSERVATION

MEASUREMENTS

REMAINED

33BEENMAYCAN

COULDWELL

DIDDOES

DOMIGHT

SHOULDWILL

WOULDMUST

CANNOT

THEYALSO

BECOMEMAG

LIKELY

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 224: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

(PNAS, 1991, vol. 88, 4874-4876)

A23 generalized49 fundamental11 theorem20 of4 natural46 selection46 is32

derived17 for5 populations46 incorporating22 both39 genetic46 and37 cultural46

transmission46. The14 phenotype15 is32 determined17 by42 an23 arbitrary49

number26 of4 multiallelic52 loci40 with22 two39-factor148 epistasis46 and37 an23

arbitrary49 linkage11 map20, as43 well33 as43 by42 cultural46 transmission46

from22 the14 parents46. Generations46 are8 discrete49 but37 partially19

overlapping24, and37 mating46 may33 be44 nonrandom17 at9 either39 the14

genotypic46 or37 the14 phenotypic46 level46 (or37 both39). I12 show34 that47

cultural46 transmission46 has18 several39 important49 implications6 for5 the14

evolution46 of4 population46 fitness46, most36 notably4 that47 there41 is32 a23

time26 lag7 in22 the14 response28 to31 selection46 such9 that47 the14 future137

evolution46 depends29 on21 the14 past24 selection46 history46 of4 the14

population46.

(graylevel = “semanticity”, the probability of using LDA over HMM)

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 225: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

(PNAS, 1996, vol. 93, 14628-14631)

The14 ''shape7'' of4 a23 female115 mating115 preference125 is32 the14

relationship7 between4 a23 male115 trait15 and37 the14 probability7 of4

acceptance21 as43 a23 mating115 partner20, The14 shape7 of4 preferences115

is32 important49 in5 many39 models6 of4 sexual115 selection46, mate115

recognition125, communication9, and37 speciation46, yet50 it41 has18

rarely19 been33 measured17 precisely19, Here12 I9 examine34 preference7

shape7 for5 male115 calling115 song125 in22 a23 bushcricket*13 (katydid*48). Preferences115 change46 dramatically19 between22 races46 of4 a23 species15, from22 strongly19 directional11 to31 broadly19 stabilizing45 (but50 with21 a23

net49 directional46 effect46), Preference115 shape46 generally19 matches10

the14 distribution16 of4 the14 male115 trait15, This41 is32 compatible29 with21

a23 coevolutionary46 model20 of4 signal9-preference115 evolution46, although50 it41 does33 nor37 rule20 out17 an23 alternative11 model20, sensory125 exploitation150. Preference46 shapes40 are8 shown35 to31 be44

genetic11 in5 origin7.

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 226: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

(PNAS, 1996, vol. 93, 14628-14631)

The14 ''shape7'' of4 a23 female115 mating115 preference125 is32 the14

relationship7 between4 a23 male115 trait15 and37 the14 probability7 of4

acceptance21 as43 a23 mating115 partner20, The14 shape7 of4 preferences115

is32 important49 in5 many39 models6 of4 sexual115 selection46, mate115

recognition125, communication9, and37 speciation46, yet50 it41 has18

rarely19 been33 measured17 precisely19, Here12 I9 examine34 preference7

shape7 for5 male115 calling115 song125 in22 a23 bushcricket*13 (katydid*48). Preferences115 change46 dramatically19 between22 races46 of4 a23 species15,from22 strongly19 directional11 to31 broadly19 stabilizing45 (but50 with21 a23

net49 directional46 effect46), Preference115 shape46 generally19 matches10

the14 distribution16 of4 the14 male115 trait15. This41 is32 compatible29 with21

a23 coevolutionary46 model20 of4 signal9-preference115 evolution46, although50 it41 does33 nor37 rule20 out17 an23 alternative11 model20, sensory125 exploitation150. Preference46 shapes40 are8 shown35 to31 be44

genetic11 in5 origin7.

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 227: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

NEW

References on Topic Models

• Latent Dirichlet allocationDavid Blei, Andrew Y. Ng and Michael Jordan. Journal of Machine Learning Research, 3:993-1022, 2003.

• Finding scientific topicsGriffiths, T., & Steyvers, M. (2004). Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228-5235

• Probabilistic author-topic models for information discovery M. Steyvers, P. Smyth, M. Rosen-Zvi, and T. Griffiths, in Proceedings of the ACM SIGKDD Conference on Data Mining and Knowledge Discovery, August 2004.

• Integrating topics and syntax. Griffiths, T.L., & Steyvers, M., Blei, D.M., & Tenenbaum, J.B. (in press, 2005). In: Advances in Neural Information Processing Systems, 17.

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 228: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Summary

• State-of-the-art probabilistic text models can be constructed from large text data sets– Can yield better performance than other approaches like

clustering, LSI, etc– Advantage of probabilistic approach is that a wide range of

queries can be supported by a single model– See also recent work by Buntine and colleagues

• Learning algorithms are slow but scalable– Linear in the number of word tokens– Applying this type of Monte Carlo statistical learning to

millions of words was unheard of a few years ago

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 229: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Conclusion

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 230: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

ProbabilisticModel

Real WorldData

Modeling

Learning

NEW

“All models are wrong, but some are useful” (G.E.P. Box)

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 231: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Concluding Comments

• The probabilistic approach is worthy of inclusion in a data miner’s toolbox– Systematic handling of missing information and uncertainty– Ability to incorporate prior knowledge– Integration of different sources of information– However, not always best choice for “black-box” predictive modeling

• Graphical models in particular provide:– A flexible and modular representational language for modeling– efficient and general computational inference and learning algorithms

• Many recent advances in theory, algorithms, and applications– Likely to continue to see advances in new powerful models, more

efficient scalable learning algorithms, etc

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 232: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

Examples of New Research Directions

• Modeling and Learning – Probabilistic Relational Models

• Work by Koller et al, Russell et al, etc.– Conditional Markov Random Fields

• information extraction (McCallum et al)– Dirichlet processes

• Flexible non-parametric models (Jordan et al)– Combining discriminative and generative models

• e.g., Haussler and Jaakkola

• Applications– Computer vision: particle filters– Robotics: map learning– Statistical machine translation– and many more….

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005

Page 233: Principles and Applications of Probabilistic Learningsmyth/talks/kdd2005_tutorial_prob_learning.pdf · 2. Probabilistic learning • e.g., for regression, model p(y|x) or p(y,x) •

UPDATED

General References

• All of Statistics: A Concise Course in Statistical InferenceL. Wasserman, Chapman and Hall, 2004

• Bayesian Data AnalysisA. Gelman, J. B. Carlin. H. S. Stern, and D. B. Rubin, Chapman and Hall, 2nd

edition, 2003.

• Learning in Graphical ModelsM. I. Jordan (ed), MIT Press, 1998

• Graphical modelsM. I. Jordan. Statistical Science (Special Issue on Bayesian Statistics), 19, 140-155, 2004.

• The Elements of Statistical Learning : Data Mining, Inference, and PredictionT. Hastie, R. Tibshirani, J. H. Friedman, Springer, 2001

• Recent Research: – Proceedings of NIPS and UAI conferences, Journal of Machine Learning

Research

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005


Recommended