Download - Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

Aug 25th, 2001

Copyright © 2001, Andrew W. Moore

Slide 1

ProbabilisticModels

Brigham S. Anderson

School of Computer Science

Carnegie Mellon University

www.cs.cmu.edu/~brigham

[email protected]

Probabililty = relative area

2

OUTLINE

Probability• Axioms Rules

Probabilistic Models• Probability Tables Bayes Nets

Machine Learning• Inference• Anomaly detection• Classification

3

Overview

The point of this section:1. Introduce probability models.

2. Introduce Bayes Nets as powerful probability models.

4

Representing P(A,B,C)• How can we represent the function P(A,B,C)?

• It ideally should contain all the information in this Venn diagram

A

B

C0.050.25

0.10 0.050.05

0.10

0.100.30

5

Recipe for making a joint distribution of M variables:

1. Make a truth table listing all combinations of values of your variables (if there are M boolean variables then the table will have 2M rows).

2. For each combination of values, say how probable it is.

3. If you subscribe to the axioms of probability, those numbers must sum to 1.

A B C Prob

0 0 0 0.30

0 0 1 0.05

0 1 0 0.10

0 1 1 0.05

1 0 0 0.05

1 0 1 0.10

1 1 0 0.25

1 1 1 0.10

Example: P(A, B, C)

A

B

C0.050.25

0.10 0.050.05

0.10

0.100.30

The Joint Probability Table

6

Using the Prob. Table

One you have the PT you can ask for the probability of any logical expression involving your attribute

E

PEP matching rows

)row()(

7

P(Poor, Male) = 0.4654 E

PEP matching rows

)row()(


8

P(Poor) = 0.7604 E

PEP matching rows

)row()(


9

Inference with the

Prob. Table

2

2 1

matching rows

and matching rows

2

2121 )row(

)row(

)(

),()|(

E

EE

P

P

EP

EEPEEP

10

Inference with the

Prob. Table

2

2 1

matching rows

and matching rows

2

2121 )row(

)row(

)(

),()|(

E

EE

P

P

EP

EEPEEP

P(Male | Poor) = 0.4654 / 0.7604 = 0.612

11

Inference is a big deal• I’ve got this evidence. What’s the chance

that this conclusion is true?• I’ve got a sore neck: how likely am I to have meningitis?• I see my lights are out and it’s 9pm. What’s the chance my spouse

is already asleep?• I see that there’s a long queue by my gate and the clouds are dark

and I’m traveling by US Airways. What’s the chance I’ll be home tonight? (Answer: 0.00000001)

• There’s a thriving set of industries growing based around Bayesian Inference. Highlights are: Medicine, Pharma, Help Desk Support, Engine Fault Diagnosis

12

The Full Probability Table

Problem:We usually don’t know P(A,B,C)!

Solutions:1. Construct it analytically

2. Learn it from data : P(A,B,C)

A B C Prob

0 0 0 0.30

0 0 1 0.05

0 1 0 0.10

0 1 1 0.05

1 0 0 0.05

1 0 1 0.10

1 1 0 0.25

1 1 1 0.10

P(A,B,C)

13

Learning a Probability Table

Build a PT for your attributes in which the probabilities are unspecified

The fill in each row with

records ofnumber total

row matching records)row(ˆ P

A B C Prob

0 0 0 ?

0 0 1 ?

0 1 0 ?

0 1 1 ?

1 0 0 ?

1 0 1 ?

1 1 0 ?

1 1 1 ?

A B C Prob

0 0 0 0.30

0 0 1 0.05

0 1 0 0.10

0 1 1 0.05

1 0 0 0.05

1 0 1 0.10

1 1 0 0.25

1 1 1 0.10Fraction of all records in whichA and B are True but C is False

14

Example of Learning a Prob. Table

• This PTable was obtained by learning from three attributes in the UCI “Adult” Census Database [Kohavi 1995]

15

Where are we?

• We have covered the fundamentals of probability

• We have become content with PTables

• And we even know how to learn PTables from data.

16

Probability Tableas Anomaly Detector

• Our PTable is our first example of something called an Anomaly Detector.

• An anomaly detector tells you how likely a datapoint is given a model.

AnomalyDetector P(x)Data point x

17

• Given a record x, a model M can tell you how likely the record is:

• Given a dataset with R records, a model can tell you how likely the dataset is:(Under the assumption that all records were independently generated

from the model’s probability function)

Evaluating a Model

R

kkR |MP|MP|MP

121 )(ˆ),,,(ˆ)dataset(ˆ xxxx

)(ˆ |MP x

18

A small dataset: Miles Per Gallon

From the UCI repository (thanks to Ross Quinlan)

192 Training Set Records

mpg modelyear maker

good 75to78 asiabad 70to74 americabad 75to78 europebad 70to74 americabad 70to74 americabad 70to74 asiabad 70to74 asiabad 75to78 america: : :: : :: : :bad 70to74 americagood 79to83 americabad 75to78 americagood 79to83 americabad 75to78 americagood 79to83 americagood 79to83 americabad 70to74 americagood 75to78 europebad 75to78 europe

19



mpg modelyear maker


20



mpg modelyear maker


203-1

21

10 3.4 case) (in this

)(ˆ),,,(ˆ)dataset(ˆ

R

kkR |MP|MP|MP xxxx

21

Log Probabilities

Since probabilities of datasets get so small we usually use log probabilities

R

kk

R

kk |MP|MP|MP

11

)(ˆlog)(ˆlog)dataset(ˆlog xx

22



mpg modelyear maker


466.19 case) (in this

)(ˆlog)(ˆlog)dataset(ˆlog11

R

kk

R

kk |MP|MP|MP xx

23

Summary: The Good News

Prob. Tables allow us to learn P(X) from data.

• Can do anomaly detection spot suspicious / incorrect records

• Can do inference: P(E1|E2)Automatic Doctor, Help Desk, etc

• Can do Bayesian classificationComing soon…

24

Summary: The Bad News

Learning a Prob. Table for P(X) is for Cowboy data miners

• Those probability tables get big fast! • With so many parameters relative to our data, the

resulting model could be pretty wild.

25

Using a test set

An independent test set with 196 cars has a worse log likelihood

(actually it’s a billion quintillion quintillion quintillion quintillion times less likely)

….Density estimators can overfit. And the full joint density estimator is the overfittiest of them all!

26

Overfitting Prob. Tables

If this ever happens, it means there are certain combinations that we learn are impossible

0)(ˆ any for if

)(ˆlog)(ˆlog)testset(ˆlog11

|MPk

|MP|MP|MP

k

R

kk

R

kk

x

xx

27

Using a test set

The only reason that our test set didn’t score -infinity is that Andrew’s code is hard-wired to always predict a probability of at least one in 1020

We need P() estimators that are less prone to overfitting

28

Naïve Density Estimation

The problem with the Probability Table is that it just mirrors the training data.

In fact, it is just another way of storing the data: we could reconstruct the original dataset perfectly from our Table.

We need something which generalizes more usefully.

The naïve model generalizes strongly: Assume that each attribute is distributed independently of any of the other attributes.

29

Probability Models

Full Prob.Table Naïve Density

• No assumptions• Overfitting-prone• Scales horribly

• Strong assumptions• Overfitting-resistant• Scales incredibly well

30

Independently Distributed Data

• Let x[i] denote the i’th field of record x.• The independently distributed assumption says

that for any i,v, u1 u2… ui-1 ui+1… uM

)][(

)][,]1[,]1[,]2[,]1[|][( 1121

vixP

uMxuixuixuxuxvixP Mii

• Or in other words, x[i] is independent of {x[1],x[2],..x[i-1], x[i+1],…x[M]}

• This is often written as ]}[],1[],1[],2[],1[{][ Mxixixxxix

31

A note about independence

• Assume A and B are Boolean Random Variables. Then

“A and B are independent”

if and only if

P(A|B) = P(A)

• “A and B are independent” is often notated as

BA

32

Independence Theorems

• Assume P(A|B) = P(A)• Then P(A, B) =

= P(A) P(B)

• Assume P(A|B) = P(A)• Then P(B|A) =

= P(B)

33

Independence Theorems

• Assume P(A|B) = P(A)• Then P(~A|B) =

= P(~A)

• Assume P(A|B) = P(A)• Then P(A|~B) =

= P(A)

34

Multivalued Independence

For Random Variables A and B,

BAif and only if

)()|(:, uAPvBuAPvu from which you can then prove things like…

)()(),(:, vBPuAPvBuAPvu )()|(:, vBPvAvBPvu

35

Back to Naïve Density Estimation

• Let x[i] denote the i’th field of record x:• Naïve DE assumes x[i] is independent of {x[1],x[2],..x[i-1], x[i+1],…x[M]}• Example:

• Suppose that each record is generated by randomly shaking a green dice and a red dice

• Dataset 1: A = red value, B = green value

• Dataset 2: A = red value, B = sum of values

• Dataset 3: A = sum of values, B = difference of values

• Which of these datasets violates the naïve assumption?

36

Using the Naïve Distribution

• Once you have a Naïve Distribution you can easily compute any row of the joint distribution.

• Suppose A, B, C and D are independently distributed.

What is P(A^~B^C^~D)?

= P(A) P(~B) P(C) P(~D)

37

Naïve Distribution General Case

• Suppose x[1], x[2], … x[M] are independently distributed.

M

kkM ukxPuMxuxuxP

121 )][()][,]2[,]1[(

• So if we have a Naïve Distribution we can construct any row of the implied Joint Distribution on demand.

• So we can do any inference* • But how do we learn a Naïve Density

Estimator?

38

Learning a Naïve Density Estimator

records ofnumber total

][in which records#)][(ˆ uixuixP

Another trivial learning algorithm!

39

Contrast

Joint DE Naïve DE

Can model anything Can model only very boring distributions

No problem to model “C is a noisy copy of A”

Outside Naïve’s scope

Given 100 records and more than 6 boolean attributes will screw up badly

Given 100 records and 10,000 multivalued attributes will be fine

40

Empirical Results: “Hopeless”

The “hopeless” dataset consists of 40,000 records and 21 boolean attributes called a,b,c, … u. Each attribute in each record is generated 50-50 randomly as 0 or 1.

Despite the vast amount of data, “Joint” overfits hopelessly and does much worse

Average test set log probability during 10 folds of k-fold cross-validation*Described in a future Andrew lecture

41

Empirical Results: “Logical”The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped

The DE learned by

“Joint”

The DE learned by

“Naive”

42

Empirical Results: “Logical”The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped

The DE learned by

“Joint”

The DE learned by

“Naive”

43

A tiny part of the DE

learned by “Joint”

Empirical Results: “MPG”The “MPG” dataset consists of 392 records and 8 attributes

The DE learned by

“Naive”

44

The DE learned by

“Joint”

Empirical Results: “Weight vs MPG”Suppose we train only from the “Weight” and “MPG” attributes

The DE learned by

“Naive”

45

The DE learned by

“Joint”

Empirical Results: “Weight vs MPG”Suppose we train only from the “Weight” and “MPG” attributes

The DE learned by

“Naive”

46

The DE learned by

“Joint”

“Weight vs MPG”: The best that Naïve can do

The DE learned by

“Naive”

47

Reminder: The Good News

• We have two ways to learn a Density Estimator from data.

• There are vastly more impressive* Density Estimators * (Mixture Models, Bayesian Networks, Density Trees, Kernel

Densities and many more)

• Density estimators can do many good things…• Anomaly detection• Can do inference: P(E1|E2) Automatic Doctor / Help Desk etc

• Ingredient for Bayes Classifiers

48

Probability Models

Full Prob.Table Naïve Density

• No assumptions• Overfitting-prone• Scales horribly

• Strong assumptions• Overfitting-resistant• Scales incredibly well

Bayes Nets

• Carefully chosen assumptions• Overfitting and scaling

properties depend on assumptions

49

Bayes Nets

50

Bayes Nets• What are they?

• Bayesian nets are a network-based framework for representing and analyzing models involving uncertainty

• What are they used for?• Intelligent decision aids, data fusion, 3-E feature recognition, intelligent

diagnostic aids, automated free text understanding, data mining• Where did they come from?

• Cross fertilization of ideas between the artificial intelligence, decision analysis, and statistic communities

• Why the sudden interest?• Development of propagation algorithms followed by availability of easy

to use commercial software • Growing number of creative applications

• How are they different from other knowledge representation and probabilistic analysis tools?• Different from other knowledge-based systems tools because

uncertainty is handled in mathematically rigorous yet efficient and simple way

51

Bayes Net Concepts

1.Chain RuleP(A,B) = P(A) P(B|A)

2.Conditional IndependenceP(B|A) = P(B)

52

A Simple Bayes Net

• Let’s assume that we already have P(Mpg,Horse)

How would you rewrite this using the Chain rule?

0.480.12bad

0.040.36good

highlowP(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48

P(Mpg, Horse) =

53

Review: Chain Rule

0.480.12bad

0.040.36good

highlow

P(Mpg, Horse)

P(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48

P(Mpg, Horse) P(good) = 0.4P( bad) = 0.6

P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79

P(Mpg)

P(Horse|Mpg)

*

54

Review: Chain Rule

0.480.12bad

0.040.36good

highlow

P(Mpg, Horse)

P(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48

P(Mpg, Horse) P(good) = 0.4P( bad) = 0.6


P(Mpg)

P(Horse|Mpg)

*

= P(good) * P(low|good) = 0.4 * 0.89

= P(good) * P(high|good) = 0.4 * 0.11

= P(bad) * P(low|bad) = 0.6 * 0.21

= P(bad) * P(high|bad) = 0.6 * 0.79

55

How to Make a Bayes Net

P(Mpg, Horse) = P(Mpg) * P(Horse | Mpg)

Mpg

Horse

56



Mpg

Horse

P(good) = 0.4P( bad) = 0.6

P(Mpg)


P(Horse|Mpg)

57


Mpg

Horse

P(good) = 0.4P( bad) = 0.6

P(Mpg)


P(Horse|Mpg)

• Each node is a probability function

• Each arc denotes conditional dependence

58


So, what have we accomplished thus far?

Nothing; we’ve just “Bayes Net-ified”

P(Mpg, Horse) using the Chain rule.

…the real excitement starts when we wield conditional independence

Mpg

Horse

P(Mpg)

P(Horse|Mpg)

59


Before we apply conditional independence, we need a worthier opponent than puny P(Mpg, Horse)…

We’ll use P(Mpg, Horse, Accel)

60


Rewrite joint using the Chain rule.

P(Mpg, Horse, Accel) = P(Mpg) P(Horse | Mpg) P(Accel | Mpg, Horse)

Note:Obviously, we could have written this 3!=6 different ways…

P(M, H, A) = P(M) * P(H|M) * P(A|M,H) = P(M) * P(A|M) * P(H|M,A) = P(H) * P(M|H) * P(A|H,M) = P(H) * P(A|H) * P(M|H,A) = … = …

61


Mpg

Horse

Accel

Rewrite joint using the Chain rule.

P(Mpg, Horse, Accel) = P(Mpg) P(Horse | Mpg) P(Accel | Mpg, Horse)

62


Mpg

Horse

Accel

P(good) = 0.4P( bad) = 0.6

P(Mpg)


P(Horse|Mpg)

P( low|good, low) = 0.97P( low|good,high) = 0.15P( low| bad, low) = 0.90P( low| bad,high) = 0.05P(high|good, low) = 0.03P(high|good,high) = 0.85P(high| bad, low) = 0.10P(high| bad,high) = 0.95

P(Accel|Mpg,Horse)

* Note: I made these up…

63


Mpg

Horse

Accel

P(good) = 0.4P( bad) = 0.6

P(Mpg)


P(Horse|Mpg)

P( low|good, low) = 0.97P( low|good,high) = 0.15P( low| bad, low) = 0.90P( low| bad,high) = 0.05P(high|good, low) = 0.03P(high|good,high) = 0.85P(high| bad, low) = 0.10P(high| bad,high) = 0.95

P(Accel|Mpg,Horse)

A Miracle Occurs!

You are told by God (or another domain expert)that Accel is independent of Mpg given Horse!

P(Accel | Mpg, Horse) = P(Accel | Horse)

64


Mpg

Horse

Accel

P(good) = 0.4P( bad) = 0.6

P(Mpg)


P(Horse|Mpg)

P( low| low) = 0.95P( low|high) = 0.11P(high| low) = 0.05P(high|high) = 0.89

P(Accel|Horse)

65


Mpg

Horse

Accel

P(good) = 0.4P( bad) = 0.6

P(Mpg)


P(Horse|Mpg)

P( low| low) = 0.95P( low|high) = 0.11P(high| low) = 0.05P(high|high) = 0.89

P(Accel|Horse)

Thank you, domain expert!

Now I only need to learn 5 parameters

instead of 7 from my data!

My parameter estimateswill be more accurate as

a result!

66

Independence“The Acceleration does not depend on the Mpg once I know the Horsepower.”

This can be specified very simply:

P(Accel Mpg, Horse) = P(Accel | Horse)

This is a powerful statement!

It required extra domain knowledge. A different kind of knowledge than numerical probabilities. It needed an understanding of causation.

67

Bayes Net-Building Tools

1.Chain RuleP(A,B) = P(A) P(B|A)

2.Conditional IndependenceP(B|A) = P(B)

This is Ridiculously Useful in general

68

Bayes Nets FormalizedA Bayes net (also called a belief network) is an augmented directed acyclic graph, represented by the pair V , E where:

• V is a set of vertices.• E is a set of directed edges joining vertices. No loops of

any length are allowed.

Each vertex in V contains the following information:• The name of a random variable• A probability distribution table indicating how the

probability of this variable’s values depends on all possible combinations of parental values.

69

Where are we now?

• We have a methodology for building Bayes nets.

• We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node.

• We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes.

• So we can also compute answers to any questions.

E.G. What could we do to compute P(R T,~S)?

S M

RL

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2

70

Where are we now?






S M

RL

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8


Step 1: Compute P(R ^ T ^ ~S)

Step 2: Compute P(~R ^ T ^ ~S)

Step 3: Return

P(R ^ T ^ ~S)

-------------------------------------

P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)

71

Where are we now?






S M

RL

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8




Step 3: Return

P(R ^ T ^ ~S)

-------------------------------------

P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)

Sum of all the rows in the Joint that match R ^ T ^ ~S

Sum of all the rows in the Joint that match ~R ^ T ^ ~S

72

Where are we now?






S M

RL

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8




Step 3: Return

P(R ^ T ^ ~S)

-------------------------------------

P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)



Each of these obtained by the “computing a joint probability entry” method of the earlier slides

4 joint computes

4 joint computes

73

IndependenceWe’ve stated:

P(M) = 0.6P(S) = 0.3P(S M) = P(S)

M S Prob

T T

T F

F T

F F

And since we now have the joint pdf, we can make any queries we like.

From these statements, we can derive the full joint pdf.

74

The good news

We can do inference. We can compute any conditional probability:

P( Some variable Some other variable values )

2

2 1

matching entriesjoint

and matching entriesjoint

2

2121 )entryjoint (

)entryjoint (

)(

)()|(

E

EE

P

P

EP

EEPEEP

75

The good news

We can do inference. We can compute any conditional probability:

P( Some variable Some other variable values )

2

2 1

matching entriesjoint

and matching entriesjoint

2

2121 )entryjoint (

)entryjoint (

)(

)()|(

E

EE

P

P

EP

EEPEEP

Suppose you have m binary-valued variables in your Bayes Net and expression E2 mentions k variables.

How much work is the above computation?

76

The sad, bad news

Conditional probabilities by enumerating all matching entries in the joint are expensive:

Exponential in the number of variables.

77

The sad, bad news



But perhaps there are faster ways of querying Bayes nets?• In fact, if I ever ask you to manually do a Bayes Net inference, you’ll find

there are often many tricks to save you time.• So we’ve just got to program our computer to do those tricks too, right?

78

The sad, bad news



But perhaps there are faster ways of querying Bayes nets?• In fact, if I ever ask you to manually do a Bayes Net inference, you’ll find

there are often many tricks to save you time.• So we’ve just got to program our computer to do those tricks too, right?

Sadder and worse news:General querying of Bayes nets is NP-complete.

79

Case Study I

Pathfinder system. (Heckerman 1991, Probabilistic Similarity Networks, MIT Press, Cambridge MA).

• Diagnostic system for lymph-node diseases.

• 60 diseases and 100 symptoms and test-results.

• 14,000 probabilities

• Expert consulted to make net.

• 8 hours to determine variables.

• 35 hours for net topology.

• 40 hours for probability table values.

• Apparently, the experts found it quite easy to invent the causal links and probabilities.

• Pathfinder is now outperforming the world experts in diagnosis. Being extended to several dozen other medical domains.

80

What you should know

• The meanings and importance of independence and conditional independence.

• The definition of a Bayes net.• Computing probabilities of assignments of variables (i.e.

members of the joint p.d.f.) with a Bayes net.• The slow (exponential) method for computing arbitrary,

conditional probabilities.• The stochastic simulation method and likelihood weighting.

81

82

ChainRule

0.6

0.4good

bad 0.790.11high

0.210.89 low

badgood

&

P(Mpg) P(Horse | Mpg)

0.920.28bad

0.080.72 good

highlow

0.52

0.48 low

high&

P(Horse) P(Mpg | Horse)

0.480.12bad

0.040.36good

highlow

P(Mpg, Horse)


P(Mpg, Horse) = P(Horse) * P(Mpg | Horse)

83

ChainRule

&

P(Mpg)


P(Mpg, Horse) = P(Horse) * P(Mpg | Horse)

0.48high

0.12lowbad

0.04high

0.36lowgood

HorseMpg

good 0.4

bad 0.6

0.79high

0.21lowbad

0.11high

0.89lowgood

HorseMpgMpg

P(Horse | Mpg)

&

P(Horse)

low 0.48

high 0.52

0.92bad

0.08goodhigh

0.28bad

0.72goodlow

MpgHorseHorse

P(Mpg | Horse)

84

Making a Bayes Net



1.Start with a Joint

2.Rewrite joint with Chain Rule

3.Each component becomes a node

(with arcs)

P(Horse, Mpg)

85

P(Mpg, Horsepower)

Mpg

Horsepower

P(Mpg)

P(Horsepower | Mpg)

0.790.11high

0.210.89 low

badgood

0.6

0.4good

bad

We can rewrite P(Mpg, Horsepower) as a Bayes net!

1. Decompose with the Chain rule.2. Make each element of decomposition into a node

86

P(Mpg, Horsepower)

Horsepower

Mpg

P(Horse)

P(Mpg | Horse)

0.92bad

0.08goodhigh

0.28bad

0.72goodlow

MpgHorse

0.52high

0.48low

Horse

87

hf

h ~h

f 0.01 0.01

~f 0.09 0.89

0.01 0.01

0.89

0.09

The Joint Distribution

88

hf

The Joint Distribution

a

~hh

~f

f

a~a

89

Joint Distribution

0.480.12bad

0.040.36good

highlow

P(Mpg, Horse)

P(Mpg, Horse)

Venn Diagrams• Visual• Good for tutorials

Probability Tables• Efficient• Good for computation

good0.36

high0.48

0.12

0.04

Up to this point, we have represented a probability model (or joint distribution) in two ways:

90

Where do Joint Distributions come from?

• Idea One: Expert Humans with lots of free time• Idea Two: Simpler probabilistic facts and some algebra Example: Suppose you knew

P(A) = 0.7

P(B|A) = 0.2P(B|~A) = 0.1

P(C|A,B) = 0.1P(C|A,~B) = 0.8P(C|~A,B) = 0.3P(C|~A,~B) = 0.1

Then you can automatically compute the PT using the chain rule

P(x, y, z) = P(x) P(y|x) P(z|x,y)Later: Bayes Nets will do this systematically

91

Where do Joint Distributions come from?

• Idea Three: Learn them from data!

Prepare to be impressed…

92

A more interesting case

• M : Manuela teaches the class• S : It is sunny• L : The lecturer arrives slightly late.

Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive late than Manuela.

Let’s begin with writing down knowledge we’re happy about:P(S M) = P(S), P(S) = 0.3, P(M) = 0.6

Lateness is not independent of the weather and is not independent of the lecturer.

93




Let’s begin with writing down knowledge we’re happy about:P(S M) = P(S), P(S) = 0.3, P(M) = 0.6

Lateness is not independent of the weather and is not independent of the lecturer.

We already know the Joint of S and M, so all we need now isP(L S=u, M=v)

in the 4 cases of u/v = True/False.

94




P(S M) = P(S)P(S) = 0.3P(M) = 0.6

P(L M ^ S) = 0.05P(L M ^ ~S) = 0.1P(L ~M ^ S) = 0.1P(L ~M ^ ~S) = 0.2

Now we can derive a full joint p.d.f. with a “mere” six numbers instead of seven*

*Savings are larger for larger numbers of variables.

95




P(S M) = P(S)P(S) = 0.3P(M) = 0.6


Question: ExpressP(L=x ^ M=y ^ S=z)

in terms that only need the above expressions, where x,y and z may each be True or False.

96

A bit of notationP(S M) = P(S)P(S) = 0.3P(M) = 0.6


S M

L

P(s)=0.3P(M)=0.6


97

A bit of notationP(S M) = P(S)P(S) = 0.3P(M) = 0.6


S M

L

P(s)=0.3P(M)=0.6


Read the absence of an arrow between S and M to mean “it

would not help me predict M if I knew the value of S”

Read the two arrows into L to mean that if I want to know the

value of L it may help me to know M and to know S.

Th

is kind o

f stuff w

ill be

tho

rou

ghly form

alize

d la

ter

98

An even cuter trick

Suppose we have these three events:• M : Lecture taught by Manuela• L : Lecturer arrives late• R : Lecture concerns robotsSuppose:• Andrew has a higher chance of being late than Manuela.• Andrew has a higher chance of giving robotics lectures.What kind of independence can we find?

How about:• P(L M) = P(L) ?• P(R M) = P(R) ?• P(L R) = P(L) ?

99

Conditional independence

Once you know who the lecturer is, then whether they arrive late doesn’t affect whether the lecture concerns robots.

P(R M,L) = P(R M) and

P(R ~M,L) = P(R ~M)

We express this in the following way:

“R and L are conditionally independent given M”

M

L R

Given knowledge of M, knowing anything else in the diagram won’t help us with L, etc.

..which is also notated by the following diagram.

100

Conditional Independence formalized

R and L are conditionally independent given M if

for all x,y,z in {T,F}:

P(R=x M=y ^ L=z) = P(R=x M=y)

More generally:

Let S1 and S2 and S3 be sets of variables.

Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets,

P(S1’s assignments S2’s assignments & S3’s assignments)=

P(S1’s assignments S3’s assignments)

101

Example:




More generally:





“Shoe-size is conditionally independent of Glove-size given height weight and age”

means

forall s,g,h,w,a

P(ShoeSize=s|Height=h,Weight=w,Age=a)

=

P(ShoeSize=s|Height=h,Weight=w,Age=a,GloveSize=g)

102

Example:




More generally:





“Shoe-size is conditionally independent of Glove-size given height weight and age”

does not mean

forall s,g,h

P(ShoeSize=s|Height=h)

=

P(ShoeSize=s|Height=h, GloveSize=g)

103

Conditional independence

M

L R

We can write down P(M). And then, since we know L is only directly influenced by M, we can write down the values of P(LM) and P(L~M) and know we’ve fully specified L’s behavior. Ditto for R.

P(M) = 0.6 P(L M) = 0.085 P(L ~M) = 0.17 P(R M) = 0.3 P(R ~M) = 0.6

‘R and L conditionally independent given M’

104

Conditional independenceM

L R

P(M) = 0.6

P(L M) = 0.085

P(L ~M) = 0.17

P(R M) = 0.3

P(R ~M) = 0.6

Conditional Independence:

P(RM,L) = P(RM),

P(R~M,L) = P(R~M)

Again, we can obtain any member of the Joint prob dist that we desire:

P(L=x ^ R=y ^ M=z) =

105

Assume five variables

T: The lecture started by 10:35L: The lecturer arrives lateR: The lecture concerns robotsM: The lecturer is ManuelaS: It is sunny

• T only directly influenced by L (i.e. T is conditionally independent of R,M,S given L)

• L only directly influenced by M and S (i.e. L is conditionally independent of R given M & S)

• R only directly influenced by M (i.e. R is conditionally independent of L,S, given M)

• M and S are independent

106

Making a Bayes net

S M

R

L

T

Step One: add variables.• Just choose the variables you’d like to be included in the

net.


107

Making a Bayes net

S M

R

L

T

Step Two: add links.• The link structure must be acyclic.• If node X is given parents Q1,Q2,..Qn you are promising

that any variable that’s a non-descendent of X is conditionally independent of X given {Q1,Q2,..Qn}


108

Making a Bayes net

S M

R

L

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8


Step Three: add a probability table for each node.• The table for node X must list P(X|Parent Values) for each

possible combination of parent values


109

Making a Bayes net

S M

R

L

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8


• Two unconnected variables may still be correlated• Each node is conditionally independent of all non-

descendants in the tree, given its parents.• You can deduce many other conditional independence

relations from a Bayes net. See the next lecture.


110

Bayes Nets FormalizedA Bayes net (also called a belief network) is an augmented directed acyclic graph, represented by the pair V , E where:

• V is a set of vertices.• E is a set of directed edges joining vertices. No loops of

any length are allowed.

Each vertex in V contains the following information:• The name of a random variable• A probability distribution table indicating how the

probability of this variable’s values depends on all possible combinations of parental values.

111

Building a Bayes Net

1. Choose a set of relevant variables.2. Choose an ordering for them

3. Assume they’re called X1 .. Xm (where X1 is the first in the ordering, X1 is the second, etc)

4. For i = 1 to m:

1. Add the Xi node to the network

2. Set Parents(Xi ) to be a minimal subset of {X1…Xi-1} such that we have conditional independence of Xi and all other members of {X1…Xi-1} given Parents(Xi )

3. Define the probability table of

P(Xi =k Assignments of Parents(Xi ) ).

112

Example Bayes Net Building

Suppose we’re building a nuclear power station.There are the following random variables:

GRL : Gauge Reads Low.CTL : Core temperature is low.FG : Gauge is faulty.FA : Alarm is faultyAS : Alarm sounds

• If alarm working properly, the alarm is meant to sound if the gauge stops reading a low temp.

• If gauge working properly, the gauge is meant to read the temp of the core.

113

Computing a Joint EntryHow to compute an entry in a joint distribution?

E.G: What is P(S ^ ~M ^ L ~R ^ T)?

S M

RL

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8


114

Computing with Bayes Net

P(T ^ ~R ^ L ^ ~M ^ S) =P(T ~R ^ L ^ ~M ^ S) * P(~R ^ L ^ ~M ^ S) = P(T L) * P(~R ^ L ^ ~M ^ S) =P(T L) * P(~R L ^ ~M ^ S) * P(L^~M^S) =P(T L) * P(~R ~M) * P(L^~M^S) =P(T L) * P(~R ~M) * P(L~M^S)*P(~M^S) =P(T L) * P(~R ~M) * P(L~M^S)*P(~M | S)*P(S) =P(T L) * P(~R ~M) * P(L~M^S)*P(~M)*P(S).

S M

RL

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8


115

The general case

P(X1=x1 ^ X2=x2 ^ ….Xn-1=xn-1 ^ Xn=xn) =

P(Xn=xn ^ Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) =

P(Xn=xn Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) * P(Xn-1=xn-1 ^…. X2=x2 ^ X1=x1) =

P(Xn=xn Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) * P(Xn-1=xn-1 …. X2=x2 ^ X1=x1) *

P(Xn-2=xn-2 ^…. X2=x2 ^ X1=x1) =

:

:

=

n

iiii

n

iiiii

XxXP

xXxXxXP

1

11111

Parents of sAssignment

So any entry in joint pdf table can be computed. And so any conditional probability can be computed.

116

Where are we now?






S M

RL

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8


117

Where are we now?






S M

RL

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8




Step 3: Return

P(R ^ T ^ ~S)

-------------------------------------

P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)

118

Where are we now?






S M

RL

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8




Step 3: Return

P(R ^ T ^ ~S)

-------------------------------------

P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)



119

Where are we now?






S M

RL

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8




Step 3: Return

P(R ^ T ^ ~S)

-------------------------------------

P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)



Each of these obtained by the “computing a joint probability entry” method of the earlier slides

4 joint computes

4 joint computes