Aug 25th, 2001
Copyright © 2001, Andrew W. Moore
Slide 1
ProbabilisticModels
Brigham S. Anderson
School of Computer Science
Carnegie Mellon University
www.cs.cmu.edu/~brigham
Probabililty = relative area
2
OUTLINE
Probability• Axioms Rules
Probabilistic Models• Probability Tables Bayes Nets
Machine Learning• Inference• Anomaly detection• Classification
3
Overview
The point of this section:1. Introduce probability models.
2. Introduce Bayes Nets as powerful probability models.
4
Representing P(A,B,C)• How can we represent the function P(A,B,C)?
• It ideally should contain all the information in this Venn diagram
A
B
C0.050.25
0.10 0.050.05
0.10
0.100.30
5
Recipe for making a joint distribution of M variables:
1. Make a truth table listing all combinations of values of your variables (if there are M boolean variables then the table will have 2M rows).
2. For each combination of values, say how probable it is.
3. If you subscribe to the axioms of probability, those numbers must sum to 1.
A B C Prob
0 0 0 0.30
0 0 1 0.05
0 1 0 0.10
0 1 1 0.05
1 0 0 0.05
1 0 1 0.10
1 1 0 0.25
1 1 1 0.10
Example: P(A, B, C)
A
B
C0.050.25
0.10 0.050.05
0.10
0.100.30
The Joint Probability Table
6
Using the Prob. Table
One you have the PT you can ask for the probability of any logical expression involving your attribute
E
PEP matching rows
)row()(
7
P(Poor, Male) = 0.4654 E
PEP matching rows
)row()(
Using the Prob. Table
8
P(Poor) = 0.7604 E
PEP matching rows
)row()(
Using the Prob. Table
9
Inference with the
Prob. Table
2
2 1
matching rows
and matching rows
2
2121 )row(
)row(
)(
),()|(
E
EE
P
P
EP
EEPEEP
10
Inference with the
Prob. Table
2
2 1
matching rows
and matching rows
2
2121 )row(
)row(
)(
),()|(
E
EE
P
P
EP
EEPEEP
P(Male | Poor) = 0.4654 / 0.7604 = 0.612
11
Inference is a big deal• I’ve got this evidence. What’s the chance
that this conclusion is true?• I’ve got a sore neck: how likely am I to have meningitis?• I see my lights are out and it’s 9pm. What’s the chance my spouse
is already asleep?• I see that there’s a long queue by my gate and the clouds are dark
and I’m traveling by US Airways. What’s the chance I’ll be home tonight? (Answer: 0.00000001)
• There’s a thriving set of industries growing based around Bayesian Inference. Highlights are: Medicine, Pharma, Help Desk Support, Engine Fault Diagnosis
12
The Full Probability Table
Problem:We usually don’t know P(A,B,C)!
Solutions:1. Construct it analytically
2. Learn it from data : P(A,B,C)
A B C Prob
0 0 0 0.30
0 0 1 0.05
0 1 0 0.10
0 1 1 0.05
1 0 0 0.05
1 0 1 0.10
1 1 0 0.25
1 1 1 0.10
P(A,B,C)
13
Learning a Probability Table
Build a PT for your attributes in which the probabilities are unspecified
The fill in each row with
records ofnumber total
row matching records)row(ˆ P
A B C Prob
0 0 0 ?
0 0 1 ?
0 1 0 ?
0 1 1 ?
1 0 0 ?
1 0 1 ?
1 1 0 ?
1 1 1 ?
A B C Prob
0 0 0 0.30
0 0 1 0.05
0 1 0 0.10
0 1 1 0.05
1 0 0 0.05
1 0 1 0.10
1 1 0 0.25
1 1 1 0.10Fraction of all records in whichA and B are True but C is False
14
Example of Learning a Prob. Table
• This PTable was obtained by learning from three attributes in the UCI “Adult” Census Database [Kohavi 1995]
15
Where are we?
• We have covered the fundamentals of probability
• We have become content with PTables
• And we even know how to learn PTables from data.
16
Probability Tableas Anomaly Detector
• Our PTable is our first example of something called an Anomaly Detector.
• An anomaly detector tells you how likely a datapoint is given a model.
AnomalyDetector P(x)Data point x
17
• Given a record x, a model M can tell you how likely the record is:
• Given a dataset with R records, a model can tell you how likely the dataset is:(Under the assumption that all records were independently generated
from the model’s probability function)
Evaluating a Model
R
kkR |MP|MP|MP
121 )(ˆ),,,(ˆ)dataset(ˆ xxxx
)(ˆ |MP x
18
A small dataset: Miles Per Gallon
From the UCI repository (thanks to Ross Quinlan)
192 Training Set Records
mpg modelyear maker
good 75to78 asiabad 70to74 americabad 75to78 europebad 70to74 americabad 70to74 americabad 70to74 asiabad 70to74 asiabad 75to78 america: : :: : :: : :bad 70to74 americagood 79to83 americabad 75to78 americagood 79to83 americabad 75to78 americagood 79to83 americagood 79to83 americabad 70to74 americagood 75to78 europebad 75to78 europe
19
A small dataset: Miles Per Gallon
192 Training Set Records
mpg modelyear maker
good 75to78 asiabad 70to74 americabad 75to78 europebad 70to74 americabad 70to74 americabad 70to74 asiabad 70to74 asiabad 75to78 america: : :: : :: : :bad 70to74 americagood 79to83 americabad 75to78 americagood 79to83 americabad 75to78 americagood 79to83 americagood 79to83 americabad 70to74 americagood 75to78 europebad 75to78 europe
20
A small dataset: Miles Per Gallon
192 Training Set Records
mpg modelyear maker
good 75to78 asiabad 70to74 americabad 75to78 europebad 70to74 americabad 70to74 americabad 70to74 asiabad 70to74 asiabad 75to78 america: : :: : :: : :bad 70to74 americagood 79to83 americabad 75to78 americagood 79to83 americabad 75to78 americagood 79to83 americagood 79to83 americabad 70to74 americagood 75to78 europebad 75to78 europe
203-1
21
10 3.4 case) (in this
)(ˆ),,,(ˆ)dataset(ˆ
R
kkR |MP|MP|MP xxxx
21
Log Probabilities
Since probabilities of datasets get so small we usually use log probabilities
R
kk
R
kk |MP|MP|MP
11
)(ˆlog)(ˆlog)dataset(ˆlog xx
22
A small dataset: Miles Per Gallon
192 Training Set Records
mpg modelyear maker
good 75to78 asiabad 70to74 americabad 75to78 europebad 70to74 americabad 70to74 americabad 70to74 asiabad 70to74 asiabad 75to78 america: : :: : :: : :bad 70to74 americagood 79to83 americabad 75to78 americagood 79to83 americabad 75to78 americagood 79to83 americagood 79to83 americabad 70to74 americagood 75to78 europebad 75to78 europe
466.19 case) (in this
)(ˆlog)(ˆlog)dataset(ˆlog11
R
kk
R
kk |MP|MP|MP xx
23
Summary: The Good News
Prob. Tables allow us to learn P(X) from data.
• Can do anomaly detection spot suspicious / incorrect records
• Can do inference: P(E1|E2)Automatic Doctor, Help Desk, etc
• Can do Bayesian classificationComing soon…
24
Summary: The Bad News
Learning a Prob. Table for P(X) is for Cowboy data miners
• Those probability tables get big fast! • With so many parameters relative to our data, the
resulting model could be pretty wild.
25
Using a test set
An independent test set with 196 cars has a worse log likelihood
(actually it’s a billion quintillion quintillion quintillion quintillion times less likely)
….Density estimators can overfit. And the full joint density estimator is the overfittiest of them all!
26
Overfitting Prob. Tables
If this ever happens, it means there are certain combinations that we learn are impossible
0)(ˆ any for if
)(ˆlog)(ˆlog)testset(ˆlog11
|MPk
|MP|MP|MP
k
R
kk
R
kk
x
xx
27
Using a test set
The only reason that our test set didn’t score -infinity is that Andrew’s code is hard-wired to always predict a probability of at least one in 1020
We need P() estimators that are less prone to overfitting
28
Naïve Density Estimation
The problem with the Probability Table is that it just mirrors the training data.
In fact, it is just another way of storing the data: we could reconstruct the original dataset perfectly from our Table.
We need something which generalizes more usefully.
The naïve model generalizes strongly: Assume that each attribute is distributed independently of any of the other attributes.
29
Probability Models
Full Prob.Table Naïve Density
• No assumptions• Overfitting-prone• Scales horribly
• Strong assumptions• Overfitting-resistant• Scales incredibly well
30
Independently Distributed Data
• Let x[i] denote the i’th field of record x.• The independently distributed assumption says
that for any i,v, u1 u2… ui-1 ui+1… uM
)][(
)][,]1[,]1[,]2[,]1[|][( 1121
vixP
uMxuixuixuxuxvixP Mii
• Or in other words, x[i] is independent of {x[1],x[2],..x[i-1], x[i+1],…x[M]}
• This is often written as ]}[],1[],1[],2[],1[{][ Mxixixxxix
31
A note about independence
• Assume A and B are Boolean Random Variables. Then
“A and B are independent”
if and only if
P(A|B) = P(A)
• “A and B are independent” is often notated as
BA
32
Independence Theorems
• Assume P(A|B) = P(A)• Then P(A, B) =
= P(A) P(B)
• Assume P(A|B) = P(A)• Then P(B|A) =
= P(B)
33
Independence Theorems
• Assume P(A|B) = P(A)• Then P(~A|B) =
= P(~A)
• Assume P(A|B) = P(A)• Then P(A|~B) =
= P(A)
34
Multivalued Independence
For Random Variables A and B,
BAif and only if
)()|(:, uAPvBuAPvu from which you can then prove things like…
)()(),(:, vBPuAPvBuAPvu )()|(:, vBPvAvBPvu
35
Back to Naïve Density Estimation
• Let x[i] denote the i’th field of record x:• Naïve DE assumes x[i] is independent of {x[1],x[2],..x[i-1], x[i+1],…x[M]}• Example:
• Suppose that each record is generated by randomly shaking a green dice and a red dice
• Dataset 1: A = red value, B = green value
• Dataset 2: A = red value, B = sum of values
• Dataset 3: A = sum of values, B = difference of values
• Which of these datasets violates the naïve assumption?
36
Using the Naïve Distribution
• Once you have a Naïve Distribution you can easily compute any row of the joint distribution.
• Suppose A, B, C and D are independently distributed.
What is P(A^~B^C^~D)?
= P(A) P(~B) P(C) P(~D)
37
Naïve Distribution General Case
• Suppose x[1], x[2], … x[M] are independently distributed.
M
kkM ukxPuMxuxuxP
121 )][()][,]2[,]1[(
• So if we have a Naïve Distribution we can construct any row of the implied Joint Distribution on demand.
• So we can do any inference* • But how do we learn a Naïve Density
Estimator?
38
Learning a Naïve Density Estimator
records ofnumber total
][in which records#)][(ˆ uixuixP
Another trivial learning algorithm!
39
Contrast
Joint DE Naïve DE
Can model anything Can model only very boring distributions
No problem to model “C is a noisy copy of A”
Outside Naïve’s scope
Given 100 records and more than 6 boolean attributes will screw up badly
Given 100 records and 10,000 multivalued attributes will be fine
40
Empirical Results: “Hopeless”
The “hopeless” dataset consists of 40,000 records and 21 boolean attributes called a,b,c, … u. Each attribute in each record is generated 50-50 randomly as 0 or 1.
Despite the vast amount of data, “Joint” overfits hopelessly and does much worse
Average test set log probability during 10 folds of k-fold cross-validation*Described in a future Andrew lecture
41
Empirical Results: “Logical”The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped
The DE learned by
“Joint”
The DE learned by
“Naive”
42
Empirical Results: “Logical”The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped
The DE learned by
“Joint”
The DE learned by
“Naive”
43
A tiny part of the DE
learned by “Joint”
Empirical Results: “MPG”The “MPG” dataset consists of 392 records and 8 attributes
The DE learned by
“Naive”
44
The DE learned by
“Joint”
Empirical Results: “Weight vs MPG”Suppose we train only from the “Weight” and “MPG” attributes
The DE learned by
“Naive”
45
The DE learned by
“Joint”
Empirical Results: “Weight vs MPG”Suppose we train only from the “Weight” and “MPG” attributes
The DE learned by
“Naive”
46
The DE learned by
“Joint”
“Weight vs MPG”: The best that Naïve can do
The DE learned by
“Naive”
47
Reminder: The Good News
• We have two ways to learn a Density Estimator from data.
• There are vastly more impressive* Density Estimators * (Mixture Models, Bayesian Networks, Density Trees, Kernel
Densities and many more)
• Density estimators can do many good things…• Anomaly detection• Can do inference: P(E1|E2) Automatic Doctor / Help Desk etc
• Ingredient for Bayes Classifiers
48
Probability Models
Full Prob.Table Naïve Density
• No assumptions• Overfitting-prone• Scales horribly
• Strong assumptions• Overfitting-resistant• Scales incredibly well
Bayes Nets
• Carefully chosen assumptions• Overfitting and scaling
properties depend on assumptions
49
Bayes Nets
50
Bayes Nets• What are they?
• Bayesian nets are a network-based framework for representing and analyzing models involving uncertainty
• What are they used for?• Intelligent decision aids, data fusion, 3-E feature recognition, intelligent
diagnostic aids, automated free text understanding, data mining• Where did they come from?
• Cross fertilization of ideas between the artificial intelligence, decision analysis, and statistic communities
• Why the sudden interest?• Development of propagation algorithms followed by availability of easy
to use commercial software • Growing number of creative applications
• How are they different from other knowledge representation and probabilistic analysis tools?• Different from other knowledge-based systems tools because
uncertainty is handled in mathematically rigorous yet efficient and simple way
51
Bayes Net Concepts
1.Chain RuleP(A,B) = P(A) P(B|A)
2.Conditional IndependenceP(B|A) = P(B)
52
A Simple Bayes Net
• Let’s assume that we already have P(Mpg,Horse)
How would you rewrite this using the Chain rule?
0.480.12bad
0.040.36good
highlowP(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48
P(Mpg, Horse) =
53
Review: Chain Rule
0.480.12bad
0.040.36good
highlow
P(Mpg, Horse)
P(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48
P(Mpg, Horse) P(good) = 0.4P( bad) = 0.6
P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79
P(Mpg)
P(Horse|Mpg)
*
54
Review: Chain Rule
0.480.12bad
0.040.36good
highlow
P(Mpg, Horse)
P(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48
P(Mpg, Horse) P(good) = 0.4P( bad) = 0.6
P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79
P(Mpg)
P(Horse|Mpg)
*
= P(good) * P(low|good) = 0.4 * 0.89
= P(good) * P(high|good) = 0.4 * 0.11
= P(bad) * P(low|bad) = 0.6 * 0.21
= P(bad) * P(high|bad) = 0.6 * 0.79
55
How to Make a Bayes Net
P(Mpg, Horse) = P(Mpg) * P(Horse | Mpg)
Mpg
Horse
56
How to Make a Bayes Net
P(Mpg, Horse) = P(Mpg) * P(Horse | Mpg)
Mpg
Horse
P(good) = 0.4P( bad) = 0.6
P(Mpg)
P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79
P(Horse|Mpg)
57
How to Make a Bayes Net
Mpg
Horse
P(good) = 0.4P( bad) = 0.6
P(Mpg)
P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79
P(Horse|Mpg)
• Each node is a probability function
• Each arc denotes conditional dependence
58
How to Make a Bayes Net
So, what have we accomplished thus far?
Nothing; we’ve just “Bayes Net-ified”
P(Mpg, Horse) using the Chain rule.
…the real excitement starts when we wield conditional independence
Mpg
Horse
P(Mpg)
P(Horse|Mpg)
59
How to Make a Bayes Net
Before we apply conditional independence, we need a worthier opponent than puny P(Mpg, Horse)…
We’ll use P(Mpg, Horse, Accel)
60
How to Make a Bayes Net
Rewrite joint using the Chain rule.
P(Mpg, Horse, Accel) = P(Mpg) P(Horse | Mpg) P(Accel | Mpg, Horse)
Note:Obviously, we could have written this 3!=6 different ways…
P(M, H, A) = P(M) * P(H|M) * P(A|M,H) = P(M) * P(A|M) * P(H|M,A) = P(H) * P(M|H) * P(A|H,M) = P(H) * P(A|H) * P(M|H,A) = … = …
61
How to Make a Bayes Net
Mpg
Horse
Accel
Rewrite joint using the Chain rule.
P(Mpg, Horse, Accel) = P(Mpg) P(Horse | Mpg) P(Accel | Mpg, Horse)
62
How to Make a Bayes Net
Mpg
Horse
Accel
P(good) = 0.4P( bad) = 0.6
P(Mpg)
P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79
P(Horse|Mpg)
P( low|good, low) = 0.97P( low|good,high) = 0.15P( low| bad, low) = 0.90P( low| bad,high) = 0.05P(high|good, low) = 0.03P(high|good,high) = 0.85P(high| bad, low) = 0.10P(high| bad,high) = 0.95
P(Accel|Mpg,Horse)
* Note: I made these up…
63
How to Make a Bayes Net
Mpg
Horse
Accel
P(good) = 0.4P( bad) = 0.6
P(Mpg)
P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79
P(Horse|Mpg)
P( low|good, low) = 0.97P( low|good,high) = 0.15P( low| bad, low) = 0.90P( low| bad,high) = 0.05P(high|good, low) = 0.03P(high|good,high) = 0.85P(high| bad, low) = 0.10P(high| bad,high) = 0.95
P(Accel|Mpg,Horse)
A Miracle Occurs!
You are told by God (or another domain expert)that Accel is independent of Mpg given Horse!
P(Accel | Mpg, Horse) = P(Accel | Horse)
64
How to Make a Bayes Net
Mpg
Horse
Accel
P(good) = 0.4P( bad) = 0.6
P(Mpg)
P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79
P(Horse|Mpg)
P( low| low) = 0.95P( low|high) = 0.11P(high| low) = 0.05P(high|high) = 0.89
P(Accel|Horse)
65
How to Make a Bayes Net
Mpg
Horse
Accel
P(good) = 0.4P( bad) = 0.6
P(Mpg)
P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79
P(Horse|Mpg)
P( low| low) = 0.95P( low|high) = 0.11P(high| low) = 0.05P(high|high) = 0.89
P(Accel|Horse)
Thank you, domain expert!
Now I only need to learn 5 parameters
instead of 7 from my data!
My parameter estimateswill be more accurate as
a result!
66
Independence“The Acceleration does not depend on the Mpg once I know the Horsepower.”
This can be specified very simply:
P(Accel Mpg, Horse) = P(Accel | Horse)
This is a powerful statement!
It required extra domain knowledge. A different kind of knowledge than numerical probabilities. It needed an understanding of causation.
67
Bayes Net-Building Tools
1.Chain RuleP(A,B) = P(A) P(B|A)
2.Conditional IndependenceP(B|A) = P(B)
This is Ridiculously Useful in general
68
Bayes Nets FormalizedA Bayes net (also called a belief network) is an augmented directed acyclic graph, represented by the pair V , E where:
• V is a set of vertices.• E is a set of directed edges joining vertices. No loops of
any length are allowed.
Each vertex in V contains the following information:• The name of a random variable• A probability distribution table indicating how the
probability of this variable’s values depends on all possible combinations of parental values.
69
Where are we now?
• We have a methodology for building Bayes nets.
• We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node.
• We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes.
• So we can also compute answers to any questions.
E.G. What could we do to compute P(R T,~S)?
S M
RL
T
P(s)=0.3P(M)=0.6
P(RM)=0.3P(R~M)=0.6
P(TL)=0.3P(T~L)=0.8
P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2
70
Where are we now?
• We have a methodology for building Bayes nets.
• We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node.
• We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes.
• So we can also compute answers to any questions.
E.G. What could we do to compute P(R T,~S)?
S M
RL
T
P(s)=0.3P(M)=0.6
P(RM)=0.3P(R~M)=0.6
P(TL)=0.3P(T~L)=0.8
P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2
Step 1: Compute P(R ^ T ^ ~S)
Step 2: Compute P(~R ^ T ^ ~S)
Step 3: Return
P(R ^ T ^ ~S)
-------------------------------------
P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)
71
Where are we now?
• We have a methodology for building Bayes nets.
• We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node.
• We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes.
• So we can also compute answers to any questions.
E.G. What could we do to compute P(R T,~S)?
S M
RL
T
P(s)=0.3P(M)=0.6
P(RM)=0.3P(R~M)=0.6
P(TL)=0.3P(T~L)=0.8
P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2
Step 1: Compute P(R ^ T ^ ~S)
Step 2: Compute P(~R ^ T ^ ~S)
Step 3: Return
P(R ^ T ^ ~S)
-------------------------------------
P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)
Sum of all the rows in the Joint that match R ^ T ^ ~S
Sum of all the rows in the Joint that match ~R ^ T ^ ~S
72
Where are we now?
• We have a methodology for building Bayes nets.
• We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node.
• We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes.
• So we can also compute answers to any questions.
E.G. What could we do to compute P(R T,~S)?
S M
RL
T
P(s)=0.3P(M)=0.6
P(RM)=0.3P(R~M)=0.6
P(TL)=0.3P(T~L)=0.8
P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2
Step 1: Compute P(R ^ T ^ ~S)
Step 2: Compute P(~R ^ T ^ ~S)
Step 3: Return
P(R ^ T ^ ~S)
-------------------------------------
P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)
Sum of all the rows in the Joint that match R ^ T ^ ~S
Sum of all the rows in the Joint that match ~R ^ T ^ ~S
Each of these obtained by the “computing a joint probability entry” method of the earlier slides
4 joint computes
4 joint computes
73
IndependenceWe’ve stated:
P(M) = 0.6P(S) = 0.3P(S M) = P(S)
M S Prob
T T
T F
F T
F F
And since we now have the joint pdf, we can make any queries we like.
From these statements, we can derive the full joint pdf.
74
The good news
We can do inference. We can compute any conditional probability:
P( Some variable Some other variable values )
2
2 1
matching entriesjoint
and matching entriesjoint
2
2121 )entryjoint (
)entryjoint (
)(
)()|(
E
EE
P
P
EP
EEPEEP
75
The good news
We can do inference. We can compute any conditional probability:
P( Some variable Some other variable values )
2
2 1
matching entriesjoint
and matching entriesjoint
2
2121 )entryjoint (
)entryjoint (
)(
)()|(
E
EE
P
P
EP
EEPEEP
Suppose you have m binary-valued variables in your Bayes Net and expression E2 mentions k variables.
How much work is the above computation?
76
The sad, bad news
Conditional probabilities by enumerating all matching entries in the joint are expensive:
Exponential in the number of variables.
77
The sad, bad news
Conditional probabilities by enumerating all matching entries in the joint are expensive:
Exponential in the number of variables.
But perhaps there are faster ways of querying Bayes nets?• In fact, if I ever ask you to manually do a Bayes Net inference, you’ll find
there are often many tricks to save you time.• So we’ve just got to program our computer to do those tricks too, right?
78
The sad, bad news
Conditional probabilities by enumerating all matching entries in the joint are expensive:
Exponential in the number of variables.
But perhaps there are faster ways of querying Bayes nets?• In fact, if I ever ask you to manually do a Bayes Net inference, you’ll find
there are often many tricks to save you time.• So we’ve just got to program our computer to do those tricks too, right?
Sadder and worse news:General querying of Bayes nets is NP-complete.
79
Case Study I
Pathfinder system. (Heckerman 1991, Probabilistic Similarity Networks, MIT Press, Cambridge MA).
• Diagnostic system for lymph-node diseases.
• 60 diseases and 100 symptoms and test-results.
• 14,000 probabilities
• Expert consulted to make net.
• 8 hours to determine variables.
• 35 hours for net topology.
• 40 hours for probability table values.
• Apparently, the experts found it quite easy to invent the causal links and probabilities.
• Pathfinder is now outperforming the world experts in diagnosis. Being extended to several dozen other medical domains.
80
What you should know
• The meanings and importance of independence and conditional independence.
• The definition of a Bayes net.• Computing probabilities of assignments of variables (i.e.
members of the joint p.d.f.) with a Bayes net.• The slow (exponential) method for computing arbitrary,
conditional probabilities.• The stochastic simulation method and likelihood weighting.
81
82
ChainRule
0.6
0.4good
bad 0.790.11high
0.210.89 low
badgood
&
P(Mpg) P(Horse | Mpg)
0.920.28bad
0.080.72 good
highlow
0.52
0.48 low
high&
P(Horse) P(Mpg | Horse)
0.480.12bad
0.040.36good
highlow
P(Mpg, Horse)
P(Mpg, Horse) = P(Mpg) * P(Horse | Mpg)
P(Mpg, Horse) = P(Horse) * P(Mpg | Horse)
83
ChainRule
&
P(Mpg)
P(Mpg, Horse) = P(Mpg) * P(Horse | Mpg)
P(Mpg, Horse) = P(Horse) * P(Mpg | Horse)
0.48high
0.12lowbad
0.04high
0.36lowgood
HorseMpg
good 0.4
bad 0.6
0.79high
0.21lowbad
0.11high
0.89lowgood
HorseMpgMpg
P(Horse | Mpg)
&
P(Horse)
low 0.48
high 0.52
0.92bad
0.08goodhigh
0.28bad
0.72goodlow
MpgHorseHorse
P(Mpg | Horse)
84
Making a Bayes Net
P(Horse) P(Mpg | Horse)
P(Horse) P(Mpg | Horse)
1.Start with a Joint
2.Rewrite joint with Chain Rule
3.Each component becomes a node
(with arcs)
P(Horse, Mpg)
85
P(Mpg, Horsepower)
Mpg
Horsepower
P(Mpg)
P(Horsepower | Mpg)
0.790.11high
0.210.89 low
badgood
0.6
0.4good
bad
We can rewrite P(Mpg, Horsepower) as a Bayes net!
1. Decompose with the Chain rule.2. Make each element of decomposition into a node
86
P(Mpg, Horsepower)
Horsepower
Mpg
P(Horse)
P(Mpg | Horse)
0.92bad
0.08goodhigh
0.28bad
0.72goodlow
MpgHorse
0.52high
0.48low
Horse
87
hf
h ~h
f 0.01 0.01
~f 0.09 0.89
0.01 0.01
0.89
0.09
The Joint Distribution
88
hf
The Joint Distribution
a
~hh
~f
f
a~a
89
Joint Distribution
0.480.12bad
0.040.36good
highlow
P(Mpg, Horse)
P(Mpg, Horse)
Venn Diagrams• Visual• Good for tutorials
Probability Tables• Efficient• Good for computation
good0.36
high0.48
0.12
0.04
Up to this point, we have represented a probability model (or joint distribution) in two ways:
90
Where do Joint Distributions come from?
• Idea One: Expert Humans with lots of free time• Idea Two: Simpler probabilistic facts and some algebra Example: Suppose you knew
P(A) = 0.7
P(B|A) = 0.2P(B|~A) = 0.1
P(C|A,B) = 0.1P(C|A,~B) = 0.8P(C|~A,B) = 0.3P(C|~A,~B) = 0.1
Then you can automatically compute the PT using the chain rule
P(x, y, z) = P(x) P(y|x) P(z|x,y)Later: Bayes Nets will do this systematically
91
Where do Joint Distributions come from?
• Idea Three: Learn them from data!
Prepare to be impressed…
92
A more interesting case
• M : Manuela teaches the class• S : It is sunny• L : The lecturer arrives slightly late.
Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive late than Manuela.
Let’s begin with writing down knowledge we’re happy about:P(S M) = P(S), P(S) = 0.3, P(M) = 0.6
Lateness is not independent of the weather and is not independent of the lecturer.
93
A more interesting case
• M : Manuela teaches the class• S : It is sunny• L : The lecturer arrives slightly late.
Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive late than Manuela.
Let’s begin with writing down knowledge we’re happy about:P(S M) = P(S), P(S) = 0.3, P(M) = 0.6
Lateness is not independent of the weather and is not independent of the lecturer.
We already know the Joint of S and M, so all we need now isP(L S=u, M=v)
in the 4 cases of u/v = True/False.
94
A more interesting case
• M : Manuela teaches the class• S : It is sunny• L : The lecturer arrives slightly late.
Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive late than Manuela.
P(S M) = P(S)P(S) = 0.3P(M) = 0.6
P(L M ^ S) = 0.05P(L M ^ ~S) = 0.1P(L ~M ^ S) = 0.1P(L ~M ^ ~S) = 0.2
Now we can derive a full joint p.d.f. with a “mere” six numbers instead of seven*
*Savings are larger for larger numbers of variables.
95
A more interesting case
• M : Manuela teaches the class• S : It is sunny• L : The lecturer arrives slightly late.
Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive late than Manuela.
P(S M) = P(S)P(S) = 0.3P(M) = 0.6
P(L M ^ S) = 0.05P(L M ^ ~S) = 0.1P(L ~M ^ S) = 0.1P(L ~M ^ ~S) = 0.2
Question: ExpressP(L=x ^ M=y ^ S=z)
in terms that only need the above expressions, where x,y and z may each be True or False.
96
A bit of notationP(S M) = P(S)P(S) = 0.3P(M) = 0.6
P(L M ^ S) = 0.05P(L M ^ ~S) = 0.1P(L ~M ^ S) = 0.1P(L ~M ^ ~S) = 0.2
S M
L
P(s)=0.3P(M)=0.6
P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2
97
A bit of notationP(S M) = P(S)P(S) = 0.3P(M) = 0.6
P(L M ^ S) = 0.05P(L M ^ ~S) = 0.1P(L ~M ^ S) = 0.1P(L ~M ^ ~S) = 0.2
S M
L
P(s)=0.3P(M)=0.6
P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2
Read the absence of an arrow between S and M to mean “it
would not help me predict M if I knew the value of S”
Read the two arrows into L to mean that if I want to know the
value of L it may help me to know M and to know S.
Th
is kind o
f stuff w
ill be
tho
rou
ghly form
alize
d la
ter
98
An even cuter trick
Suppose we have these three events:• M : Lecture taught by Manuela• L : Lecturer arrives late• R : Lecture concerns robotsSuppose:• Andrew has a higher chance of being late than Manuela.• Andrew has a higher chance of giving robotics lectures.What kind of independence can we find?
How about:• P(L M) = P(L) ?• P(R M) = P(R) ?• P(L R) = P(L) ?
99
Conditional independence
Once you know who the lecturer is, then whether they arrive late doesn’t affect whether the lecture concerns robots.
P(R M,L) = P(R M) and
P(R ~M,L) = P(R ~M)
We express this in the following way:
“R and L are conditionally independent given M”
M
L R
Given knowledge of M, knowing anything else in the diagram won’t help us with L, etc.
..which is also notated by the following diagram.
100
Conditional Independence formalized
R and L are conditionally independent given M if
for all x,y,z in {T,F}:
P(R=x M=y ^ L=z) = P(R=x M=y)
More generally:
Let S1 and S2 and S3 be sets of variables.
Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets,
P(S1’s assignments S2’s assignments & S3’s assignments)=
P(S1’s assignments S3’s assignments)
101
Example:
R and L are conditionally independent given M if
for all x,y,z in {T,F}:
P(R=x M=y ^ L=z) = P(R=x M=y)
More generally:
Let S1 and S2 and S3 be sets of variables.
Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets,
P(S1’s assignments S2’s assignments & S3’s assignments)=
P(S1’s assignments S3’s assignments)
“Shoe-size is conditionally independent of Glove-size given height weight and age”
means
forall s,g,h,w,a
P(ShoeSize=s|Height=h,Weight=w,Age=a)
=
P(ShoeSize=s|Height=h,Weight=w,Age=a,GloveSize=g)
102
Example:
R and L are conditionally independent given M if
for all x,y,z in {T,F}:
P(R=x M=y ^ L=z) = P(R=x M=y)
More generally:
Let S1 and S2 and S3 be sets of variables.
Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets,
P(S1’s assignments S2’s assignments & S3’s assignments)=
P(S1’s assignments S3’s assignments)
“Shoe-size is conditionally independent of Glove-size given height weight and age”
does not mean
forall s,g,h
P(ShoeSize=s|Height=h)
=
P(ShoeSize=s|Height=h, GloveSize=g)
103
Conditional independence
M
L R
We can write down P(M). And then, since we know L is only directly influenced by M, we can write down the values of P(LM) and P(L~M) and know we’ve fully specified L’s behavior. Ditto for R.
P(M) = 0.6 P(L M) = 0.085 P(L ~M) = 0.17 P(R M) = 0.3 P(R ~M) = 0.6
‘R and L conditionally independent given M’
104
Conditional independenceM
L R
P(M) = 0.6
P(L M) = 0.085
P(L ~M) = 0.17
P(R M) = 0.3
P(R ~M) = 0.6
Conditional Independence:
P(RM,L) = P(RM),
P(R~M,L) = P(R~M)
Again, we can obtain any member of the Joint prob dist that we desire:
P(L=x ^ R=y ^ M=z) =
105
Assume five variables
T: The lecture started by 10:35L: The lecturer arrives lateR: The lecture concerns robotsM: The lecturer is ManuelaS: It is sunny
• T only directly influenced by L (i.e. T is conditionally independent of R,M,S given L)
• L only directly influenced by M and S (i.e. L is conditionally independent of R given M & S)
• R only directly influenced by M (i.e. R is conditionally independent of L,S, given M)
• M and S are independent
106
Making a Bayes net
S M
R
L
T
Step One: add variables.• Just choose the variables you’d like to be included in the
net.
T: The lecture started by 10:35L: The lecturer arrives lateR: The lecture concerns robotsM: The lecturer is ManuelaS: It is sunny
107
Making a Bayes net
S M
R
L
T
Step Two: add links.• The link structure must be acyclic.• If node X is given parents Q1,Q2,..Qn you are promising
that any variable that’s a non-descendent of X is conditionally independent of X given {Q1,Q2,..Qn}
T: The lecture started by 10:35L: The lecturer arrives lateR: The lecture concerns robotsM: The lecturer is ManuelaS: It is sunny
108
Making a Bayes net
S M
R
L
T
P(s)=0.3P(M)=0.6
P(RM)=0.3P(R~M)=0.6
P(TL)=0.3P(T~L)=0.8
P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2
Step Three: add a probability table for each node.• The table for node X must list P(X|Parent Values) for each
possible combination of parent values
T: The lecture started by 10:35L: The lecturer arrives lateR: The lecture concerns robotsM: The lecturer is ManuelaS: It is sunny
109
Making a Bayes net
S M
R
L
T
P(s)=0.3P(M)=0.6
P(RM)=0.3P(R~M)=0.6
P(TL)=0.3P(T~L)=0.8
P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2
• Two unconnected variables may still be correlated• Each node is conditionally independent of all non-
descendants in the tree, given its parents.• You can deduce many other conditional independence
relations from a Bayes net. See the next lecture.
T: The lecture started by 10:35L: The lecturer arrives lateR: The lecture concerns robotsM: The lecturer is ManuelaS: It is sunny
110
Bayes Nets FormalizedA Bayes net (also called a belief network) is an augmented directed acyclic graph, represented by the pair V , E where:
• V is a set of vertices.• E is a set of directed edges joining vertices. No loops of
any length are allowed.
Each vertex in V contains the following information:• The name of a random variable• A probability distribution table indicating how the
probability of this variable’s values depends on all possible combinations of parental values.
111
Building a Bayes Net
1. Choose a set of relevant variables.2. Choose an ordering for them
3. Assume they’re called X1 .. Xm (where X1 is the first in the ordering, X1 is the second, etc)
4. For i = 1 to m:
1. Add the Xi node to the network
2. Set Parents(Xi ) to be a minimal subset of {X1…Xi-1} such that we have conditional independence of Xi and all other members of {X1…Xi-1} given Parents(Xi )
3. Define the probability table of
P(Xi =k Assignments of Parents(Xi ) ).
112
Example Bayes Net Building
Suppose we’re building a nuclear power station.There are the following random variables:
GRL : Gauge Reads Low.CTL : Core temperature is low.FG : Gauge is faulty.FA : Alarm is faultyAS : Alarm sounds
• If alarm working properly, the alarm is meant to sound if the gauge stops reading a low temp.
• If gauge working properly, the gauge is meant to read the temp of the core.
113
Computing a Joint EntryHow to compute an entry in a joint distribution?
E.G: What is P(S ^ ~M ^ L ~R ^ T)?
S M
RL
T
P(s)=0.3P(M)=0.6
P(RM)=0.3P(R~M)=0.6
P(TL)=0.3P(T~L)=0.8
P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2
114
Computing with Bayes Net
P(T ^ ~R ^ L ^ ~M ^ S) =P(T ~R ^ L ^ ~M ^ S) * P(~R ^ L ^ ~M ^ S) = P(T L) * P(~R ^ L ^ ~M ^ S) =P(T L) * P(~R L ^ ~M ^ S) * P(L^~M^S) =P(T L) * P(~R ~M) * P(L^~M^S) =P(T L) * P(~R ~M) * P(L~M^S)*P(~M^S) =P(T L) * P(~R ~M) * P(L~M^S)*P(~M | S)*P(S) =P(T L) * P(~R ~M) * P(L~M^S)*P(~M)*P(S).
S M
RL
T
P(s)=0.3P(M)=0.6
P(RM)=0.3P(R~M)=0.6
P(TL)=0.3P(T~L)=0.8
P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2
115
The general case
P(X1=x1 ^ X2=x2 ^ ….Xn-1=xn-1 ^ Xn=xn) =
P(Xn=xn ^ Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) =
P(Xn=xn Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) * P(Xn-1=xn-1 ^…. X2=x2 ^ X1=x1) =
P(Xn=xn Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) * P(Xn-1=xn-1 …. X2=x2 ^ X1=x1) *
P(Xn-2=xn-2 ^…. X2=x2 ^ X1=x1) =
:
:
=
n
iiii
n
iiiii
XxXP
xXxXxXP
1
11111
Parents of sAssignment
So any entry in joint pdf table can be computed. And so any conditional probability can be computed.
116
Where are we now?
• We have a methodology for building Bayes nets.
• We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node.
• We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes.
• So we can also compute answers to any questions.
E.G. What could we do to compute P(R T,~S)?
S M
RL
T
P(s)=0.3P(M)=0.6
P(RM)=0.3P(R~M)=0.6
P(TL)=0.3P(T~L)=0.8
P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2
117
Where are we now?
• We have a methodology for building Bayes nets.
• We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node.
• We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes.
• So we can also compute answers to any questions.
E.G. What could we do to compute P(R T,~S)?
S M
RL
T
P(s)=0.3P(M)=0.6
P(RM)=0.3P(R~M)=0.6
P(TL)=0.3P(T~L)=0.8
P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2
Step 1: Compute P(R ^ T ^ ~S)
Step 2: Compute P(~R ^ T ^ ~S)
Step 3: Return
P(R ^ T ^ ~S)
-------------------------------------
P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)
118
Where are we now?
• We have a methodology for building Bayes nets.
• We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node.
• We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes.
• So we can also compute answers to any questions.
E.G. What could we do to compute P(R T,~S)?
S M
RL
T
P(s)=0.3P(M)=0.6
P(RM)=0.3P(R~M)=0.6
P(TL)=0.3P(T~L)=0.8
P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2
Step 1: Compute P(R ^ T ^ ~S)
Step 2: Compute P(~R ^ T ^ ~S)
Step 3: Return
P(R ^ T ^ ~S)
-------------------------------------
P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)
Sum of all the rows in the Joint that match R ^ T ^ ~S
Sum of all the rows in the Joint that match ~R ^ T ^ ~S
119
Where are we now?
• We have a methodology for building Bayes nets.
• We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node.
• We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes.
• So we can also compute answers to any questions.
E.G. What could we do to compute P(R T,~S)?
S M
RL
T
P(s)=0.3P(M)=0.6
P(RM)=0.3P(R~M)=0.6
P(TL)=0.3P(T~L)=0.8
P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2
Step 1: Compute P(R ^ T ^ ~S)
Step 2: Compute P(~R ^ T ^ ~S)
Step 3: Return
P(R ^ T ^ ~S)
-------------------------------------
P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)
Sum of all the rows in the Joint that match R ^ T ^ ~S
Sum of all the rows in the Joint that match ~R ^ T ^ ~S
Each of these obtained by the “computing a joint probability entry” method of the earlier slides
4 joint computes
4 joint computes