Oliver Schulte Machine Learning 726 Bayes Nets and
Probabilities
Slide 2
2/57 Bayes Nets: General Points Represent domain knowledge.
Allow for uncertainty. Complete representation of probabilistic
knowledge. Represent causal relations. Fast answers to types of
queries: Probabilistic: What is the probability that a patient has
strep throat given that they have fever? Relevance: Is fever
relevant to having strep throat?
Slide 3
3/57 Bayes Net Links Judea Pearl's Turing Award See UBCs
AISpaceAISpace
Slide 4
4/57 Probability Reasoning (With Bayes Nets)
Slide 5
5/57 Random Variables A random variable has a probability
associated with each of its values. A basic statement assigns a
value to a random variable. VariableValueProbability
WeatherSunny0.7 WeatherRainy0.2 WeatherCloudy0.08 WeatherSnow0.02
CavityTrue0.2 CavityFalse0.8
Slide 6
6/57 Probability for Sentences A sentence or query is formed by
using and, or, not recursively with basic statements. Sentences
also have probabilities assigned to them. SentenceProbability
P(Cavity = false AND Toothache = false)0.72 P(Cavity = true OR
Toothache = false)0.08
Slide 7
7/57 Probability Notation Often probability theorists write A,B
instead of A B (like Prolog). If the intended random variables are
known, they are often not mentioned. ShorthandFull Notation
P(Cavity = false,Toothache = false) P(Cavity = false Toothache =
false) P(false, false) P(Cavity = false Toothache = false)
Slide 8
8/57 Axioms of probability For any formula A, B 0 P(A) 1
P(true) = 1 and P(false) = 0 P(A B) = P(A) + P(B) - P(A B) P(A) =
P(B) if A and B are logically equivalent. Formulas considered as
sets of complet
Slide 9
9/57 Rule 1: Logical Equivalence P(NOT (NOT Cavity))P(Cavity)
0.2 P(NOT (Cavity OR Toothache) P(Cavity = F AND Toothache = F)
0.72 P(NOT (Cavity AND Toothache)) P(Cavity = F OR Toothache = F)
0.88
Slide 10
10/57 The Logical Equivalence Pattern P(NOT (NOT Cavity))
=P(Cavity) 0.2 P(NOT (Cavity OR Toothache) =P(Cavity = F AND
Toothache = F) 0.72 P(NOT (Cavity AND Toothache)) =P(Cavity = F OR
Toothache = F) 0.88 Rule 1: Logically equivalent expressions have
the same probability.
13/57 Prove the Pattern: Marginalization Theorem. P(A) = P(A,B)
+ P(A, not B) Proof. 1. A is logically equivalent to [A and B) (A
and not B)]. 2. P(A) = P([A and B) (A and not B)]) = P(A and B) +
P(A and not B) P([A and B) (A and not B)]). Disjunction Rule. 3. [A
and B) (A and not B)] is logically equivalent to false, so P([A and
B) (A and not B)]) =0. 4. So 2. implies P(A) = P(A and B) + P(A and
not B).
Slide 14
14/57 Completeness of Bayes Nets A probabilistic query system
is complete if it can compute a probability for every sentence.
Proposition: A Bayes net is complete. Proof has two steps. 1. Any
system that encodes the joint distribution is complete. 2. A Bayes
net encodes the joint distribution.
Slide 15
15/57 The Joint Distribution
Slide 16
16/57 Assigning Probabilities to Sentences A complete
assignment is a conjunctive sentence that assigns a value to each
random variable. The joint probability distribution specifies a
probability for each complete assignment. A joint distribution
determines an probability for every sentence. How? Spot the
pattern.
Slide 17
17/57 Probabilities for Sentences: Spot the Pattern
SentenceProbability P(Cavity = false AND Toothache = false)0.72
P(Cavity = true OR Toothache = false)0.08 P(Toothache =
false)0.8
Slide 18
18/57 Inference by enumeration
Slide 19
19/57 Inference by enumeration Marginalization: For any
sentence A, sum the joint probabilities for the complete
assignments where A is true. P(toothache) = 0.108 + 0.012 + 0.016 +
0.064 = 0.2.
Slide 20
20/57 Completeness Proof for Joint Distribution Theorem [from
propositional logic] Every sentence is logically equivalent to a
disjunction of the form A 1 or A 2 or... or A k where the A i are
complete assignments. 1. All of the A i are mutually exclusive
(joint probability 0). Why? 2. So if S is equivalent to A 1 or A 2
or... or A k, then P(S) = i P(A i ) where each A i is given by the
joint distribution.
Slide 21
21/57 Bayes Nets and The Joint Distribution
Slide 22
22/57 Example: Complete Bayesian Network
Slide 23
23/57 The Story You have a new burglar alarm installed at home.
Its reliable at detecting burglary but also responds to
earthquakes. You have two neighbors that promise to call you at
work when they hear the alarm. John always calls when he hears the
alarm, but sometimes confuses alarm with telephone ringing. Mary
listens to loud music and sometimes misses the alarm.
Slide 24
24/57 Computing The Joint Distribution A Bayes net provides a
compact factored representation of a joint distribution. In words,
the joint probability is computed as follows. 1. For each node X i
: 2. Find the assigned value x i. 3. Find the values y 1,..,y k
assigned to the parents of X i. 4. Look up the conditional
probability P(x i |y 1,..,y k ) in the Bayes net. 5. Multiply
together these conditional probabilities.
Slide 25
25/57 Product Formula Example: Burglary Query: What is the
joint probability that all variables are true? P(M, J, A, E, B) =
P(M|A) p(J|A) p(A|E,B)P(E)P(B) =.7 x.9 x.95 x.002 x.001
Slide 26
26/57 Compactness of Bayesian Networks Consider n binary
variables Unconstrained joint distribution requires O(2 n )
probabilities If we have a Bayesian network, with a maximum of k
parents for any node, then we need O(n 2 k ) probabilities Example
Full unconstrained joint distribution n = 30: need 2 30
probabilities for full joint distribution Bayesian network n = 30,
k = 4: need 480 probabilities
Slide 27
27/57 Summary: Why are Bayes nets useful? - Graph structure
supports - Modular representation of knowledge - Local, distributed
algorithms for inference and learning - Intuitive (possibly causal)
interpretation - Factored representation may have exponentially
fewer parameters than full joint P(X 1,,X n ) => - lower sample
complexity (less data for learning) - lower time complexity (less
time for inference)
Slide 28
28/57 Is it Magic? How can the Bayes net reduce parameters? By
exploiting conditional independencies. Why does the product formula
work? 1. The Bayes net topological or graphical semantics. The
graph by itself entails conditional independencies. 2. The Chain
Rule.
Slide 29
29/57 Conditional Probabilities and Independence
Slide 30
30/57 Conditional Probabilities: Intro Given (A) that a die
comes up with an odd number, what is the probability that (B) the
number is 1. a 2 2. a 3 Answer: the number of cases that satisfy
both A and B, out of the number of cases that satisfy A. Examples:
1. #faces with (odd and 2)/#faces with odd = 0 / 3 = 0. 2. #faces
with (odd and 3)/#faces with odd = 1 / 3.
Slide 31
31/57 Conditional Probs ctd. Suppose that 50 students are
taking 310 and 30 are women. Given (A) that a student is taking
310, what is the probability that (B) they are a woman? Answer:
#students who take 310 and are a woman/ #students in 310 = 30/50 =
3/5. Notation: P(A|B)
Slide 32
32/57 Conditional Ratios: Spot the Pattern Spot the Pattern
P(Student takes 310)P(Student takes 310 and is woman) P(Student is
woman|Student takes 310) =50/15,00030/15,0003/5 P(die comes up with
odd number) P(die comes up with 3) P(3|odd number) 1/21/61/3
Slide 33
33/57 Conditional Probs: The Ratio Pattern Spot the Pattern
P(Student takes 310) /P(Student takes 310 and is woman) =P(Student
is woman|Stud ent takes 310) =50/15,00030/15,0003/5 P(die comes up
with odd number) /P(die comes up with 3) =P(3|odd number) 1/21/61/3
P(A|B) = P(A and B)/ P(B) Important!
Slide 34
34/57 Conditional Probabilities: Motivation Much knowledge can
be represented as implications B 1,..,B k =>A. Conditional
probabilities are a probabilistic version of reasoning about what
follows from conditions. Cognitive Science: Our minds store
implicational knowledge.
Slide 35
35/57 The Product Rule: Spot the Pattern
P(Cavity)P(Toothache|Cavity)P(Cavity,Toothache) 0.120.080.2
P(Cavity =F)P(Toothache|Cavit y = F) P(Toothache,Cavity =F)
0.120.080.2 P(Toothache)P(Cavity| Toothache) P(Cavity,Toothache)
0.080.720.8
Slide 36
36/57 The Product Rule Pattern
P(Cavity)xP(Toothache|Cavity)=P(Cavity,Toothache) 0.120.080.2
P(Cavity =F)xP(Toothache|Cavit y = F) =P(Toothache,Cavity =F)
0.120.080.2 P(Toothache)xP(Cavity| Toothache)=P(Cavity,Toothache)
0.080.720.8
Slide 37
37/57 Independence A and B are independent iff P(A|B) = P(A) or
P(B|A) = P(B) or P(A, B) = P(A) P(B) Suppose that Weather is
independent of the Cavity Scenario. Then the joint distribution
decomposes: P(Toothache, Catch, Cavity, Weather) = P(Toothache,
Catch, Cavity) P(Weather) Absolute independence powerful but rare
Dentistry is a large field with hundreds of variables, none of
which are independent. What to do?
Slide 38
38/57 Exercise Prove that the three definitions of independence
are equivalent (assuming all positive probabilities). A and B are
independent iff 1. P(A|B) = P(A) 2. or P(B|A) = P(B) 3. or P(A, B)
= P(A) P(B)
Slide 39
39/57 Conditional independence If I have a cavity, the
probability that the probe catches in it doesn't depend on whether
I have a toothache: (1) P(catch | toothache, cavity) = P(catch |
cavity) The same independence holds if I haven't got a cavity: (2)
P(catch | toothache, cavity) = P(catch | cavity) Catch is
conditionally independent of Toothache given Cavity: P(Catch |
Toothache,Cavity) = P(Catch | Cavity) The equivalences for
independence also holds for conditional independence, e.g.:
P(Toothache | Catch, Cavity) = P(Toothache | Cavity) P(Toothache,
Catch | Cavity) = P(Toothache | Cavity) P(Catch | Cavity)
Conditional independence is our most basic and robust form of
knowledge about uncertain environments.
Slide 40
40/57 Bayes Nets Graphical Semantics
Slide 41
41/57 Common Causes: Spot the Pattern Cavity Catchtoothache
Catch is independent of toothache given Cavity.
Slide 42
42/57 Burglary Example JohnCalls, MaryCalls are conditionally
independent given Alarm.
Slide 43
43/57 Spot the Pattern: Chain Scenario MaryCalls is independent
of Burglary given Alarm. JohnCalls is independent of Earthquake
given Alarm.
Slide 44
44/57 The Markov Condition A Bayes net is constructed so that:
each variable is conditionally independent of its nondescendants
given its parents. The graph alone (without specified
probabilities) entails conditional independencies. Causal
Interpretation: Each parent is a direct cause.
Slide 45
45/57 Derivation of the Product Formula
Slide 46
46/57 The Chain Rule We can always write P(a, b, c, z) = P(a |
b, c, . z) P(b, c, z) (Product Rule) Repeatedly applying this idea,
we obtain P(a, b, c, z) = P(a | b, c, . z) P(b | c,.. z) P(c|..
z)..P(z) Order the variables such that children come before
parents. Then given its parents, each node is independent of its
other ancestors by the topological independence. P(a,b,c, z) = x.
P(x|parents)
Slide 47
47/57 Example in Burglary Network P(M, J,A,E,B) = P(M| J,A,E,B)
p(J,A,E,B)= P(M|A) p(J,A,E,B) = P(M|A) p(J|A,E,B) p(A,E,B) = P(M|A)
p(J|A) p(A,E,B) = P(M|A) p(J|A) p(A|E,B) P(E,B) = P(M|A) p(J|A)
p(A|E,B) P(E)P(B) Colours show applications of the Bayes net
topological independence.
Slide 48
48/57 Explaining Away
Slide 49
49/57 Common Effects: Spot the Pattern Influenza and Smokes are
independent. Given Bronchitis, they become dependent.
InfluenzaSmokes Bronchitis Battery Age Charging System OK Battery
Voltage Battery Age and Charging System are independent. Given
Battery Voltage, they become dependent.
Slide 50
50/57 Conditioning on Children A B C Independent Causes: A and
B are independent. Explaining away effect: Given C, observing A
makes B less likely. E.g. Bronchitis in UBC Simple Diagnostic
Problem. A and B are (marginally) independent, become dependent
once C is known.
Slide 51
51/57 D-separation A, B, and C are non-intersecting subsets of
nodes in a directed graph. A path from A to B is blocked if it
contains a node such that either a)the arrows on the path meet
either head-to-tail or tail- to-tail at the node, and the node is
in the set C, or b)the arrows meet head-to-head at the node, and
neither the node, nor any of its descendants, are in the set C. If
all paths from A to B are blocked, A is said to be d- separated
from B by C. If A is d-separated from B by C, the joint
distribution over all variables in the graph satisfies.
Slide 52
52/57 D-separation: Example
Slide 53
53/57 Mathematical Analysis Theorem: If A, B have no common
ancestors and neither is a descendant of the other, then they are
independent of each other. Proof for our example: P(a,b) = c
P(a,b,c) = c P(a) P(b) P(c|a,b) c P(a) P(b) P(c|a,b) = P(a) P(b) c
P(c|a,b) = P(a) P(b) A B C
Slide 54
54/57 Bayes Theorem
Slide 55
55/57 Abductive Reasoning Implications are often causal, from
cause to effect. Many important queries are diagnostic, from effect
to cause. Burglary Alarm Cavity Toothache
Slide 56
56/57 Bayes Theorem: Another Example A doctor knows the
following. The disease meningitis causes the patient to have a
stiff neck 50% of the time. The prior probability that someone has
meningitis is 1/50,000. The prior that someone has a stiff neck is
1/20. Question: knowing that a person has a stiff neck what is the
probability that they have meningitis?
59/57 Explain the Pattern: Bayes Theorem Exercise: Prove Bayes
Theorem P(A | B) = P(B | A) P(A) / P(B).
Slide 60
60/57 On Bayes Theorem P(a | b) = P(b | a) P(a) / P(b). Useful
for assessing diagnostic probability from causal probability:
P(Cause|Effect) = P(Effect|Cause) P(Cause) / P(Effect). Likelihood:
how well does the cause explain the effect? Prior: how plausible is
the explanation before any evidence? Evidence Term/Normaliza tion
Constant: how surprising is the evidence?