Date post: | 11-Jan-2016 |
Category: |
Documents |
Upload: | harvey-sharp |
View: | 214 times |
Download: | 0 times |
Computer Science CPSC 502
Uncertainty Probability and Bayesian Networks
(Ch. 6)
Outline
• Uncertainty and Probability• Marginal and Conditional Independence• Bayesian networks
Where are we?Environment
Problem Type
Query
Planning
Deterministic Stochastic
Constraint Satisfaction Search
Arc Consistency
Search
Search
Logics
STRIPS
Vars + Constraints
Value Iteration
Variable
Elimination
Belief Nets
Decision Nets
Markov Processes
Static
Sequential
Representation
ReasoningTechnique
Variable
Elimination
Done with Deterministic Environments
Where are we?Environment
Problem Type
Query
Planning
Deterministic Stochastic
Constraint Satisfaction Search
Arc Consistency
Search
Search
Logics
STRIPS
Vars + Constraints
Value Iteration
Variable
Elimination
Belief Nets
Decision Nets
Markov Processes
Static
Sequential
Representation
ReasoningTechnique
Variable
Elimination
Second Part of the Course
Where Are We?Environment
Problem Type
Query
Planning
Deterministic Stochastic
Constraint Satisfaction Search
Arc Consistency
Search
Search
Logics
STRIPS
Vars + Constraints
Value Iteration
Variable
Elimination
Belief Nets
Decision Nets
Markov Processes
Static
Sequential
Representation
ReasoningTechnique
Variable
Elimination
We’ll focus on Belief Nets
Two main sources of uncertainty (From Lecture 2)• Sensing Uncertainty: The agent cannot fully observe a state of
interest.
For example:• Right now, how many people are in this building?• What disease does this patient have?• Where is the soccer player behind me?
• Effect Uncertainty: The agent cannot be certain about the effects of its actions.
For example:• If I work hard, will I get an A?• Will this drug work for this patient?• Where will the ball go when I kick it?
Motivation for uncertainty
• To act in the real world, we almost always have to handle uncertainty (both effect and sensing uncertainty)• Deterministic domains are an abstraction
Sometimes this abstraction enables more powerful inference• Now we don’t make this abstraction anymore
• AI main focus shifted from logic to probability in the 1980s• The language of probability is very expressive and general• New representations enable efficient reasoning
We will see some of these, in particular Bayesian networks• Reasoning under uncertainty is part of the ‘new’ AI• This is not a dichotomy: framework for probability is logical!• New frontier: combine logic and probability
Probability as a measure of uncertainty/ignorance• Probability measures an agent's degree of belief in truth of
propositions about states of the world • Belief in a proposition f can be measured in terms of a number
between 0 and 1• this is the probability of f• E.g. P(“roll of fair die came out as a 6”) = 1/6 ≈ 16.7% = 0.167• P(f) = 0 means that f is believed to be definitely false• P(f) = 1 means f is believed to be definitely true• Using probabilities between 0 and 1 is purely a convention.
.
Probability Theory and Random Variables• Probability Theory
• system of logical axioms and formal operations for sound reasoning under uncertainty
• Basic element: random variable X • X is a variable like the ones we have seen in CSP/Planning/Logic
but the agent can be uncertain about the value of X
• As usual, the domain of a random variable X, written dom(X), is the set of values X can take
• Types of variables• Boolean: e.g., Cancer (does the patient have cancer or not?)• Categorical: e.g., CancerType could be one of {breastCancer,
lungCancer, skinMelanomas}• Numeric: e.g., Temperature (integer or real)
• We will focus on Boolean and categorical variables
Possible Worlds• A possible world specifies an assignment to each random variable
• Example: weather in Vancouver, represented by random variables - -
- Temperature: {hot mild cold}
- Weather: {sunny, cloudy}
• There are 6 possible worlds:
• w╞ f means that proposition f is true in world w• A probability measure (w) over possible worlds w is a
nonnegative real number such that- (w) sums to 1 over all possible worlds w
Because for sure we are in one of these worlds!
Weather Temperature
w1 sunny hot
w2 sunny mild
w3 sunny cold
w4 cloudy hot
w5 cloudy mild
w6 cloudy cold
Possible Worlds• A possible world specifies an assignment to each random variable
• Example: weather in Vancouver, represented by random variables - -
- Temperature: {hot mild cold}
- Weather: {sunny, cloudy}
• There are 6 possible worlds:
• w╞ f means that proposition f is true in world w• A probability measure (w) over possible worlds w is a
nonnegative real number such that- (w) sums to 1 over all possible worlds w
- The probability of proposition f is defined by: P(f )=Σ w╞ f µ(w). i.e. sum of the probabilities of the worlds w in which f is true
Because for sure we are in one of these worlds!
Weather Temperature µ(w)
w1 sunny hot 0.10
w2 sunny mild 0.20
w3 sunny cold 0.10
w4 cloudy hot 0.05
w5 cloudy mild 0.35
w6 cloudy cold 0.20
Example• What’s the probability of it
being cloudy or cold?• µ(w3) + µ(w4) + µ(w5) + µ(w6) =
0.7 Weather Temperature µ(w)
w1 sunny hot 0.10
w2 sunny mild 0.20
w3 sunny cold 0.10
w4 cloudy hot 0.05
w5 cloudy mild 0.35
w6 cloudy cold 0.20
• Remember
- The probability of proposition f is defined by: P(f )=Σ w╞ f µ(w)- sum of the probabilities of the worlds w in which f is true
Joint Probability Distribution
• Joint distribution over random variables X1, …, Xn:• a probability distribution over the joint random variable <X1, …, Xn>
with domain dom(X1) × … × dom(Xn) (the Cartesian product)
• Think of a joint distribution over n variables as the n-dimensional table of the corresponding possible worlds• Each row corresponds to an assignment X1= x1, …, Xn= xn and its
probability P(X1= x1, … ,Xn= xn)
• E.g., {Weather, Temperature} example
Weather Temperature µ(w)
sunny hot 0.10
sunny mild 0.20
sunny cold 0.10
cloudy hot 0.05
cloudy mild 0.35
cloudy cold 0.20
Definition (probability distribution)A probability distribution P on a random variable X is a function dom(X)
[0,1] such that x P(X=x)
Marginalization• Given the joint distribution, we can compute distributions over
subsets of the variables through marginalization:
We also write this as P(X) = zdom(Z) P(X, Z = z).
• Simply an application of the definition of probability measure!
P(X=x) = zdom(Z) P(X=x, Z = z) Marginalization over Z
• Remember?
- The probability of proposition f is defined by: P(f )=Σ w╞ f µ(w)- sum of the probabilities of the worlds w in which f is true
Marginalization• Given the joint distribution, we can compute distributions over
subsets of the variables through marginalization:
•We also write this as P(X) = zdom(Z) P(X, Z = z).
• This corresponds to summing out a dimension in the table.• Does the new table still sum to 1?
Temperature µ(w)
hot ?
mild ?
cold ?
Weather Temperature µ(w)
sunny hot 0.10
sunny mild 0.20
sunny cold 0.10
cloudy hot 0.05
cloudy mild 0.35
cloudy cold 0.20
P(X=x) = zdom(Z) P(X=x, Z = z) Marginalization over Z
Marginalization• Given the joint distribution, we can compute distributions over
subsets of the variables through marginalization:
•We also write this as P(X) = zdom(Z) P(X, Z = z).
• This corresponds to summing out a dimension in the table.• The new table still sums to 1. It must, since it’s a probability
distribution!
Temperature µ(w)
hot ?
mild ?
cold ?
Weather Temperature µ(w)
sunny hot 0.10
sunny mild 0.20
sunny cold 0.10
cloudy hot 0.05
cloudy mild 0.35
cloudy cold 0.20
P(X=x) = zdom(Z) P(X=x, Z = z) Marginalization over Z
Marginalization• Given the joint distribution, we can compute distributions over
subsets of the variables through marginalization:
•We also write this as P(X) = zdom(Z) P(X, Z = z).
• This corresponds to summing out a dimension in the table.• The new table still sums to 1. It must, since it’s a probability distribution!
Temperature µ(w)
hot 0.15
mild
cold
Weather Temperature µ(w)
sunny hot 0.10
sunny mild 0.20
sunny cold 0.10
cloudy hot 0.05
cloudy mild 0.35
cloudy cold 0.20
P(Temperature=hot) = P(Weather=sunny, Temperature = hot)+ P(Weather=cloudy, Temperature = hot)= 0.10 + 0.05 = 0.15
P(X=x) = zdom(Z) P(X=x, Z = z) Marginalization over Z
Marginalization• We can also marginalize over more than one variable at once
Weather µ(w)
sunny 0.40
cloudy
Wind Weather Temperature µ(w)
yes sunny hot 0.04
yes sunny mild 0.09
yes sunny cold 0.07
yes cloudy hot 0.01
yes cloudy mild 0.10
yes cloudy cold 0.12
no sunny hot 0.06
no sunny mild 0.11
no sunny cold 0.03
no cloudy hot 0.04
no cloudy mild 0.25
no cloudy cold 0.08
P(X=x) = z1dom(Z1),…, zndom(Zn) P(X=x, Z1 = z1, …, Zn = zn)
i.e., Marginalization over Temperature and Wind
Marginalization• We can also get marginals for more than one variable
Wind Weather Temperature µ(w)
yes sunny hot 0.04
yes sunny mild 0.09
yes sunny cold 0.07
yes cloudy hot 0.01
yes cloudy mild 0.10
yes cloudy cold 0.12
no sunny hot 0.06
no sunny mild 0.11
no sunny cold 0.03
no cloudy hot 0.04
no cloudy mild 0.25
no cloudy cold 0.08
Weather Temperature µ(w)
sunny hot 0.10
sunny mild
sunny cold
cloudy hot
cloudy mild
cloudy cold
P(X=x,Y=y) = z1dom(Z1),…, zndom(Zn) P(X=x, Y=y, Z1 = z1, …, Zn = zn)
Marginalization• We can also get marginals for more than one variable
• Still simply an application of the definition of probability measure!
P(X=x,Y=y) = z1dom(Z1),…, zndom(Zn) P(X=x, Y=y, Z1 = z1, …, Zn = zn)
- The probability of proposition f is defined by: P(f )=Σ w╞ f µ(w)- sum of the probabilities of the worlds w in which f is true
Conditioning
Conditioning: revise beliefs based on new observations• Build a probabilistic model (the joint probability distribution,
JPD)Take into account all background informationCalled the prior probability distribution Denote the prior probability for proposition h as P(h)
• Observe new information about the worldCall all information we received subsequently the evidence e
• Integrate the two sources of information to compute the conditional probability P(h|e)This is also called the posterior probability of h given e.
Example for conditioning• You have a prior for the joint distribution of weather and
temperaturePossible
worldWeather Temperature µ(w)
w1 sunny hot 0.10
w2 sunny mild 0.20
w3 sunny cold 0.10
w4 cloudy hot 0.05
w5 cloudy mild 0.35
w6 cloudy cold 0.20
T P(T|W=sunny)
hot 0.10/0.40=0.25
mild
cold
• Now, you look outside and see that it’s sunny• You are now certain that you’re in one of worlds w1, w2, or w3
• To get the conditional probability P(T|W=sunny)• renormalize µ(w1), µ(w2), µ(w3) to sum to 1
• µ(w1) + µ(w2) + µ(w3) = 0.10+0.20+0.10=0.40
Example for conditioning• You have a prior for the joint distribution of weather and
temperaturePossible
worldWeather Temperature µ(w)
w1 sunny hot 0.10
w2 sunny mild 0.20
w3 sunny cold 0.10
w4 cloudy hot 0.05
w5 cloudy mild 0.35
w6 cloudy cold 0.20
• Now, you look outside and see that it’s sunny• You are now certain that you’re in one of worlds w1, w2, or w3
• To get the conditional probability P(T|W=sunny)• renormalize µ(w1), µ(w2), µ(w3) to sum to 1
• µ(w1) + µ(w2) + µ(w3) = 0.10+0.20+0.10=0.40
T P(T|W=sunny)
hot 0.10/0.40=0.25
mild 0.20/0.40=0.50
cold 0.10/0.40=0.25
)(
)(
sunnyWP
sunnyWTP
Definition (conditional probability)The conditional probability of proposition h given evidence e is
• P(e): Sum of probability for all worlds in which e is true• P(he): Sum of probability for all worlds in which both h and e are
true
Conditional Probability
•
)(
)()|(
eP
ehPehP
Conditional Probability among Random Variables
P(X | Y) = P(Temperature | Weather) = P(Temperature Weather) / P(Weather)
P(X | Y) = P(X , Y) / P(Y) It expresses the conditional probability of each possible value for X given each possible value for Y
T = hot T = cold
W = sunny P(hot|sunny) P(cold|sunny)
W = cloudy P(hot|cloudy) P(cold|cloudy)
Which of the following is true?
1. The probabilities in each row should sum to 1
2. The probabilities in each column should sum to 1
3. Both of the above
4. None of the above
Crucial that you can answer this question. Think about it at home and let me know if you have questions next time
Inference by Enumeration
Great, we can compute arbitrary probabilities now!• Given:
• Prior joint probability distribution (JPD) on set of variables X• specific values e for the evidence variables E (subset of X)
• We want to compute:• posterior joint distribution of query variables Y (a subset of X)
given evidence e
• Step 1: Condition to get distribution P(X|e)• Step 2: Marginalize to get distribution P(Y|e)
Inference by Enumeration: example• Given P(W,C,T) as JPD below, and evidence e : “Wind=yes”
• What is the probability that it is hot? I.e., P(Temperature=hot | Wind=yes)
• Step 1: condition to get distribution P(X|e)Windy
WCloudy
CTemperature
TP(W, C, T)
yes no hot 0.04
yes no mild 0.09
yes no cold 0.07
yes yes hot 0.01
yes yes mild 0.10
yes yes cold 0.12
no no hot 0.06
no no mild 0.11
no no cold 0.03
no yes hot 0.04
no yes mild 0.25
no yes cold 0.08
Inference by Enumeration: example• Given P(X) as JPD below, and evidence e : “Wind=yes”
• What is the probability that it is hot? I.e., P(Temperature=hot | Wind=yes)
• Step 1: condition to get distribution P(X|e)
• P(X|e) CloudyC
TemperatureT
P(C, T| W=yes)
sunny hot
sunny mild
sunny cold
cloudy hot
cloudy mild
cloudy cold
Windy W
Cloudy C
Temperature T
P(W, C, T)
yes no hot 0.04
yes no mild 0.09
yes no cold 0.07
yes yes hot 0.01
yes yes mild 0.10
yes yes cold 0.12
no no hot 0.06
no no mild 0.11
no no cold 0.03
no yes hot 0.04
no yes mild 0.25
no yes cold 0.08
Inference by Enumeration: example• Given P(X) as JPD below, and evidence e : “Wind=yes”
• What is the probability that it is hot? I.e., P(Temperature=hot | Wind=yes)
• Step 1: condition to get distribution P(X|e)Cloudy
CTemperature
TP(C, T| W=yes)
sunny hot 0.04/0.43 0.10
sunny mild 0.09/0.43 0.21
sunny cold 0.07/0.43 0.16
cloudy hot 0.01/0.43 0.02
cloudy mild 0.10/0.43 0.23
cloudy cold 0.12/0.43 0.28
Windy W
Cloudy C
Temperature T
P(W, C, T)
yes no hot 0.04
yes no mild 0.09
yes no cold 0.07
yes yes hot 0.01
yes yes mild 0.10
yes yes cold 0.12
no no hot 0.06
no no mild 0.11
no no cold 0.03
no yes hot 0.04
no yes mild 0.25
no yes cold 0.08
Inference by Enumeration: example• Given P(X) as JPD below, and evidence e : “Wind=yes”
• What is the probability that it is hot? I.e., P(Temperature=hot | Wind=yes)
• Step 2: marginalize to get distribution P(Y|e)
CloudyC
TemperatureT
P(C, T| W=yes)
sunny hot 0.10
sunny mild 0.21
sunny cold 0.16
cloudy hot 0.02
cloudy mild 0.23
cloudy cold 0.28
TemperatureT
P(T| W=yes)
hot 0.10+0.02 = 0.12
mild 0.21+0.23 = 0.44
cold 0.16+0.28 = 0.44
Problems of Inference by Enumeration
• If we have n variables, and d is the size of the largest domain• What is the space complexity to store the joint distribution?
• We need to store the probability for each possible world• There are O(dn) possible worlds, so the space complexity is O(dn)
• How do we find the numbers for O(dn) entries?• Time complexity O(dn)
• In the worse case, need to sum over all entries in the JPD
• We will look at an alternative way to perform inference, Bayesian networks• Formalism to exploit (conditional) independence between variables
• But first, let’s look at a couple more definitions
Product Rule• By definition, we know that :
• We can rewrite this to
• In general
)(
)()|(
1
1212 fP
ffPffP
)()|()( 11212 fPffPffP
Chain Rule
Why does the chain rule help us?
We will see how, under specific circumstances (variables
independence), this rule helps gain compactness
• We can represent the JPD as a product of marginal distributions
• We can simplify some terms when the variables involved are independent or conditionally independent
Outline
• Uncertainty and Probability• Marginal and Conditional Independence• Bayesian networks
Marginal Independence
• Intuitively: if X and Y are marginally independent, then• learning that Y=y does not change your belief in X• and this is true for all values y that Y could take
• For example, weather is marginally independent from the result of a dice throw
Examples for marginal independence
• Intuitively (without numbers):• Boolean random variable “Canucks win the Stanley Cup this
season”• Numerical random variable “Canucks’ revenue last season” ?• Are the two marginally independent?
Examples for marginal independence
• Intuitively (without numbers):• Boolean random variable “Canucks win the Stanley Cup this
season”• Numerical random variable “Canucks’ revenue last season” ?• Are the two marginally independent?
No! Without revenue they cannot afford to keep their best players
Exploiting marginal independence
39
Exponentially fewer than the JPD!
Follow-up Example
• We said that “Canucks win the Stanley Cup this season”and “Canucks’ revenue last season” are not marginally independent?
• But they are conditionally independent given the Canucks line-up• Once we know who is playing then learning their revenue last
year won’t change our belief in their chances
Conditional Independence
41
• Intuitively: if X and Y are conditionally independent given Z, then• learning that Y=y does not change your belief in X
when we already know Z=z• and this is true for all values y that Y could take
and all values z that Z could take
Example for Conditional Independence
• Whether light l1 is lit and the position of switch s2 are not marginally independent• The position of the switch determines whether there is power in the wire w0
connected to the light
Lit l1
Up s2
Example for Conditional Independence
• Whether light l1 is lit and the position of switch s2 are not marginally independent• The position of the switch determines whether there is power in the wire w0
connected to the light
• However, whether light l1 is lit is conditionally independent from the position of switch s2 given whether there is power in wire w0
• Once we know Power w0, learning values for any other variable will not change our beliefs about light l1
• I.e., Lit l1 is independent of any other variable given Power w0
Power w0
Lit l1
Up s2
Exploiting Conditional Independence
• Recall the chain rule
Slide 45
Belief Networks
• Belief networks and their extensions are R&R systems explicitly defined to exploit independence in probabilistic reasoning
Outline
• Uncertainty and Probability• Marginal and Conditional Independence• Bayesian networks
Bayesian Network Motivation• We want a representation and reasoning system that is based on
conditional (and marginal) independence• Compact yet expressive representation• Efficient reasoning procedures
• Bayesian (Belief) Networks are such a representation• Named after Thomas Bayes (ca. 1702 –1761)• Term coined in 1985 by Judea Pearl (1936 – )• Their invention changed the primary focus of AI from logic to
probability!
Thomas Bayes Judea Pearl47
Pearl recently received the very prestigious ACM Turing Award for his contributions to Artificial Intelligence!And is going to give a DLS talk on November 8!
Belief (or Bayesian) networksDef. A Belief network consists of
• a directed, acyclic graph (DAG) where each node is associated with a random variable Xi
• A domain for each variable Xi represented
• a set of conditional probability distributions for each node Xi given its parents Pa(Xi) in the graph
P (Xi | Pa(Xi))
• The parents Pa(Xi) of a variable Xi are those variables upon which Xi depends directly
• A Bayesian network is a compact representation of the JDP for a set of variables (X1, …,Xn )
P(X1, …,Xn) = ∏ni= 1 P (Xi | Pa(Xi))
How to build a Bayesian network
Define a total order over the random variables: (X1, …,Xn)
If we apply the chain rule, we have
P(X1, …,Xn) = ∏ni= 1 P(Xi | X1, … ,Xi-1)
For each Xi, , select as parents in the Belief network a minimal set of its predecessors Pa (Xi) such that
P(Xi | X1, … ,Xi-1) = P (Xi | Pa (Xi))
Putting it all together, in a Belief network
P(X1, …,Xn) = ∏ni= 1 P (Xi | Pa(Xi))
Predecessors of Xi in the total order defined over the variables
Xi is conditionally independent from all its other predecessors given Parents(Xi)
Compact representation of the JPD - factorization over the JDP based on existing conditional independencies among the variables
Example for BN construction: Fire Diagnosis
• You want to diagnose whether there is a fire in a building• You receive a noisy report about whether everyone is leaving
the building• If everyone is leaving, this may have been caused by a fire
alarm• If there is a fire alarm, it may have been caused by a fire or by
tampering • If there is a fire, there may be smoke
50
Let’s construct the Bayesian network for this
Example for BN construction: Fire Diagnosis
First you choose the variables. In this case, all are Boolean:• Tampering is true when the alarm has been tampered with• Fire is true when there is a fire• Alarm is true when there is an alarm• Smoke is true when there is smoke• Leaving is true if there are lots of people leaving the building• Report is true if the sensor reports that lots of people are
leaving the building
51
Example for BN construction: Fire Diagnosis• Next, define a total ordering of variables:
• Let’s say Fire (F), Tampering (T), Alarm, (A), Smoke (S) Leaving (L) Report (R)
• The chain rule gives us: P(F,T,A,S,L,R) =
• Now choose the parents for each variable by evaluating conditional independencies
52
P(F)P (T | F) P (A | F,T) P (S | F,T,A) P (L | F,T,A,S) P (R | F,T,A,S,L)
Fire (F) is the first variable in the ordering, X1. It does not have parents.
Fire Diagnosis Example
Fire
P(F)P (T | F) P (A | F,T) P (S | F,T,A) P (L | F,T,A,S) P (R | F,T,A,S,L)
• Tampering (T) is independent of fire (learning that one is true would not change your beliefs about the probability of the other)
Example
Tampering Fire
P(F)P (T ) P (A | F,T) P (S | F,T,A) P (L | F,T,A,S) P (R | F,T,A,S,L)
• Alarm (A) depends on both Fire and Tampering: it could be caused by either or both
Fire Diagnosis Example
Tampering Fire
Alarm
P(F)P (T | F) P (A | F,T) P (S | F,T,A) P (L | F,T,A,S) P (R | F,T,A,S,L)
• Smoke (S) is caused by Fire, and so is independent of Tampering and Alarm given whether there is a Fire
Fire Diagnosis Example
Tampering Fire
AlarmSmoke
P(F)P (T | F) P (A | F,T) P (S | F) P (L | F,T,A,S) P (R | F,T,A,S,L)
• Leaving (L) is caused by Alarm, and thus is independent of the other variables given Alarm
Example
Tampering Fire
AlarmSmoke
Leaving
P(F)P (T ) P (A | F,T) P (S | F) P (L | A) P (R | F,T,A,S,L)
• Report ( R) is caused by Leaving, and thus is independent of the other variables given Leaving
Fire Diagnosis Example
Tampering Fire
AlarmSmoke
Leaving
Report
The result is the following Bnets, and its corresponding, very compact factorization of the original JPD
Fire Diagnosis Example
Tampering Fire
AlarmSmoke
Leaving
Report
P(F)P (T ) P (A | F,T) P (S | F) P (L | A) P (R | L)
Example for BN construction: Fire Diagnosis
• We are not done yet: must specify the Conditional Probability Table (CPT) for each variable. All variables are Boolean.
• How many probabilities do we need to specify for this Bayesian network?
60
Example for BN construction: Fire Diagnosis
• We are not done yet: must specify the Conditional Probability Table (CPT) for each variable. All variables are Boolean.
• How many probabilities do we need to specify for this Bayesian network? P(Tampering): 1 probability P(Alarm|Tampering, Fire): 4 (independent)
1 probability for each of the 4 instantiations of the parents In total: 1+1+4+2+2+2 = 12 (compared to 26 -1= 63 for full JPD!)
61
Example for BN construction: Fire Diagnosis
62
P(Tampering=t) P(Tampering=f)
0.02 0.98
Example for BN construction: Fire Diagnosis
We don’t need to store P(Tampering=f) since probabilities sum to 1
63
P(Tampering=t)
0.02
Example for BN construction: Fire DiagnosisP(Tampering=t)
0.02
P(Fire=t)
0.01
Tampering T Fire F P(Alarm=t|T,F) P(Alarm=f|T,F)
t t 0.5 0.5
t f 0.85 0.15
f t 0.99 0.01
f f 0.0001 0.9999
Example for BN construction: Fire Diagnosis
65
P(Tampering=t)
0.02
P(Fire=t)
0.01
We don’t need to store P(Alarm=f|T,F) since probabilities sum to 1
Each row of this table is a conditional probability distribution
Example for BN construction: Fire Diagnosis
66
Tampering T Fire F P(Alarm=t|T,F)
t t 0.5
t f 0.85
f t 0.99
f f 0.0001
P(Tampering=t)
0.02
P(Fire=t)
0.01
Fire F P(Smoke=t |F)
t 0.9
f 0.01
Alarm P(Leaving=t|A)
t 0.88
f 0.001Leaving P(Report=t|A)
t 0.75
f 0.01
In total: 1+1+4+2+2+2 = 12 (compared to 26-1 .for full JPD!)
Compactness• A CPT for a Boolean variable Xi with k Boolean parents has ……
rows for the combinations of parent values
• If each variable has no more than k parents, the complete network with n nodes requires to specify …………. numbers
• For k<< n, this is a substantial improvement,
• the numbers required grow ………… with n, vs. …………..for the full joint distribution
• E.g., if we have a Bnets with 30 boolean variables, each with 5 parents
• Need to specify……………… probability
• But we need ……..for JPD
Compactness• A CPT for a Boolean variable Xi with k Boolean parents has 2k
rows for the combinations of parent values
• Each row requires one number p for Xi = true(the number for Xi = false is just 1-p)
• If each variable has no more than k parents, the complete network requires to specify O(n · 2k) numbers
• For k<< n, this is a substantial improvement,
• the numbers required grow linearly with n, vs. O(2n) for the full joint distribution
• E.g., if we have a Bnets with 30 boolean variables, each with 5 parents
• Need to specify 30*25probability
• But we need 230 for JPD
Realistic BNet: Liver Diagnosis Source: Onisko et al., 1999
Compactness• What happens if the network is fully connected?
• Or k ≈ n• Not much saving compared to the numbers needed to specify
the full JPD
• Bnets are useful in sparse (or locally structured) domains• Domains in with each component interacts with (is related to) a
small fraction of other components• What if this is not the case in a domain we need to reason about
May need to make simplifying assumptions to reduce the dependencies in a domain
“Where do the numbers (CPTs) come from?”
From experts• Tedious• Costly• Not always reliable
From data => Machine Learning• There are algorithms to learn both structures and numbers –
later in the course• Can be hard to get enough data
Still, usually better than specifying the full JPD
Example of probability computation
How do we compute, for instance,
P(Tampering=t, Fire=f, Alarm=t, Smoke=f, Leaving=t, Report=t) =
72
Tampering T Fire F P(Alarm=t|T,F)
t t 0.5
t f 0.85
f t 0.99
f f 0.0001
P(Tampering=t)
0.02
P(Fire=t)
0.01
Fire F P(Smoke=t |F)
t 0.9
f 0.01
Alarm P(Leaving=t|A)
t 0.88
f 0.001Leaving P(Report=t|A)
t 0.75
f 0.01
Example for BN construction: Fire Diagnosis
P(Tampering=t, Fire=f, Alarm=t, Smoke=f, Leaving=t, Report=t) =
P(Tampering=t) x P(Fire=f) x P(Alarm=t| Tampering=t, Fire=f) x
x P(Smoke=f| Fire = f) x P(Leaving=t| Alarm=t)
x P(Report=t|Leaving=t) =
73
Tampering T Fire F P(Alarm=t|T,F)
t t 0.5
t f 0.85
f t 0.99
f f 0.0001
P(Tampering=t)
0.02
P(Fire=t)
0.01
Fire F P(Smoke=t |F)
t 0.9
f 0.01
Alarm P(Leaving=t|A)
t 0.88
f 0.001Leaving P(Report=t|A)
t 0.75
f 0.01
What if we use a different ordering?
74
Leaving
Report
Tampering
Smoke
Alarm
• We end up with a completely different network structure!• Which of the two structures is better (think computationally)?
Fire
• Say, we use the following order:• Leaving; Tampering; Report; Smoke; Alarm; Fire.
Which Structure is Better?Leaving
Report
Tampering
Smoke
Alarm
Fire
Which Structure is Better?
• Non-causal network is less compact: ……………………………numbers needed
• Deciding on conditional independence is hard in non-causal directions• Causal models and conditional independence seem hardwired for humans!
• Specifying the conditional probabilities may be harder than in causal direction• For instance, we have lost the direct dependency between alarm and one of its
causes, which essentially describes the alarm’s reliability (info often provided by the maker)
Leaving
Report
Tampering
Smoke
Alarm
Fire
Which Structure is Better?
• Non-causal network is less compact: 1+2+2+2+8+8 = 23 numbers needed• Deciding on conditional independence is hard in non-causal directions
• Causal models and conditional independence seem hardwired for humans!
• Specifying the conditional probabilities may be harder than in causal direction• For instance, we have lost the direct dependency between alarm and one of its
causes, which essentially describes the alarm’s reliability (info often provided by the maker)
Leaving
Report
Tampering
Smoke
Alarm
Fire
Example contd.
• Other than that, our two Bnets for the Alarm problem are equivalent as long as they represent the same probability distribution
P(T,F,A,S,L,R) = P (T) P (F) P (A | T,F) P (L | A) P (R|L) =
= P(L)P(T|L)P(R|L)P(S|L)P(A|S,L,T) P(F|S,A,T)
i.e., they are equivalent if the corresponding CPTs are specified so that they satisfy the equation above
Leaving
Report
Tampering
Smoke
Alarm
Fire
Are there wrong network structures?• Given an order of variables, a network with arcs in excess to
those required by the direct dependencies implied by that order are still ok• Just not as efficient
79
Leaving
Report
Tampering
Smoke
Alarm
Fire
P (L)P(T|L)P(R|L) P(S|L,R,T) P(A|S,L,T) P(F|S,A,T) =P (L)P(T|L)P(R|L)P(S|L)P(A|S,L,T) P(F|S,A,T)
• One extreme: the fully connected network is always correct but rarely the best choice
• Essentially we are not leveraging conditional independencies to simplify the factorization of the JDP
P (L)P(T|L)P(R|L,T)P(S|L,T,R)P(A|S,L,T,R) P(F|S,A,T,L,R)
Are there wrong network structures?• How can a network structure be wrong?
• If it misses directed edges that are required• E.g. an edge is missing below: Fire is not conditionally independent of
Alarm | {Tampering, Smoke}
80
Leaving
Report
Tampering
Smoke
Alarm
Fire
But remember what we said a few slides back. Sometimes we may need to make simplifying assumptions - e.g. assume conditional independence when it does not actually hold – in order to reduce complexity
Deciding on Structure In general, the direction of a direct dependency can always be changed
using Bayes rule
Product rule P(ab) = P(a | b) P(b) = P(b | a) P(a)
Bayes' rule: P(a | b) = P(b | a) P(a) / P(b)
Useful for assessing diagnostic probability from causal probability (or vice-versa):• P(Cause|Effect) = P(Effect|Cause) P(Cause) / P(Effect)
Structure (contd.)
So the two simple Bnets below are equivalent as long as the CPTs are related via Bayes rule
Which structure to chose depends, among other things, on which CPT it is easier to specify
Burglair
Alarm
P(A | B) = P(B | A) P(A) / P(B)
Alarm
Burglar
Stucture (contd.) CPTs for causal relationships represent knowledge of
the mechanims underlying the process of interest. • e.g. how an alarm works, why a disease generates certain symptoms
CPTs for diagnostic relations can be defined only based on past observations.
E.g., let m be meningitis, s be stiff neck:
P(m|s) = P(s|m) P(m) / P(s)
• P(s|m) can be defined based on medical knowledge on the workings of meningitis
• P(m|s) requires statistics on how often the symptom of stiff neck appears in conjuction with meningities.
In Summary• A Belief network is such that, the JPD of the variables involved is
defined as the product of the local conditional distributions:
P (X1, … ,Xn) = ∏i P(Xi | X1, … ,Xi-1) = ∏ i P (Xi | Pa (Xi))
Once we know the JPD, we can answer any query about any subset of the variables
Thus, a Belief network allows one to answer any query on any subset of the variables
Bayesian Networks: Types of Inference
Diagnostic
People are leaving
L=t
P(F|L=t)=?
Predictive IntercausalMixed
Fire happensF=t
P(L|F=t)=?Alarm goes off
P(a) = 1.0
P(F|A=t,T=t)=?
People are leaving
L=t
There is no fireF=f
P(A|F=f,L=t)=?
There is tamperingT=t
Fire
Alarm
Leaving
Fire
Alarm
Leaving
Fire
Alarm
Leaving
Fire
Alarm
Tampering
We will use the same reasoning procedure for all of these types
X
Dependencies in a Bayesian Network
By construction, a node X is conditionally independent of its non-descendant nodes (e.g., Zij in the picture) given its parents. The gray area “blocks” probability propagation
Grey areas in the picture below represent evidence
• A node X is conditionally independent of all other nodes in the network given its Markov blanket (the gray area in the picture). It “blocks” probability propagation
• Note that node X is conditionally dependent of non-descendant nodes in its Markov blanket (e.g., its children’s parents, like Z1j ) given their common descendants (e.g., Y1j).
• This allows, for instance, explaining away one cause (e.g. X) because of evidence of its effect (e.g., Y1) and another potential cause (e.g. z1j)
X
In 1, 2 and 3, X and Y are dependent
Dependencies: another way to see them
Z
Z
Z
XY
E
E
E
1
2
3
• In 3, X and Y become dependent as soon as there is evidence on Z or on any of its descendants.
• This is because knowledge of one possible cause given evidence of the effect explains away the other cause
Or, blocking paths for probability propagation. Three ways in which a path between Y to X (or viceversa) can be blocked, given evidence E
Or Conditional Independencies
Z
Z
Z
XY E
1
2
3
Conditional Independece in a Bnet
Is H conditionally independent of E given I?
Z
Z
Z
XY E1
2
3
Conditional Independece in a Bnet
Is H conditionally independent of E given I?
Z
Z
Z
XY E1
2
3
T: Since we have no evidence on F, changes on the belief of G (caused by changes in the belief of H) do not affect belief on E
(In)Dependencies in a Bnet
Is A conditionally independent of I given F?
Z
Z
Z
XY E1
2
3
(In)Dependencies in a Bnet
Is A conditionally independent of I given F?
Z
Z
Z
XY E1
2
3
False: Since we have evidence on F, changes to the belief on G due to changes in I affect belief on E, which in turns propagates to A via C and B
What if node C is not there?
(In)Dependencies in a Bnet
Is A conditionally independent of I given F?
Z
Z
Z
XY E1
2
3
(In)Dependencies in a Bnet
Is A conditionally independent of I given F?
Z
Z
Z
XY E1
2
3
T: lack on evidence on D blocks the path that goes through E, D, B,
What does this means in terms of choosing structure?
That you need to double check the appropriateness of the indirect dependencies/independencies generated by your chosen structures
Example: representing a domain for an intelligent system that acts as a tutor (aka Intelligent Tutoring System)• Topics divided in sub-topics• Student knowledge of a topic depends on student knowledge of its sub-
topics• We can never observe student knowledge directly, we can only observe it
indirectly via student test answers
Two Ways of Representing Knowledge
Overall Proficiency
Topic 1
Sub-topic 1.1
Answer 3 Answer 4
Sub-topic 1.2
Answer 2Answer 1
Answer 3 Answer 4Answer 2Answer 1
Sub-topic 1.1 Sub-topic 1.2
Overall Proficiency
Topic 1
Which one should I pick?
Two Ways of Representing Knowledge
Change in probability for a given node always propagates to its siblings, because we never get direct evidence on knowledge
Change in probability for a given node does not propagate to its siblings, because we never get direct evidence on knowledge
Overall Proficiency
Topic 1
Sub-topic 1.1
Answer 3 Answer 4
Sub-topic 1.2
Answer 2Answer 1
Answer 3 Answer 4Answer 2Answer 1
Sub-topic 1.1 Sub-topic 1.2
Overall Proficiency
Topic 1
Answer 1
Answer 1
Which one you want to chose depends on the domain you want to represent
Bayesian Networks - AISpace
Try the queries in the previous slides with the Belief Networks applet in AISpace, • Load the Fire Diagnosis example (ex. 6.10 in textbook)• Compare the probability of each query node before and
after the new evidence is observed • try first to predict how they should change using your
understanding of the domain
• Make sure you explore and understand the various other example networks in textbook and Aispace• Electrical Circuit example (textbook ex 6.11)• Patient’s wheezing and coughing example (ex. 6.14)
Not in applet, but you can easily insert itSeveral other examples in AIspace
Test your understandings of dependencies in a Bnet
Use the AISpace (http://www.aispace.org/mainApplets.shtml) applet for Belief and Decision networks (http://www.aispace.org/bayes/index.shtml)
Load the “conditional independence quiz” network Go in “Solve” mode and select “Independence Quiz”
• Given a JPD• Marginalize over specific variables• Compute distributions over any subset of the variables
• Use inference by enumeration • to compute joint posterior probability distributions over any subset of
variables given evidence• Define and use marginal and conditional independence• Build a Bayesian Network for a given domain• Compute the representational savings in terms of number of probabilities
required• Identify dependencies/independencies between nodes in a Bayesian Network
Learning Goals For Today’s Class