Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Computer Science CPSC 502

Uncertainty Probability and Bayesian Networks

(Ch. 6)

Outline

• Uncertainty and Probability• Marginal and Conditional Independence• Bayesian networks

Where are we?Environment

Problem Type

Query

Planning

Deterministic Stochastic

Constraint Satisfaction Search

Arc Consistency

Search

Search

Logics

STRIPS

Vars + Constraints

Value Iteration

Variable

Elimination

Belief Nets

Decision Nets

Markov Processes

Static

Sequential

Representation

ReasoningTechnique

Variable

Elimination

Done with Deterministic Environments

Where are we?Environment

Problem Type

Query

Planning



Arc Consistency

Search

Search

Logics

STRIPS

Vars + Constraints

Value Iteration

Variable

Elimination

Belief Nets

Decision Nets

Markov Processes

Static

Sequential

Representation

ReasoningTechnique

Variable

Elimination

Second Part of the Course

Where Are We?Environment

Problem Type

Query

Planning



Arc Consistency

Search

Search

Logics

STRIPS

Vars + Constraints

Value Iteration

Variable

Elimination

Belief Nets

Decision Nets

Markov Processes

Static

Sequential

Representation

ReasoningTechnique

Variable

Elimination

We’ll focus on Belief Nets

Two main sources of uncertainty (From Lecture 2)• Sensing Uncertainty: The agent cannot fully observe a state of

interest.

For example:• Right now, how many people are in this building?• What disease does this patient have?• Where is the soccer player behind me?

• Effect Uncertainty: The agent cannot be certain about the effects of its actions.

For example:• If I work hard, will I get an A?• Will this drug work for this patient?• Where will the ball go when I kick it?

Motivation for uncertainty

• To act in the real world, we almost always have to handle uncertainty (both effect and sensing uncertainty)• Deterministic domains are an abstraction

Sometimes this abstraction enables more powerful inference• Now we don’t make this abstraction anymore

• AI main focus shifted from logic to probability in the 1980s• The language of probability is very expressive and general• New representations enable efficient reasoning

We will see some of these, in particular Bayesian networks• Reasoning under uncertainty is part of the ‘new’ AI• This is not a dichotomy: framework for probability is logical!• New frontier: combine logic and probability

Probability as a measure of uncertainty/ignorance• Probability measures an agent's degree of belief in truth of

propositions about states of the world • Belief in a proposition f can be measured in terms of a number

between 0 and 1• this is the probability of f• E.g. P(“roll of fair die came out as a 6”) = 1/6 ≈ 16.7% = 0.167• P(f) = 0 means that f is believed to be definitely false• P(f) = 1 means f is believed to be definitely true• Using probabilities between 0 and 1 is purely a convention.

.

Probability Theory and Random Variables• Probability Theory

• system of logical axioms and formal operations for sound reasoning under uncertainty

• Basic element: random variable X • X is a variable like the ones we have seen in CSP/Planning/Logic

but the agent can be uncertain about the value of X

• As usual, the domain of a random variable X, written dom(X), is the set of values X can take

• Types of variables• Boolean: e.g., Cancer (does the patient have cancer or not?)• Categorical: e.g., CancerType could be one of {breastCancer,

lungCancer, skinMelanomas}• Numeric: e.g., Temperature (integer or real)

• We will focus on Boolean and categorical variables

Possible Worlds• A possible world specifies an assignment to each random variable

• Example: weather in Vancouver, represented by random variables - -

- Temperature: {hot mild cold}

- Weather: {sunny, cloudy}

• There are 6 possible worlds:

• w╞ f means that proposition f is true in world w• A probability measure (w) over possible worlds w is a

nonnegative real number such that- (w) sums to 1 over all possible worlds w

Because for sure we are in one of these worlds!

Weather Temperature

w1 sunny hot

w2 sunny mild

w3 sunny cold

w4 cloudy hot

w5 cloudy mild

w6 cloudy cold

Possible Worlds• A possible world specifies an assignment to each random variable

• Example: weather in Vancouver, represented by random variables - -

- Temperature: {hot mild cold}

- Weather: {sunny, cloudy}

• There are 6 possible worlds:

• w╞ f means that proposition f is true in world w• A probability measure (w) over possible worlds w is a

nonnegative real number such that- (w) sums to 1 over all possible worlds w

- The probability of proposition f is defined by: P(f )=Σ w╞ f µ(w). i.e. sum of the probabilities of the worlds w in which f is true

Because for sure we are in one of these worlds!

Weather Temperature µ(w)

w1 sunny hot 0.10

w2 sunny mild 0.20

w3 sunny cold 0.10

w4 cloudy hot 0.05

w5 cloudy mild 0.35

w6 cloudy cold 0.20

Example• What’s the probability of it

being cloudy or cold?• µ(w3) + µ(w4) + µ(w5) + µ(w6) =

0.7 Weather Temperature µ(w)

w1 sunny hot 0.10

w2 sunny mild 0.20

w3 sunny cold 0.10

w4 cloudy hot 0.05

w5 cloudy mild 0.35

w6 cloudy cold 0.20

• Remember

- The probability of proposition f is defined by: P(f )=Σ w╞ f µ(w)- sum of the probabilities of the worlds w in which f is true

Joint Probability Distribution

• Joint distribution over random variables X1, …, Xn:• a probability distribution over the joint random variable <X1, …, Xn>

with domain dom(X1) × … × dom(Xn) (the Cartesian product)

• Think of a joint distribution over n variables as the n-dimensional table of the corresponding possible worlds• Each row corresponds to an assignment X1= x1, …, Xn= xn and its

probability P(X1= x1, … ,Xn= xn)

• E.g., {Weather, Temperature} example


sunny hot 0.10

sunny mild 0.20

sunny cold 0.10

cloudy hot 0.05

cloudy mild 0.35

cloudy cold 0.20

Definition (probability distribution)A probability distribution P on a random variable X is a function dom(X)

[0,1] such that x P(X=x)

Marginalization• Given the joint distribution, we can compute distributions over

subsets of the variables through marginalization:

We also write this as P(X) = zdom(Z) P(X, Z = z).

• Simply an application of the definition of probability measure!

P(X=x) = zdom(Z) P(X=x, Z = z) Marginalization over Z

• Remember?




•We also write this as P(X) = zdom(Z) P(X, Z = z).

• This corresponds to summing out a dimension in the table.• Does the new table still sum to 1?

Temperature µ(w)

hot ?

mild ?

cold ?


sunny hot 0.10

sunny mild 0.20

sunny cold 0.10

cloudy hot 0.05

cloudy mild 0.35

cloudy cold 0.20





• This corresponds to summing out a dimension in the table.• The new table still sums to 1. It must, since it’s a probability

distribution!

Temperature µ(w)

hot ?

mild ?

cold ?


sunny hot 0.10

sunny mild 0.20

sunny cold 0.10

cloudy hot 0.05

cloudy mild 0.35

cloudy cold 0.20





• This corresponds to summing out a dimension in the table.• The new table still sums to 1. It must, since it’s a probability distribution!

Temperature µ(w)

hot 0.15

mild

cold


sunny hot 0.10

sunny mild 0.20

sunny cold 0.10

cloudy hot 0.05

cloudy mild 0.35

cloudy cold 0.20

P(Temperature=hot) = P(Weather=sunny, Temperature = hot)+ P(Weather=cloudy, Temperature = hot)= 0.10 + 0.05 = 0.15


Marginalization• We can also marginalize over more than one variable at once

Weather µ(w)

sunny 0.40

cloudy

Wind Weather Temperature µ(w)

yes sunny hot 0.04

yes sunny mild 0.09

yes sunny cold 0.07

yes cloudy hot 0.01

yes cloudy mild 0.10

yes cloudy cold 0.12

no sunny hot 0.06

no sunny mild 0.11

no sunny cold 0.03

no cloudy hot 0.04

no cloudy mild 0.25

no cloudy cold 0.08

P(X=x) = z1dom(Z1),…, zndom(Zn) P(X=x, Z1 = z1, …, Zn = zn)

i.e., Marginalization over Temperature and Wind

Marginalization• We can also get marginals for more than one variable

Wind Weather Temperature µ(w)

yes sunny hot 0.04

yes sunny mild 0.09

yes sunny cold 0.07

yes cloudy hot 0.01

yes cloudy mild 0.10

yes cloudy cold 0.12

no sunny hot 0.06

no sunny mild 0.11

no sunny cold 0.03

no cloudy hot 0.04

no cloudy mild 0.25

no cloudy cold 0.08


sunny hot 0.10

sunny mild

sunny cold

cloudy hot

cloudy mild

cloudy cold

P(X=x,Y=y) = z1dom(Z1),…, zndom(Zn) P(X=x, Y=y, Z1 = z1, …, Zn = zn)

Marginalization• We can also get marginals for more than one variable

• Still simply an application of the definition of probability measure!

P(X=x,Y=y) = z1dom(Z1),…, zndom(Zn) P(X=x, Y=y, Z1 = z1, …, Zn = zn)


Conditioning

Conditioning: revise beliefs based on new observations• Build a probabilistic model (the joint probability distribution,

JPD)Take into account all background informationCalled the prior probability distribution Denote the prior probability for proposition h as P(h)

• Observe new information about the worldCall all information we received subsequently the evidence e

• Integrate the two sources of information to compute the conditional probability P(h|e)This is also called the posterior probability of h given e.

Example for conditioning• You have a prior for the joint distribution of weather and

temperaturePossible

worldWeather Temperature µ(w)

w1 sunny hot 0.10

w2 sunny mild 0.20

w3 sunny cold 0.10

w4 cloudy hot 0.05

w5 cloudy mild 0.35

w6 cloudy cold 0.20

T P(T|W=sunny)

hot 0.10/0.40=0.25

mild

cold

• Now, you look outside and see that it’s sunny• You are now certain that you’re in one of worlds w1, w2, or w3

• To get the conditional probability P(T|W=sunny)• renormalize µ(w1), µ(w2), µ(w3) to sum to 1

• µ(w1) + µ(w2) + µ(w3) = 0.10+0.20+0.10=0.40

Example for conditioning• You have a prior for the joint distribution of weather and

temperaturePossible

worldWeather Temperature µ(w)

w1 sunny hot 0.10

w2 sunny mild 0.20

w3 sunny cold 0.10

w4 cloudy hot 0.05

w5 cloudy mild 0.35

w6 cloudy cold 0.20

• Now, you look outside and see that it’s sunny• You are now certain that you’re in one of worlds w1, w2, or w3

• To get the conditional probability P(T|W=sunny)• renormalize µ(w1), µ(w2), µ(w3) to sum to 1

• µ(w1) + µ(w2) + µ(w3) = 0.10+0.20+0.10=0.40

T P(T|W=sunny)

hot 0.10/0.40=0.25

mild 0.20/0.40=0.50

cold 0.10/0.40=0.25

)(

)(

sunnyWP

sunnyWTP

Definition (conditional probability)The conditional probability of proposition h given evidence e is

• P(e): Sum of probability for all worlds in which e is true• P(he): Sum of probability for all worlds in which both h and e are

true

Conditional Probability

•

)(

)()|(

eP

ehPehP

Conditional Probability among Random Variables

P(X | Y) = P(Temperature | Weather) = P(Temperature Weather) / P(Weather)

P(X | Y) = P(X , Y) / P(Y) It expresses the conditional probability of each possible value for X given each possible value for Y

T = hot T = cold

W = sunny P(hot|sunny) P(cold|sunny)

W = cloudy P(hot|cloudy) P(cold|cloudy)

Which of the following is true?

1. The probabilities in each row should sum to 1

2. The probabilities in each column should sum to 1

3. Both of the above

4. None of the above

Crucial that you can answer this question. Think about it at home and let me know if you have questions next time

Inference by Enumeration

Great, we can compute arbitrary probabilities now!• Given:

• Prior joint probability distribution (JPD) on set of variables X• specific values e for the evidence variables E (subset of X)

• We want to compute:• posterior joint distribution of query variables Y (a subset of X)

given evidence e

• Step 1: Condition to get distribution P(X|e)• Step 2: Marginalize to get distribution P(Y|e)

Inference by Enumeration: example• Given P(W,C,T) as JPD below, and evidence e : “Wind=yes”

• What is the probability that it is hot? I.e., P(Temperature=hot | Wind=yes)

• Step 1: condition to get distribution P(X|e)Windy

WCloudy

CTemperature

TP(W, C, T)

yes no hot 0.04

yes no mild 0.09

yes no cold 0.07

yes yes hot 0.01

yes yes mild 0.10

yes yes cold 0.12

no no hot 0.06

no no mild 0.11

no no cold 0.03

no yes hot 0.04

no yes mild 0.25

no yes cold 0.08

Inference by Enumeration: example• Given P(X) as JPD below, and evidence e : “Wind=yes”


• Step 1: condition to get distribution P(X|e)

• P(X|e) CloudyC

TemperatureT

P(C, T| W=yes)

sunny hot

sunny mild

sunny cold

cloudy hot

cloudy mild

cloudy cold

Windy W

Cloudy C

Temperature T

P(W, C, T)

yes no hot 0.04

yes no mild 0.09

yes no cold 0.07

yes yes hot 0.01

yes yes mild 0.10

yes yes cold 0.12

no no hot 0.06

no no mild 0.11

no no cold 0.03

no yes hot 0.04

no yes mild 0.25

no yes cold 0.08



• Step 1: condition to get distribution P(X|e)Cloudy

CTemperature

TP(C, T| W=yes)

sunny hot 0.04/0.43 0.10

sunny mild 0.09/0.43 0.21

sunny cold 0.07/0.43 0.16

cloudy hot 0.01/0.43 0.02

cloudy mild 0.10/0.43 0.23

cloudy cold 0.12/0.43 0.28

Windy W

Cloudy C

Temperature T

P(W, C, T)

yes no hot 0.04

yes no mild 0.09

yes no cold 0.07

yes yes hot 0.01

yes yes mild 0.10

yes yes cold 0.12

no no hot 0.06

no no mild 0.11

no no cold 0.03

no yes hot 0.04

no yes mild 0.25

no yes cold 0.08



• Step 2: marginalize to get distribution P(Y|e)

CloudyC

TemperatureT

P(C, T| W=yes)

sunny hot 0.10

sunny mild 0.21

sunny cold 0.16

cloudy hot 0.02

cloudy mild 0.23

cloudy cold 0.28

TemperatureT

P(T| W=yes)

hot 0.10+0.02 = 0.12

mild 0.21+0.23 = 0.44

cold 0.16+0.28 = 0.44

Problems of Inference by Enumeration

• If we have n variables, and d is the size of the largest domain• What is the space complexity to store the joint distribution?

• We need to store the probability for each possible world• There are O(dn) possible worlds, so the space complexity is O(dn)

• How do we find the numbers for O(dn) entries?• Time complexity O(dn)

• In the worse case, need to sum over all entries in the JPD

• We will look at an alternative way to perform inference, Bayesian networks• Formalism to exploit (conditional) independence between variables

• But first, let’s look at a couple more definitions

Product Rule• By definition, we know that :

• We can rewrite this to

• In general

)(

)()|(

1

1212 fP

ffPffP

)()|()( 11212 fPffPffP

Chain Rule

Why does the chain rule help us?

We will see how, under specific circumstances (variables

independence), this rule helps gain compactness

• We can represent the JPD as a product of marginal distributions

• We can simplify some terms when the variables involved are independent or conditionally independent

Outline


Marginal Independence

• Intuitively: if X and Y are marginally independent, then• learning that Y=y does not change your belief in X• and this is true for all values y that Y could take

• For example, weather is marginally independent from the result of a dice throw

Examples for marginal independence

• Intuitively (without numbers):• Boolean random variable “Canucks win the Stanley Cup this

season”• Numerical random variable “Canucks’ revenue last season” ?• Are the two marginally independent?

Examples for marginal independence

• Intuitively (without numbers):• Boolean random variable “Canucks win the Stanley Cup this

season”• Numerical random variable “Canucks’ revenue last season” ?• Are the two marginally independent?

No! Without revenue they cannot afford to keep their best players

Exploiting marginal independence

39

Exponentially fewer than the JPD!

Follow-up Example

• We said that “Canucks win the Stanley Cup this season”and “Canucks’ revenue last season” are not marginally independent?

• But they are conditionally independent given the Canucks line-up• Once we know who is playing then learning their revenue last

year won’t change our belief in their chances

Conditional Independence

41

• Intuitively: if X and Y are conditionally independent given Z, then• learning that Y=y does not change your belief in X

when we already know Z=z• and this is true for all values y that Y could take

and all values z that Z could take

Example for Conditional Independence

• Whether light l1 is lit and the position of switch s2 are not marginally independent• The position of the switch determines whether there is power in the wire w0

connected to the light

Lit l1

Up s2

Example for Conditional Independence

• Whether light l1 is lit and the position of switch s2 are not marginally independent• The position of the switch determines whether there is power in the wire w0

connected to the light

• However, whether light l1 is lit is conditionally independent from the position of switch s2 given whether there is power in wire w0

• Once we know Power w0, learning values for any other variable will not change our beliefs about light l1

• I.e., Lit l1 is independent of any other variable given Power w0

Power w0

Lit l1

Up s2

Exploiting Conditional Independence

• Recall the chain rule

Slide 45

Belief Networks

• Belief networks and their extensions are R&R systems explicitly defined to exploit independence in probabilistic reasoning

Outline


Bayesian Network Motivation• We want a representation and reasoning system that is based on

conditional (and marginal) independence• Compact yet expressive representation• Efficient reasoning procedures

• Bayesian (Belief) Networks are such a representation• Named after Thomas Bayes (ca. 1702 –1761)• Term coined in 1985 by Judea Pearl (1936 – )• Their invention changed the primary focus of AI from logic to

probability!

Thomas Bayes Judea Pearl47

Pearl recently received the very prestigious ACM Turing Award for his contributions to Artificial Intelligence!And is going to give a DLS talk on November 8!

http://forwardthinking.pcmag.com/show-reports/295428-acm-awards-judea-pearl-the-turing-award-for-work-on-artificial-intelligence



https://www.cs.ubc.ca/event/2012/11/dls-talk-judea-pearl-ucla

Belief (or Bayesian) networksDef. A Belief network consists of

• a directed, acyclic graph (DAG) where each node is associated with a random variable Xi

• A domain for each variable Xi represented

• a set of conditional probability distributions for each node Xi given its parents Pa(Xi) in the graph

P (Xi | Pa(Xi))

• The parents Pa(Xi) of a variable Xi are those variables upon which Xi depends directly

• A Bayesian network is a compact representation of the JDP for a set of variables (X1, …,Xn )

P(X1, …,Xn) = ∏ni= 1 P (Xi | Pa(Xi))

How to build a Bayesian network

Define a total order over the random variables: (X1, …,Xn)

If we apply the chain rule, we have

P(X1, …,Xn) = ∏ni= 1 P(Xi | X1, … ,Xi-1)

For each Xi, , select as parents in the Belief network a minimal set of its predecessors Pa (Xi) such that

P(Xi | X1, … ,Xi-1) = P (Xi | Pa (Xi))

Putting it all together, in a Belief network

P(X1, …,Xn) = ∏ni= 1 P (Xi | Pa(Xi))

Predecessors of Xi in the total order defined over the variables

Xi is conditionally independent from all its other predecessors given Parents(Xi)

Compact representation of the JPD - factorization over the JDP based on existing conditional independencies among the variables

Example for BN construction: Fire Diagnosis

• You want to diagnose whether there is a fire in a building• You receive a noisy report about whether everyone is leaving

the building• If everyone is leaving, this may have been caused by a fire

alarm• If there is a fire alarm, it may have been caused by a fire or by

tampering • If there is a fire, there may be smoke

50

Let’s construct the Bayesian network for this


First you choose the variables. In this case, all are Boolean:• Tampering is true when the alarm has been tampered with• Fire is true when there is a fire• Alarm is true when there is an alarm• Smoke is true when there is smoke• Leaving is true if there are lots of people leaving the building• Report is true if the sensor reports that lots of people are

leaving the building

51

Example for BN construction: Fire Diagnosis• Next, define a total ordering of variables:

• Let’s say Fire (F), Tampering (T), Alarm, (A), Smoke (S) Leaving (L) Report (R)

• The chain rule gives us: P(F,T,A,S,L,R) =

• Now choose the parents for each variable by evaluating conditional independencies

52

P(F)P (T | F) P (A | F,T) P (S | F,T,A) P (L | F,T,A,S) P (R | F,T,A,S,L)

Fire (F) is the first variable in the ordering, X1. It does not have parents.

Fire Diagnosis Example

Fire


• Tampering (T) is independent of fire (learning that one is true would not change your beliefs about the probability of the other)

Example

Tampering Fire

P(F)P (T ) P (A | F,T) P (S | F,T,A) P (L | F,T,A,S) P (R | F,T,A,S,L)

• Alarm (A) depends on both Fire and Tampering: it could be caused by either or both


Tampering Fire

Alarm


• Smoke (S) is caused by Fire, and so is independent of Tampering and Alarm given whether there is a Fire


Tampering Fire

AlarmSmoke

P(F)P (T | F) P (A | F,T) P (S | F) P (L | F,T,A,S) P (R | F,T,A,S,L)

• Leaving (L) is caused by Alarm, and thus is independent of the other variables given Alarm

Example

Tampering Fire

AlarmSmoke

Leaving

P(F)P (T ) P (A | F,T) P (S | F) P (L | A) P (R | F,T,A,S,L)

• Report ( R) is caused by Leaving, and thus is independent of the other variables given Leaving


Tampering Fire

AlarmSmoke

Leaving

Report

The result is the following Bnets, and its corresponding, very compact factorization of the original JPD


Tampering Fire

AlarmSmoke

Leaving

Report

P(F)P (T ) P (A | F,T) P (S | F) P (L | A) P (R | L)


• We are not done yet: must specify the Conditional Probability Table (CPT) for each variable. All variables are Boolean.

• How many probabilities do we need to specify for this Bayesian network?

60


• We are not done yet: must specify the Conditional Probability Table (CPT) for each variable. All variables are Boolean.

• How many probabilities do we need to specify for this Bayesian network? P(Tampering): 1 probability P(Alarm|Tampering, Fire): 4 (independent)

1 probability for each of the 4 instantiations of the parents In total: 1+1+4+2+2+2 = 12 (compared to 26 -1= 63 for full JPD!)

61


62

P(Tampering=t) P(Tampering=f)

0.02 0.98


We don’t need to store P(Tampering=f) since probabilities sum to 1

63

P(Tampering=t)

0.02

Example for BN construction: Fire DiagnosisP(Tampering=t)

0.02

P(Fire=t)

0.01

Tampering T Fire F P(Alarm=t|T,F) P(Alarm=f|T,F)

t t 0.5 0.5

t f 0.85 0.15

f t 0.99 0.01

f f 0.0001 0.9999


65

P(Tampering=t)

0.02

P(Fire=t)

0.01

We don’t need to store P(Alarm=f|T,F) since probabilities sum to 1

Each row of this table is a conditional probability distribution


66

Tampering T Fire F P(Alarm=t|T,F)

t t 0.5

t f 0.85

f t 0.99

f f 0.0001

P(Tampering=t)

0.02

P(Fire=t)

0.01

Fire F P(Smoke=t |F)

t 0.9

f 0.01

Alarm P(Leaving=t|A)

t 0.88

f 0.001Leaving P(Report=t|A)

t 0.75

f 0.01

In total: 1+1+4+2+2+2 = 12 (compared to 26-1 .for full JPD!)

Compactness• A CPT for a Boolean variable Xi with k Boolean parents has ……

rows for the combinations of parent values

• If each variable has no more than k parents, the complete network with n nodes requires to specify …………. numbers

• For k<< n, this is a substantial improvement,

• the numbers required grow ………… with n, vs. …………..for the full joint distribution

• E.g., if we have a Bnets with 30 boolean variables, each with 5 parents

• Need to specify……………… probability

• But we need ……..for JPD

Compactness• A CPT for a Boolean variable Xi with k Boolean parents has 2k

rows for the combinations of parent values

• Each row requires one number p for Xi = true(the number for Xi = false is just 1-p)

• If each variable has no more than k parents, the complete network requires to specify O(n · 2k) numbers

• For k<< n, this is a substantial improvement,

• the numbers required grow linearly with n, vs. O(2n) for the full joint distribution

• E.g., if we have a Bnets with 30 boolean variables, each with 5 parents

• Need to specify 30*25probability

• But we need 230 for JPD

Realistic BNet: Liver Diagnosis Source: Onisko et al., 1999

Compactness• What happens if the network is fully connected?

• Or k ≈ n• Not much saving compared to the numbers needed to specify

the full JPD

• Bnets are useful in sparse (or locally structured) domains• Domains in with each component interacts with (is related to) a

small fraction of other components• What if this is not the case in a domain we need to reason about

May need to make simplifying assumptions to reduce the dependencies in a domain

“Where do the numbers (CPTs) come from?”

From experts• Tedious• Costly• Not always reliable

From data => Machine Learning• There are algorithms to learn both structures and numbers –

later in the course• Can be hard to get enough data

Still, usually better than specifying the full JPD

Example of probability computation

How do we compute, for instance,

P(Tampering=t, Fire=f, Alarm=t, Smoke=f, Leaving=t, Report=t) =

72


t t 0.5

t f 0.85

f t 0.99

f f 0.0001

P(Tampering=t)

0.02

P(Fire=t)

0.01


t 0.9

f 0.01


t 0.88


t 0.75

f 0.01


P(Tampering=t, Fire=f, Alarm=t, Smoke=f, Leaving=t, Report=t) =

P(Tampering=t) x P(Fire=f) x P(Alarm=t| Tampering=t, Fire=f) x

x P(Smoke=f| Fire = f) x P(Leaving=t| Alarm=t)

x P(Report=t|Leaving=t) =

73


t t 0.5

t f 0.85

f t 0.99

f f 0.0001

P(Tampering=t)

0.02

P(Fire=t)

0.01


t 0.9

f 0.01


t 0.88


t 0.75

f 0.01

What if we use a different ordering?

74

Leaving

Report

Tampering

Smoke

Alarm

• We end up with a completely different network structure!• Which of the two structures is better (think computationally)?

Fire

• Say, we use the following order:• Leaving; Tampering; Report; Smoke; Alarm; Fire.

Which Structure is Better?Leaving

Report

Tampering

Smoke

Alarm

Fire

Which Structure is Better?

• Non-causal network is less compact: ……………………………numbers needed

• Deciding on conditional independence is hard in non-causal directions• Causal models and conditional independence seem hardwired for humans!

• Specifying the conditional probabilities may be harder than in causal direction• For instance, we have lost the direct dependency between alarm and one of its

causes, which essentially describes the alarm’s reliability (info often provided by the maker)

Leaving

Report

Tampering

Smoke

Alarm

Fire

Which Structure is Better?

• Non-causal network is less compact: 1+2+2+2+8+8 = 23 numbers needed• Deciding on conditional independence is hard in non-causal directions

• Causal models and conditional independence seem hardwired for humans!

• Specifying the conditional probabilities may be harder than in causal direction• For instance, we have lost the direct dependency between alarm and one of its

causes, which essentially describes the alarm’s reliability (info often provided by the maker)

Leaving

Report

Tampering

Smoke

Alarm

Fire

Example contd.

• Other than that, our two Bnets for the Alarm problem are equivalent as long as they represent the same probability distribution

P(T,F,A,S,L,R) = P (T) P (F) P (A | T,F) P (L | A) P (R|L) =

= P(L)P(T|L)P(R|L)P(S|L)P(A|S,L,T) P(F|S,A,T)

i.e., they are equivalent if the corresponding CPTs are specified so that they satisfy the equation above

Leaving

Report

Tampering

Smoke

Alarm

Fire

Are there wrong network structures?• Given an order of variables, a network with arcs in excess to

those required by the direct dependencies implied by that order are still ok• Just not as efficient

79

Leaving

Report

Tampering

Smoke

Alarm

Fire

P (L)P(T|L)P(R|L) P(S|L,R,T) P(A|S,L,T) P(F|S,A,T) =P (L)P(T|L)P(R|L)P(S|L)P(A|S,L,T) P(F|S,A,T)

• One extreme: the fully connected network is always correct but rarely the best choice

• Essentially we are not leveraging conditional independencies to simplify the factorization of the JDP

P (L)P(T|L)P(R|L,T)P(S|L,T,R)P(A|S,L,T,R) P(F|S,A,T,L,R)

Are there wrong network structures?• How can a network structure be wrong?

• If it misses directed edges that are required• E.g. an edge is missing below: Fire is not conditionally independent of

Alarm | {Tampering, Smoke}

80

Leaving

Report

Tampering

Smoke

Alarm

Fire

But remember what we said a few slides back. Sometimes we may need to make simplifying assumptions - e.g. assume conditional independence when it does not actually hold – in order to reduce complexity

Deciding on Structure In general, the direction of a direct dependency can always be changed

using Bayes rule

Product rule P(ab) = P(a | b) P(b) = P(b | a) P(a)

Bayes' rule: P(a | b) = P(b | a) P(a) / P(b)

Useful for assessing diagnostic probability from causal probability (or vice-versa):• P(Cause|Effect) = P(Effect|Cause) P(Cause) / P(Effect)

Structure (contd.)

So the two simple Bnets below are equivalent as long as the CPTs are related via Bayes rule

Which structure to chose depends, among other things, on which CPT it is easier to specify

Burglair

Alarm

P(A | B) = P(B | A) P(A) / P(B)

Alarm

Burglar

Stucture (contd.) CPTs for causal relationships represent knowledge of

the mechanims underlying the process of interest. • e.g. how an alarm works, why a disease generates certain symptoms

CPTs for diagnostic relations can be defined only based on past observations.

E.g., let m be meningitis, s be stiff neck:

P(m|s) = P(s|m) P(m) / P(s)

• P(s|m) can be defined based on medical knowledge on the workings of meningitis

• P(m|s) requires statistics on how often the symptom of stiff neck appears in conjuction with meningities.

In Summary• A Belief network is such that, the JPD of the variables involved is

defined as the product of the local conditional distributions:

P (X1, … ,Xn) = ∏i P(Xi | X1, … ,Xi-1) = ∏ i P (Xi | Pa (Xi))

Once we know the JPD, we can answer any query about any subset of the variables

Thus, a Belief network allows one to answer any query on any subset of the variables

Bayesian Networks: Types of Inference

Diagnostic

People are leaving

L=t

P(F|L=t)=?

Predictive IntercausalMixed

Fire happensF=t

P(L|F=t)=?Alarm goes off

P(a) = 1.0

P(F|A=t,T=t)=?

People are leaving

L=t

There is no fireF=f

P(A|F=f,L=t)=?

There is tamperingT=t

Fire

Alarm

Leaving

Fire

Alarm

Leaving

Fire

Alarm

Leaving

Fire

Alarm

Tampering

We will use the same reasoning procedure for all of these types

X

Dependencies in a Bayesian Network

By construction, a node X is conditionally independent of its non-descendant nodes (e.g., Zij in the picture) given its parents. The gray area “blocks” probability propagation

Grey areas in the picture below represent evidence

• A node X is conditionally independent of all other nodes in the network given its Markov blanket (the gray area in the picture). It “blocks” probability propagation

• Note that node X is conditionally dependent of non-descendant nodes in its Markov blanket (e.g., its children’s parents, like Z1j ) given their common descendants (e.g., Y1j).

• This allows, for instance, explaining away one cause (e.g. X) because of evidence of its effect (e.g., Y1) and another potential cause (e.g. z1j)

X

In 1, 2 and 3, X and Y are dependent

Dependencies: another way to see them

Z

Z

Z

XY

E

E

E

1

2

3

• In 3, X and Y become dependent as soon as there is evidence on Z or on any of its descendants.

• This is because knowledge of one possible cause given evidence of the effect explains away the other cause

Or, blocking paths for probability propagation. Three ways in which a path between Y to X (or viceversa) can be blocked, given evidence E

Or Conditional Independencies

Z

Z

Z

XY E

1

2

3

Conditional Independece in a Bnet

Is H conditionally independent of E given I?

Z

Z

Z

XY E1

2

3

Conditional Independece in a Bnet

Is H conditionally independent of E given I?

Z

Z

Z

XY E1

2

3

T: Since we have no evidence on F, changes on the belief of G (caused by changes in the belief of H) do not affect belief on E

(In)Dependencies in a Bnet

Is A conditionally independent of I given F?

Z

Z

Z

XY E1

2

3



Z

Z

Z

XY E1

2

3

False: Since we have evidence on F, changes to the belief on G due to changes in I affect belief on E, which in turns propagates to A via C and B

What if node C is not there?



Z

Z

Z

XY E1

2

3



Z

Z

Z

XY E1

2

3

T: lack on evidence on D blocks the path that goes through E, D, B,

What does this means in terms of choosing structure?

That you need to double check the appropriateness of the indirect dependencies/independencies generated by your chosen structures

Example: representing a domain for an intelligent system that acts as a tutor (aka Intelligent Tutoring System)• Topics divided in sub-topics• Student knowledge of a topic depends on student knowledge of its sub-

topics• We can never observe student knowledge directly, we can only observe it

indirectly via student test answers

Two Ways of Representing Knowledge

Overall Proficiency

Topic 1

Sub-topic 1.1

Answer 3 Answer 4

Sub-topic 1.2

Answer 2Answer 1

Answer 3 Answer 4Answer 2Answer 1

Sub-topic 1.1 Sub-topic 1.2

Overall Proficiency

Topic 1

Which one should I pick?

Two Ways of Representing Knowledge

Change in probability for a given node always propagates to its siblings, because we never get direct evidence on knowledge

Change in probability for a given node does not propagate to its siblings, because we never get direct evidence on knowledge

Overall Proficiency

Topic 1

Sub-topic 1.1

Answer 3 Answer 4

Sub-topic 1.2

Answer 2Answer 1

Answer 3 Answer 4Answer 2Answer 1

Sub-topic 1.1 Sub-topic 1.2

Overall Proficiency

Topic 1

Answer 1

Answer 1

Which one you want to chose depends on the domain you want to represent

Bayesian Networks - AISpace

Try the queries in the previous slides with the Belief Networks applet in AISpace, • Load the Fire Diagnosis example (ex. 6.10 in textbook)• Compare the probability of each query node before and

after the new evidence is observed • try first to predict how they should change using your

understanding of the domain

• Make sure you explore and understand the various other example networks in textbook and Aispace• Electrical Circuit example (textbook ex 6.11)• Patient’s wheezing and coughing example (ex. 6.14)

Not in applet, but you can easily insert itSeveral other examples in AIspace

Test your understandings of dependencies in a Bnet

Use the AISpace (http://www.aispace.org/mainApplets.shtml) applet for Belief and Decision networks (http://www.aispace.org/bayes/index.shtml)

Load the “conditional independence quiz” network Go in “Solve” mode and select “Independence Quiz”

http://www.aispace.org/mainApplets.shtml

http://www.aispace.org/bayes/index.shtml

• Given a JPD• Marginalize over specific variables• Compute distributions over any subset of the variables

• Use inference by enumeration • to compute joint posterior probability distributions over any subset of

variables given evidence• Define and use marginal and conditional independence• Build a Bayesian Network for a given domain• Compute the representational savings in terms of number of probabilities

required• Identify dependencies/independencies between nodes in a Bayesian Network

Learning Goals For Today’s Class

Date post:	11-Jan-2016
Category:	Documents
Upload:	harvey-sharp
View:	214 times
Download:	0 times

Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Documents