CAP 4621 ARTIFICIAL INTELLIGENCE Reminder: [CAP4621] in … · 2018. 10. 24. · Recap...

CAP 4621 ARTIFICIAL INTELLIGENCEReminder: [CAP4621] in your email subject line

“Probabilistic Reasoning”

Eakta Jain

[email protected]

With slides from: Stuart Russell, Hwee Tau Ng, Dan Klein, Pieter Abbeel

Wake up! (2 minutes)

Module 1• Objectives

– Interpret a given Bayesian network with respect to independence and conditional independence

– Construct a Bayesian network given a problem statement

Recap

• Probability models are a representation of our uncertain knowledge about the world

Recap


• Takeaways from previous week

Recap


• Takeaways from previous week– Examples we discussed

• Joint probability table of Weather, Toothache, Cavity, Catch

• Joint probability table of symptoms and diseases for an intelligent medical assistant

Recap


• Takeaways from previous week• Independence and conditionalindependence simplify our probabilistic representation of the world

Recap


• Takeaways from previous week• Independence and conditionalindependence simplify our probabilistic representation of the world– Example: Weather is independent of

(Toothache, Catch, Cavity)

Recap


• Takeaways from previous week• Independence and conditionalindependence simplify our probabilistic representation of the world– Example: Weather is independent of

(Toothache, Catch, Cavity)– This allows us to reduce the size of our joint

probability table from 32 to 8+4=12

Bayesian Networks

• Bayesian networks are a way to represent independence and conditional independence relationships

Bayesian Networks


• We will define the syntax and semantics of Bayesian networks

Bayesian Networks


• We will define the syntax and semantics of Bayesian networks

• We will discuss how probabilistic inference can be performed in practical situations

Bayesian network

• Also called belief network, graphical model, causal network, knowledge map

Bayesian network


• A directed graphExample

Topology of network encodes conditional independence assertions:

Weather Cavity

Toothache Catch

Weather is independent of the other variables

Toothache and Catch are conditionally independent given Cavity

Chapter 14.1–3 4

Bayesian network


• A directed graph

Representation of our belief that Weather is independent of other three variables, Cavity causes Toothache and Catch

Example


Weather Cavity

Toothache Catch



Chapter 14.1–3 4

Bayesian network


• A directed acyclic graph

Representation of our belief that Weather is independent of other three variables, Cavity causes Toothache and Catch

Example


Weather Cavity

Toothache Catch



Chapter 14.1–3 4

Bayesian network


• A directed acyclic graph

Representation of the joint probability model

Example


Weather Cavity

Toothache Catch



Chapter 14.1–3 4

Representing the full joint distribution

Representation of the joint probability modelAbbreviate variables: W, C, T, H

Example


Weather Cavity

Toothache Catch



Chapter 14.1–3 4



Example


Weather Cavity

Toothache Catch



Chapter 14.1–3 4

P (W,C, T,H) = P (W |C, T,H)P (C, T,H)



Example


Weather Cavity

Toothache Catch



Chapter 14.1–3 4

P (W,C, T,H) = P (W |C, T,H)P (C, T,H)

Let’s expand the right hand side and simplify based on the probability rules we studied last time


P (W,C, T,H) = P (W |C, T,H)P (C, T,H)


P (W,C, T,H) = P (W |C, T,H)P (C, T,H)

P (W,C, T,H) = P (W )P (C, T,H)


P (W,C, T,H) = P (W |C, T,H)P (C, T,H)

P (W,C, T,H) = P (W )P (C, T,H)

P (W,C, T,H) = P (W )P (T |C,H)P (C,H)


P (W,C, T,H) = P (W |C, T,H)P (C, T,H)

P (W,C, T,H) = P (W )P (C, T,H)

P (W,C, T,H) = P (W )P (T |C,H)P (C,H)

P (W,C, T,H) = P (W )P (T |C)P (C,H)


P (W,C, T,H) = P (W |C, T,H)P (C, T,H)

P (W,C, T,H) = P (W )P (C, T,H)

P (W,C, T,H) = P (W )P (T |C,H)P (C,H)

P (W,C, T,H) = P (W )P (T |C)P (C,H)

P (W,C, T,H) = P (W )P (T |C)P (H|C)P (C)

Bayesian networkExample


Weather Cavity

Toothache Catch



Chapter 14.1–3 4

P (W,C, T,H) = P (W )P (T |C)P (H|C)P (C)

P (x1, x2, ..., xn) = ⇧P (xi|Parents(Xi))

Classroom Exercise

• Handout


– Construct a Bayesian network given a problem statement

A method to construct Bayesian

networks

• First, determine the set of variables that are

required to model the domain.

• Order then {X1, X2, …, Xn}

– Any order will work, but the resulting network will

be more compact if variables are ordered by

cause preceding effect

• Loop from i to n

– For each node Xi, choose the minimal set of

parents for Xi from the list X1,…Xi-1

– Insert a link from each parent to Xi

Hybrid Bayesian Networks• Many real world problems

involve continuous quantities• Example

– Subsidy (is the govt subsidy scheme in operation?)

– Buys (does the customer buy the fruit?)

– Cost (price of fruit)– Harvest (quantity of fruit

harvested)

520 Chapter 14. Probabilistic Reasoning

HarvestSubsidy

Buys

Cost

Figure 14.5 A simple network with discrete variables (Subsidy and Buys) and continuousvariables (Harvest and Cost ).

possible values into a fixed set of intervals. For example, temperatures could be divided into(<0oC), (0oC−100oC), and (>100oC). Discretization is sometimes an adequate solution,but often results in a considerable loss of accuracy and very large CPTs. The most com-mon solution is to define standard families of probability density functions (see Appendix A)that are specified by a finite number of parameters. For example, a Gaussian (or normal)PARAMETER

distribution N(µ,σ2)(x) has the mean µ and the variance σ

2 as parameters. Yet anothersolution—sometimes called a nonparametric representation—is to define the conditionalNONPARAMETRIC

distribution implicitly with a collection of instances, each containing specific values of theparent and child variables. We explore this approach further in Chapter 18.

A network with both discrete and continuous variables is called a hybrid Bayesiannetwork. To specify a hybrid network, we have to specify two new kinds of distributions:HYBRID BAYESIAN

NETWORK

the conditional distribution for a continuous variable given discrete or continuous parents;and the conditional distribution for a discrete variable given continuous parents. Consider thesimple example in Figure 14.5, in which a customer buys some fruit depending on its cost,which depends in turn on the size of the harvest and whether the government’s subsidy schemeis operating. The variable Cost is continuous and has continuous and discrete parents; thevariable Buys is discrete and has a continuous parent.

For the Cost variable, we need to specify P(Cost |Harvest ,Subsidy). The discreteparent is handled by enumeration—that is, by specifying both P (Cost |Harvest , subsidy)

and P (Cost |Harvest ,¬subsidy). To handle Harvest , we specify how the distribution overthe cost c depends on the continuous value h of Harvest . In other words, we specify theparameters of the cost distribution as a function of h. The most common choice is the linearGaussian distribution, in which the child has a Gaussian distribution whose mean µ variesLINEAR GAUSSIAN

linearly with the value of the parent and whose standard deviation σ is fixed. We need twodistributions, one for subsidy and one for ¬subsidy , with different parameters:

P (c | h, subsidy) = N(ath + bt,σ2t )(c) =

1

σt

√2π

e−

1

2

“

c−(ath+bt)

σt

”2

P (c | h,¬subsidy) = N(afh + bf ,σ2f )(c) =

1

σf

√2π

e

−

1

2

„

c−(af h+bf )

σf

«2

.

For this example, then, the conditional distribution for Cost is specified by naming the linearGaussian distribution and providing the parameters at, bt, σt, af , bf , and σf . Figures 14.6(a)

Hybrid Bayesian Networks• Many real world problems

involve continuous quantities• Example

– Subsidy (is the govt subsidy scheme in operation?)

– Buys (does the customer buy the fruit?)

– Cost (price of fruit)– Harvest (quantity of fruit

harvested)


HarvestSubsidy

Buys

Cost







NETWORK






1

σt

√2π

e−

1

2

“

c−(ath+bt)

σt

”2


1

σf

√2π

e

−

1

2

„

c−(af h+bf )

σf

«2

.


Which ones are discrete/continuous?

Hybrid Bayesian Networks

• P(Cost|Harvest, Subsidy)


• P(Cost|Harvest, Subsidy)• The discrete parent is handled by writing

outP(Cost|Harvest, subsidy)P(Cost|Harvest, !subsidy)


• P(Cost|Harvest, Subsidy)• The continuous parent is handled by the

use of a linear Gaussian distribution





HarvestSubsidy

Buys

Cost







NETWORK






1

σt

√2π

e−

1

2

“

c−(ath+bt)

σt

”2


1

σf

√2π

e

−

1

2

„

c−(af h+bf )

σf

«2

.


1058 Appendix A. Mathematical background

The density function must be nonnegative for all x and must have!

∞

−∞

P (x) dx= 1 .

We can also define a cumulative probability density function FX(x), which is the proba-CUMULATIVE

PROBABILITY

DENSITY FUNCTION

bility of a random variable being less than x:

FX(x) = P (X ≤ x) =

! x

−∞

P (u) du .

Note that the probability density function has units, whereas the discrete probability functionis unitless. For example, if values of X are measured in seconds, then the density is measuredin Hz (i.e., 1/sec). If values of X are points in three-dimensional space measured in meters,then density is measured in 1/m3.

One of the most important probability distributions is the Gaussian distribution, alsoGAUSSIAN

DISTRIBUTION

known as the normal distribution. A Gaussian distribution with mean µ and standard devi-ation σ (and therefore variance σ

2) is defined as

P (x)=1

σ

√2π

e−(x−µ)2/(2σ2)

,

where x is a continuous variable ranging from −∞ to +∞. With mean µ = 0 and varianceσ

2 = 1, we get the special case of the standard normal distribution. For a distribution overSTANDARD NORMAL

DISTRIBUTION

a vector x in n dimensions, there is the multivariate Gaussian distribution:MULTIVARIATE

GAUSSIAN

P (x)=1

"(2π)n|Σ|

e−

1

2

“

(x−µ)⊤Σ−1

(x−µ)

”

,

where µ is the mean vector and Σ is the covariance matrix (see below).In one dimension, we can define the cumulative distribution function F (x) as theCUMULATIVE

DISTRIBUTION

probability that a random variable will be less than x. For the normal distribution, this is

F (x)=

x!

−∞

P (z)dz =1

2(1 + erf(

z − µ

σ

√2

)) ,

where erf(x) is the so-called error function, which has no closed-form representation.The central limit theorem states that the distribution formed by sampling n indepen-CENTRAL LIMIT

THEOREM

dent random variables and taking their mean tends to a normal distribution as n tends toinfinity. This holds for almost any collection of random variables, even if they are not strictlyindependent, unless the variance of any finite subset of variables dominates the others.

The expectation of a random variable, E(X), is the mean or average value, weightedEXPECTATION

by the probability of each value. For a discrete variable it is:

E(X)=#

i

xi P (X =xi) .

For a continuous variable, replace the summation with an integral over the probability densityfunction, P (x):

E(X)=

∞!

−∞

xP (x) dx ,





HarvestSubsidy

Buys

Cost







NETWORK






1

σt

√2π

e−

1

2

“

c−(ath+bt)

σt

”2


1

σf

√2π

e

−

1

2

„

c−(af h+bf )

σf

«2

.


The mean value of cost increases linearly with harvest(Remember, coefficients can be negative)





HarvestSubsidy

Buys

Cost







NETWORK






1

σt

√2π

e−

1

2

“

c−(ath+bt)

σt

”2


1

σf

√2π

e

−

1

2

„

c−(af h+bf )

σf

«2

.

For this example, then, the conditional distribution for Cost is specified by naming the linearGaussian distribution and providing the parameters at, bt, σt, af , bf , and σf . Figures 14.6(a)One set of coefficients for when subsidy is

operational, and one set of coefficients for when it is not





HarvestSubsidy

Buys

Cost







NETWORK






1

σt

√2π

e−

1

2

“

c−(ath+bt)

σt

”2


1

σf

√2π

e

−

1

2

„

c−(af h+bf )

σf

«2

.

For this example, then, the conditional distribution for Cost is specified by naming the linearGaussian distribution and providing the parameters at, bt, σt, af , bf , and σf . Figures 14.6(a)One set of coefficients for when subsidy is

operational, and one set of coefficients for when it is not

Section 14.3. Efficient Representation of Conditional Distributions 521

0 2 4 6 8 10Cost c02 46 81012

Harvest h

00.10.20.30.4

P(c | h, subsidy)

0 2 4 6 8 10Cost c02 46 81012

Harvest h

00.10.20.30.4

P(c | h, ¬subsidy)

0 2 4 6 8 10Cost c02 46 81012

Harvest h

00.10.20.30.4

P(c | h)

(a) (b) (c)

Figure 14.6 The graphs in (a) and (b) show the probability distribution over Cost as afunction of Harvest size, with Subsidy true and false, respectively. Graph (c) shows thedistribution P (Cost |Harvest), obtained by summing over the two subsidy cases.

and (b) show these two relationships. Notice that in each case the slope is negative, becausecost decreases as supply increases. (Of course, the assumption of linearity implies that thecost becomes negative at some point; the linear model is reasonable only if the harvest size islimited to a narrow range.) Figure 14.6(c) shows the distribution P (c | h), averaging over thetwo possible values of Subsidy and assuming that each has prior probability 0.5. This showsthat even with very simple models, quite interesting distributions can be represented.

The linear Gaussian conditional distribution has some special properties. A networkcontaining only continuous variables with linear Gaussian distributions has a joint distribu-tion that is a multivariate Gaussian distribution (see Appendix A) over all the variables (Exer-cise 14.9). Furthermore, the posterior distribution given any evidence also has this property.3

When discrete variables are added as parents (not as children) of continuous variables, thenetwork defines a conditional Gaussian, or CG, distribution: given any assignment to theCONDITIONAL

GAUSSIAN

discrete variables, the distribution over the continuous variables is a multivariate Gaussian.Now we turn to the distributions for discrete variables with continuous parents. Con-

sider, for example, the Buys node in Figure 14.5. It seems reasonable to assume that thecustomer will buy if the cost is low and will not buy if it is high and that the probability ofbuying varies smoothly in some intermediate region. In other words, the conditional distribu-tion is like a “soft” threshold function. One way to make soft thresholds is to use the integralof the standard normal distribution:

Φ(x) =

! x

−∞

N(0, 1)(x)dx .

Then the probability of Buys given Cost might be

P (buys |Cost = c) = Φ((−c + µ)/σ) ,

which means that the cost threshold occurs around µ, the width of the threshold region is pro-portional to σ, and the probability of buying decreases as cost increases. This probit distri-

3 It follows that inference in linear Gaussian networks takes only O(n3) time in the worst case, regardless of the

network topology. In Section 14.4, we see that inference for networks of discrete variables is NP-hard.


• P(Buys|Cost)• The continuous parent of a discrete variable




0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12

P(c

)

Cost c

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12

P(b

uys

| c)

Cost c

LogitProbit

(a) (b)

Figure 14.7 (a) A normal (Gaussian) distribution for the cost threshold, centered onµ =6.0 with standard deviation σ =1.0. (b) Logit and probit distributions for the probabilityof buys given cost , for the parameters µ =6.0 and σ = 1.0.

bution (pronounced “pro-bit” and short for “probability unit”) is illustrated in Figure 14.7(a).PROBIT

DISTRIBUTION

The form can be justified by proposing that the underlying decision process has a hard thresh-old, but that the precise location of the threshold is subject to random Gaussian noise.

An alternative to the probit model is the logit distribution (pronounced “low-jit”). ItLOGIT DISTRIBUTION

uses the logistic function 1/(1 + e−x) to produce a soft threshold:LOGISTIC FUNCTION

P (buys |Cost = c) =1

1 + exp(−2−c+µ

σ)

.

This is illustrated in Figure 14.7(b). The two distributions look similar, but the logit actuallyhas much longer “tails.” The probit is often a better fit to real situations, but the logit is some-times easier to deal with mathematically. It is used widely in neural networks (Chapter 20).Both probit and logit can be generalized to handle multiple continuous parents by taking alinear combination of the parent values.

14.4 EXACT INFERENCE IN BAYESIAN NETWORKS

The basic task for any probabilistic inference system is to compute the posterior probabilitydistribution for a set of query variables, given some observed event—that is, some assign-EVENT

ment of values to a set of evidence variables. To simplify the presentation, we will consideronly one query variable at a time; the algorithms can easily be extended to queries with mul-tiple variables. We will use the notation from Chapter 13: X denotes the query variable; Edenotes the set of evidence variables E1, . . . , Em, and e is a particular observed event; Y willdenotes the nonevidence, nonquery variables Y1, . . . , Yl (called the hidden variables). Thus,HIDDEN VARIABLE

the complete set of variables is X = {X}∪E∪Y. A typical query asks for the posteriorprobability distribution P(X | e).


0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12

P(c

)

Cost c

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12

P(b

uys

| c)

Cost c

LogitProbit

(a) (b)



DISTRIBUTION





1 + exp(−2−c+µ

σ)

.









0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12

P(c

)

Cost c

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12

P(b

uys

| c)

Cost c

LogitProbit

(a) (b)



DISTRIBUTION





1 + exp(−2−c+µ

σ)

.







0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12

P(c

)

Cost c

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12

P(b

uys

| c)

Cost c

LogitProbit

(a) (b)



DISTRIBUTION





1 + exp(−2−c+µ

σ)

.






Logistic function: used in neural networks



Eakta Jain

[email protected]



Announcements

• Thank you for the mid-semester feedback

Announcements

• Common themes– Advantage of coming to class is staying on

top of course content and not having things back up

– Morning class – too early!

Announcements

• Suggestions from you that I will incorporate going forward– Hard to take notes à Lecture slides online, I

will try to go slower on slides– Confused about how readings, homeworks fit

in à I will make this clearer– Not enough time for group work à More time

for group work, less follow along?

Announcements• Things that you can do to improve

experience in this class– “xyz is not clear”, “TA hours N/A”à Go to TA

and instructor office hours because classroom is not an ideal environment for individualized attention

Announcements• Things that you can do to improve

experience in this class– “xyz is not clear”, “TA hours N/A”à Go to TA

and instructor office hours because classroom is not an ideal environment for individualized attention

– Confused about how readings, homeworks fit in• Revisit topics list and learning objectives in the

handout for Week 1 • Ask your peers (use Canvas, use the 10 minute

break on Tuesdays)


– Perform inference by enumeration to answer queries on Bayesian networks

What does “Inference” mean?

• Compute the posterior probability for a set of query variables given the values of some evidence variables

• Reading: Sec 14.4 of textbook

Example

139

.001

P(B)

Alarm

Earthquake

MaryCallsJohnCalls

Burglary

A P(J)

tf

.90

.05

B

t

t

f

f

E

t

f

t

f

P(A)

.95

.29.001

.94

.002

P(E)

A P(M)

tf

.70

.01

Figure 14.2 FILES: figures/burglary2.eps (Tue Nov 3 16:22:29 2009). A typical Bayesian net-work, showing both the topology and the conditional probability tables (CPTs). In the CPTs, the lettersB, E, A, J , and M stand for Burglary , Earthquake , Alarm , JohnCalls, and MaryCalls , respec-tively.

Example

139

.001

P(B)

Alarm

Earthquake

MaryCallsJohnCalls

Burglary

A P(J)

tf

.90

.05

B

t

t

f

f

E

t

f

t

f

P(A)

.95

.29.001

.94

.002

P(E)

A P(M)

tf

.70

.01


Evidence variable?Query variable?

Example

139

.001

P(B)

Alarm

Earthquake

MaryCallsJohnCalls

Burglary

A P(J)

tf

.90

.05

B

t

t

f

f

E

t

f

t

f

P(A)

.95

.29.001

.94

.002

P(E)

A P(M)

tf

.70

.01



Burglary is a variable, abbreviated BIt can take on two values: 0 or 1 (True or False)This is abbreviated as “b” or “!b”

Example

139

.001

P(B)

Alarm

Earthquake

MaryCallsJohnCalls

Burglary

A P(J)

tf

.90

.05

B

t

t

f

f

E

t

f

t

f

P(A)

.95

.29.001

.94

.002

P(E)

A P(M)

tf

.70

.01



P(B) is a distribution It refers to two numbers P(b) and P(!b)

Example

139

.001

P(B)

Alarm

Earthquake

MaryCallsJohnCalls

Burglary

A P(J)

tf

.90

.05

B

t

t

f

f

E

t

f

t

f

P(A)

.95

.29.001

.94

.002

P(E)

A P(M)

tf

.70

.01


Evidence variable?Query variable? Posterior probability

P(B|J,M) is the posterior probability distribution of the query B given the evidence variables J and M.It refers to a table which enumerates P(b) for every value that J and M can take.

Example

139

.001

P(B)

Alarm

Earthquake

MaryCallsJohnCalls

Burglary

A P(J)

tf

.90

.05

B

t

t

f

f

E

t

f

t

f

P(A)

.95

.29.001

.94

.002

P(E)

A P(M)

tf

.70

.01



Example

139

.001

P(B)

Alarm

Earthquake

MaryCallsJohnCalls

Burglary

A P(J)

tf

.90

.05

B

t

t

f

f

E

t

f

t

f

P(A)

.95

.29.001

.94

.002

P(E)

A P(M)

tf

.70

.01



Alarm is a hidden variable, i.e., non-query, non-evidence variables.

Example

139

.001

P(B)

Alarm

Earthquake

MaryCallsJohnCalls

Burglary

A P(J)

tf

.90

.05

B

t

t

f

f

E

t

f

t

f

P(A)

.95

.29.001

.94

.002

P(E)

A P(M)

tf

.70

.01



Alarm is a hidden variable, i.e., non-query, non-evidence variables.

When the query is on Burglary, then Earthquake is a hidden variable.

Example

139

.001

P(B)

Alarm

Earthquake

MaryCallsJohnCalls

Burglary

A P(J)

tf

.90

.05

B

t

t

f

f

E

t

f

t

f

P(A)

.95

.29.001

.94

.002

P(E)

A P(M)

tf

.70

.01


Inference algorithms compute P(B|J,M)More generally, inference algorithms compute the distribution P(X|e)

Inference by enumeration

• Goal: Compute P (X|e)


• Goal: Compute• Recall:

P (X|e)

Section 14.4. Exact Inference in Bayesian Networks 523

In the burglary network, we might observe the event in which JohnCalls = true andMaryCalls = true . We could then ask for, say, the probability that a burglary has occurred:

P(Burglary | JohnCalls = true,MaryCalls = true) = ⟨0.284, 0.716⟩ .

In this section we discuss exact algorithms for computing posterior probabilities and willconsider the complexity of this task. It turns out that the general case is intractable, so Sec-tion 14.5 covers methods for approximate inference.

14.4.1 Inference by enumeration

Chapter 13 explained that any conditional probability can be computed by summing termsfrom the full joint distribution. More specifically, a query P(X | e) can be answered usingEquation (13.9), which we repeat here for convenience:

P(X | e) = α P(X, e) = α

!

y

P(X, e, y) .

Now, a Bayesian network gives a complete representation of the full joint distribution. Morespecifically, Equation (14.2) on page 513 shows that the terms P (x, e, y) in the joint distri-bution can be written as products of conditional probabilities from the network. Therefore, aquery can be answered using a Bayesian network by computing sums of products of condi-tional probabilities from the network.

Consider the query P(Burglary | JohnCalls = true,MaryCalls = true). The hiddenvariables for this query are Earthquake and Alarm . From Equation (13.9), using initialletters for the variables to shorten the expressions, we have4

P(B | j,m) = α P(B, j,m) = α

!

e

!

a

P(B, j,m, e, a, ) .

The semantics of Bayesian networks (Equation (14.2)) then gives us an expression in termsof CPT entries. For simplicity, we do this just for Burglary = true:

P (b | j,m) = α

!

e

!

a

P (b)P (e)P (a | b, e)P (j | a)P (m | a) .

To compute this expression, we have to add four terms, each computed by multiplying fivenumbers. In the worst case, where we have to sum out almost all the variables, the complexityof the algorithm for a network with n Boolean variables is O(n2n).

An improvement can be obtained from the following simple observations: the P (b)

term is a constant and can be moved outside the summations over a and e, and the P (e) termcan be moved outside the summation over a. Hence, we have

P (b | j,m) = αP (b)!

e

P (e)!

a

P (a | b, e)P (j | a)P (m | a) . (14.4)

This expression can be evaluated by looping through the variables in order, multiplying CPTentries as we go. For each summation, we also need to loop over the variable’s possible

4 An expression such asP

eP (a, e) means to sum P (A = a, E = e) for all possible values of e. When E is

Boolean, there is an ambiguity in that P (e) is used to mean both P (E = true) and P (E = e), but it should beclear from context which is intended; in particular, in the context of a sum the latter is intended.



P (X|e)







P(X | e) = α P(X, e) = α

!

y

P(X, e, y) .



P(B | j,m) = α P(B, j,m) = α

!

e

!

a

P(B, j,m, e, a, ) .


P (b | j,m) = α

!

e

!

a

P (b)P (e)P (a | b, e)P (j | a)P (m | a) .




P (b | j,m) = αP (b)!

e

P (e)!

a

P (a | b, e)P (j | a)P (m | a) . (14.4)





Chapter 13, Equation 13.9Revisit Sec 13.3 if needed



• For a Bayesian network, the joint probability is computed using chain rule

P (X|e)







P(X | e) = α P(X, e) = α

!

y

P(X, e, y) .



P(B | j,m) = α P(B, j,m) = α

!

e

!

a

P(B, j,m, e, a, ) .


P (b | j,m) = α

!

e

!

a

P (b)P (e)P (a | b, e)P (j | a)P (m | a) .




P (b | j,m) = αP (b)!

e

P (e)!

a

P (a | b, e)P (j | a)P (m | a) . (14.4)





P (x1, x2, ..., xn) = ⇧P (xi|Parents(Xi))

Example

• Query:•







P(X | e) = α P(X, e) = α

!

y

P(X, e, y) .



P(B | j,m) = α P(B, j,m) = α

!

e

!

a

P(B, j,m, e, a, ) .


P (b | j,m) = α

!

e

!

a

P (b)P (e)P (a | b, e)P (j | a)P (m | a) .




P (b | j,m) = αP (b)!

e

P (e)!

a

P (a | b, e)P (j | a)P (m | a) . (14.4)





Example

• Query:• Hidden variables: Earthquake and Alarm







P(X | e) = α P(X, e) = α

!

y

P(X, e, y) .



P(B | j,m) = α P(B, j,m) = α

!

e

!

a

P(B, j,m, e, a, ) .


P (b | j,m) = α

!

e

!

a

P (b)P (e)P (a | b, e)P (j | a)P (m | a) .




P (b | j,m) = αP (b)!

e

P (e)!

a

P (a | b, e)P (j | a)P (m | a) . (14.4)





Example

• Query:• Hidden variables: Earthquake and Alarm• Shorthand notation:







P(X | e) = α P(X, e) = α

!

y

P(X, e, y) .



P(B | j,m) = α P(B, j,m) = α

!

e

!

a

P(B, j,m, e, a, ) .


P (b | j,m) = α

!

e

!

a

P (b)P (e)P (a | b, e)P (j | a)P (m | a) .




P (b | j,m) = αP (b)!

e

P (e)!

a

P (a | b, e)P (j | a)P (m | a) . (14.4)











P(X | e) = α P(X, e) = α

!

y

P(X, e, y) .



P(B | j,m) = α P(B, j,m) = α

!

e

!

a

P(B, j,m, e, a, ) .


P (b | j,m) = α

!

e

!

a

P (b)P (e)P (a | b, e)P (j | a)P (m | a) .




P (b | j,m) = αP (b)!

e

P (e)!

a

P (a | b, e)P (j | a)P (m | a) . (14.4)





Example

• Start by using product rule and replacing denominator with normalization constant







P(X | e) = α P(X, e) = α

!

y

P(X, e, y) .



P(B | j,m) = α P(B, j,m) = α

!

e

!

a

P(B, j,m, e, a, ) .


P (b | j,m) = α

!

e

!

a

P (b)P (e)P (a | b, e)P (j | a)P (m | a) .




P (b | j,m) = αP (b)!

e

P (e)!

a

P (a | b, e)P (j | a)P (m | a) . (14.4)





Example


• Then sum over hidden variables







P(X | e) = α P(X, e) = α

!

y

P(X, e, y) .



P(B | j,m) = α P(B, j,m) = α

!

e

!

a

P(B, j,m, e, a, ) .


P (b | j,m) = α

!

e

!

a

P (b)P (e)P (a | b, e)P (j | a)P (m | a) .




P (b | j,m) = αP (b)!

e

P (e)!

a

P (a | b, e)P (j | a)P (m | a) . (14.4)











P(X | e) = α P(X, e) = α

!

y

P(X, e, y) .



P(B | j,m) = α P(B, j,m) = α

!

e

!

a

P(B, j,m, e, a, ) .


P (b | j,m) = α

!

e

!

a

P (b)P (e)P (a | b, e)P (j | a)P (m | a) .




P (b | j,m) = αP (b)!

e

P (e)!

a

P (a | b, e)P (j | a)P (m | a) . (14.4)





Example



• Use chain rule







P(X | e) = α P(X, e) = α

!

y

P(X, e, y) .



P(B | j,m) = α P(B, j,m) = α

!

e

!

a

P(B, j,m, e, a, ) .


P (b | j,m) = α

!

e

!

a

P (b)P (e)P (a | b, e)P (j | a)P (m | a) .




P (b | j,m) = αP (b)!

e

P (e)!

a

P (a | b, e)P (j | a)P (m | a) . (14.4)











P(X | e) = α P(X, e) = α

!

y

P(X, e, y) .



P(B | j,m) = α P(B, j,m) = α

!

e

!

a

P(B, j,m, e, a, ) .


P (b | j,m) = α

!

e

!

a

P (b)P (e)P (a | b, e)P (j | a)P (m | a) .




P (b | j,m) = αP (b)!

e

P (e)!

a

P (a | b, e)P (j | a)P (m | a) . (14.4)











P(X | e) = α P(X, e) = α

!

y

P(X, e, y) .



P(B | j,m) = α P(B, j,m) = α

!

e

!

a

P(B, j,m, e, a, ) .


P (b | j,m) = α

!

e

!

a

P (b)P (e)P (a | b, e)P (j | a)P (m | a) .




P (b | j,m) = αP (b)!

e

P (e)!

a

P (a | b, e)P (j | a)P (m | a) . (14.4)





Example



• Use chain rule







P(X | e) = α P(X, e) = α

!

y

P(X, e, y) .



P(B | j,m) = α P(B, j,m) = α

!

e

!

a

P(B, j,m, e, a, ) .


P (b | j,m) = α

!

e

!

a

P (b)P (e)P (a | b, e)P (j | a)P (m | a) .




P (b | j,m) = αP (b)!

e

P (e)!

a

P (a | b, e)P (j | a)P (m | a) . (14.4)











P(X | e) = α P(X, e) = α

!

y

P(X, e, y) .



P(B | j,m) = α P(B, j,m) = α

!

e

!

a

P(B, j,m, e, a, ) .


P (b | j,m) = α

!

e

!

a

P (b)P (e)P (a | b, e)P (j | a)P (m | a) .




P (b | j,m) = αP (b)!

e

P (e)!

a

P (a | b, e)P (j | a)P (m | a) . (14.4)











P(X | e) = α P(X, e) = α

!

y

P(X, e, y) .



P(B | j,m) = α P(B, j,m) = α

!

e

!

a

P(B, j,m, e, a, ) .


P (b | j,m) = α

!

e

!

a

P (b)P (e)P (a | b, e)P (j | a)P (m | a) .




P (b | j,m) = αP (b)!

e

P (e)!

a

P (a | b, e)P (j | a)P (m | a) . (14.4)





Anyone confused about B vs b?

Example

• How many terms are added in this formula?







P(X | e) = α P(X, e) = α

!

y

P(X, e, y) .



P(B | j,m) = α P(B, j,m) = α

!

e

!

a

P(B, j,m, e, a, ) .


P (b | j,m) = α

!

e

!

a

P (b)P (e)P (a | b, e)P (j | a)P (m | a) .




P (b | j,m) = αP (b)!

e

P (e)!

a

P (a | b, e)P (j | a)P (m | a) . (14.4)





Example

• How many terms are added in this formula?







P(X | e) = α P(X, e) = α

!

y

P(X, e, y) .



P(B | j,m) = α P(B, j,m) = α

!

e

!

a

P(B, j,m, e, a, ) .


P (b | j,m) = α

!

e

!

a

P (b)P (e)P (a | b, e)P (j | a)P (m | a) .




P (b | j,m) = αP (b)!

e

P (e)!

a

P (a | b, e)P (j | a)P (m | a) . (14.4)





E = 0 or 1A = 0 or 12^k terms, for k Boolean hidden variables

Example

• Can we do better?







P(X | e) = α P(X, e) = α

!

y

P(X, e, y) .



P(B | j,m) = α P(B, j,m) = α

!

e

!

a

P(B, j,m, e, a, ) .


P (b | j,m) = α

!

e

!

a

P (b)P (e)P (a | b, e)P (j | a)P (m | a) .




P (b | j,m) = αP (b)!

e

P (e)!

a

P (a | b, e)P (j | a)P (m | a) . (14.4)





Example

• Can we do better?

• Collect the repeated terms







P(X | e) = α P(X, e) = α

!

y

P(X, e, y) .



P(B | j,m) = α P(B, j,m) = α

!

e

!

a

P(B, j,m, e, a, ) .


P (b | j,m) = α

!

e

!

a

P (b)P (e)P (a | b, e)P (j | a)P (m | a) .




P (b | j,m) = αP (b)!

e

P (e)!

a

P (a | b, e)P (j | a)P (m | a) . (14.4)











P(X | e) = α P(X, e) = α

!

y

P(X, e, y) .



P(B | j,m) = α P(B, j,m) = α

!

e

!

a

P(B, j,m, e, a, ) .


P (b | j,m) = α

!

e

!

a

P (b)P (e)P (a | b, e)P (j | a)P (m | a) .




P (b | j,m) = αP (b)!

e

P (e)!

a

P (a | b, e)P (j | a)P (m | a) . (14.4)







Eakta Jain

[email protected]




– Perform inference by enumeration to answer queries on Bayesian networks

Finish up Classroom Exercise

• Answer for manual computation

• Recognize that inference by enumerationis a depth first traversal of an expression tree

Algorithm to perform inference

14 PROBABILISTICREASONING

function ENUMERATION-ASK(X , e, bn) returns a distribution over Xinputs: X , the query variable

e, observed values for variables Ebn , a Bayes net with variables {X} ∪ E ∪ Y /* Y = hidden variables */

Q(X )← a distribution over X , initially emptyfor each value xi of X doQ(xi)← ENUMERATE-ALL(bn .VARS, exi

)where exi

is e extended withX = xi

return NORMALIZE(Q(X))

function ENUMERATE-ALL(vars , e) returns a real numberif EMPTY?(vars) then return 1.0Y ← FIRST(vars)if Y has value y in ethen return P (y | parents(Y )) × ENUMERATE-ALL(REST(vars), e)else return

P

y P (y | parents(Y )) × ENUMERATE-ALL(REST(vars), ey)where ey is e extended with Y = y

Figure 14.9 The enumeration algorithm for answering queries on Bayesian networks.

function ELIMINATION-ASK(X , e, bn) returns a distribution over Xinputs: X , the query variable

e, observed values for variables Ebn , a Bayesian network specifying joint distribution P(X1, . . . , Xn)

factors← [ ]for each var in ORDER(bn .VARS) do

factors← [MAKE-FACTOR(var , e)|factors ]if var is a hidden variable then factors← SUM-OUT(var , factors )

return NORMALIZE(POINTWISE-PRODUCT(factors ))

Figure 14.10 The variable elimination algorithm for inference in Bayesian networks.

33

Algorithm to perform inference

14 PROBABILISTICREASONING

function ENUMERATION-ASK(X , e, bn) returns a distribution over Xinputs: X , the query variable

e, observed values for variables Ebn , a Bayes net with variables {X} ∪ E ∪ Y /* Y = hidden variables */

Q(X )← a distribution over X , initially emptyfor each value xi of X doQ(xi)← ENUMERATE-ALL(bn .VARS, exi

)where exi

is e extended withX = xi

return NORMALIZE(Q(X))

function ENUMERATE-ALL(vars , e) returns a real numberif EMPTY?(vars) then return 1.0Y ← FIRST(vars)if Y has value y in ethen return P (y | parents(Y )) × ENUMERATE-ALL(REST(vars), e)else return

P

y P (y | parents(Y )) × ENUMERATE-ALL(REST(vars), ey)where ey is e extended with Y = y

Figure 14.9 The enumeration algorithm for answering queries on Bayesian networks.

function ELIMINATION-ASK(X , e, bn) returns a distribution over Xinputs: X , the query variable

e, observed values for variables Ebn , a Bayesian network specifying joint distribution P(X1, . . . , Xn)

factors← [ ]for each var in ORDER(bn .VARS) do

factors← [MAKE-FACTOR(var , e)|factors ]if var is a hidden variable then factors← SUM-OUT(var , factors )

return NORMALIZE(POINTWISE-PRODUCT(factors ))

Figure 14.10 The variable elimination algorithm for inference in Bayesian networks.

33

Classroom Exercise

• Open handout again• Evaluate

• Step your way through the ENUMERATION-ASK algorithm to perform inference by enumeration

• Points of group discussion: what is the tree being traversed? Is this depth first or breadth first traversal?







P(X | e) = α P(X, e) = α

!

y

P(X, e, y) .



P(B | j,m) = α P(B, j,m) = α

!

e

!

a

P(B, j,m, e, a, ) .


P (b | j,m) = α

!

e

!

a

P (b)P (e)P (a | b, e)P (j | a)P (m | a) .




P (b | j,m) = αP (b)!

e

P (e)!

a

P (a | b, e)P (j | a)P (m | a) . (14.4)





Expression tree

145

P(j|a).90

P(m|a).70 .01

P(m|¬a)

.05P( j|¬a) P( j|a)

.90

P(m|a).70 .01

P(m|¬a)

.05P( j|¬a)

P(b).001

P(e).002

P(¬e).998

P(a|b,e).95 .06

P(¬a|b,¬e).05P(¬a|b,e)

.94P(a|b,¬e)

Figure 14.8 FILES: figures/enumeration-tree.eps (Tue Nov 3 16:22:41 2009). The structure ofthe expression shown in Equation (??). The evaluation proceeds top down, multiplying values alongeach path and summing at the “+” nodes. Notice the repetition of the paths for j andm.

Practical Example

• Recommender systems• Examples?

Practical Example: MovieRecommendations!

• Learn the probability model P(U,S,C,V)• U = user’s profile, e.g., age, sex• S = user’s situation, e.g., location• C = movie attributes, e.g., ??


• Learn the probability model P(U,S,C,V)• U = user’s profile, e.g., age, sex• S = user’s situation, e.g., location• C = movie attributes, e.g., genre, director,

lead actor


• Learn the probability model P(U,S,C,V)• U = user’s profile, e.g., age, sex• S = user’s situation, e.g., location• C = movie attributes, e.g., genre, director,

lead actor• V = user’s rating of given movie


• Learn the probability model P(U,S,C,V)

• Why? What are the types of inferences you can make with this model?



• Why? What are the types of inferences you can make with this model?– Find the movies that a user is likely to rate

highly, i.e., compute P(V|u,s,c) for the target user u, given situation s, and movie attributes c, and then recommend movies in decreasing order of P(V|u,s,c)



• Why? What are the types of inferences you can make with this model?– Find the movies that a user is likely to rate

highly, i.e., compute P(C|u,s,v) for the target user u, given situation s, and rating v on a given movie, and then find a subset attributes in decreasing order of P(C|u,s,v) and recommend movies with those attributes



• Why? What are the types of inferences you can make with this model?– Find the users that are likely to rate a given

movie highly, i.e., compute P(U|c,s,v) for the target movies c, given situation s, and rating v on a given movie, and then send promotional materials to this user

An approach to construct theBayesian network

• Ono C, Kurokawa M, Motomura Y, Asoh H. A context-aware movie preference model using a Bayesian network for recommendation and promotion. In International Conference on User Modeling 2007 Jul 25 (pp. 247-257). Springer, Berlin, Heidelberg.


• Assume a rough network structure based on recommendations by domain experts

• Estimate the conditional probabilities fromdata



Age Sex Location Genre Director

Laugh Cry

RatingWhere would you put arrows?




Laugh Cry

Rating

P(rating|age,sex,location,genre,director) = ??




Laugh Cry

Rating










• Ask users




• Ask users• Enumerate, and count!




• Ask users• Enumerate, and count!

Discussion: Is the network sexist?

Classroom Exercise (5 minutes)

• What is your one takeaway from this class• What is still unclear• Collector: Collect these for this class and

identify one common item in each category. Put on Canvas.

• PS: putting extra resources on Canvas for these topics counts towards class participation

Date post:	27-Sep-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

CAP 4621 ARTIFICIAL INTELLIGENCE Reminder: [CAP4621] in … · 2018. 10. 24. · Recap...

Documents