Introduction to Bayesian Belief Nets

Post on 22-Jan-2016

25 views 0 download

description

Introduction to Bayesian Belief Nets. Russ Greiner Dep’t of Computing Science Alberta Ingenuity Centre for Machine Learning University of Alberta http://www.cs.ualberta.ca/~greiner/bn.html. Motivation. Gates says [LATimes, 28/Oct/96]: - PowerPoint PPT Presentation

transcript

Introduction toBayesian Belief Nets

Russ GreinerDep’t of Computing Science

Alberta Ingenuity Centre for Machine LearningUniversity of Alberta

http://www.cs.ualberta.ca/~greiner/bn.html

2

 

 

3

Motivation

Gates says [LATimes, 28/Oct/96]:

Microsoft’s competitive advantages is its expertise in “Bayesian networks”

Current Products Microsoft Pregnancy and Child Care (MSN) Answer Wizard (Office 95, Office 2000) Print Troubleshooter

Excel Workbook TroubleshooterOffice 95 Setup Media TroubleshooterWindows NT 4.0 Video TroubleshooterWord Mail Merge Troubleshooter

4

Motivation (II)

US Army: SAIP (Battalion Detection from SAR, IR… GulfWar)

NASA: Vista (DSS for Space Shuttle) GE: Gems (real-time monitor for utility

generators) Intel: (infer possible processing problems from end-of-line

tests on semiconductor chips) KIC:

medical: sleep disorders, pathology, trauma care, hand and wrist evaluations, dermatology, home-based health evaluations

DSS for capital equipment: locomotives, gas-turbine engines, office equipment

5

Motivation (III) Lymph-node pathology diagnosis Manufacturing control Software diagnosis Information retrieval Types of tasks

Classification/Regression Sensor Fusion Prediction/Forecasting

6

Outline Existing uses of Belief Nets (BNs) How to reason with BNs Specific Examples of BNs Contrast with Rules, Neural Nets,

… Possible applications of BNs Challenges

How to reason efficiently How to learn BNs

7

Blah blah ouch yak ouch blah ouch blahblah ouch blah

SymptomsSymptomsChief complaintHistory, …

SignsSigns

Physical ExamTest results, …

DiagnosisPlanPlan

Treatment, …

8

Objectives: Decision Support System

Determine which tests to perform which repair to suggest

based on costs, sensitivity/specificity, …

Use all sources of information symbolic (discrete observations, history,

…) signal (from sensors)

Handle partial information Adapt to track fault distribution

9

Underlying Task Situation: Given observations {O1=v1, … Ok=vk}

(symptoms, history, test results, …)

what is best DIAGNOSIS Dxi for patient? Approach1Approach1:: Use set of obs1 & … & obsm Dxi rules

Seldom Completely Certain

but… Need rule for each situation for each diagnosis Dxr

for each set of possible values vj for Oj

for each subset of obs. {Ox1, Ox2, … } {Oj}Can’t use

if only know Temp and BP

If Temp>100 & BP = High & Cough = Yes DiseaseX

10

Underlying Task, II

Situation: Given observations {O1=v1, … Ok=vk}

(symptoms, history, test results, …)what is best DIAGNOSIS Dxi for patient?

Challenge: How to express Probabilities?

Approach 2Approach 2: Compute Probabilities of Dxi

given observations { obsj }

P( Dx = u | O1= v1, …, Ok= vk )

11• But… even if binary Dx, 20 binary obs.’s. >2,097,000 numbers!

P( Dx=T, O1=T, O2=T, …, ON=T ) = 0.03

P( Dx=T, O1=T, O2=T, …, ON=F ) = 0.4 …P( Dx=T, O1=F, O2=F, … , ON=T ) = 0

…P( Dx=F, O1=F, O2=F, …, ON=F ) = 0.01

• Then: Marginalize:

Conditionalize:

P( Dx = u, O1= v1,…,Ok= vk ) = Σ P( Dx = u , O1= v1 , …, Ok= vk, …, ON= vN )

P( Dx = u | O1 = v1,…, Ok = vk) P( Dx = u, O1 = v1,…,Ok = vk )P( O1 = v1,…,Ok = vk)

P( Dx = u, O1=v1,..., Ok= vk,…, ON=vN )

Sufficient: “atomic events”:

for all 21+N values u {T, F}, vj {T, F}

How to deal with Probabilities

12

Problems with “Atomic Events”

Representation is not intuitive

Should make “connections” explicituse “local information”

Too many numbers – O(2N) Hard to store Hard to use

[Must add 2r values to marginalize r variables]

Hard to learn[Takes O(2N) samples to learn 2N

parameters]

Include only necessary “connections”Belief Nets

P(Jaundice | Hepatitis), P(LightDim | BadBattery),…

13

? Hepatitis?

? Hepatitis, not Jaunticed but +BloodTest?

Jaunticed

BloodTest

14

Hepatitis Example• (Boolean)

Variables:

H HepatitisJ JaundiceB (positive) Blood test

• Want P( H=1 | J=0, B=1 ) …, P(H=1 | B=1, J=1), P(H=1 | B=0,J=0), …

• Alternatively…

Option 1:

J B H P(J, B, H)0 0 0 0.033950 0 1 0.00950 1 0 0.00030 1 1 0.18051 0 0 0.014551 0 1 0.0382 1 0 0.000451 1 1 0.722

…Marginalize/Conditionalize, to get P( H=1 | J=0, B=1 ) …

15

Encoding Causal Links Simple Belief Net:

0.950.05

P(H=0)P(H=1)

0.970.030

0.050.951

P(B=0 | H=h)P(B=1 | H=h)h

0.70.300

0.7

0.2

0.2

P(J=0|h,b)

0.310

0.801

0.811

P(J=1|h,b)bh

H

B

J

Node ~ VariableLink ~ “Causal dependency”

“CPTable” ~ P(child | parents)

16

H

B

J

P(J | H, B=0) = P(J | H, B=1) J, H ! P( J | H, B) = P(J | H)

J is INDEPENDENT of B, once we know HDon’t need B J arc!

h P(B=1 | H=h)

1 0.95

0 0.03

P(H=1)

0.05

h b P(J=1|h , b )

1 1 0.8

1 0 0.8

0 1 0.3

0 0 0.3

Encoding Causal Links

17

H

B

J

P(J | H, B=0) = P(J | H, B=1) J, H ! P( J | H, B) = P(J | H)

J is INDEPENDENT of B, once we know HDon’t need B J arc!

h P(B=1 | H=h)

1 0.95

0 0.03

P(H=1)

0.05

h P(J=1|h )

1 0.8

1

0 0.3

0

Encoding Causal Links

18

H

B

J

P(J | H, B=0) = P(J | H, B=1) J, H ! P( J | H, B) = P(J | H)

J is INDEPENDENT of B, once we know HDon’t need B J arc!

h P(B=1 | H=h)

1 0.95

0 0.03

P(H=1)

0.05

h P(J=1|h )

1 0.8

0 0.3

Encoding Causal Links

19

Sufficient Belief NetH

B

J

h P(B=1 | H=h)

1 0.95

0 0.03

P(H=1)

0.05

h P(J=1|h )

1 0.8

0 0.3

Requires: P(H=1) knownP(J=1 | H=1) knownP(B=1 | H=1) known

(Only 5 parameters, not 7)

Hence: P(H=1 | J=0, B=1) = P(H=1) P(J=0 | H=1) P(B=1 | J=0,H=1) 1

P(B=1 | H=1)

20

“Factoring”

B does depend on J:If J=1, then likely that H=1 B =1

but… ONLY THROUGH H: If know H=1, then likely that B=1… doesn’t matter whether J=1 or J=0 !

P(B=1 | J=0, H=1) = P(B=1 | H=1)

N.b., B and J ARE correlated a priori P(B | J ) P(B)GIVEN H, they become uncorrelated P(B | J, H) = P(B | H)

H

B

J

21

Factored Distribution Symptoms independent, given Disease

ReadingAbility and ShoeSize are dependent,P(ReadAbility | ShoeSize ) P(ReadAbility )

but become independent, given AgeP(ReadAbility | ShoeSize, Age ) = P(ReadAbility | Age)

H HepatitisJ JaundiceB (positive) Blood test

P( B | J ) P ( B ) butP( B | J,H ) = P ( B | H )

Age

ShoeSize Reading

22

Find argmax {hi}

Given

P(H = hi )P(Oj = vj | H = hi)

Independent: P(Oj | H, Ok,…) = P(Oj | H)

H

O2O1 On...

j

ijjinni hHvOPhHPvOvOhHP )|()(1

)...,|( 11

Classification Task:Given { O1 = v1, …, On = vn }Find hi that maximizes (H = hi | O1 = v1, …, On = vn)

“Naïve Bayes”

23

Naïve Bayes (con’t)

Normalizing term

(No need to compute, as same for all hi)

Easy to use for Classification

Can use even if some vjs not specified

)|()(1

)...,|( 11 ij jjinni hHvOPhHPvOvOhHP

i j

ijjinn hHvOPhHPvOvOP )|()(),...,( 11

If k Dx’s and n Ois,requires only k priors, n * k pairwise-conditionals

(Not 2n+k… relatively easy to learn)

2,147,438,6476130

2,0472110

2n+1 – 11+2nn

H

O2O1 On...

24

Bigger Networks

Intuition: Show CAUSAL connections:GeneticPH CAUSES Hepatitis; Hepatitis CAUSES Jaundice

But only via Hepatitis: GeneticPH and not Hepatitis Jaundice

P( J | D ) P( J ) butP( J | D,H ) = P( J | H)

h P(J=1| h )

1 0.8

0 0.3

h P(B=1| h )

1 0.98

0 0.01

d i P(H=1|d ,i )

1 1 0.82

1 0 0.10

0 1 0.45

0 0 0.04

If GeneticPH, then expect Jaundice: GeneticPH Hepatitis Jaundice

LiverTrauma

Jaundice

GeneticPH

Hepatitis

Bloodtest

P(I=1)

0.20P(H=1)

0.32

25

Belief Nets DAG structure

Each node Variable v v depends (only) on its parents

+ conditional prob: P(vi | parenti = 0,1,… ) v is INDEPENDENT of non-descendants, given assignments to its parents

Given H = 1,- D has no influence on J- J has no influence on B- etc.

D I

H

J B

26

Less Trivial Situations• N.b., obs1 is not always independent of obs2 given H

• Eg, FamilyHistoryDepression ‘causes’ MotherSuicide and Depression

MotherSuicide causes Depression (w/ or w/o F.H.Depression)

• Here, P( D | MS, FHD ) P( D | FHD ) ! Can be done using Belief Network,

but need to specify:P( FHD ) 1P( MS | FHD ) 2P( D | MS, FHD ) 4

FHD

MS

D

0.001

P(FHD=1)

0.101

0.030

P(MS=1 | FHD=f)f

0.0400

0.0810

0.9001

0.9711

P(D=1 | FHD=f, MS=m)mf

27

Example: Car Diagnosis

28

MammoNet

29

ALARM

A Logical Alarm Reduction Mechanism• 8 diagnoses, 16 findings, …

30

Troup Detection

31

ARCO1: Forecasting Oil Prices

32

ARCO1: Forecasting Oil Prices

33

Forecasting Potato Production

34

Warning System

35

Extensions Find best values (posterior distr.) for

SEVERAL (> 1) “output” variables Partial specification of “input” values

only subset of variables only “distribution” of each input variable

General Variables Discrete, but domain > 2 Continuous (Gaussian: x = i bi yi for parents {Y} )

Decision Theory Decision Nets (Influence Diagrams) Making Decisions, not just assigning prob’s

Storing P( v | p1, p2,…,pk)General “CP Tables” 0(2k)Noisy-Or, Noisy-And, Noisy-Max“Decision Trees”

36

Outline Existing uses of Belief Nets (BNs) How to reason with BNs Specific Examples of BNs

Contrast with Rules, Neural Nets, …

Possible applications of BNs Challenges

How to reason efficiently How to learn BNs

37

Belief Nets vs Rules Both have “Locality”

Specific clusters (rules / connected nodes)

WHY?: Easier for people to reason CAUSALLYeven if use is DIAGNOSTIC

BN provide OPTIMAL way to deal with+ Uncertainty+ Vagueness (var not given, or only dist)+ Error

…Signals meeting Symbols …

BN permits different “direction”s of inference

Often same nodes (rep’ning Propositions) butBN: Cause Effect “Hep Jaundice” P(J | H )

Rule: Effect Cause“Jaundice Hep”

38

Belief Nets vs Neural Nets Both have “graph structure” but

So harder to Initialize NN Explain NN(But perhaps easier to learn NN from examples only?)

BNs can deal withPartial InformationDifferent “direction”s of inference

BN: Nodes have SEMANTICs Combination Rules: Sound Probability

NN: Nodes: arbitrary Combination Rules: Arbitrary

39

Belief Nets vs Markov Nets Each uses “graph structure”

to FACTOR a distribution… explicitly specify dependencies, implicitly

independencies…

but subtle differences…BNs capture “causality”, “hierarchies”MNs capture “temporality”

C

BATechnical: BNs use DIRECTRED arcs allow “induced dependencies”

I (A, {}, B) “A independent of B, given {}” ¬ I (A, C, B) “A dependent on B, given C”

MNs use UNDIRECTED arcs allow other independencies

I(A, BC, D) A independent of D, given B, CI(B, AD, C) B independent of C, given A, D D

CB

A

40

Uses of Belief Nets #1 Medical Diagnosis: “Assist/Critique” MD

identify diseases not ruled-out specify additional tests to perform suggest treatments appropriate/cost-effective react to MD’s proposed treatment

Decision Support: Find/repair faults in complex machines[Device, or Manufacturing Plant, or …]… based on sensors, recorded info, history,…

Preventative Maintenance: Anticipate problems in complex machines

[Device, or Manufacturing Plant, or …]…based on sensors, statistics, recorded info, device history,…

41

Uses (con’t)

Logistics Support: Stock warehouses appropriately…based on (estimated) freq. of needs, costs,

Diagnose Software:Find most probable bugs, given

program behavior, core dump, source code, … Part Inspection/Classification:

… based on multiple sensors, background, model of production,… Information Retrieval:

Combine information from various sources,based on info from various “agents”,…

General: Partial Info, Sensor fusion-Classification -Interpretation-Prediction -…

42

Challenge #1Computational Efficiency

For given BN:General problem is

Given

Compute

+ If BN is “poly tree”, efficient alg.

- If BN is gen’l DAG (>1 path from X to Y)

- NP-hard in theory- slow in practice

Tricks: Get approximate answer (quickly)+ Use abstraction of BN+ Use “abstraction” of query (range)

O1 = v1, …, On = vn

P(H | O1 = v1, …, On = vn)

DI

H

J

B

43

# 2a:Obtaining Accurate BN BN encodes distribution over n variables

Not O(2n) values, but “only” i 2k_i

(Node ni binary, with ki parents)

Still lots of values! …structure ..

Qualitative InformationStructure: “What depends on what?”

• Easy for people (background knowledge)• But NP-hard to learn from samples…

Quantitative InformationActual CP-tables

• Easy to learn, given lots of examples.• But people have hard time…

Knowledge acquisition: from human experts

Simple learning algorithm

44

Notes on Learning

Mixed Sources: Person provides structure;Algorithm fills-in numbers.

Just Learning Algorithm: algorithms that

learn from samplestructure values

Just Human Expert: People produce CP-table, as well as structure

Relatively few values really requiredEsp. if NoisyOr, NoisyAnd, NaiveBayes, …

Actual values not that important…Sensitivity studies

45

My Current Work Learning Belief Nets

Model selection:Challenging myth that MDL is appropriate

criteria Learning “performance system”, not

model Validating Belief Nets

“Error bars” around answers

Adaptive User Interfaces Efficient Vision Systems Foundations of Learnability

Learning Active Classifiers Sequential learners

Condition Based maintenance, Bio-signal interpretation, …

46

# 2b: Maintaining Accurate BN

The world changes.Information in BN*

may be perfect at time t sub-optimal at time t + 20 worthless at time t + 200

Need to MAINTAIN a BN over timeusing on-going human consultant

Adaptive BN Dirichlet distribution (variables) Priors over BNs

47

Conclusions Belief Nets are PROVEN TECHNOLOGY

Medical Diagnosis DSS for complex machines Forecasting, Modeling, InfoRetrieval…

Provide effective way toRepresent complicated, inter-related eventsReason about such situations

•Diagnosis, Explanation, ValueOfInfo•Explain conclusions•Mix Symbolic and Numeric observations

ChallengesEfficient ways to use BNsWays to create BNsWays to maintain BNsReason about time

48

Extra Slides

AI Seminar Friday, noon, CSC3-33 Free PIZZA! http://www.cs.ualberta.ca/~ai/ai-seminar.html

Referenceshttp://www.cs.ualberta.ca/~greiner/bn.html

Crusher Controller Formal Framework Decision Nets Developing the Model Why Reasoning is Hard Learning Accurate Belief Nets

49

References•http://www.cs.ualberta.ca/~greiner/bn.html• Overview textbooks:

Judea Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, 1988.

Stuart Russell and Peter Norvig, Artificial Intelligence: A Modern Approach, Prentice Hall, 1995. (See esp Ch 14, 15, 19.)

• General info re BayesNetshttp://www.afit.af.mil:80/Schools/EN/ENG/LABS/AI/BayesianNetworks Proceedings: http://www.sis.pitt.edu/~dsl/uai.htmlAssoc for Uncertainty in AI http://www.auai.org/

• Learning:David Heckerman, A tutorial on learning with Bayesian networks,

1995,http://www.research.microsoft.com/research/dtg/heckerma/TR-95-

06.htm

• Software:General: http://bayes.stat.washington.edu/almond/belief.htmlJavaBayes http://www.cs.cmu.edu/~fgcozman/Research/JavaBayeNorsys http://www.norsys.com/

50

Decision Net: Test/Buy a Car

51

Utility: Decision Nets Given c( action, state) R (cost function)

Cp(a) = Es[ c(a,s) ] = sS p(s | obs) * c(a, s) Best (immediate) action: a* = argmina A

{Cp(a) } Decision Net (like Belief Net) but…

3 types of nodes chance (like Belief net) action – repair, sensing cost/utility

Links for “dependency” Given observations, obs, computes best action, a*

Sequence of Actions: MDPs, POMDPs, …Go Back

52

Decision Net: Drill for Oil?

Go Back

53

Formal Framework

)|(),...,( 1 ii

in paxPxxP

)2( ||i

paiO

Always true:P(x1, …,xn) = P(x1) P(x2 | x1) P (x3 | x2, x1) … P (xn | xn-1,…,x1)

Given independencies,P(xk | x1,…,xk-1) = P (xk | pak) for some pak {x1, …, xk-1}

Hence

So just connect each y pai to xi… DAG structure

.Note: -Size of BN is so better to use small pai.

-pai = {1,…,i – 1} is never incorrect … but seldom min’l… (so hard to store, learn, reason with,…)- Order of variables can make HUGE difference Can have |pai| = 1 for one ordering

|pai| =i– 1 for anotherGo Back

54

Developing the ModelSource of information

+ (Human) Expert (s)

+ Data from earlier Runs

+ Simulator

Typical Process1. Develop / Refine Initial Prototype

2. Test Prototype ↦ Accurate System

3. Deploy System

4. Update / Maintain System

55

Develop/Refine PrototypeRequires expert

useful to have dataInitial Interview(s):

To establish “what relates to what”Expert time: ≈ ½ - day

Iterative process: (Gradual refinement)

To refine qualitative connectionsTo establish correct operationsExpert presents “Good Performance”

KE implements Expert’s claimsKE tests on examples (real data or expert), and reports to Expert

Expert time: ≈ 1 – 2 hours / week for ?? Weeks(Depends on complexity of device, and accuracy of model)

Go Back

56

Why Reasoning is HardBN reasoning may look easy:

Just “propagate” information from node to node

Z

BA

C Challenge: What is P(C=t)?A = Z = ¬B P ( A = t ) = P ( B = f ) = ½ So… ? P ( C = t ) = P ( A = t, B = t) = P ( A = t) * P( B = t) = ½ * ½ = ¼ Wrong: P ( C = t ) = 0 !

Need to maintain dependencies! P ( A = t, B = t ) = P ( A = t ) * P ( B = t | A = t)

z P(A=t|Z=z)

t 1.0

f 0.0

z P(B=t|Z=z)

t 0.0

f 1.0

a b P(C=t|a,b)

t t 1.0

t f 0.0

f t 0.0

f f 0.0

P(Z=t)

0.5

Go Back

57

Crusher Controller Given observations

History, sensor readings, schedule, … Specify best action for crusher

“stop immediately”, “increase roller speed by ”

Best == minimize expected cost …

Initially: just recommendation to human operator Later: Directly implement (some) actions

?Request values of other sensors?

58

Approach1. For each state s

(“Good flow”, “tooth about to enter”, …)

for each action a(“Stop immediately”, “Change p7 += 0.32”, …)

determine utility of performing a in s(Cost of lost production if stopped;… of reduced production efficient if continue; …)

2. Use observations to estimate (dist over) current states

Infer EXPECTED UTILITY of each action, based on distr.

3. Return action with highest Expected Utility

59

Details Inputs

Sensor Readings (history) Camera, microphone,

power-draw Parameter settings Log files, Maintenance

records Schedule (maintenance,

anticipated load, …) Outputs

Continue as is Adjust parameters

GapSize, ApronFeederSpeed, 1J_ConveyorSpeed

Shut down immediately Step adding new material Tell operator to look

State “CrusherEnvironment”

#UncrushableThingsNowInCrusher

#TeethMissing NextUncrushableEntry Control Parameters

60

Benefits Increase Crusher Effectiveness

Find best settings for parameters To maximize production of well-sized chunks

Reduce Down Time Know when maintain/repair is critical

Reduce Damage to Crusher Usable Model of Crusher

Easy to modify when needed Training Design of next generation

Prototype for design of {control, diagnostician} of other machines

Go Back

61

My Background PhD, Stanford (Computer Science)

Representational issues, Analogical Inference … everything in Logic

PostDoc at UofToronto (CS) Foundations of learnability, logical inference, DB, control

theory, … … everything in Logic

Industrial research (Siemens Corporate Research) Need to solve REAL problems

Theory Revision, Navigational systems, … …logic is not be-all-and-end-all!

Prof at UofAlberta (CS) Industrial problems (Siemens, BioTools, Syncrude) Foundations of learnability, probabilistic inference …

62

Less Trivial Situations• N.b., obs1 is not always independent of obs2 given H

• Eg, FamilyHistoryDepression ‘causes’ MotherSuicide and Depression

MotherSuicide causes Depression (w/ or w/o F.H.Depression)

• Here, P( D | MS, FHD ) P( D | FHD ) ! Can be done using Belief Network,

but need to specify:P( FHD ) 1P( MS | FHD ) 2P( D | MS, FHD ) 4

FHD

MS

D

0.001

P(FHD=1)

f P(MS=1 | FHD=f)

1 0.10

0 0.03

f m P(D=1 | FHD=f, MS=m)

1 1 0.97

1 0 0.90

0 1 0.08

0 0 0.04