transcript
- Slide 1
- Author: David Heckerman Presented By: Yan Zhang - 2006 Jeremy
Gould 2013 1
- Slide 2
- Outline Bayesian Approach Bayesian vs. classical probability
methods Examples Bayesian Network Structure Inference Learning
Probabilities Learning the Network Structure Two coin toss an
example Conclusions Exam Questions 2
- Slide 3
- Bayesian vs. the Classical Approach The Bayesian probability of
an event x, represents the persons degree of belief or confidence
in that events occurrence based on prior and observed facts.
Classical probability refers to the true or actual probability of
the event and is not concerned with observed behavior. 3
- Slide 4
- Example Is this Man a Martian Spy? 4
- Slide 5
- Example We start with two concepts: 1. Hypothesis (H) He either
is or is not a Martian spy. 2. Data (D) Some set of information
about the subject. Perhaps financial data, phone records, maybe we
bugged his office 5
- Slide 6
- Example Frequentist Says Bayesian Says Given a hypothesis (He
IS a Martian) there is a probability P of seeing this data: P( D |
H ) (Considers absolute ground truth, the uncertainty/noise is in
the data.) Given this data there is a probability P of this
hypothesis being true: P( H | D ) (This probability indicates our
level of belief in the hypothesis.) 6
- Slide 7
- Bayesian vs. the Classical Approach Bayesian approach restricts
its prediction to the next (N+1) occurrence of an event given the
observed previous (N) events. Classical approach is to predict
likelihood of any given event regardless of the number of
occurrences. 7 NOTE: The Bayesian approach can be updated as new
data is observed.
- Slide 8
- Bayes Theorem 8 where For the continuous case imagine an
infinite number of infinitesimally small partitions.
- Slide 9
- Example Coin Toss I want to toss a coin n = 100 times. Lets
denote the random variable X as the outcome of one flip: p(X=head)
= p(X=tail) =1- Before doing this experiment we have some belief in
our mind: the prior probability . Lets assume that this event will
have a Beta distribution (a common assumption): Sample Beta
Distributions:
- Slide 10
- Example Coin Toss If we assume a 50-50 coin we can use = = 5
which gives: (Hopefully, what you were expecting!)
- Slide 11
- Example Coin Toss Now I can run my experiment. As I go I can
update my beliefs based on the observed heads (h) and tails (t) by
applying Bayes Law to the Beta Distribution: 11
- Slide 12
- Example Coin Toss 12 Since were assuming a Beta distribution
this becomes: our posterior probability. Supposing that we observed
h = 65, t = 35, we would get:
- Slide 13
- Example Coin Toss 13
- Slide 14
- Integration 14 To find the probability that X n+1 = heads, we
could also integrate over all possible values of to find the
average value of which yields: This might be necessary if we were
working with a distribution with a less obvious Expected
Value.
- Slide 15
- More than Two Outcomes In the previous example, we used a Beta
distribution to encode the states of the random variable. This was
possible because there were only 2 states/outcomes of the variable
X. In general, if the observed variable X is discrete, having r
possible states {1,,r}, the likelihood function is given by: 15 In
this general case we can use a Dirichlet distribution instead:
- Slide 16
- Vocabulary Review Prior Probability, P( | ): Prior Probability
of a particular value of given no observed data (our previous
belief) Posterior Probability, P( | D, ): Probability of a
particular value of given that D has been observed (our final value
of ). Observed Probability or Likelihood, P(D|, ): Likelihood of
sequence of coin tosses D being observed given that is a particular
value. P(D|): Raw probability of D 16
- Slide 17
- Outline Bayesian Approach Bayes Therom Bayesian vs. classical
probability methods coin toss an example Bayesian Network Structure
Inference Learning Probabilities Learning the Network Structure Two
coin toss an example Conclusions Exam Questions 17
- Slide 18
- OK, But So What? Thats great but this is Data Mining not
Philosophy of Mathematics. Why should we care about all of this
ugly math? 18
- Slide 19
- Bayesian Advantages It turns out that the Bayesian technique
permits us to do some very useful things from a mining perspective!
1. We can use the Chain Rule with Bayesian Probabilities: 19 Ex.
This isnt something we cant easily do with classical probability!
2. As weve already seen using the Bayesian model permits us to
update our beliefs based on new data.
- Slide 20
- Example Network 20 To create a Bayesian network we will
ultimately need 3 things: A set of Variables X={X 1,, X n } A
Network Structure Conditional Probability Table (CPT) Note that
when we start we may not have any of these things or a given
element may be incomplete!
- Slide 21
- Lets start with a simple case where we are given all three
things: a credit fraud network designed to determine the
probability of credit fraud. 21
- Slide 22
- Set of Variables 22 Each node represents a random variable.
(Lets assume discrete for now.)
- Slide 23
- Network Structure 23 Each edge represents a conditional
dependence between variables.
- Slide 24
- Conditional Probability Table 24 Each rule represents the
quantification of a conditional dependency.
- Slide 25
- 25 Since weve been given the network structure we can easily
see the conditional dependencies: P(A|F,A,S,G) = P(A) P(S|F,A,S,G)
= P(S) P(G|F,A,S,G) = P(G|F) P(J|F,A,S,G) = P(J|F,A,S)
- Slide 26
- 26 Note that the absence of an edge indicates conditional
independence: P(A|G) = P(A)
- Slide 27
- 27 Important Note: The presence of a of cycle will render one
or more of the relationships intractable!
- Slide 28
- Inference 28 Now suppose we want to calculate (infer) our
confidence level in a hypothesis on the fraud variable f given some
knowledge about the other variables. This can be directly
calculated via: (Kind of messy)
- Slide 29
- Inference 29 Fortunately, we can use the Chain Rule to
simplify! This Simplification is especially powerful when the
network is sparse which is frequently the case in real world
problems. This shows how we can use a Bayesian Network to infer a
probability not stored directly in the model.
- Slide 30
- Now for the Data Mining! So far we havent added much value to
the data. So lets take advantage of the Bayesian models ability to
update our beliefs and learn from new data. First well rewrite our
joint probability distribution in a more compact form: 30
- Slide 31
- Learning Probabilities in a Bayesian Network First we need to
make two assumptions: There is no missing data (i.e. the data
accurately describes the distribution) The parameter vectors are
independent (generally a good assumption, at least locally).
31
- Slide 32
- Learning Probabilities in a Bayesian Network If these
assumptions hold we can express the probabilities as: 32
- Slide 33
- Dealing with Unknowns Whew! Now we know how to use our network
to infer conditional relationships and how to update our network
with new data. But what if we arent given a well defined network?
We could start with missing or incomplete: 1. Set of Variables 2.
Conditional Relationship Data 3. Network Structure 33
- Slide 34
- Unknown Variable Set Our goal when choosing variables is to:
Organizeinto variables having mutually exclusive and collectively
exhaustive states. This is a problem shared by all data mining
algorithms: What should we measure and why? There is not and
probably cannot be an algorithmic solution to this problem as
arriving at any solution requires intelligent and creative thought.
34
- Slide 35
- Unknown Conditional Relationships This can be easy. So long as
we can generate a plausible initial belief about a conditional
relationship we can simply start with our assumption and let our
data refine our model via the mechanism shown in the Learning
Probabilities in a Bayesian Network slide. 35
- Slide 36
- Unknown Conditional Relationships However, when our ignorance
becomes serious enough that we no longer even know what is
dependent on what we segue into the Unknown Structure scenario.
36
- Slide 37
- Learning the Network Structure Sometimes the conditional
relationships are not obvious. In this case we are uncertain with
the network structure: we dont know where the edges should be.
37
- Slide 38
- Learning the Network Structure Theoretically, we can use a
Bayesian approach to get the posterior distribution of the network
structure: Unfortunately, the number of possible network structure
increase exponentially with n the number of nodes. Were basically
asking ourselves to consider every possible graph with n nodes!
38
- Slide 39
- Learning the Network Structure Heckerman describes two main
methods for shortening the search for a network model: Model
Selection To select a good model (i.e. the network structure) from
all possible models, and use it as if it were the correct model.
Selective Model Averaging To select a manageable number of good
models from among all possible models and pretend that these models
are exhaustive. The math behind both techniques is quite involved
so Im afraid well have to content ourselves with a toy example
today. 39
- Slide 40
- Two Coin Toss Example Experiment: flip two coins and observe
the outcome Propose two network structures: S h 1 or S h 2 Assume
P(S h 1 )=P(S h 2 )=0.5 After observing some data, which model is
more accurate for this collection of data? 40 X1X1 X2X2 X1X1 X2X2
p(H)=p(T)=0.5 p(H|H)= 0.1 p(T|H)= 0.9 p(H|T)= 0.9 p(T|T)= 0.1
Sh1Sh1 Sh2Sh2 P(X 2 |X 1 )
- Slide 41
- Two Coin Toss Example X1X1 X2X2 1TT 2TH 3HT 4HT 5TH 6HT 7TH 8TH
9HT 10HT 41
- Slide 42
- Two Coin Toss Example X1X1 X2X2 1TT 2TH 3HT 4HT 5TH 6HT 7TH 8TH
9HT 10HT 42
- Slide 43
- Two Coin Toss Example X1X1 X2X2 1TT 2TH 3HT 4HT 5TH 6HT 7TH 8TH
9HT 10HT 43......
- Slide 44
- Two Coin Toss Example 44
- Slide 45
- Two Coin Toss Example 45
- Slide 46
- Two Coin Toss Example 46
- Slide 47
- Outline Bayesian Approach Bayes Therom Bayesian vs. classical
probability methods coin toss an example Bayesian Network Structure
Inference Learning Probabilities Learning the Network Structure Two
coin toss an example Conclusions Exam Questions 47
- Slide 48
- Conclusions Bayesian method Bayesian network Structure
Inference Learn parameters and structure Advantages 48
- Slide 49
- Question1: What is Bayesian Probability? A persons degree of
belief in a certain event Your own degree of certainty that a
tossed coin will land heads A degree of confidence in an outcome
given some data. 49
- Slide 50
- Question 2: Compare the Bayesian and classical approaches to
probability (any one point). Bayesian Approach: Classical
Probability: +Reflects an experts knowledge +The belief is kept
updating when new data item arrives - Arbitrary (More subjective)
Wants P( H | D ) +Objective and unbiased - Generally not available
It takes a long time to measure the objects physical
characteristics Wants P( D | H ) 50
- Slide 51
- Question 3: Mention at least 1 Advantage of Bayesian analysis
Handle incomplete data sets Learning about causal relationships
Combine domain knowledge and data Avoid over fitting 51
- Slide 52
- The End Any Questions? 52