Download - Introduction to Bayesian Learning · Introduction to Bayesian Learning Lecture Outline: ... COM3250 / 6170 1 2010-2011. Relevance of Bayesian Methods for Machine Learning Bayesian

Introduction to Bayesian Learning

Lecture Outline:

• Relevance of Bayesian Methods for Machine Learning

• Bayes Theorem and MAP/ML Hypotheses

• Bayes Theorem and Concept Learning

• Maximum Likelihood and Least-Squared Error Hypotheses

• Minimum Description Length Principle

Reading:

Chapter 6.1-6.4, 6.6 of Mitchell

COM3250 / 6170 1 2010-2011

Relevance of Bayesian Methods for Machine Learning

Bayesian methods have two distinct, important roles in machine learning:

COM3250 / 6170 2 2010-2011



1. They provide practical learning algorithms which explicitly manipulate probabilities:

• Naive Bayes learning

• Bayesian belief network learning

A significant feature of these Bayesian learning algorithmsis that they allow us to combine

prior knowledge (prior probabilities) with observed data

COM3250 / 6170 2-a 2010-2011



1. They provide practical learning algorithms which explicitly manipulate probabilities:

• Naive Bayes learning

• Bayesian belief network learning

A significant feature of these Bayesian learning algorithmsis that they allow us to combine

prior knowledge (prior probabilities) with observed data

2. They provide a useful conceptual framework:

• A “gold standard” for evaluating other learning algorithms

• Additional insight into Occam’s razor and the inductive bias of decision tree learning in

favour of short hypotheses

COM3250 / 6170 2-b 2010-2011

Bayes Theorem

P(h|D) =P(D|h)P(h)

P(D)

• P(h) = prior probability of hypothesish

• P(D) = prior probability of training dataD

• P(D|h) = probability ofD givenh

• P(h|D) = probability ofh givenD

COM3250 / 6170 3 2010-2011

Example

A patient takes a cancer test and the result comes back positive. The test returns a correct positive

result in only98%of the cases in which the disease is actually present, and a correct negative

result in only97%of the cases in which the disease is not present. Furthermore, .008of the entire

population have this cancer. What is the probability that the patient has cancer?

COM3250 / 6170 4 2010-2011

Example





• h is the hypothesis that the patient has cancer

– P(h) = prior probability of hypothesish = .008

COM3250 / 6170 4-a 2010-2011

Example







• D is data concerning positive outcomes of the test

– P(D) = (.008∗ .98)+(.992∗ .03) = .0376

COM3250 / 6170 4-b 2010-2011

Example








– P(D) = (.008∗ .98)+(.992∗ .03) = .0376

• P(D|h) is probability of a positive outcome to the test given that a patient has cancer

– P(D|h) = .98

COM3250 / 6170 4-c 2010-2011

Example








– P(D) = (.008∗ .98)+(.992∗ .03) = .0376

• P(D|h) is probability of a positive outcome to the test given that a patient has cancer

– P(D|h) = .98

CalculateP(h|D), the probability of the patient having cancer given a positive outcome to the testusing Bayes Theorem:

P(h|D) =P(D|h)P(h)

P(D)=

.98∗ .008.0376

= .2085

COM3250 / 6170 4-d 2010-2011

MAP/ML Hypotheses

Bayes Theorem: P(h|D) =P(D|h)P(h)

P(D)

• Generally want the most probable hypothesis given the training data

Maximum a posteriori (MAP) hypothesis,hMAP:

hMAP ≡ argmaxh∈H

P(h|D)

= argmaxh∈H

P(D|h)P(h)

P(D)

= argmaxh∈H

P(D|h)P(h)

The last step is justified, sinceP(D) has no impact on whichh maximises the expression

COM3250 / 6170 5 2010-2011

MAP/ML Hypotheses (cont)

• In some cases it may be useful to assume every hypothesis is equally probablea priori.

I.e. P(hi) = P(h j) for all hi ,h j ∈ H.

– In such cases need only considerP(D|h) to to find most probable hypothesis

– P(D|h) is called thelikelihoodof dataD givenh

– Any hypothesis that maximisesP(D|h) is called amaximum likelihood (ML) hypothesis,

hML = argmaxh∈H

P(D|h)

COM3250 / 6170 6 2010-2011

Example Again




population have this cancer. What is the MAP hypothesis?

COM3250 / 6170 7 2010-2011

Example Again





• From the problem statement we see that in general:

P(cancer) = .008 P(¬cancer) = .992

P(+|cancer) = .98 P(−|cancer) = .02

P(+|¬cancer) = .03 P(−|¬cancer) = .97

COM3250 / 6170 7-a 2010-2011

Example Again








P(+|¬cancer) = .03 P(−|¬cancer) = .97

• Given that a patient returns a positive test, what is the MAP hypothesis?

As noted:HMAP = argmax

h∈HP(D|h)P(h)

COM3250 / 6170 7-b 2010-2011

Example Again








P(+|¬cancer) = .03 P(−|¬cancer) = .97

• Given that a patient returns a positive test, what is the MAP hypothesis?

As noted:HMAP = argmax

h∈HP(D|h)P(h)

Given :P(+|cancer)P(cancer) = .98∗ .008 = .0078

P(+|¬cancer)P(¬cancer) = .03∗ .992 = .0298

Therefore, the MAP hypothesis is that the patient doesnot have cancer.

COM3250 / 6170 7-c 2010-2011

Bayes Theorem and Concept Learning

• Recall the concept learning problem:

Given:

– an instance spaceX

– a hypothesis spaceH

– a target conceptc : X →{0,1}– a sequence of training examples:〈〈x1,d1〉 . . .〈xm,dm〉〉, wheredi = c(xi)

Learn a hypothesish such thath(x) = c(x), for as manyx∈ X as possible.

COM3250 / 6170 8 2010-2011



Given:





• Brute Force MAP Hypothesis Learner

COM3250 / 6170 8-a 2010-2011



Given:






1. For each hypothesish in H, calculate the posterior probability

P(h|D) =P(D|h)P(h)

P(D)

COM3250 / 6170 8-b 2010-2011



Given:






1. For each hypothesish in H, calculate the posterior probability

P(h|D) =P(D|h)P(h)

P(D)

2. Output the hypothesishMAP with the highest posterior probability

hMAP = argmaxh∈H

P(h|D)

COM3250 / 6170 8-c 2010-2011

Bayes Theorem and Concept Learning (cont)

• Brute-force MAP learning approach may be computationally infeasible.

– requires applying Bayes Theorem toall h ∈ H

• Still useful as a standard against which other concept learning approaches may be judged.

• How can we apply Brute-Force MAP to concept learning?

COM3250 / 6170 9 2010-2011





• How can we apply Brute-Force MAP to concept learning? Assume:

1. The training data are noise free:di = c(xi).

2. Target conceptc is in hypothesis spaceH.

3. Noa priori reason to believe any one hypothesis more probable than any other.

COM3250 / 6170 9-a 2010-2011









• Given 2. and 3. should choose

P(h) =1|H| for all h∈ H

COM3250 / 6170 9-b 2010-2011









• Given 2. and 3. should choose

P(h) =1|H| for all h∈ H

• Given 1.,P(D|h), the probability of observing the target valuesD = 〈d1 . . .dm〉 for fixedinstances〈x1 . . .xm〉 if h is true, is

P(D|h) =

1 if di = h(xi) for all di ∈ D

0 otherwise

COM3250 / 6170 9-c 2010-2011


• Given these choices forP(h) andP(D|h) can now explore how the Brute-Force MAP Learning

algorithm would proceed.

• Step 1: compute the posterior probability of each hypothesish∈ H, using Bayes Theorem:

P(h|D) =P(D|h)P(h)

P(D)

• There are two cases to consider:

1. h is inconsistent with the training dataD

ThenP(D|h) = 0 and

P(h|D) =0 ·P(h)

P(D)= 0

COM3250 / 6170 10 2010-2011


2. h is consistent with the training dataD

ThenP(D|h) = 1 and

P(h|D) =1 · 1

|H|P(D)

=1 · 1

|H||VSH,D||H|

=1

|VSH,D|

whereVSH,D (the version space) is the subset of hypotheses inH consistent withD

P(D) = ∑hi∈H

P(D|hi)P(hi) (Theorem of total probability

and(∀i 6= j)P(hi ∧h j) = 0

and∑hi∈H P(hi) = 1)

= ∑hi∈VSH,D

1 · 1|H| + ∑

hi 6∈VSH,D

0 · 1|H|

= ∑hi∈VSH,D

1 · 1|H| =

|VSH,D||H|

COM3250 / 6170 11 2010-2011


• In sum, if we assume

– uniform prior probability distribution overH (i.e. P(hi) = P(h j), 1≤ i, j ≤ |H|)– deterministic, noise-free data (i.e.P(D|h) = 1 if D andh are consistent; 0 otherwise)

then Bayes Theorem tells us

P(h|D) =

1|VSH,D| if h is consistent withD

0 otherwise

COM3250 / 6170 12 2010-2011





P(h|D) =


0 otherwise

• Thus, every consistent hypothesis has posterior probability 1|VSH,D| and is a MAP hypothesis.

COM3250 / 6170 12-a 2010-2011





P(h|D) =


0 otherwise

• Thus, every consistent hypothesis has posterior probability 1|VSH,D| and is a MAP hypothesis.

• Can think of posterior probability evolving as training examples are presented from an initial,even distribution over all hypotheses to a concentrated distribution over those hypothesesconsistent with the examples.

hypotheses hypotheses hypotheses

P(h|D1,D2)P(h|D1)P h)(

a( ) b( ) c( )

COM3250 / 6170 12-b 2010-2011

MAP Hypotheses and Consistent Learners

• A consistent learneris any learner that outputs a hypothesis that commits 0 errors over the

training examples.

• Each such learner, under the two assumptions of

– uniform prior probability distribution overH

– deterministic, noise-free data

outputs a MAP hypothesis.

• Consider FIND-S, the simple concept learner that searchesH and outputs a maximally

specific consistent hypothesis.

• Since FIND-S outputs a consistent hypothesis (i.e. is a consistent learner) then under the

assumptions aboutP(h) andP(D|h), FIND-S outputs a MAP hypothesis.

– Note, FIND-S does not explicitly manipulate probabilities, but using a Bayesian

framework we can analyse it to show that its outputs are MAP hypotheses

COM3250 / 6170 13 2010-2011

MAP Hypotheses and Consistent Learners (cont)

• Since FIND-S outputs a maximally specific hypothesis, then under any prior probability

distribution favouring more specific hypotheses, FIND-S will also output a MAP hypothesis.

E.g. A probability distributionP(h) overH that assignsP(h1) ≥ P(h2) whereh1 is more

specific thanh2.

• Bayesian framework gives us a way to characterise a learningalgorithm by identifying

probability distributions (P(h) andP(D|h)) under which the algorithm outputs MAP

hypotheses (analogous to characterising ML algorithms by inductive bias).

COM3250 / 6170 14 2010-2011

Learning A Real Valued Function

• Can use Bayesian analysis to show, under certain assumptions, that any learning algorithm thatminimises the squared error between hypothesis and training data in learning a real-valuedfunction will output a maximum likelihood hypothesis.

• Consider any real-valued target func-

tion f

Training examples〈xi ,di〉, wheredi is

noisy training value

– di = f (xi)+ei

– ei is random variable (noise) drawn

independently for eachxi accord-

ing to some Normal distribution

with mean=0

hML

f

e

y

x

• Then the maximum likelihood hypothesishML is the one that minimizes the sum of squarederrors:

hML = argminh∈H

m

∑i=1

(di −h(xi))2

COM3250 / 6170 15 2010-2011

Learning A Real Valued Function (cont)

• Why? Roughly . . . (see Mitchell for full details)

• To associate probabilities with continuous variables likee must useprobability densities(so

integral of probability density,p, over all values is one, rather than sum of probabilities,P, of

discrete values being one).

• As before, the maximum likelihood hypotheses is the hypothesis that maximises the

probability of the data:

hML = argmaxh∈H

p(D|h)

= argmaxh∈H

m

∏i=1

p(di|h)

(Assuming training examples mutually independent)

= argmaxh∈H

m

∏i=1

1√2πσ2

e−12 (

di−h(xi )σ )2

(sinceei is Normally distributed, with Varσ2 andµ= 0

di = f (xi)+ei is too, with Varσ2 andµ= f (xi) = h(xi))

COM3250 / 6170 16 2010-2011

Learning A Real Valued Function (cont)

• Maximize natural log instead (maximisingln p maximisesp)

hML = argmaxh∈H

m

∑i=1

ln1√

2πσ2− 1

2

(

di −h(xi)

σ

)2

= argmaxh∈H

m

∑i=1

−12

(

di −h(xi)

σ

)2

= argmaxh∈H

m

∑i=1

−(di −h(xi))2

= argminh∈H

m

∑i=1

(di −h(xi))2

COM3250 / 6170 17 2010-2011

Minimum Description Length Principle

• TheMinimum Description Length Principle (MDL) is the principle that shorter encodings

(e.g. of hypotheses) are to be preferred.

It too can be given an interpretation in the Bayesian framework

COM3250 / 6170 18 2010-2011





• In the machine learning setting, MDL says: prefer the hypothesishMDL such that

hMDL = argminh∈H

LC1(h)+LC2(D|h)

whereLC(x) is the description length ofx under encodingC

I.e. prefer the hypothesis that minimises the length of encoding the hypothesis, plus the data

encoded using the hypothesis.

COM3250 / 6170 18-a 2010-2011





• In the machine learning setting, MDL says: prefer the hypothesishMDL such that

hMDL = argminh∈H

LC1(h)+LC2(D|h)

whereLC(x) is the description length ofx under encodingC

I.e. prefer the hypothesis that minimises the length of encoding the hypothesis, plus the data

encoded using the hypothesis.

• Example:H = decision trees,D = training data labels

– LC1(h) is # bits to describe treeh

– LC2(D|h) is # bits to describeD givenh

∗ NoteLC2(D|h) = 0 if examples classified perfectly byh. Need only describe exceptions

– HencehMDL trades off tree size for training errors

COM3250 / 6170 18-b 2010-2011

Minimum Description Length Principle (cont)

• Recall definition of MAP hypothesis:

hMAP = argmaxh∈H

P(D|h)P(h)

= argmaxh∈H

log2P(D|h)+ log2P(h)

= argminh∈H

− log2P(D|h)− log2P(h) (1)

COM3250 / 6170 19 2010-2011



hMAP = argmaxh∈H

P(D|h)P(h)

= argmaxh∈H


= argminh∈H


• Interesting fact from information theory:

The optimal (shortest expected coding length) code for an event with probabilityp is

− log2 p bits.

COM3250 / 6170 19-a 2010-2011



hMAP = argmaxh∈H

P(D|h)P(h)

= argmaxh∈H


= argminh∈H


• Interesting fact from information theory:

The optimal (shortest expected coding length) code for an event with probabilityp is

− log2 p bits.

• So interpret (1):

− log2P(h) is length ofh under optimal code

− log2P(D|h) is length ofD givenh under optimal code

→ prefer the hypothesis that minimizes

length(h)+ length(misclassi f ications)

COM3250 / 6170 19-b 2010-2011

Summary

• Bayesian methods are important for machine learning both because they supply specific

algorithms for ML and because they provide a framework for analysing other ML algorithms

COM3250 / 6170 20 2010-2011

Summary



• Generally we would like to find the most probable hypothesis given the training data – the

MAP hypothesis

COM3250 / 6170 20-a 2010-2011

Summary




MAP hypothesis

• A brute force approach to calculating the MAP hypothesis involves computing the posterior

probability of every hypothesis and then selecting the one with the highest probability – such

an approach will not in general be computationally feasible

COM3250 / 6170 20-b 2010-2011

Summary




MAP hypothesis




• In the case of concept learning given noise-free data, a Bayesian analysis shows that all

consistent hypotheses are MAP hypotheses which have equivalent probability, equal to one

over the size of the version space. Hence FIND-S is shown to output MAP hypotheses

COM3250 / 6170 20-c 2010-2011

Summary




MAP hypothesis







• Bayesian analysis can also be used to show that any hypothesis that minimises the squared

error in learning a real-valued target function is a maximumlikelihood hypothesis

COM3250 / 6170 20-d 2010-2011

Summary




MAP hypothesis







• Bayesian analysis can also be used to show that any hypothesis that minimises the squared

error in learning a real-valued target function is a maximumlikelihood hypothesis

• The MAP hypothesis can be interpreted in the light of the minimum description length

principle as the hypothesis for which the length of the hypothesis plus the length of any

residual misclassifications is minimised

COM3250 / 6170 20-e 2010-2011