Introduction to Bayesian Learning
Lecture Outline:
• Relevance of Bayesian Methods for Machine Learning
• Bayes Theorem and MAP/ML Hypotheses
• Bayes Theorem and Concept Learning
• Maximum Likelihood and Least-Squared Error Hypotheses
• Minimum Description Length Principle
Reading:
Chapter 6.1-6.4, 6.6 of Mitchell
COM3250 / 6170 1 2010-2011
Relevance of Bayesian Methods for Machine Learning
Bayesian methods have two distinct, important roles in machine learning:
COM3250 / 6170 2 2010-2011
Relevance of Bayesian Methods for Machine Learning
Bayesian methods have two distinct, important roles in machine learning:
1. They provide practical learning algorithms which explicitly manipulate probabilities:
• Naive Bayes learning
• Bayesian belief network learning
A significant feature of these Bayesian learning algorithmsis that they allow us to combine
prior knowledge (prior probabilities) with observed data
COM3250 / 6170 2-a 2010-2011
Relevance of Bayesian Methods for Machine Learning
Bayesian methods have two distinct, important roles in machine learning:
1. They provide practical learning algorithms which explicitly manipulate probabilities:
• Naive Bayes learning
• Bayesian belief network learning
A significant feature of these Bayesian learning algorithmsis that they allow us to combine
prior knowledge (prior probabilities) with observed data
2. They provide a useful conceptual framework:
• A “gold standard” for evaluating other learning algorithms
• Additional insight into Occam’s razor and the inductive bias of decision tree learning in
favour of short hypotheses
COM3250 / 6170 2-b 2010-2011
Bayes Theorem
P(h|D) =P(D|h)P(h)
P(D)
• P(h) = prior probability of hypothesish
• P(D) = prior probability of training dataD
• P(D|h) = probability ofD givenh
• P(h|D) = probability ofh givenD
COM3250 / 6170 3 2010-2011
Example
A patient takes a cancer test and the result comes back positive. The test returns a correct positive
result in only98%of the cases in which the disease is actually present, and a correct negative
result in only97%of the cases in which the disease is not present. Furthermore, .008of the entire
population have this cancer. What is the probability that the patient has cancer?
COM3250 / 6170 4 2010-2011
Example
A patient takes a cancer test and the result comes back positive. The test returns a correct positive
result in only98%of the cases in which the disease is actually present, and a correct negative
result in only97%of the cases in which the disease is not present. Furthermore, .008of the entire
population have this cancer. What is the probability that the patient has cancer?
• h is the hypothesis that the patient has cancer
– P(h) = prior probability of hypothesish = .008
COM3250 / 6170 4-a 2010-2011
Example
A patient takes a cancer test and the result comes back positive. The test returns a correct positive
result in only98%of the cases in which the disease is actually present, and a correct negative
result in only97%of the cases in which the disease is not present. Furthermore, .008of the entire
population have this cancer. What is the probability that the patient has cancer?
• h is the hypothesis that the patient has cancer
– P(h) = prior probability of hypothesish = .008
• D is data concerning positive outcomes of the test
– P(D) = (.008∗ .98)+(.992∗ .03) = .0376
COM3250 / 6170 4-b 2010-2011
Example
A patient takes a cancer test and the result comes back positive. The test returns a correct positive
result in only98%of the cases in which the disease is actually present, and a correct negative
result in only97%of the cases in which the disease is not present. Furthermore, .008of the entire
population have this cancer. What is the probability that the patient has cancer?
• h is the hypothesis that the patient has cancer
– P(h) = prior probability of hypothesish = .008
• D is data concerning positive outcomes of the test
– P(D) = (.008∗ .98)+(.992∗ .03) = .0376
• P(D|h) is probability of a positive outcome to the test given that a patient has cancer
– P(D|h) = .98
COM3250 / 6170 4-c 2010-2011
Example
A patient takes a cancer test and the result comes back positive. The test returns a correct positive
result in only98%of the cases in which the disease is actually present, and a correct negative
result in only97%of the cases in which the disease is not present. Furthermore, .008of the entire
population have this cancer. What is the probability that the patient has cancer?
• h is the hypothesis that the patient has cancer
– P(h) = prior probability of hypothesish = .008
• D is data concerning positive outcomes of the test
– P(D) = (.008∗ .98)+(.992∗ .03) = .0376
• P(D|h) is probability of a positive outcome to the test given that a patient has cancer
– P(D|h) = .98
CalculateP(h|D), the probability of the patient having cancer given a positive outcome to the testusing Bayes Theorem:
P(h|D) =P(D|h)P(h)
P(D)=
.98∗ .008.0376
= .2085
COM3250 / 6170 4-d 2010-2011
MAP/ML Hypotheses
Bayes Theorem: P(h|D) =P(D|h)P(h)
P(D)
• Generally want the most probable hypothesis given the training data
Maximum a posteriori (MAP) hypothesis,hMAP:
hMAP ≡ argmaxh∈H
P(h|D)
= argmaxh∈H
P(D|h)P(h)
P(D)
= argmaxh∈H
P(D|h)P(h)
The last step is justified, sinceP(D) has no impact on whichh maximises the expression
COM3250 / 6170 5 2010-2011
MAP/ML Hypotheses (cont)
• In some cases it may be useful to assume every hypothesis is equally probablea priori.
I.e. P(hi) = P(h j) for all hi ,h j ∈ H.
– In such cases need only considerP(D|h) to to find most probable hypothesis
– P(D|h) is called thelikelihoodof dataD givenh
– Any hypothesis that maximisesP(D|h) is called amaximum likelihood (ML) hypothesis,
hML = argmaxh∈H
P(D|h)
COM3250 / 6170 6 2010-2011
Example Again
A patient takes a cancer test and the result comes back positive. The test returns a correct positive
result in only98%of the cases in which the disease is actually present, and a correct negative
result in only97%of the cases in which the disease is not present. Furthermore, .008of the entire
population have this cancer. What is the MAP hypothesis?
COM3250 / 6170 7 2010-2011
Example Again
A patient takes a cancer test and the result comes back positive. The test returns a correct positive
result in only98%of the cases in which the disease is actually present, and a correct negative
result in only97%of the cases in which the disease is not present. Furthermore, .008of the entire
population have this cancer. What is the MAP hypothesis?
• From the problem statement we see that in general:
P(cancer) = .008 P(¬cancer) = .992
P(+|cancer) = .98 P(−|cancer) = .02
P(+|¬cancer) = .03 P(−|¬cancer) = .97
COM3250 / 6170 7-a 2010-2011
Example Again
A patient takes a cancer test and the result comes back positive. The test returns a correct positive
result in only98%of the cases in which the disease is actually present, and a correct negative
result in only97%of the cases in which the disease is not present. Furthermore, .008of the entire
population have this cancer. What is the MAP hypothesis?
• From the problem statement we see that in general:
P(cancer) = .008 P(¬cancer) = .992
P(+|cancer) = .98 P(−|cancer) = .02
P(+|¬cancer) = .03 P(−|¬cancer) = .97
• Given that a patient returns a positive test, what is the MAP hypothesis?
As noted:HMAP = argmax
h∈HP(D|h)P(h)
COM3250 / 6170 7-b 2010-2011
Example Again
A patient takes a cancer test and the result comes back positive. The test returns a correct positive
result in only98%of the cases in which the disease is actually present, and a correct negative
result in only97%of the cases in which the disease is not present. Furthermore, .008of the entire
population have this cancer. What is the MAP hypothesis?
• From the problem statement we see that in general:
P(cancer) = .008 P(¬cancer) = .992
P(+|cancer) = .98 P(−|cancer) = .02
P(+|¬cancer) = .03 P(−|¬cancer) = .97
• Given that a patient returns a positive test, what is the MAP hypothesis?
As noted:HMAP = argmax
h∈HP(D|h)P(h)
Given :P(+|cancer)P(cancer) = .98∗ .008 = .0078
P(+|¬cancer)P(¬cancer) = .03∗ .992 = .0298
Therefore, the MAP hypothesis is that the patient doesnot have cancer.
COM3250 / 6170 7-c 2010-2011
Bayes Theorem and Concept Learning
• Recall the concept learning problem:
Given:
– an instance spaceX
– a hypothesis spaceH
– a target conceptc : X →{0,1}– a sequence of training examples:〈〈x1,d1〉 . . .〈xm,dm〉〉, wheredi = c(xi)
Learn a hypothesish such thath(x) = c(x), for as manyx∈ X as possible.
COM3250 / 6170 8 2010-2011
Bayes Theorem and Concept Learning
• Recall the concept learning problem:
Given:
– an instance spaceX
– a hypothesis spaceH
– a target conceptc : X →{0,1}– a sequence of training examples:〈〈x1,d1〉 . . .〈xm,dm〉〉, wheredi = c(xi)
Learn a hypothesish such thath(x) = c(x), for as manyx∈ X as possible.
• Brute Force MAP Hypothesis Learner
COM3250 / 6170 8-a 2010-2011
Bayes Theorem and Concept Learning
• Recall the concept learning problem:
Given:
– an instance spaceX
– a hypothesis spaceH
– a target conceptc : X →{0,1}– a sequence of training examples:〈〈x1,d1〉 . . .〈xm,dm〉〉, wheredi = c(xi)
Learn a hypothesish such thath(x) = c(x), for as manyx∈ X as possible.
• Brute Force MAP Hypothesis Learner
1. For each hypothesish in H, calculate the posterior probability
P(h|D) =P(D|h)P(h)
P(D)
COM3250 / 6170 8-b 2010-2011
Bayes Theorem and Concept Learning
• Recall the concept learning problem:
Given:
– an instance spaceX
– a hypothesis spaceH
– a target conceptc : X →{0,1}– a sequence of training examples:〈〈x1,d1〉 . . .〈xm,dm〉〉, wheredi = c(xi)
Learn a hypothesish such thath(x) = c(x), for as manyx∈ X as possible.
• Brute Force MAP Hypothesis Learner
1. For each hypothesish in H, calculate the posterior probability
P(h|D) =P(D|h)P(h)
P(D)
2. Output the hypothesishMAP with the highest posterior probability
hMAP = argmaxh∈H
P(h|D)
COM3250 / 6170 8-c 2010-2011
Bayes Theorem and Concept Learning (cont)
• Brute-force MAP learning approach may be computationally infeasible.
– requires applying Bayes Theorem toall h ∈ H
• Still useful as a standard against which other concept learning approaches may be judged.
• How can we apply Brute-Force MAP to concept learning?
COM3250 / 6170 9 2010-2011
Bayes Theorem and Concept Learning (cont)
• Brute-force MAP learning approach may be computationally infeasible.
– requires applying Bayes Theorem toall h ∈ H
• Still useful as a standard against which other concept learning approaches may be judged.
• How can we apply Brute-Force MAP to concept learning? Assume:
1. The training data are noise free:di = c(xi).
2. Target conceptc is in hypothesis spaceH.
3. Noa priori reason to believe any one hypothesis more probable than any other.
COM3250 / 6170 9-a 2010-2011
Bayes Theorem and Concept Learning (cont)
• Brute-force MAP learning approach may be computationally infeasible.
– requires applying Bayes Theorem toall h ∈ H
• Still useful as a standard against which other concept learning approaches may be judged.
• How can we apply Brute-Force MAP to concept learning? Assume:
1. The training data are noise free:di = c(xi).
2. Target conceptc is in hypothesis spaceH.
3. Noa priori reason to believe any one hypothesis more probable than any other.
• Given 2. and 3. should choose
P(h) =1|H| for all h∈ H
COM3250 / 6170 9-b 2010-2011
Bayes Theorem and Concept Learning (cont)
• Brute-force MAP learning approach may be computationally infeasible.
– requires applying Bayes Theorem toall h ∈ H
• Still useful as a standard against which other concept learning approaches may be judged.
• How can we apply Brute-Force MAP to concept learning? Assume:
1. The training data are noise free:di = c(xi).
2. Target conceptc is in hypothesis spaceH.
3. Noa priori reason to believe any one hypothesis more probable than any other.
• Given 2. and 3. should choose
P(h) =1|H| for all h∈ H
• Given 1.,P(D|h), the probability of observing the target valuesD = 〈d1 . . .dm〉 for fixedinstances〈x1 . . .xm〉 if h is true, is
P(D|h) =
1 if di = h(xi) for all di ∈ D
0 otherwise
COM3250 / 6170 9-c 2010-2011
Bayes Theorem and Concept Learning (cont)
• Given these choices forP(h) andP(D|h) can now explore how the Brute-Force MAP Learning
algorithm would proceed.
• Step 1: compute the posterior probability of each hypothesish∈ H, using Bayes Theorem:
P(h|D) =P(D|h)P(h)
P(D)
• There are two cases to consider:
1. h is inconsistent with the training dataD
ThenP(D|h) = 0 and
P(h|D) =0 ·P(h)
P(D)= 0
COM3250 / 6170 10 2010-2011
Bayes Theorem and Concept Learning (cont)
2. h is consistent with the training dataD
ThenP(D|h) = 1 and
P(h|D) =1 · 1
|H|P(D)
=1 · 1
|H||VSH,D||H|
=1
|VSH,D|
whereVSH,D (the version space) is the subset of hypotheses inH consistent withD
P(D) = ∑hi∈H
P(D|hi)P(hi) (Theorem of total probability
and(∀i 6= j)P(hi ∧h j) = 0
and∑hi∈H P(hi) = 1)
= ∑hi∈VSH,D
1 · 1|H| + ∑
hi 6∈VSH,D
0 · 1|H|
= ∑hi∈VSH,D
1 · 1|H| =
|VSH,D||H|
COM3250 / 6170 11 2010-2011
Bayes Theorem and Concept Learning (cont)
• In sum, if we assume
– uniform prior probability distribution overH (i.e. P(hi) = P(h j), 1≤ i, j ≤ |H|)– deterministic, noise-free data (i.e.P(D|h) = 1 if D andh are consistent; 0 otherwise)
then Bayes Theorem tells us
P(h|D) =
1|VSH,D| if h is consistent withD
0 otherwise
COM3250 / 6170 12 2010-2011
Bayes Theorem and Concept Learning (cont)
• In sum, if we assume
– uniform prior probability distribution overH (i.e. P(hi) = P(h j), 1≤ i, j ≤ |H|)– deterministic, noise-free data (i.e.P(D|h) = 1 if D andh are consistent; 0 otherwise)
then Bayes Theorem tells us
P(h|D) =
1|VSH,D| if h is consistent withD
0 otherwise
• Thus, every consistent hypothesis has posterior probability 1|VSH,D| and is a MAP hypothesis.
COM3250 / 6170 12-a 2010-2011
Bayes Theorem and Concept Learning (cont)
• In sum, if we assume
– uniform prior probability distribution overH (i.e. P(hi) = P(h j), 1≤ i, j ≤ |H|)– deterministic, noise-free data (i.e.P(D|h) = 1 if D andh are consistent; 0 otherwise)
then Bayes Theorem tells us
P(h|D) =
1|VSH,D| if h is consistent withD
0 otherwise
• Thus, every consistent hypothesis has posterior probability 1|VSH,D| and is a MAP hypothesis.
• Can think of posterior probability evolving as training examples are presented from an initial,even distribution over all hypotheses to a concentrated distribution over those hypothesesconsistent with the examples.
hypotheses hypotheses hypotheses
P(h|D1,D2)P(h|D1)P h)(
a( ) b( ) c( )
COM3250 / 6170 12-b 2010-2011
MAP Hypotheses and Consistent Learners
• A consistent learneris any learner that outputs a hypothesis that commits 0 errors over the
training examples.
• Each such learner, under the two assumptions of
– uniform prior probability distribution overH
– deterministic, noise-free data
outputs a MAP hypothesis.
• Consider FIND-S, the simple concept learner that searchesH and outputs a maximally
specific consistent hypothesis.
• Since FIND-S outputs a consistent hypothesis (i.e. is a consistent learner) then under the
assumptions aboutP(h) andP(D|h), FIND-S outputs a MAP hypothesis.
– Note, FIND-S does not explicitly manipulate probabilities, but using a Bayesian
framework we can analyse it to show that its outputs are MAP hypotheses
COM3250 / 6170 13 2010-2011
MAP Hypotheses and Consistent Learners (cont)
• Since FIND-S outputs a maximally specific hypothesis, then under any prior probability
distribution favouring more specific hypotheses, FIND-S will also output a MAP hypothesis.
E.g. A probability distributionP(h) overH that assignsP(h1) ≥ P(h2) whereh1 is more
specific thanh2.
• Bayesian framework gives us a way to characterise a learningalgorithm by identifying
probability distributions (P(h) andP(D|h)) under which the algorithm outputs MAP
hypotheses (analogous to characterising ML algorithms by inductive bias).
COM3250 / 6170 14 2010-2011
Learning A Real Valued Function
• Can use Bayesian analysis to show, under certain assumptions, that any learning algorithm thatminimises the squared error between hypothesis and training data in learning a real-valuedfunction will output a maximum likelihood hypothesis.
• Consider any real-valued target func-
tion f
Training examples〈xi ,di〉, wheredi is
noisy training value
– di = f (xi)+ei
– ei is random variable (noise) drawn
independently for eachxi accord-
ing to some Normal distribution
with mean=0
hML
f
e
y
x
• Then the maximum likelihood hypothesishML is the one that minimizes the sum of squarederrors:
hML = argminh∈H
m
∑i=1
(di −h(xi))2
COM3250 / 6170 15 2010-2011
Learning A Real Valued Function (cont)
• Why? Roughly . . . (see Mitchell for full details)
• To associate probabilities with continuous variables likee must useprobability densities(so
integral of probability density,p, over all values is one, rather than sum of probabilities,P, of
discrete values being one).
• As before, the maximum likelihood hypotheses is the hypothesis that maximises the
probability of the data:
hML = argmaxh∈H
p(D|h)
= argmaxh∈H
m
∏i=1
p(di|h)
(Assuming training examples mutually independent)
= argmaxh∈H
m
∏i=1
1√2πσ2
e−12 (
di−h(xi )σ )2
(sinceei is Normally distributed, with Varσ2 andµ= 0
di = f (xi)+ei is too, with Varσ2 andµ= f (xi) = h(xi))
COM3250 / 6170 16 2010-2011
Learning A Real Valued Function (cont)
• Maximize natural log instead (maximisingln p maximisesp)
hML = argmaxh∈H
m
∑i=1
ln1√
2πσ2− 1
2
(
di −h(xi)
σ
)2
= argmaxh∈H
m
∑i=1
−12
(
di −h(xi)
σ
)2
= argmaxh∈H
m
∑i=1
−(di −h(xi))2
= argminh∈H
m
∑i=1
(di −h(xi))2
COM3250 / 6170 17 2010-2011
Minimum Description Length Principle
• TheMinimum Description Length Principle (MDL) is the principle that shorter encodings
(e.g. of hypotheses) are to be preferred.
It too can be given an interpretation in the Bayesian framework
COM3250 / 6170 18 2010-2011
Minimum Description Length Principle
• TheMinimum Description Length Principle (MDL) is the principle that shorter encodings
(e.g. of hypotheses) are to be preferred.
It too can be given an interpretation in the Bayesian framework
• In the machine learning setting, MDL says: prefer the hypothesishMDL such that
hMDL = argminh∈H
LC1(h)+LC2(D|h)
whereLC(x) is the description length ofx under encodingC
I.e. prefer the hypothesis that minimises the length of encoding the hypothesis, plus the data
encoded using the hypothesis.
COM3250 / 6170 18-a 2010-2011
Minimum Description Length Principle
• TheMinimum Description Length Principle (MDL) is the principle that shorter encodings
(e.g. of hypotheses) are to be preferred.
It too can be given an interpretation in the Bayesian framework
• In the machine learning setting, MDL says: prefer the hypothesishMDL such that
hMDL = argminh∈H
LC1(h)+LC2(D|h)
whereLC(x) is the description length ofx under encodingC
I.e. prefer the hypothesis that minimises the length of encoding the hypothesis, plus the data
encoded using the hypothesis.
• Example:H = decision trees,D = training data labels
– LC1(h) is # bits to describe treeh
– LC2(D|h) is # bits to describeD givenh
∗ NoteLC2(D|h) = 0 if examples classified perfectly byh. Need only describe exceptions
– HencehMDL trades off tree size for training errors
COM3250 / 6170 18-b 2010-2011
Minimum Description Length Principle (cont)
• Recall definition of MAP hypothesis:
hMAP = argmaxh∈H
P(D|h)P(h)
= argmaxh∈H
log2P(D|h)+ log2P(h)
= argminh∈H
− log2P(D|h)− log2P(h) (1)
COM3250 / 6170 19 2010-2011
Minimum Description Length Principle (cont)
• Recall definition of MAP hypothesis:
hMAP = argmaxh∈H
P(D|h)P(h)
= argmaxh∈H
log2P(D|h)+ log2P(h)
= argminh∈H
− log2P(D|h)− log2P(h) (1)
• Interesting fact from information theory:
The optimal (shortest expected coding length) code for an event with probabilityp is
− log2 p bits.
COM3250 / 6170 19-a 2010-2011
Minimum Description Length Principle (cont)
• Recall definition of MAP hypothesis:
hMAP = argmaxh∈H
P(D|h)P(h)
= argmaxh∈H
log2P(D|h)+ log2P(h)
= argminh∈H
− log2P(D|h)− log2P(h) (1)
• Interesting fact from information theory:
The optimal (shortest expected coding length) code for an event with probabilityp is
− log2 p bits.
• So interpret (1):
− log2P(h) is length ofh under optimal code
− log2P(D|h) is length ofD givenh under optimal code
→ prefer the hypothesis that minimizes
length(h)+ length(misclassi f ications)
COM3250 / 6170 19-b 2010-2011
Summary
• Bayesian methods are important for machine learning both because they supply specific
algorithms for ML and because they provide a framework for analysing other ML algorithms
COM3250 / 6170 20 2010-2011
Summary
• Bayesian methods are important for machine learning both because they supply specific
algorithms for ML and because they provide a framework for analysing other ML algorithms
• Generally we would like to find the most probable hypothesis given the training data – the
MAP hypothesis
COM3250 / 6170 20-a 2010-2011
Summary
• Bayesian methods are important for machine learning both because they supply specific
algorithms for ML and because they provide a framework for analysing other ML algorithms
• Generally we would like to find the most probable hypothesis given the training data – the
MAP hypothesis
• A brute force approach to calculating the MAP hypothesis involves computing the posterior
probability of every hypothesis and then selecting the one with the highest probability – such
an approach will not in general be computationally feasible
COM3250 / 6170 20-b 2010-2011
Summary
• Bayesian methods are important for machine learning both because they supply specific
algorithms for ML and because they provide a framework for analysing other ML algorithms
• Generally we would like to find the most probable hypothesis given the training data – the
MAP hypothesis
• A brute force approach to calculating the MAP hypothesis involves computing the posterior
probability of every hypothesis and then selecting the one with the highest probability – such
an approach will not in general be computationally feasible
• In the case of concept learning given noise-free data, a Bayesian analysis shows that all
consistent hypotheses are MAP hypotheses which have equivalent probability, equal to one
over the size of the version space. Hence FIND-S is shown to output MAP hypotheses
COM3250 / 6170 20-c 2010-2011
Summary
• Bayesian methods are important for machine learning both because they supply specific
algorithms for ML and because they provide a framework for analysing other ML algorithms
• Generally we would like to find the most probable hypothesis given the training data – the
MAP hypothesis
• A brute force approach to calculating the MAP hypothesis involves computing the posterior
probability of every hypothesis and then selecting the one with the highest probability – such
an approach will not in general be computationally feasible
• In the case of concept learning given noise-free data, a Bayesian analysis shows that all
consistent hypotheses are MAP hypotheses which have equivalent probability, equal to one
over the size of the version space. Hence FIND-S is shown to output MAP hypotheses
• Bayesian analysis can also be used to show that any hypothesis that minimises the squared
error in learning a real-valued target function is a maximumlikelihood hypothesis
COM3250 / 6170 20-d 2010-2011
Summary
• Bayesian methods are important for machine learning both because they supply specific
algorithms for ML and because they provide a framework for analysing other ML algorithms
• Generally we would like to find the most probable hypothesis given the training data – the
MAP hypothesis
• A brute force approach to calculating the MAP hypothesis involves computing the posterior
probability of every hypothesis and then selecting the one with the highest probability – such
an approach will not in general be computationally feasible
• In the case of concept learning given noise-free data, a Bayesian analysis shows that all
consistent hypotheses are MAP hypotheses which have equivalent probability, equal to one
over the size of the version space. Hence FIND-S is shown to output MAP hypotheses
• Bayesian analysis can also be used to show that any hypothesis that minimises the squared
error in learning a real-valued target function is a maximumlikelihood hypothesis
• The MAP hypothesis can be interpreted in the light of the minimum description length
principle as the hypothesis for which the length of the hypothesis plus the length of any
residual misclassifications is minimised
COM3250 / 6170 20-e 2010-2011