Computational and StatisticalLearning Theory
TTIC 31120
Prof. Nati Srebro
Lecture 4:MDL and PAC-Bayes
Uniform vs Non-Uniform Bias
β’ No Free Lunch: we need some βinductive biasβ
β’ Limiting attention to hypothesis class β: βflatβ bias
β’ π β =1
βfor β β β, and π β = 0 otherwise
β’ Non-uniform bias: π β encodes bias
β’ Can use any π β β₯ 0, s.t. β π(β) β€ 1
β’ E.g. choose prefix-disambiguous encoding π(β) and use π β = 2β π β
β’ Or, choose π:π° β π΄π³ over prefix-disambiguous programs π° β 0,1 β
and use π β = 2β min
π π =βπ
β’ Choice of π β , π β or π β encodes are expert knowledge/inductive bias
Minimum Description Length Learning
β’ Choose βpriorβ π(β) s.t. β π β β€ 1 (or description language π β or π β )
β’ Minimum Description Length learning rule(based on above prior/description language):
ππ·πΏπ π = arg maxπΏπ β =0
π(β) = arg minπΏπ β =0
|π β |
β’ For any π, w.p. β₯ 1 β πΏ,
πΏ ππ·πΏπ π β€ infβ π .π‘.πΏπ β =0
β log π β + log 2/πΏ
2π
Sample complexity: π = πβ log π(β)
π2= π
π β
π2
(more careful analysis: ππ ββ
π)
MDL and Universal Learningβ’ Theorem: For any β and π:β β [0,1], s.t. β π(β) β€ 1, and any
source distribution π, if there exists β with πΏ β = 0 and π β > 0, then w.p. β₯ 1 β πΏ over π βΌ ππ:
πΏ ππ·πΏπ π β€β log π β + log 2/πΏ
2π
β’ Can learn any countable class!
β’ Class of all computable functions, with π β = 2β min
π π =βπ
.
β’ Class enumerable with π:β β β with π β = 2βπ β
β’ But VCdim(all computable functions)=β !
β’ Why no contradiction to Fundamental Theorem?
β’ PAC Learning: Sample complexity π π, πΏ is uniform for all β β β.Depends only on class β, not on specific ββ
β’ MDL: Sample complexity π(π, πΏ, π) depends on β.
Uniform and Non-Uniform Learnability
β’ Definition: A hypothesis class β is agnostically PAC-Learnable if there
exists a learning rule π΄ such that βπ, πΏ > 0, βπ π, πΏ , βπ, βπ, βπβΌππ π,πΏπΏ ,
πΏπ π΄ π β€ πΏπ β + π
β’ Definition: A hypothesis class β is non-uniformly learnable if there exists
a learning rule π΄ such that βπ, πΏ > 0, ββ, βπ π, πΏ, π , βπ, βπβΌππ π,πΏ,ππΏ ,
πΏπ π΄ π β€ πΏπ β + π
β’ Theorem: βπ, if there exists β with πΏ β = 0, then βπβΌπππΏ
πΏ ππ·πΏπ π β€β log π β + log 2/πΏ
2π
Compete also with β s.t. πΏ β > 0?
Allowing Errors: From MDL to SRM
πΏ β β€ πΏπ β +β log π β + log 2/πΏ
2π
β’ Structural Risk Minimization:
ππ ππ π = argminβ
πΏπ β +β log π β
2π
β’ Theorem: For any prior π(β), β π(β) β€ 1, and any source distribution π, w.p. β₯ 1 β πΏ over π βΌ ππ:
πΏ ππ ππ π β€ infβ
πΏ β + 2β log π β + log 2/πΏ
π
Minimized by MDLMinimized
by ERM
fit thedata match the prior /
simple / short description
Non-Uniform Learning: Beyond Cardinalityβ’ MDL still essentially based on cardinality (βhow many hypothesis are simpler
then meβ) and ignores relationship between predictors.
β’ Generalizes the cardinality bound: Using π β =1
βwe get
π π, πΏ, β = π π, πΏ =log β + log 2/πΏ
π2
β’ Can we treat continuous classes (e.g. linear predictors)?Move from cardinality to βgrowth functionβ?
β’ E.g.:
β’ β = π πππ π π π₯ | π:βπ β β is a polynomial , π:π³ β βπ
β’ VCdim(β)=ββ’ β is uncountable, and there is no distribution with ββββπ β > 0β’ But what if we bias toward lower order polynomials?
β’ Answer 1: prior over hypothesis classesβ’ Write β =βͺβπ (e.g. βπ =degree-π polynomials)β’ Use prior π(π»π) over hypothesis classes
Prior Over Hypothesis Classes
β’ VC bound: βπβ ββββππΏ β β€ πΏπ β + π
VCdim βπ +log 1 πΏπ
πβ₯ 1 β πΏπ
β’ Setting πΏπ = π βπ β πΏ and taking a union bound,
βπβΌπππΏ ββπ
ββββππΏ β β€ πΏπ β + π
VCdim βπ β log π βπ + log 1 πΏπ
β’ Structural Risk Minimization over hypothesis classes:
ππ ππ π = arg minβββπ
πΏπ β + πΆβ log π βπ + VCdim βπ
π
β’ Theorem: w.p. β₯ 1 β πΏ,
πΏπ ππ ππ π β€ minβπ,βββπ
πΏπ β + πβ log π βπ + VCdim βπ + log
1πΏ
π
Structural Risk Minimizationβ’ Theorem: For a prior π βπ with βπ
π βπ β€ 1 and any π, βπβΌπππΏ ,
πΏπ ππ ππ π β€ minβπ,βββπ
πΏπ β + πβ log π βπ + VCdim βπ + log
1πΏ
π
β’ For βπ = βπ :
β’ ππΆπππ βπ = 0
β’ Reduces to βstandardβ SRM with a prior over hypothesis
β’ For π βπ = 1
β’ Reduces to ERM over a finite-VC class
β’ More general. Eg for polynomials over π π₯ β βπ with π degree π = 2βπ,
π π, πΏ, β = πdegree(β) + π + 1 degree β + log
1πΏ
π2
β’ Allows non-uniform learning of a countable union of finite-VC classes
Uniform and Non-Uniform Learnability
β’ Definition: A hypothesis class β is agnostically PAC-Learnable if there exists a
learning rule π΄ such that βπ, πΏ > 0, βπ π, πΏ , βπ, βπ, βπβΌππ π,πΏπΏ ,
πΏπ π΄ π β€ πΏπ β + π
β’ Definition: A hypothesis class β is non-uniformly learnable if there exists a
learning rule π΄ such that βπ, πΏ > 0, ββ, βπ π, πΏ, π , βπ, βπβΌππ π,πΏ,ππΏ ,
πΏπ π΄ π β€ πΏπ β + π
β’ Theorem: A hypothesis class β is non-uniformly learnable if and only ifit is a countable union of finite VC class (β =βͺπββ βπ, VCdim βπ < β)
β’ Definition: A hypothesis class β is βconsistently learnableβ if there exists a
learning rule π΄ such that βπ, πΏ > 0, ββ βπ, βπ π, πΏ, β, π , βπβΌππ π,πΏ,β,ππΏ ,
πΏπ π΄ π β€ πΏπ β + π
Consistency
β’ π³ countable (e.g. π³ = 0,1 β), β = Β±1 π³ (all possible functions)
β’ β is uncountable, it is not a countable union of finite VC classes, and is thus not non-uniformly learnable
β’ Claim: β is βconsistently learnableβ usingπΈπ πβ π π₯ = ππ΄π½ππ πΌππ π¦π π . π‘. π₯π , π¦π β π
β’ Proof sketch: for any π,
β’ Sort π³ by decreasing probability. The tail has diminishing probability and thus for any π, there exists some prefix π³β² of the sort s.t. the tail π³ βπ³β² has probability mass β€ π/2.
β’ Weβll give up on the tail. π³β² is finite, and so Β±1 β is also finite.
β’ Why only βconsistently learnableβ?
β’ Size of π³β required to capture 1 β π/2 of mass depends on π.
Uniform and Non-Uniform Learnability
β’ Definition: A hypothesis class β is agnostically PAC-Learnable if there exists a
learning rule π΄ such that βπ, πΏ > 0, βπ π, πΏ , βπ, βπ, βπβΌππ π,πΏπΏ ,
πΏπ π΄ π β€ πΏπ β + π
β’ (Agnostically) PAC-Learnable iff VCdim β < β
β’ Definition: A hypothesis class β is non-uniformly learnable if there exists a
learning rule π΄ such that βπ, πΏ > 0, ββ, βπ π, πΏ, π , βπ, βπβΌππ π,πΏ,ππΏ ,
πΏπ π΄ π β€ πΏπ β + π
β’ Non-uniformly learnable iff β is a countable union of finite VC classes
β’ Definition: A hypothesis class β is βconsistently learnableβ if there exists a
learning rule π΄ such that βπ, πΏ > 0, ββ βπ, βπ π, πΏ, β, π , βπβΌππ π,πΏ,β,ππΏ ,
πΏπ π΄ π β€ πΏπ β + π
SRM In Practice
ππ ππ π = arg minβββπ
πΏπ β + πΆβ logπ βπ + VCdim βπ
π
β’ Bound is loose anyway. Better to view as bi-criteria optimization:argmin πΏπ β and β log π βπ + VCdim βπ
E.g. serialize asargmin πΏπ β + π β log π βπ + VCdim βπ
β’ Typically βlog π βπ , VCdim βπ monotone in βcomplexityβ πargmin πΏπ β ππ§π π(β)
whereπ β = min π π . π‘. β β βπ
SRM as a Bi-Criteria Problemargmin πΏπ β and π(β)
Regularization Path = argminβ πΏπ β + π β π β | 0 β€ π β€ β
Select π using a validation setβexact bound not needed
π(β)
πΏπ(β)
β
Regularization Path(Pareto Frontier)
π = β
π = 0
Non-Uniform Learning: Beyond Cardinalityβ’ MDL still essentially based on cardinality (βhow many hypothesis are
simpler then meβ) and ignores relationship between predictors.
β’ Can we treat continuous classes (e.g. linear predictors)?Move from cardinality? Take into account that many predictors are similar?
β’ Answer 1: prior π(π) over hypothesis class
β’ Answer 2: PAC-Bayes Theory
β’ Prior distribution π (not necessarily discrete) over β
David McAllester
PAC-Bayes
β’ Until now (MDL, SRM) we used a discrete βpriorβ (discrete βdistributionβ π βover hypothesis, or discrete βdistributionβ π(βπ) over hypothesis classes)
β’ Instead: encode inductive bias as distribution π over hypothesis
β’ Use randomized (averaged) predictor βπ, where for each prediction choosesβ βΌ π and predicts β(π₯)
β’ βπ π₯ = π¦ w. p. βββΌπ β π₯ = π¦
β’ πΏπ βπ = πΌββΌπ πΏπ β
β’ Theorem: for any distribution π over hypothesis and any π, βπβΌπππΏ
πΏπ βπ β πΏπ βπ β€πΎπΏ π||π + log 2π
πΏ2 π β 1
KL-Divergence
πΎπΏ(π| π = πΌββΌπ logππ
ππ
= β π β logπ β
π βfor discrete dist with pmf π, π
= β« ππ β logππ β
ππ βπβ for continuous distributions
β’ Measures how much π deviates from π
β’ πΎπΏ(π| π β₯ 0, andπΎπΏ(π| π = 0 if and only if π = π
β’ If π π΄ > 0 while π π΄ = 0, πΎπΏ(π| π = β (other direction is allowed)
β’ πΎπΏ(π»1| π»0 =information per sample for rejecting π»0 when π»1 is true
β’ πΎπΏ π¬||Unif π = logπ β π»(π¬)
β’ πΌ π, π = πΎπΏ π π, π ||π π π π
Solomon
Kullback
Richard
Leibler
PAC-Bayesβ’ For any distribution π« over hypothesis and any π, βπβΌππ
πΏ
πΏπ βπ¬ β πΏπ βπ¬ β€πΎπΏ π||π + log 2π
πΏ2 π β 1
β’ Can only use hypothesis in the support of π (otherwise πΎπΏ(π| π = β)
β’ For a finite β with π = Unif(β)
β’ Consider π = point mass on h
β’ πΎπΏ(π| π = log β
β’ Generalizes cardinality bound (up to logπ)
β’ More generally, for a discrete π and π = point mass on h
β’ πΎπΏ(π| π = π β logπ β
π β=
1
π β
β’ Generalizes MDL/SRM (up to logπ)
β’ For continuous π (eg over linear predictors or polynomials)
β’ For π=point-mass (or any discrete), πΎπΏ(π| π = β
β’ Take βπ as average over similar hypothesis (eg with same behavior on π)
PAC-Bayes
πΏπ βπ β€ πΏπ βπ +πΎπΏ π||π + log 2π
πΏ2 π β 1
β’ What learning rule does the PAC-Bayes bound suggest?
ππ = argminπ
πΏπ βπ + π β πΎπΏ π||π
β’ Theorem:ππ β β π β πβπ½πΏπ β
for some βinverse temperatureβ π½
β’ As π β β we ignore the data, corresponding to infinite temperature, π½ β 0
β’ As π β 0 we insist on minimizing πΏπ βπ , corresponding to zero temperature, π½ β β, and the prediction becomes ERM (or rather, a distribution over the ERM hypothesis in the support of π)
PAC-Bayes vs Bayes
Bayesian approach:β’ Assume β βΌ π«,
β’ π¦1, β¦ , π¦π iid conditioned on β, with π¦π|π₯π , β = β(π₯π), π€. π. 1 β π
ββ(π₯π), π€. π. π
Use posterior:
π β π β π β π π β
= π β π π π₯π π(π¦π|π₯π)
β π β ππ
1βπ
β π₯π β π¦π= π β
π
1βπ
π β π₯π β π¦π
= π β πβπ½πΏπ β
where π½ = π log1βπ
π
PAC-Bayes vs BayesPAC-Bayes
β’ π encodes inductive bias, not assumption about reality
β’ SRM-type bound minimized by Gibbs distribution
ππ β β π β πβπ½πΏπ β
β’ Post-hoc guarantee always valid (βππΏ),
with no assumption about reality
πΏπ βπ β€ πΏπ βπ +πΎπΏ π||π + log 2π
πΏ2 π β 1
β’ Bound valid for any π
β’ If inductive bias very different from reality, bound will be high
Bayesian Approachβ’ π« is prior over reality
β’ Posterior given by Gibbs distribution
ππ β β π β πβπ½πΏπ β
β’ Risk analysis assuming prior
PAC-Bayes: Tighter Version
β’ For any distribution π over hypothesis and any source distribution π, βπβΌπππΏ
πΎπΏ πΏπ βπ ||πΏπ βπ β€πΎπΏ π||π + log 2π
πΏπ β 1
where πΎπΏ(πΌ| π½ = πΌ logπΌ
π½+ 1 β πΌ log
1βπΌ
1βπ½for πΌ, π½ β [0,1]
πΏπ βπ β€ πΏπ βπ +2πΏπ βπ πΎπΏ ππ + log
2ππΏ
π β 1+2 πΎπΏ ππ + log
2ππΏ
πβ 1
β’ This generalizes the realizable case (πΏπ βπ = 0, and so only the 1
πterm
appears) and the agnostic case (where the 1 π term is dominant)
β’ Numerically much tighter
β’ Can also be used as a tail bound instead of Hoeffding or Bernstein also with cardinality or VC-based guarantees. Arises naturally in PAC-Bayes.