+ All Categories
Transcript
Page 1: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfΒ Β· β€’E.g. choose prefix-disambiguous encoding (β„Ž) ... π‘πœŽ=β„Ž 𝜎 β€’Choice of Lβ‹…,

Computational and StatisticalLearning Theory

TTIC 31120

Prof. Nati Srebro

Lecture 4:MDL and PAC-Bayes

Page 2: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfΒ Β· β€’E.g. choose prefix-disambiguous encoding (β„Ž) ... π‘πœŽ=β„Ž 𝜎 β€’Choice of Lβ‹…,

Uniform vs Non-Uniform Bias

β€’ No Free Lunch: we need some β€œinductive bias”

β€’ Limiting attention to hypothesis class β„‹: β€œflat” bias

β€’ 𝑝 β„Ž =1

β„‹for β„Ž ∈ β„‹, and 𝑝 β„Ž = 0 otherwise

β€’ Non-uniform bias: 𝑝 β„Ž encodes bias

β€’ Can use any 𝑝 β„Ž β‰₯ 0, s.t. β„Ž 𝑝(β„Ž) ≀ 1

β€’ E.g. choose prefix-disambiguous encoding 𝑑(β„Ž) and use 𝑝 β„Ž = 2βˆ’ 𝑑 β„Ž

β€’ Or, choose 𝑐:𝒰 β†’ 𝒴𝒳 over prefix-disambiguous programs 𝒰 βŠ‚ 0,1 βˆ—

and use 𝑝 β„Ž = 2βˆ’ min

𝑐 𝜎 =β„ŽπœŽ

β€’ Choice of 𝑝 β‹… , 𝑑 β‹… or 𝑐 β‹… encodes are expert knowledge/inductive bias

Page 3: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfΒ Β· β€’E.g. choose prefix-disambiguous encoding (β„Ž) ... π‘πœŽ=β„Ž 𝜎 β€’Choice of Lβ‹…,

Minimum Description Length Learning

β€’ Choose β€œprior” 𝑝(β„Ž) s.t. β„Ž 𝑝 β„Ž ≀ 1 (or description language 𝑑 β‹… or 𝑐 β‹… )

β€’ Minimum Description Length learning rule(based on above prior/description language):

𝑀𝐷𝐿𝑝 𝑆 = arg max𝐿𝑆 β„Ž =0

𝑝(β„Ž) = arg min𝐿𝑆 β„Ž =0

|𝑑 β„Ž |

β€’ For any π’Ÿ, w.p. β‰₯ 1 βˆ’ 𝛿,

𝐿 𝑀𝐷𝐿𝑝 𝑆 ≀ infβ„Ž 𝑠.𝑑.πΏπ’Ÿ β„Ž =0

βˆ’ log 𝑝 β„Ž + log 2/𝛿

2π‘š

Sample complexity: π‘š = π‘‚βˆ’ log 𝑝(β„Ž)

πœ–2= 𝑂

𝑑 β„Ž

πœ–2

(more careful analysis: 𝑂𝑑 β„Žβˆ—

πœ–)

Page 4: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfΒ Β· β€’E.g. choose prefix-disambiguous encoding (β„Ž) ... π‘πœŽ=β„Ž 𝜎 β€’Choice of Lβ‹…,

MDL and Universal Learningβ€’ Theorem: For any β„‹ and 𝑝:β„‹ β†’ [0,1], s.t. β„Ž 𝑝(β„Ž) ≀ 1, and any

source distribution π’Ÿ, if there exists β„Ž with 𝐿 β„Ž = 0 and 𝑝 β„Ž > 0, then w.p. β‰₯ 1 βˆ’ 𝛿 over 𝑆 ∼ π’Ÿπ‘š:

𝐿 𝑀𝐷𝐿𝑝 𝑆 β‰€βˆ’ log 𝑝 β„Ž + log 2/𝛿

2π‘š

β€’ Can learn any countable class!

β€’ Class of all computable functions, with 𝑝 β„Ž = 2βˆ’ min

𝑐 𝜎 =β„ŽπœŽ

.

β€’ Class enumerable with 𝑛:β„‹ β†’ β„• with 𝑝 β„Ž = 2βˆ’π‘› β„Ž

β€’ But VCdim(all computable functions)=∞ !

β€’ Why no contradiction to Fundamental Theorem?

β€’ PAC Learning: Sample complexity π‘š πœ–, 𝛿 is uniform for all β„Ž ∈ β„‹.Depends only on class β„‹, not on specific β„Žβˆ—

β€’ MDL: Sample complexity π‘š(πœ–, 𝛿, 𝒉) depends on β„Ž.

Page 5: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfΒ Β· β€’E.g. choose prefix-disambiguous encoding (β„Ž) ... π‘πœŽ=β„Ž 𝜎 β€’Choice of Lβ‹…,

Uniform and Non-Uniform Learnability

β€’ Definition: A hypothesis class β„‹ is agnostically PAC-Learnable if there

exists a learning rule 𝐴 such that βˆ€πœ–, 𝛿 > 0, βˆƒπ‘š πœ–, 𝛿 , βˆ€π’Ÿ, βˆ€π’‰, βˆ€π‘†βˆΌπ’Ÿπ‘š πœ–,𝛿𝛿 ,

πΏπ’Ÿ 𝐴 𝑆 ≀ πΏπ’Ÿ β„Ž + πœ–

β€’ Definition: A hypothesis class β„‹ is non-uniformly learnable if there exists

a learning rule 𝐴 such that βˆ€πœ–, 𝛿 > 0, βˆ€β„Ž, βˆƒπ‘š πœ–, 𝛿, 𝒉 , βˆ€π’Ÿ, βˆ€π‘†βˆΌπ’Ÿπ‘š πœ–,𝛿,𝒉𝛿 ,

πΏπ’Ÿ 𝐴 𝑆 ≀ πΏπ’Ÿ β„Ž + πœ–

β€’ Theorem: βˆ€π’Ÿ, if there exists β„Ž with 𝐿 β„Ž = 0, then βˆ€π‘†βˆΌπ’Ÿπ‘šπ›Ώ

𝐿 𝑀𝐷𝐿𝑝 𝑆 β‰€βˆ’ log 𝑝 β„Ž + log 2/𝛿

2π‘š

Compete also with β„Ž s.t. 𝐿 β„Ž > 0?

Page 6: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfΒ Β· β€’E.g. choose prefix-disambiguous encoding (β„Ž) ... π‘πœŽ=β„Ž 𝜎 β€’Choice of Lβ‹…,

Allowing Errors: From MDL to SRM

𝐿 β„Ž ≀ 𝐿𝑆 β„Ž +βˆ’ log 𝑝 β„Ž + log 2/𝛿

2π‘š

β€’ Structural Risk Minimization:

𝑆𝑅𝑀𝑝 𝑆 = argminβ„Ž

𝐿𝑆 β„Ž +βˆ’ log 𝑝 β„Ž

2π‘š

β€’ Theorem: For any prior 𝑝(β„Ž), β„Ž 𝑝(β„Ž) ≀ 1, and any source distribution π’Ÿ, w.p. β‰₯ 1 βˆ’ 𝛿 over 𝑆 ∼ π’Ÿπ‘š:

𝐿 𝑆𝑅𝑀𝑝 𝑆 ≀ infβ„Ž

𝐿 β„Ž + 2βˆ’ log 𝑝 β„Ž + log 2/𝛿

π‘š

Minimized by MDLMinimized

by ERM

fit thedata match the prior /

simple / short description

Page 7: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfΒ Β· β€’E.g. choose prefix-disambiguous encoding (β„Ž) ... π‘πœŽ=β„Ž 𝜎 β€’Choice of Lβ‹…,

Non-Uniform Learning: Beyond Cardinalityβ€’ MDL still essentially based on cardinality (β€œhow many hypothesis are simpler

then me”) and ignores relationship between predictors.

β€’ Generalizes the cardinality bound: Using 𝑝 β„Ž =1

β„‹we get

π‘š πœ–, 𝛿, β„Ž = π‘š πœ–, 𝛿 =log β„‹ + log 2/𝛿

πœ–2

β€’ Can we treat continuous classes (e.g. linear predictors)?Move from cardinality to β€œgrowth function”?

β€’ E.g.:

β€’ β„‹ = 𝑠𝑖𝑔𝑛 𝑓 πœ™ π‘₯ | 𝑓:ℝ𝑑 β†’ ℝ is a polynomial , πœ™:𝒳 β†’ ℝ𝑑

β€’ VCdim(β„‹)=βˆžβ€’ β„‹ is uncountable, and there is no distribution with βˆ€β„Žβˆˆβ„‹π‘ β„Ž > 0β€’ But what if we bias toward lower order polynomials?

β€’ Answer 1: prior over hypothesis classesβ€’ Write β„‹ =βˆͺβ„‹π‘Ÿ (e.g. β„‹π‘Ÿ =degree-π‘Ÿ polynomials)β€’ Use prior 𝑝(π»π‘Ÿ) over hypothesis classes

Page 8: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfΒ Β· β€’E.g. choose prefix-disambiguous encoding (β„Ž) ... π‘πœŽ=β„Ž 𝜎 β€’Choice of Lβ‹…,

Prior Over Hypothesis Classes

β€’ VC bound: βˆ€π‘Ÿβ„™ βˆ€β„Žβˆˆβ„‹π‘ŸπΏ β„Ž ≀ 𝐿𝑆 β„Ž + 𝑂

VCdim β„‹π‘Ÿ +log 1 π›Ώπ‘Ÿ

π‘šβ‰₯ 1 βˆ’ π›Ώπ‘Ÿ

β€’ Setting π›Ώπ‘Ÿ = 𝑝 β„‹π‘Ÿ β‹… 𝛿 and taking a union bound,

βˆ€π‘†βˆΌπ’Ÿπ‘šπ›Ώ βˆ€β„‹π‘Ÿ

βˆ€β„Žβˆˆβ„‹π‘ŸπΏ β„Ž ≀ 𝐿𝑆 β„Ž + 𝑂

VCdim β„‹π‘Ÿ βˆ’ log 𝑝 β„‹π‘Ÿ + log 1 π›Ώπ‘š

β€’ Structural Risk Minimization over hypothesis classes:

𝑆𝑅𝑀𝑝 𝑆 = arg minβ„Žβˆˆβ„‹π‘Ÿ

𝐿𝑆 β„Ž + πΆβˆ’ log 𝑝 β„‹π‘Ÿ + VCdim β„‹π‘Ÿ

π‘š

β€’ Theorem: w.p. β‰₯ 1 βˆ’ 𝛿,

πΏπ’Ÿ 𝑆𝑅𝑀𝑝 𝑆 ≀ minβ„‹π‘Ÿ,β„Žβˆˆβ„‹π‘Ÿ

πΏπ’Ÿ β„Ž + π‘‚βˆ’ log 𝑝 β„‹π‘Ÿ + VCdim β„‹π‘Ÿ + log

1𝛿

π‘š

Page 9: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfΒ Β· β€’E.g. choose prefix-disambiguous encoding (β„Ž) ... π‘πœŽ=β„Ž 𝜎 β€’Choice of Lβ‹…,

Structural Risk Minimizationβ€’ Theorem: For a prior 𝑝 β„‹π‘Ÿ with β„‹π‘Ÿ

𝑝 β„‹π‘Ÿ ≀ 1 and any π’Ÿ, βˆ€π‘†βˆΌπ’Ÿπ‘šπ›Ώ ,

πΏπ’Ÿ 𝑆𝑅𝑀𝑝 𝑆 ≀ minβ„‹π‘Ÿ,β„Žβˆˆβ„‹π‘Ÿ

πΏπ’Ÿ β„Ž + π‘‚βˆ’ log 𝑝 β„‹π‘Ÿ + VCdim β„‹π‘Ÿ + log

1𝛿

π‘š

β€’ For ℋ𝑖 = β„Žπ‘– :

β€’ π‘‰πΆπ‘‘π‘–π‘š β„‹π‘Ÿ = 0

β€’ Reduces to β€œstandard” SRM with a prior over hypothesis

β€’ For 𝑝 β„‹π‘Ÿ = 1

β€’ Reduces to ERM over a finite-VC class

β€’ More general. Eg for polynomials over πœ™ π‘₯ ∈ ℝ𝑑 with 𝑝 degree π‘Ÿ = 2βˆ’π‘Ÿ,

π‘š πœ–, 𝛿, β„Ž = 𝑂degree(β„Ž) + 𝑑 + 1 degree β„Ž + log

1𝛿

πœ–2

β€’ Allows non-uniform learning of a countable union of finite-VC classes

Page 10: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfΒ Β· β€’E.g. choose prefix-disambiguous encoding (β„Ž) ... π‘πœŽ=β„Ž 𝜎 β€’Choice of Lβ‹…,

Uniform and Non-Uniform Learnability

β€’ Definition: A hypothesis class β„‹ is agnostically PAC-Learnable if there exists a

learning rule 𝐴 such that βˆ€πœ–, 𝛿 > 0, βˆƒπ‘š πœ–, 𝛿 , βˆ€π’Ÿ, βˆ€π’‰, βˆ€π‘†βˆΌπ’Ÿπ‘š πœ–,𝛿𝛿 ,

πΏπ’Ÿ 𝐴 𝑆 ≀ πΏπ’Ÿ β„Ž + πœ–

β€’ Definition: A hypothesis class β„‹ is non-uniformly learnable if there exists a

learning rule 𝐴 such that βˆ€πœ–, 𝛿 > 0, βˆ€β„Ž, βˆƒπ‘š πœ–, 𝛿, 𝒉 , βˆ€π’Ÿ, βˆ€π‘†βˆΌπ’Ÿπ‘š πœ–,𝛿,𝒉𝛿 ,

πΏπ’Ÿ 𝐴 𝑆 ≀ πΏπ’Ÿ β„Ž + πœ–

β€’ Theorem: A hypothesis class β„‹ is non-uniformly learnable if and only ifit is a countable union of finite VC class (β„‹ =βˆͺπ‘–βˆˆβ„• ℋ𝑖, VCdim ℋ𝑖 < ∞)

β€’ Definition: A hypothesis class β„‹ is β€œconsistently learnable” if there exists a

learning rule 𝐴 such that βˆ€πœ–, 𝛿 > 0, βˆ€β„Ž βˆ€π““, βˆƒπ‘š πœ–, 𝛿, β„Ž, 𝓓 , βˆ€π‘†βˆΌπ’Ÿπ‘š πœ–,𝛿,β„Ž,𝓓𝛿 ,

πΏπ’Ÿ 𝐴 𝑆 ≀ πΏπ’Ÿ β„Ž + πœ–

Page 11: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfΒ Β· β€’E.g. choose prefix-disambiguous encoding (β„Ž) ... π‘πœŽ=β„Ž 𝜎 β€’Choice of Lβ‹…,

Consistency

β€’ 𝒳 countable (e.g. 𝒳 = 0,1 βˆ—), β„‹ = Β±1 𝒳 (all possible functions)

β€’ β„‹ is uncountable, it is not a countable union of finite VC classes, and is thus not non-uniformly learnable

β€’ Claim: β„‹ is β€œconsistently learnable” using𝐸𝑅𝑀ℋ 𝑆 π‘₯ = π‘€π΄π½π‘‚π‘…πΌπ‘‡π‘Œ 𝑦𝑖 𝑠. 𝑑. π‘₯𝑖 , 𝑦𝑖 ∈ 𝑆

β€’ Proof sketch: for any π’Ÿ,

β€’ Sort 𝒳 by decreasing probability. The tail has diminishing probability and thus for any πœ–, there exists some prefix 𝒳′ of the sort s.t. the tail 𝒳 βˆ–π’³β€² has probability mass ≀ πœ–/2.

β€’ We’ll give up on the tail. 𝒳′ is finite, and so Β±1 β„‹ is also finite.

β€’ Why only β€œconsistently learnable”?

β€’ Size of π’³β€˜ required to capture 1 βˆ’ πœ–/2 of mass depends on π’Ÿ.

Page 12: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfΒ Β· β€’E.g. choose prefix-disambiguous encoding (β„Ž) ... π‘πœŽ=β„Ž 𝜎 β€’Choice of Lβ‹…,

Uniform and Non-Uniform Learnability

β€’ Definition: A hypothesis class β„‹ is agnostically PAC-Learnable if there exists a

learning rule 𝐴 such that βˆ€πœ–, 𝛿 > 0, βˆƒπ‘š πœ–, 𝛿 , βˆ€π’Ÿ, βˆ€π’‰, βˆ€π‘†βˆΌπ’Ÿπ‘š πœ–,𝛿𝛿 ,

πΏπ’Ÿ 𝐴 𝑆 ≀ πΏπ’Ÿ β„Ž + πœ–

β€’ (Agnostically) PAC-Learnable iff VCdim β„‹ < ∞

β€’ Definition: A hypothesis class β„‹ is non-uniformly learnable if there exists a

learning rule 𝐴 such that βˆ€πœ–, 𝛿 > 0, βˆ€β„Ž, βˆƒπ‘š πœ–, 𝛿, 𝒉 , βˆ€π’Ÿ, βˆ€π‘†βˆΌπ’Ÿπ‘š πœ–,𝛿,𝒉𝛿 ,

πΏπ’Ÿ 𝐴 𝑆 ≀ πΏπ’Ÿ β„Ž + πœ–

β€’ Non-uniformly learnable iff β„‹ is a countable union of finite VC classes

β€’ Definition: A hypothesis class β„‹ is β€œconsistently learnable” if there exists a

learning rule 𝐴 such that βˆ€πœ–, 𝛿 > 0, βˆ€β„Ž βˆ€π““, βˆƒπ‘š πœ–, 𝛿, β„Ž, 𝓓 , βˆ€π‘†βˆΌπ’Ÿπ‘š πœ–,𝛿,β„Ž,𝓓𝛿 ,

πΏπ’Ÿ 𝐴 𝑆 ≀ πΏπ’Ÿ β„Ž + πœ–

Page 13: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfΒ Β· β€’E.g. choose prefix-disambiguous encoding (β„Ž) ... π‘πœŽ=β„Ž 𝜎 β€’Choice of Lβ‹…,

SRM In Practice

𝑆𝑅𝑀𝑝 𝑆 = arg minβ„Žβˆˆβ„‹π‘Ÿ

𝐿𝑆 β„Ž + πΆβˆ’ log𝑝 β„‹π‘Ÿ + VCdim β„‹π‘Ÿ

π‘š

β€’ Bound is loose anyway. Better to view as bi-criteria optimization:argmin 𝐿𝑠 β„Ž and βˆ’ log 𝑝 β„‹π‘Ÿ + VCdim β„‹π‘Ÿ

E.g. serialize asargmin 𝐿𝑠 β„Ž + πœ† βˆ’ log 𝑝 β„‹π‘Ÿ + VCdim β„‹π‘Ÿ

β€’ Typically βˆ’log 𝑝 β„‹π‘Ÿ , VCdim β„‹π‘Ÿ monotone in β€œcomplexity” π‘Ÿargmin 𝐿𝑠 β„Ž 𝐚𝐧𝐝 π‘Ÿ(β„Ž)

whereπ‘Ÿ β„Ž = min π‘Ÿ 𝑠. 𝑑. β„Ž ∈ β„‹π‘Ÿ

Page 14: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfΒ Β· β€’E.g. choose prefix-disambiguous encoding (β„Ž) ... π‘πœŽ=β„Ž 𝜎 β€’Choice of Lβ‹…,

SRM as a Bi-Criteria Problemargmin 𝐿𝑠 β„Ž and π‘Ÿ(β„Ž)

Regularization Path = argminβ„Ž 𝐿𝑠 β„Ž + πœ† β‹… π‘Ÿ β„Ž | 0 ≀ πœ† ≀ ∞

Select πœ† using a validation setβ€”exact bound not needed

π‘Ÿ(β„Ž)

𝐿𝑆(β„Ž)

β„Ž

Regularization Path(Pareto Frontier)

πœ† = ∞

πœ† = 0

Page 15: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfΒ Β· β€’E.g. choose prefix-disambiguous encoding (β„Ž) ... π‘πœŽ=β„Ž 𝜎 β€’Choice of Lβ‹…,

Non-Uniform Learning: Beyond Cardinalityβ€’ MDL still essentially based on cardinality (β€œhow many hypothesis are

simpler then me”) and ignores relationship between predictors.

β€’ Can we treat continuous classes (e.g. linear predictors)?Move from cardinality? Take into account that many predictors are similar?

β€’ Answer 1: prior 𝒑(𝓗) over hypothesis class

β€’ Answer 2: PAC-Bayes Theory

β€’ Prior distribution 𝑃 (not necessarily discrete) over β„‹

David McAllester

Page 16: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfΒ Β· β€’E.g. choose prefix-disambiguous encoding (β„Ž) ... π‘πœŽ=β„Ž 𝜎 β€’Choice of Lβ‹…,

PAC-Bayes

β€’ Until now (MDL, SRM) we used a discrete β€œprior” (discrete β€œdistribution” 𝑝 β„Žover hypothesis, or discrete β€œdistribution” 𝑝(β„‹π‘Ÿ) over hypothesis classes)

β€’ Instead: encode inductive bias as distribution 𝑃 over hypothesis

β€’ Use randomized (averaged) predictor β„Žπ‘„, where for each prediction choosesβ„Ž ∼ 𝑄 and predicts β„Ž(π‘₯)

β€’ β„Žπ‘„ π‘₯ = 𝑦 w. p. β„™β„ŽβˆΌπ‘„ β„Ž π‘₯ = 𝑦

β€’ πΏπ’Ÿ β„Žπ‘„ = π”Όβ„ŽβˆΌπ‘„ πΏπ’Ÿ β„Ž

β€’ Theorem: for any distribution 𝑃 over hypothesis and any π’Ÿ, βˆ€π‘†βˆΌπ’Ÿπ‘šπ›Ώ

πΏπ’Ÿ β„Žπ‘„ βˆ’ 𝐿𝑆 β„Žπ‘„ ≀𝐾𝐿 𝑄||𝑃 + log 2π‘š

𝛿2 π‘š βˆ’ 1

Page 17: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfΒ Β· β€’E.g. choose prefix-disambiguous encoding (β„Ž) ... π‘πœŽ=β„Ž 𝜎 β€’Choice of Lβ‹…,

KL-Divergence

𝐾𝐿(𝑄| 𝑃 = π”Όβ„ŽβˆΌπ‘„ log𝑑𝑄

𝑑𝑃

= β„Ž π‘ž β„Ž logπ‘ž β„Ž

𝑝 β„Žfor discrete dist with pmf 𝑝, π‘ž

= ∫ 𝑓𝑄 β„Ž log𝑓𝑄 β„Ž

𝑓𝑃 β„Žπ‘‘β„Ž for continuous distributions

β€’ Measures how much 𝑄 deviates from 𝑄

β€’ 𝐾𝐿(𝑄| 𝑃 β‰₯ 0, and𝐾𝐿(𝑄| 𝑃 = 0 if and only if 𝑄 = 𝑃

β€’ If 𝑄 𝐴 > 0 while 𝑃 𝐴 = 0, 𝐾𝐿(𝑄| 𝑃 = ∞ (other direction is allowed)

β€’ 𝐾𝐿(𝐻1| 𝐻0 =information per sample for rejecting 𝐻0 when 𝐻1 is true

β€’ 𝐾𝐿 𝒬||Unif 𝑛 = log𝑛 βˆ’ 𝐻(𝒬)

β€’ 𝐼 𝑋, π‘Œ = 𝐾𝐿 𝑃 𝑋, π‘Œ ||𝑃 𝑋 𝑃 π‘Œ

Solomon

Kullback

Richard

Leibler

Page 18: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfΒ Β· β€’E.g. choose prefix-disambiguous encoding (β„Ž) ... π‘πœŽ=β„Ž 𝜎 β€’Choice of Lβ‹…,

PAC-Bayesβ€’ For any distribution 𝒫 over hypothesis and any π’Ÿ, βˆ€π‘†βˆΌπ’Ÿπ‘š

𝛿

πΏπ’Ÿ β„Žπ’¬ βˆ’ 𝐿𝑆 β„Žπ’¬ ≀𝐾𝐿 𝑄||𝑃 + log 2π‘š

𝛿2 π‘š βˆ’ 1

β€’ Can only use hypothesis in the support of 𝑃 (otherwise 𝐾𝐿(𝑄| 𝑃 = ∞)

β€’ For a finite β„‹ with 𝑃 = Unif(β„‹)

β€’ Consider 𝑄 = point mass on h

β€’ 𝐾𝐿(𝑄| 𝑃 = log β„‹

β€’ Generalizes cardinality bound (up to logπ‘š)

β€’ More generally, for a discrete 𝑃 and 𝑄 = point mass on h

β€’ 𝐾𝐿(𝑄| 𝑃 = π‘ž β„Ž logπ‘ž β„Ž

𝑝 β„Ž=

1

𝑝 β„Ž

β€’ Generalizes MDL/SRM (up to logπ‘š)

β€’ For continuous 𝑃 (eg over linear predictors or polynomials)

β€’ For 𝑄=point-mass (or any discrete), 𝐾𝐿(𝑄| 𝑃 = ∞

β€’ Take β„Žπ‘„ as average over similar hypothesis (eg with same behavior on 𝑆)

Page 19: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfΒ Β· β€’E.g. choose prefix-disambiguous encoding (β„Ž) ... π‘πœŽ=β„Ž 𝜎 β€’Choice of Lβ‹…,

PAC-Bayes

πΏπ’Ÿ β„Žπ‘„ ≀ 𝐿𝑆 β„Žπ‘„ +𝐾𝐿 𝑄||𝑃 + log 2π‘š

𝛿2 π‘š βˆ’ 1

β€’ What learning rule does the PAC-Bayes bound suggest?

π‘„πœ† = argmin𝑄

𝐿𝑆 β„Žπ‘„ + πœ† β‹… 𝐾𝐿 𝑄||𝑃

β€’ Theorem:π‘žπœ† β„Ž ∝ 𝑝 β„Ž π‘’βˆ’π›½πΏπ‘† β„Ž

for some β€œinverse temperature” 𝛽

β€’ As πœ† β†’ ∞ we ignore the data, corresponding to infinite temperature, 𝛽 β†’ 0

β€’ As πœ† β†’ 0 we insist on minimizing 𝐿𝑆 β„Žπ‘„ , corresponding to zero temperature, 𝛽 β†’ ∞, and the prediction becomes ERM (or rather, a distribution over the ERM hypothesis in the support of 𝑃)

Page 20: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfΒ Β· β€’E.g. choose prefix-disambiguous encoding (β„Ž) ... π‘πœŽ=β„Ž 𝜎 β€’Choice of Lβ‹…,

PAC-Bayes vs Bayes

Bayesian approach:β€’ Assume β„Ž ∼ 𝒫,

β€’ 𝑦1, … , π‘¦π‘š iid conditioned on β„Ž, with 𝑦𝑖|π‘₯𝑖 , β„Ž = β„Ž(π‘₯𝑖), 𝑀. 𝑝. 1 βˆ’ 𝜈

βˆ’β„Ž(π‘₯𝑖), 𝑀. 𝑝. 𝜈

Use posterior:

𝑝 β„Ž 𝑆 ∝ 𝑝 β„Ž 𝑝 𝑆 β„Ž

= 𝑝 β„Ž 𝑖 𝑝 π‘₯𝑖 𝑝(𝑦𝑖|π‘₯𝑖)

∝ 𝑝 β„Ž π‘–πœˆ

1βˆ’πœˆ

β„Ž π‘₯𝑖 ≠𝑦𝑖= 𝑝 β„Ž

𝜈

1βˆ’πœˆ

𝑖 β„Ž π‘₯𝑖 ≠𝑦𝑖

= 𝑝 β„Ž π‘’βˆ’π›½πΏπ‘  β„Ž

where 𝛽 = π‘š log1βˆ’πœˆ

𝜈

Page 21: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfΒ Β· β€’E.g. choose prefix-disambiguous encoding (β„Ž) ... π‘πœŽ=β„Ž 𝜎 β€’Choice of Lβ‹…,

PAC-Bayes vs BayesPAC-Bayes

β€’ 𝑃 encodes inductive bias, not assumption about reality

β€’ SRM-type bound minimized by Gibbs distribution

π‘žπœ† β„Ž ∝ 𝑝 β„Ž π‘’βˆ’π›½πΏπ‘† β„Ž

β€’ Post-hoc guarantee always valid (βˆ€π‘†π›Ώ),

with no assumption about reality

πΏπ’Ÿ β„Žπ‘„ ≀ 𝐿𝑆 β„Žπ‘„ +𝐾𝐿 𝑄||𝑃 + log 2π‘š

𝛿2 π‘š βˆ’ 1

β€’ Bound valid for any 𝑄

β€’ If inductive bias very different from reality, bound will be high

Bayesian Approachβ€’ 𝒫 is prior over reality

β€’ Posterior given by Gibbs distribution

π‘žπœ† β„Ž ∝ 𝑝 β„Ž π‘’βˆ’π›½πΏπ‘† β„Ž

β€’ Risk analysis assuming prior

Page 22: Computational and Statistical Learning Theorynati/Teaching/TTIC31120/2015/Lecture4.pdfΒ Β· β€’E.g. choose prefix-disambiguous encoding (β„Ž) ... π‘πœŽ=β„Ž 𝜎 β€’Choice of Lβ‹…,

PAC-Bayes: Tighter Version

β€’ For any distribution 𝑃 over hypothesis and any source distribution π’Ÿ, βˆ€π‘†βˆΌπ’Ÿπ‘šπ›Ώ

𝐾𝐿 𝐿𝑆 β„Žπ‘„ ||πΏπ’Ÿ β„Žπ‘„ ≀𝐾𝐿 𝑄||𝑃 + log 2π‘š

π›Ώπ‘š βˆ’ 1

where 𝐾𝐿(𝛼| 𝛽 = 𝛼 log𝛼

𝛽+ 1 βˆ’ 𝛼 log

1βˆ’π›Ό

1βˆ’π›½for 𝛼, 𝛽 ∈ [0,1]

πΏπ’Ÿ β„Žπ‘„ ≀ 𝐿𝑆 β„Žπ‘„ +2𝐿𝑠 β„Žπ‘„ 𝐾𝐿 𝑄𝑃 + log

2π‘šπ›Ώ

π‘š βˆ’ 1+2 𝐾𝐿 𝑄𝑃 + log

2π‘šπ›Ώ

π‘šβˆ’ 1

β€’ This generalizes the realizable case (𝐿𝑆 β„Žπ‘„ = 0, and so only the 1

π‘šterm

appears) and the agnostic case (where the 1 π‘š term is dominant)

β€’ Numerically much tighter

β€’ Can also be used as a tail bound instead of Hoeffding or Bernstein also with cardinality or VC-based guarantees. Arises naturally in PAC-Bayes.


Top Related