+ All Categories
Home > Documents > Computational and Statistical Learning...

Computational and Statistical Learning...

Date post: 22-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
39
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification
Transcript
Page 1: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

Computational and StatisticalLearning Theory

TTIC 31120

Prof. Nati Srebro

Lecture 17:Stochastic Optimization

Part II: Realizable vs Agnostic RatesPart III: Nearest Neighbor Classification

Page 2: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

Stochastic (Sub)-Gradient Descent

If 𝛻𝑓 𝑀, 𝑧 2 ≀ 𝐺 then with appropriate step size:

𝔼 𝐹 π‘€π‘š ≀ infπ‘€βˆˆπ’², 𝑀 2≀𝐡

𝐹 𝑀 + 𝑂𝐡2𝐺2

π‘š

Similarly, also Stochastic Mirror Descent

Optimize 𝑭 π’˜ = π”Όπ’›βˆΌπ““ 𝒇 π’˜, 𝒛 s.t. π’˜ ∈ 𝓦1. Initialize 𝑀1 = 0 ∈ 𝒲2. At iteration 𝑑 = 1,2,3, …

1. Sample 𝑧𝑑 ∼ π’Ÿ (Obtain 𝑔𝑑 s.t. 𝔼 𝑔𝑑 ∈ πœ•πΉ(𝑀𝑑))

2. 𝑀𝑑+1 = Π𝒲 𝑀𝑑 βˆ’ πœ‚π‘‘π›»π‘“ 𝑀𝑑 , 𝑧𝑑

3. Return π‘€π‘š =1

π‘šΟƒπ‘‘=1π‘š 𝑀𝑑

Stochastic Gradient DescentOnline Gradient Descent[Cesa-Binachi et al 02]

online2stochastic

π’ˆπ’•

[Zinkevich 03] [Nemirovski Yudin 78]

Online Mirror Descent Stochastic Mirror Descent[Shalev-Shwatz Singer 07] [Nemirovski Yudin 78]

Page 3: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

Stochastic Optimization

minπ‘€βˆˆπ’²

𝐹(𝑀)

based only on stochastic information on 𝐹‒ Only unbiased estimates of 𝐹 𝑀 , 𝛻𝐹(𝑀)

β€’ No direct access to 𝐹

E.g., fixed 𝑓(𝑀, 𝑧) but π’Ÿ unknownβ€’ Optimize 𝐹(𝑀) based on iid sample 𝑧1, 𝑧2, … , π‘§π‘š~π’Ÿ

β€’ 𝑔 = 𝛻𝑓(𝑀, 𝑧𝑑) is unbiased estimate of 𝛻𝐹 𝑀

β€’ Traditional applicationsβ€’ Optimization under uncertainty

β€’ Uncertainty about network performance

β€’ Uncertainty about client demands

β€’ Uncertainty about system behavior in control problems

β€’ Complex systems where its easier to sample then integrate over 𝑧

= 𝔼𝑧~π’Ÿ[𝑓 𝑀, 𝑧 ]

Page 4: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

Machine Learning isStochastic Optimization

minβ„Ž

𝐿 β„Ž = π”Όπ‘§βˆΌπ’Ÿ β„“ β„Ž, 𝑧 = 𝔼π‘₯,π‘¦βˆΌπ’Ÿ π‘™π‘œπ‘ π‘  β„Ž π‘₯ , 𝑦

β€’ Optimization variable: predictor β„Ž

β€’ Objective: generalization error 𝐿(β„Ž)

β€’ Stochasticity over 𝑧 = (π‘₯, 𝑦)

β€œGeneral Learning” ≑ Stochastic Optimization:

ValdimirVapnik

ArkadiNemirovskii

Page 5: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

β€’Focus on computational efficiency

β€’Generally assumes unlimited sampling- as in monte-carlo methods for complicated objectives

β€’Optimization variable generally a vector in a normed space- complexity control through norm

β€’Mostly convex objectives

Stochastic Optimization

β€’Focus on sample size

β€’What can be done with a fixed number of samples?

β€’Abstract hypothesis classes- linear predictors, but also combinatorial hypothesis classes- generic measures of complexity such as VC-dim, fat shattering, Radamacher

β€’ Also non-convex classes and loss functions

Statistical Learningvs

ValdimirVapnik

ArkadiNemirovskii

Page 6: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

Two Approaches to Stochastic Optimization / Learning

minπ‘€βˆˆπ’²

𝐹(𝑀) = 𝔼𝑧~π’Ÿ[𝑓 𝑀, 𝑧 ]

β€’ Empirical Risk Minimization (ERM) / Sample Average Approximation (SAA):β€’ Collect sample z1,…,zm

β€’ Minimize 𝐹𝑆 𝑀 =1

π‘šΟƒπ‘– 𝑓 𝑀, 𝑧𝑖

β€’ Analysis typically based on Uniform Convergence

β€’ Stochastic Approximation (SA): [Robins Monro 1951]

β€’ Update 𝑀𝑑 based on 𝑧𝑑‒ E.g., based on 𝑔𝑑 = 𝛻𝑓(𝑀, 𝑧𝑑)

β€’ E.g.: stochastic gradient descent

β€’ Online-to-batch conversion of online algorithm…

Page 7: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

SA/SGD for Machine Learning

β€’ In learning with ERM, need to optimize

ෝ𝑀 = arg minπ‘€βˆˆπ’²

𝐿𝑆(𝑀) 𝐿𝑆 𝑀 =1

π‘šΟƒπ‘– β„“(𝑀, 𝑧𝑖)

β€’ 𝐿𝑆 𝑀 is expensive to evaluate exactlyβ€” 𝑂(π‘šπ‘‘) time

β€’ Cheap to get unbiased gradient estimateβ€” 𝑂(𝑑) time

𝑖 ∼ π‘ˆπ‘›π‘–π‘“(1β€¦π‘š) 𝑔 = 𝛻ℓ 𝑀, 𝑧𝑖

𝔼 𝑔 = σ𝑖1

π‘šπ›»β„“ 𝑀, 𝑧𝑖 = 𝛻𝐿𝑆(𝑀)

β€’ SGD guarantee:

𝔼 𝐿𝑆 ΰ΄₯𝑀 𝑇 ≀ infπ‘€βˆˆπ’²

𝐿𝑆 𝑀 +sup 𝛻ℓ 2

2 sup 𝑀 22

𝑇

Page 8: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

SGD for SVM

min 𝐿𝑆 𝑀 𝑠. 𝑑. 𝑀 2 ≀ 𝐡

Use 𝑔𝑑 = π›»π‘€π‘™π‘œπ‘ π‘ β„Žπ‘–π‘›π‘”π‘’ 𝑀𝑑, πœ™π‘–π‘‘ π‘₯ ; 𝑦𝑖𝑑 for random 𝑖𝑑

Initialize 𝑀 0 = 0At iteration t:β€’ Pick 𝑖 ∈ 1β€¦π‘š at random

β€’ If 𝑦𝑖 𝑀𝑑 , πœ™ π‘₯𝑖 < 1,

𝑀 𝑑+1 ← 𝑀 𝑑 + πœ‚π‘‘π‘¦π‘–πœ™ π‘₯𝑖else: 𝑀 𝑑+1 ← 𝑀 𝑑

β€’ If 𝑀 𝑑+12> 𝐡, then 𝑀 𝑑+1 ← 𝐡

𝑀 𝑑+1

𝑀 𝑑+12

Return ΰ΄₯𝑀 𝑇 =1

𝑇σ𝑑=1𝑇 𝑀 𝑑

πœ™ π‘₯ 2 ≀ 𝐺 𝑔𝑑 2 ≀ 𝐺 𝐿𝑆 ΰ΄₯𝑀 𝑇 ≀ 𝐿𝑆 ෝ𝑀 +𝐡2𝐺2

𝑇

(in expectation over randomness in algorithm)

Page 9: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

Stochastic vs Batch

x1,y1

x2,y2

x3,y3

x4,y4

x5,y5

xm,ym

𝑔1 = βˆ‡π‘™π‘œπ‘ π‘ (𝑀, (π‘₯1, 𝑦1) )

𝑔2 = βˆ‡π‘™π‘œπ‘ π‘ (𝑀, (π‘₯2, 𝑦2) )

𝑔3 = βˆ‡π‘™π‘œπ‘ π‘ (𝑀, (π‘₯3, 𝑦3) )

𝑔4 = βˆ‡π‘™π‘œπ‘ π‘ (𝑀, (π‘₯4, 𝑦4) )

𝑔5 = βˆ‡π‘™π‘œπ‘ π‘ (𝑀, π‘₯5, 𝑦5 )

π‘”π‘š = βˆ‡π‘™π‘œπ‘ π‘ (𝑀, (π‘₯π‘š, π‘¦π‘š) )

𝑔1 = βˆ‡π‘™π‘œπ‘ π‘ (𝑀, (π‘₯1, 𝑦1) )

𝑔2 = βˆ‡π‘™π‘œπ‘ π‘ (𝑀, (π‘₯2, 𝑦2) )

𝑔3 = βˆ‡π‘™π‘œπ‘ π‘ (𝑀, (π‘₯3, 𝑦3) )

𝑔4 = βˆ‡π‘™π‘œπ‘ π‘ (𝑀, (π‘₯4, 𝑦4) )

𝑔5 = βˆ‡π‘™π‘œπ‘ π‘ (𝑀, (π‘₯5, 𝑦5) )

π‘”π‘š = βˆ‡π‘™π‘œπ‘ π‘ (𝑀, (π‘₯π‘š, π‘¦π‘š) )

𝑀 ← 𝑀 βˆ’ 𝑔1

𝑀 ← 𝑀 βˆ’ 𝑔2

𝑀 ← 𝑀 βˆ’ 𝑔3

𝑀 ← 𝑀 βˆ’ 𝑔4

𝑀 ← 𝑀 βˆ’ π‘”π‘šβˆ’1

𝑀 ← 𝑀 βˆ’ 𝑔𝑖𝑀 ← 𝑀 βˆ’ π‘”π‘š

𝑀 ← 𝑀 βˆ’ 𝑔5

min 𝐿𝑆 𝑀 𝑠. 𝑑. 𝑀 2 ≀ 𝐡

βˆ‡πΏπ‘† 𝑀 =1

π‘šΟƒπ‘”π‘–

Page 10: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

Stochastic vs Batchβ€’ Intuitive argument: if only taking simple gradient steps, better

to be stochastic

β€’ To get 𝐿𝑆 𝑀 ≀ 𝐿𝑆 ෝ𝑀 + πœ–π‘œπ‘π‘‘:

#iter cost/iter runtime

Batch GD 𝐡2𝐺2/πœ–π‘œπ‘π‘‘2 π‘šπ‘‘ π’Žπ‘‘

𝐡2𝐺2

πœ–π‘œπ‘π‘‘2

SGD 𝐡2𝐺2/πœ–π‘œπ‘π‘‘2 𝑑 𝑑

𝐡2𝐺2

πœ–π‘œπ‘π‘‘2

β€’ Comparison to methods with a log 1/πœ–π‘œπ‘π‘‘ dependence that use the structure of 𝐿𝑆(𝑀) (not only local access)?

β€’ How small should πœ–π‘œπ‘π‘‘ be?

β€’ What about 𝐿 𝑀 , which is what we really care about?

Page 11: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

Overall Analysis of πΏπ’Ÿ(𝑀)β€’ Recall for ERM: πΏπ’Ÿ ෝ𝑀 ≀ πΏπ’Ÿ π‘€βˆ— + 2sup

π‘€πΏπ’Ÿ 𝑀 + 𝐿𝑆 𝑀

β€’ For πœ–π‘œπ‘π‘‘ suboptimal ERM ΰ΄₯𝑀:

πΏπ’Ÿ ΰ΄₯𝑀 ≀ πΏπ’Ÿ π‘€βˆ— + 2sup𝑀

πΏπ’Ÿ 𝑀 βˆ’ 𝐿𝑆 𝑀 + 𝐿𝑆 ΰ΄₯𝑀 βˆ’ 𝐿𝑆 ෝ𝑀

β€’ Take πœ–π‘œπ‘π‘‘ β‰ˆ πœ–π‘’π‘ π‘‘, i.e. #π‘–π‘‘π‘’π‘Ÿ 𝑇 β‰ˆ π‘ π‘Žπ‘šπ‘π‘™π‘’ 𝑠𝑖𝑧𝑒 π‘š

β€’ To ensure πΏπ’Ÿ 𝑀 ≀ πΏπ’Ÿ π‘€βˆ— + πœ–:

𝑇,π‘š = 𝑂𝐡2𝐺2

πœ–2

ෝ𝑀 = arg min𝑀 ≀𝐡

𝐿𝑆(𝑀) π‘€βˆ— = arg min𝑀 ≀𝐡

πΏπ’Ÿ(𝑀)

πœ–π‘Žπ‘π‘Ÿπ‘œπ‘₯

πœ–π‘’π‘ π‘‘ ≀ 2𝐡2𝐺2

π‘šπœ–π‘œπ‘π‘‘ ≀

𝐡2𝐺2

𝑇

Page 12: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

Direct Online-to-Batch:SGD on πΏπ’Ÿ(𝑀)

min𝑀

πΏπ’Ÿ(𝑀)

use 𝑔𝑑 = π›»π‘€β„Žπ‘–π‘›π‘”π‘’ 𝑦 𝑀, πœ™ π‘₯ for random 𝑦, π‘₯~π’Ÿ

𝔼 𝑔𝑑 = π›»πΏπ’Ÿ 𝑀

Initialize 𝑀 0 = 0At iteration t:β€’ Draw π‘₯𝑑 , 𝑦𝑑~π’Ÿ

β€’ If 𝑦𝑑 𝑀𝑑 , πœ™ π‘₯𝑑 < 1,

𝑀 𝑑+1 ← 𝑀 𝑑 + πœ‚π‘‘π‘¦π‘‘πœ™ π‘₯𝑑else: 𝑀 𝑑+1 ← 𝑀 𝑑

Return ΰ΄₯𝑀 𝑇 =1

𝑇σ𝑑=1𝑇 𝑀 𝑑

πΏπ’Ÿ ΰ΄₯𝑀 𝑇 ≀ inf𝑀 2≀𝐡

πΏπ’Ÿ 𝑀 +𝐡2𝐺2

𝑇

π‘š = 𝑇 = 𝑂𝐡2𝐺2

πœ–2

Page 13: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

SGD for Machine Learning

Initialize 𝑀 0 = 0At iteration t:β€’ Draw π‘₯𝑑 , 𝑦𝑑~π’Ÿ

β€’ If 𝑦𝑑 𝑀𝑑 , πœ™ π‘₯𝑑 < 1,

𝑀 𝑑+1 ← 𝑀 𝑑 + πœ‚π‘‘π‘¦π‘‘πœ™ π‘₯𝑑else: 𝑀 𝑑+1 ← 𝑀 𝑑

Return ΰ΄₯𝑀 𝑇 =1

𝑇σ𝑑=1𝑇 𝑀 𝑑

Draw π‘₯1, 𝑦1 , … , π‘₯π‘š, π‘¦π‘š ~π’Ÿ

Initialize 𝑀 0 = 0At iteration t:β€’ Pick 𝑖 ∈ 1β€¦π‘š at random

β€’ If 𝑦𝑖 𝑀𝑑 , πœ™ π‘₯𝑖 < 1,

𝑀 𝑑+1 ← 𝑀 𝑑 + πœ‚π‘‘π‘¦π‘–πœ™ π‘₯𝑖else: 𝑀 𝑑+1 ← 𝑀 𝑑

β€’ 𝑀 𝑑+1 ← π‘π‘Ÿπ‘œπ‘— 𝑀 𝑑+1 π‘‘π‘œ 𝑀 ≀ 𝐡

Return ΰ΄₯𝑀 𝑇 =1

𝑇σ𝑑=1𝑇 𝑀 𝑑

Direct SA (online2batch) Approach:SGD on ERM:min𝑀 2≀𝐡

𝐿𝑆 𝑀

min𝑀

𝐿 𝑀

β€’ Fresh sample at each iteration, π‘š = 𝑇‒ No need to project nor require 𝑀 ≀ 𝐡‒ Implicit regularization via early stopping

β€’ Can have 𝑇 > π‘š iterationsβ€’ Need to project to 𝑀 ≀ 𝐡‒ Explicit regularization via ‖𝑀‖

Page 14: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

SGD for Machine Learning

Initialize 𝑀 0 = 0At iteration t:β€’ Draw π‘₯𝑑 , 𝑦𝑑~π’Ÿ

β€’ If 𝑦𝑑 𝑀𝑑 , πœ™ π‘₯𝑑 < 1,

𝑀 𝑑+1 ← 𝑀 𝑑 + πœ‚π‘‘π‘¦π‘‘πœ™ π‘₯𝑑else: 𝑀 𝑑+1 ← 𝑀 𝑑

Return ΰ΄₯𝑀 𝑇 =1

𝑇σ𝑑=1𝑇 𝑀 𝑑

Draw π‘₯1, 𝑦1 , … , π‘₯π‘š, π‘¦π‘š ~π’Ÿ

Initialize 𝑀 0 = 0At iteration t:β€’ Pick 𝑖 ∈ 1β€¦π‘š at random

β€’ If 𝑦𝑖 𝑀𝑑 , πœ™ π‘₯𝑖 < 1,

𝑀 𝑑+1 ← 𝑀 𝑑 + πœ‚π‘‘π‘¦π‘–πœ™ π‘₯𝑖else: 𝑀 𝑑+1 ← 𝑀 𝑑

β€’ 𝑀 𝑑+1 ← π‘π‘Ÿπ‘œπ‘— 𝑀 𝑑+1 π‘‘π‘œ 𝑀 ≀ 𝐡

Return ΰ΄₯𝑀 𝑇 =1

𝑇σ𝑑=1𝑇 𝑀 𝑑

Direct SA (online2batch) Approach:SGD on ERM:min𝑀 2≀𝐡

𝐿𝑆 𝑀

min𝑀

𝐿 𝑀

𝐿 ΰ΄₯𝑀 𝑇 ≀ 𝐿 π‘€βˆ— +𝐡2𝐺2

𝑇𝐿 ΰ΄₯𝑀 𝑇 ≀ 𝐿 π‘€βˆ— + 2

𝐡2𝐺2

π‘š+

𝐡2𝐺2

𝑇

π‘€βˆ— = arg min𝑀 ≀𝐡

𝐿(𝑀)

πœ‚π‘‘ = 𝐡2/𝐺2𝑑

Page 15: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

SGD for Machine Learning

Initialize 𝑀 0 = 0At iteration t:β€’ Draw π‘₯𝑑 , 𝑦𝑑~π’Ÿ

β€’ If 𝑦𝑑 𝑀𝑑 , πœ™ π‘₯𝑑 < 1,

𝑀 𝑑+1 ← 𝑀 𝑑 + πœ‚π‘‘π‘¦π‘‘πœ™ π‘₯𝑑else: 𝑀 𝑑+1 ← 𝑀 𝑑

Return ΰ΄₯𝑀 𝑇 =1

𝑇σ𝑑=1𝑇 𝑀 𝑑

Draw π‘₯1, 𝑦1 , … , π‘₯π‘š, π‘¦π‘š ~π’Ÿ

Initialize 𝑀 0 = 0At iteration t:β€’ Pick 𝑖 ∈ 1β€¦π‘š at random

β€’ If 𝑦𝑖 𝑀𝑑 , πœ™ π‘₯𝑖 < 1,

𝑀 𝑑+1 ← 𝑀 𝑑 + πœ‚π‘‘π‘¦π‘–πœ™ π‘₯𝑖else: 𝑀 𝑑+1 ← 𝑀 𝑑

β€’ 𝑀 𝑑+1 ← 𝑀 𝑑+1 βˆ’ πœ†π‘€(𝑑)

Return ΰ΄₯𝑀 𝑇 =1

𝑇σ𝑑=1𝑇 𝑀 𝑑

Direct SA (online2batch) Approach:

SGD on RERM:

min𝐿𝑆 𝑀 +πœ†

2‖𝑀‖

min𝑀

𝐿 𝑀

β€’ Fresh sample at each iteration, π‘š = 𝑇‒ No need shrink 𝑀‒ Implicit regularization via early stopping

β€’ Can have 𝑇 > π‘š iterationsβ€’ Need to shrink 𝑀‒ Explicit regularization via ‖𝑀‖

Page 16: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

SGD vs ERM

𝑀 0

ΰ΄₯𝑀 π‘š

ΰ·π‘€π‘€βˆ—

𝑂𝐡2𝐺2

π‘š

argmin𝑀

𝐿𝑆(𝑀)

(overfit)

𝑻 > π’Žw/ proj

π‘€βˆ— = arg min𝑀 ≀𝐡

𝐿(𝑀)ෝ𝑀 = arg min𝑀 ≀𝐡

𝐿𝑆(𝑀)

Page 17: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

Mixed Approach: SGD on ERM

0 1,000,000 2,000,000 3,000,000

0.052

0.054

0.056

0.058

# iterations k

Test

mis

clas

sifi

cati

on

err

or

m=300,000m=400,000m=500,000

Reuters RCV1 data, CCAT task

β€’ The mixed approach (reusing examples) can make sense

Page 18: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

Mixed Approach: SGD on ERM

0 1,000,000 2,000,000 3,000,000

0.052

0.054

0.056

0.058

# iterations k

Test

mis

clas

sifi

cati

on

err

or

m=300,000m=400,000m=500,000

Reuters RCV1 data, CCAT task

β€’ The mixed approach (reusing examples) can make senseβ€’ Still: fresh samples are better

)With a larger training set, can reduce generalization error faster) Larger training set means less runtime to get target generalization error

Page 19: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

Online Optimization vs Stochastic Approximation

β€’ In both Online Setting and Stochastic Approximation β€’ Receive samples sequentiallyβ€’ Update w after each sample

β€’ But, in Online Setting:β€’ Objective is empirical regret, i.e. behavior on observed instancesβ€’ 𝑧𝑑 chosen adversarialy (no distribution involved)

β€’ As opposed on Stochastic Approximation:β€’ Objective is 𝔼 β„“ 𝑀, 𝑧 , i.e. behavior on β€œfuture” samplesβ€’ i.i.d. samples 𝑧𝑑

β€’ Stochastic Approximation is a computational approach,Online Learning is an analysis setupβ€’ E.g. β€œFollow the leader”

Page 20: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

Part II:Realizable vs Agnostic

Rates

Page 21: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

Realizable vs Agnostic Rates

β€’ Recall for finite hypothesis classes:

πΏπ’Ÿ β„Ž ≀ infβ„Žβˆˆβ„‹

πΏπ’Ÿ(β„Ž) + 2log |β„‹|+log ΰ΅—2 𝛿

2π‘š π‘š = 𝑂

log β„‹

πœ–πŸ

β€’ But in the realizable case, if infβ„Žβˆˆβ„‹

πΏπ’Ÿ(β„Ž) = 0:

πΏπ’Ÿ β„Ž ≀log β„‹ +log

1

𝛿

π‘š π‘š = 𝑂

log β„‹

πœ–

β€’ Also for VC-classes, in general π‘š = π‘‚π‘‰πΆπ‘‘π‘–π‘š

πœ–πŸwhile in the

realizable case π‘š = π‘‚π‘‰πΆπ‘‘π‘–π‘šβ‹…log 1/πœ–

πœ–

β€’ What happens if πΏβˆ— = infβ„Žβˆˆβ„‹

πΏπ’Ÿ(β„Ž) is low, but not zero?

Page 22: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

Estimating the Bias of a Coin

𝑝 βˆ’ Ƹ𝑝 ≀log ΰ΅—2 𝛿2π‘š

𝑝 βˆ’ Ƹ𝑝 ≀2 𝑝 log ΰ΅—2 𝛿

π‘š+2 log ΰ΅—2 𝛿3π‘š

Page 23: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

Optimistic VC bound(aka πΏβˆ—-bound, multiplicative bound)

β„Ž = argminβ„Žβˆˆβ„‹

𝐿𝑆(β„Ž) πΏβˆ— = infβ„Žβˆˆβ„‹

𝐿(β„Ž)

β€’ For a hypothesis class with VC-dim 𝐷, w.p. 1-𝛿 over n samples:

𝐿 β„Ž ≀ πΏβˆ— + 2 πΏβˆ—π· log Ξ€2π‘’π‘š

𝐷+log ΰ΅—2 𝛿

π‘š+ 4

𝐷 log Ξ€2π‘’π‘šπ·+log ΰ΅—2 𝛿

π‘š

= inf𝛼

1 + 𝛼 πΏβˆ— + 1 +1

𝛼4𝐷 log Ξ€2π‘’π‘š

𝐷+log ΰ΅—2 𝛿

π‘š

β€’ Sample complexity to get 𝐿 β„Ž ≀ πΏβˆ— + πœ–:

π‘š πœ– = 𝑂𝐷

πœ–β‹…πΏβˆ— + πœ–

πœ–log

1

πœ–

β€’ Extends to bounded real-valued loss in terms of VC-subgraph dim

Page 24: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

From Parametric to Scale Sensitive𝐿 β„Ž = 𝔼 π‘™π‘œπ‘ π‘  β„Ž π‘₯ , 𝑦 β„Ž ∈ β„‹

β€’ Instead of VC-dim or VC-subgraph-dim (β‰ˆ #params), rely on metric scale to control complexity, e.g.:

β„‹ = 𝑀 ↦ 𝑀, π‘₯ 𝑀 2 ≀ 𝐡 }

β€’ Learning depends on:

β€’ Metric complexity measures: fat shattering dimension, covering numbers, Rademacher Complexity

β€’ Scale sensitivity of loss (bound on derivatives or β€œmargin”)

β€’ For β„‹with Rademacher Complexity β„›π‘š, and π‘™π‘œπ‘ π‘ β€² ≀ 𝐺:

𝐿 β„Ž ≀ πΏβˆ— + 2πΊβ„›π‘š +log ΰ΅—2 𝛿2π‘š

≀ πΏβˆ— + 𝑂𝐺2𝑅 + log ΰ΅—2 𝛿

2π‘š

β„›π‘š ≀𝑅

π‘š

𝑅=

β„›π‘š β„‹ =𝐡2 sup π‘₯ 2

π‘š

Page 25: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

Non-Parametric Optimistic Rate for Smooth Loss

β€’ Theorem: for any β„‹ with (worst case) RademacherComplexity β„›π‘š(β„‹), and any smooth loss with π‘™π‘œπ‘ π‘ β€²β€² ≀ 𝐻, π‘™π‘œπ‘ π‘  ≀ 𝑏, w.p. 1 βˆ’ 𝛿 over n samples:

β€’ Sample complexity

Page 26: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

Parametric vs Non-Parametric

Parametricdim(β„‹) ≀ 𝐃, 𝒉 ≀ 𝟏

Scale-Sensitive

ℛ𝒏 β„‹ ≀ ࡗ𝑹 𝒏

Lipschitz: πœ™β€² ≀ 𝐺(e.g. hinge, β„“1) 𝐺 𝐷

π‘š + πΏβˆ—πΊπ·π‘š

𝐺2π‘…π‘š

Smooth: πœ™β€²β€² ≀ 𝐻(e.g. logistic, Huber, smoothed hinge)

𝐻 π·π‘š + πΏβˆ—

π»π·π‘š

𝐻 π‘…π‘š + πΏβˆ—

π»π‘…π‘š

Smooth & strongly convex: πœ‡ ≀ πœ™β€²β€² ≀ 𝐻(e.g. square loss)

𝐻

πœ‡β‹…π» 𝐷

π‘šπ» π‘…π‘š + πΏβˆ—

π»π‘…π‘š

Min-max tight up to poly-log factors

Page 27: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

Optimistic Learning Guarantees

𝐿 β„Ž ≀ 1 + 𝛼 πΏβˆ— + 1 +1

𝛼෨𝑂

𝑅

π‘š

π‘š πœ– ≀ ෨𝑂𝑅

πœ–β‹…πΏβˆ— + πœ–

πœ–

Parametric classes

Scale-sensitive classes with smooth loss

Perceptron gurantee

Margin Bounds

Stability-based guarantees with smooth loss

Online Learning/Optimization with smooth loss

Γ— Non-param (scale sensitive) classes with non-smooth loss

Γ— Online Learning/Optimization with non-smooth loss

Page 28: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

Why Optimistic Guarantees?

𝐿 β„Ž ≀ 1 + 𝛼 πΏβˆ— + 1 +1

𝛼෨𝑂

𝑅

π‘š

π‘š πœ– ≀ ෨𝑂𝑅

πœ–β‹…πΏβˆ— + πœ–

πœ–

β€’ Optimistic regime typically relevant regime:

β€’ Approximation error πΏβˆ— β‰ˆ Estimation error πœ–

β€’ If πœ– β‰ͺ πΏβˆ—, better to spend energy on lowering approx. error(use more complex class)

β€’ Often important in highlighting true phenomena

Page 29: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

Part III:Nearest Neighbor

Classification

Page 30: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

The Nearest Neighbor Classifierβ€’ Training sample S = π‘₯1, 𝑦1 , … , π‘₯π‘š, π‘¦π‘š

β€’ Want to predict label of new point π‘₯

β€’ The Nearest Neighbor Rule:

β€’ Find the closest training point: 𝑖 = argminπ‘–πœŒ(π‘₯, π‘₯𝑖)

β€’ Predict label of π‘₯ as 𝑦𝑖

?

Page 31: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

The Nearest Neighbor Classifierβ€’ Training sample S = π‘₯1, 𝑦1 , … , π‘₯π‘š, π‘¦π‘š

β€’ Want to predict label of new point π‘₯

β€’ The Nearest Neighbor Rule:

β€’ Find the closest training point: 𝑖 = argminπ‘–πœŒ(π‘₯, π‘₯𝑖)

β€’ Predict label of π‘₯ as 𝑦𝑖

β€’ As learning rule: 𝑁𝑁 𝑆 = β„Ž where β„Ž(π‘₯) = 𝑦arg min𝑖

𝜌 π‘₯,π‘₯𝑖

Page 32: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

෩𝝓 𝒙 βˆ’ ෩𝝓(𝒙′)𝟐

෩𝝓 𝒙 = (πŸ“π’™ 𝟏 , 𝒙 𝟐 )𝒙 βˆ’ 𝒙′ πŸπ’™ βˆ’ 𝒙′ 𝟐

Where is the Bias Hiding?

β€’ What is the right β€œdistance” between images? Between sound waves? Between sentences?

β€’ Option 1: 𝜌 π‘₯, π‘₯β€² = πœ™ π‘₯ βˆ’ πœ™ π‘₯β€² 2

β€’ What representation πœ™(π‘₯)?

β€’ Find the closest training point: 𝑖 = argminπ‘–πœŒ(π‘₯, π‘₯𝑖)

β€’ Predict label of π‘₯ as 𝑦𝑖

Page 33: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

෩𝝓 𝒙 βˆ’ ෩𝝓(𝒙′)𝟐

෩𝝓 𝒙 = (πŸ“π’™ 𝟏 , 𝒙 𝟐 )𝒙 βˆ’ 𝒙′ 𝟏 𝒙 βˆ’ 𝒙′ ∞

Where is the Bias Hiding?

β€’ What is the right β€œdistance” between images? Between sound waves? Between sentences?

β€’ Option 1: 𝜌 π‘₯, π‘₯β€² = πœ™ π‘₯ βˆ’ πœ™ π‘₯β€² 2

β€’ What representation πœ™(π‘₯)?

β€’ Maybe a different distance? πœ™ π‘₯ βˆ’ πœ™ π‘₯β€² 1? πœ™ π‘₯ βˆ’ πœ™ π‘₯ ∞? sin(∠(πœ™ π‘₯ , πœ™ π‘₯β€² )? 𝐾𝐿(πœ™(π‘₯)||πœ™ π‘₯β€² )?

β€’ Find the closest training point: 𝑖 = argminπ‘–πœŒ(π‘₯, π‘₯𝑖)

β€’ Predict label of π‘₯ as 𝑦𝑖

Page 34: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

Where is the Bias Hiding?

β€’ What is the right β€œdistance” between images? Between sound waves? Between sentences?

β€’ Option 1: 𝜌 π‘₯, π‘₯β€² = πœ™ π‘₯ βˆ’ πœ™ π‘₯β€² 2

β€’ What representation πœ™(π‘₯)?

β€’ Maybe a different distance? πœ™ π‘₯ βˆ’ πœ™ π‘₯β€² 1? πœ™ π‘₯ βˆ’ πœ™ π‘₯ ∞? sin(∠(πœ™ π‘₯ , πœ™ π‘₯β€² )? 𝐾𝐿(πœ™(π‘₯)||πœ™ π‘₯β€² )?

β€’ Option 2: Special-purpose distance measure on π‘₯β€’ E.g. edit distance, deformation measure, etc

β€’ Find the closest training point: 𝑖 = argminπ‘–πœŒ(π‘₯, π‘₯𝑖)

β€’ Predict label of π‘₯ as 𝑦𝑖

Page 35: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

Nearest Neighbor Learning Guarantee

β€’ Optimal predictor: β„Žβˆ— = argmin πΏπ’Ÿ(β„Ž)

β„Žβˆ— π‘₯ = α‰Š+1, πœ‚ π‘₯ > 0.5

βˆ’1, πœ‚ π‘₯ < 0.5πœ‚ π‘₯ = π‘ƒπ’Ÿ(𝑦 = 1|π‘₯)

β€’ For the NN rule with 𝜌 π‘₯, π‘₯β€² = πœ™ π‘₯ βˆ’ πœ™ π‘₯β€² 2, and πœ™:𝒳 β†’ [0,1] 𝑑:

π”Όπ‘†βˆΌπ’Ÿπ‘š 𝐿 𝑁𝑁 𝑆 ≀ 2𝐿 β„Žβˆ— + 4π‘π’Ÿπ‘‘

𝑑+1 π‘š

πœ‚ π‘₯ βˆ’ πœ‚ π‘₯β€² ≀ π‘π’Ÿ β‹… 𝜌(π‘₯, π‘₯β€²)

Page 36: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

Data Fit / Complexity Tradeoff

β€’ k-Nearest Neighbor: predict according to majority among π‘˜ closest point from S.

π”Όπ‘†βˆΌπ’Ÿπ‘š 𝐿 𝑁𝑁 𝑆 ≀ 2𝐿(β„Žβˆ—) + 4π‘π’Ÿπ‘‘

𝑑+1 π‘š

Page 37: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

k-Nearest Neighbor:Data Fit / Complexity Tradeoff

k=1 k=5 k=12

k=50 k=100 k=200

S= h*=

Page 38: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

k-Nearest Neighbor Guarantee

β€’ For k-NN with 𝜌 π‘₯, π‘₯β€² = πœ™ π‘₯ βˆ’ πœ™ π‘₯β€² 2, and πœ™:𝒳 β†’ [0,1]𝑑:

π”Όπ‘†βˆΌπ’Ÿπ‘š 𝐿 π‘π‘π‘˜ 𝑆 ≀ 1 + 8π‘˜ 𝐿 β„Žβˆ— +

6π‘π’Ÿ 𝑑 + π‘˜π‘‘+1 π‘š

β€’ Should increase π‘˜ with sample size π‘š

β€’ Above theory suggests π‘˜π‘š ∝ 𝐿 β„Žβˆ— 2/3 β‹… π‘š2

3(𝑑+1)

β€’ β€œUniversal” Learning: for any β€œsmooth” π’Ÿ and representation πœ™ β‹… (with continuous 𝑃(𝑦|πœ™ π‘₯ )), if we increase k slowly enough, we will eventually converge to optimal 𝐿(β„Žβˆ—)

β€’ Very non-uniform: sample complexity depends not only on β„Žβˆ—, but also on π’Ÿ

πœ‚ π‘₯ βˆ’ πœ‚ π‘₯β€² ≀ π‘π’Ÿ β‹… 𝜌(π‘₯, π‘₯β€²)

Page 39: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfΒ Β· Stochastic Optimization / Learning min ∈ ( )=𝔼 ~π’Ÿ[ , ] β€’Empirical

Uniform and Non-Uniform Learnability

β€’ Definition: A hypothesis class β„‹ is agnostically PAC-Learnable if there exists a

learning rule 𝐴 such that βˆ€πœ–, 𝛿 > 0, βˆƒπ‘š πœ–, 𝛿 , βˆ€π’Ÿ, βˆ€π’‰, βˆ€π‘†βˆΌπ’Ÿπ‘š πœ–,𝛿𝛿 ,

πΏπ’Ÿ 𝐴 𝑆 ≀ πΏπ’Ÿ β„Ž + πœ–

β€’ Definition: A hypothesis class β„‹ is non-uniformly learnable if there exists a

learning rule 𝐴 such that βˆ€πœ–, 𝛿 > 0, βˆ€β„Ž, βˆƒπ‘š πœ–, 𝛿, 𝒉 , βˆ€π’Ÿ, βˆ€π‘†βˆΌπ’Ÿπ‘š πœ–,𝛿,𝒉𝛿 ,

πΏπ’Ÿ 𝐴 𝑆 ≀ πΏπ’Ÿ β„Ž + πœ–

β€’ Definition: A hypothesis class β„‹ is β€œconsistently learnable” if there exists a

learning rule 𝐴 such that βˆ€πœ–, 𝛿 > 0, βˆ€β„Ž βˆ€π““, βˆƒπ‘š πœ–, 𝛿, β„Ž, 𝓓 , βˆ€π‘†βˆΌπ’Ÿπ‘š πœ–,𝛿,β„Ž,𝓓𝛿 ,

πΏπ’Ÿ 𝐴 𝑆 ≀ πΏπ’Ÿ β„Ž + πœ–

Realizable/Optimistic Guarantees: π’Ÿ dependence through πΏπ’Ÿ(β„Ž)


Recommended