+ All Categories
Transcript
Page 1: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

Computational and StatisticalLearning Theory

TTIC 31120

Prof. Nati Srebro

Lecture 17:Stochastic Optimization

Part II: Realizable vs Agnostic RatesPart III: Nearest Neighbor Classification

Page 2: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

Stochastic (Sub)-Gradient Descent

If 𝛻𝑓 𝑀, 𝑧 2 ≀ 𝐺 then with appropriate step size:

𝔌 𝐹 𝑀𝑚 ≀ inf𝑀∈𝒲, 𝑀 2≀𝐵

𝐹 𝑀 + 𝑂𝐵2𝐺2

𝑚

Similarly, also Stochastic Mirror Descent

Optimize 𝑭 𝒘 = 𝔌𝒛∌𝓓 𝒇 𝒘, 𝒛 s.t. 𝒘 ∈ 𝓊1. Initialize 𝑀1 = 0 ∈ 𝒲2. At iteration 𝑡 = 1,2,3, 


1. Sample 𝑧𝑡 ∌ 𝒟 (Obtain 𝑔𝑡 s.t. 𝔌 𝑔𝑡 ∈ 𝜕𝐹(𝑀𝑡))

2. 𝑀𝑡+1 = Π𝒲 𝑀𝑡 − 𝜂𝑡𝛻𝑓 𝑀𝑡 , 𝑧𝑡

3. Return 𝑀𝑚 =1

𝑚σ𝑡=1𝑚 𝑀𝑡

Stochastic Gradient DescentOnline Gradient Descent[Cesa-Binachi et al 02]

online2stochastic

𝒈𝒕

[Zinkevich 03] [Nemirovski Yudin 78]

Online Mirror Descent Stochastic Mirror Descent[Shalev-Shwatz Singer 07] [Nemirovski Yudin 78]

Page 3: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

Stochastic Optimization

min𝑀∈𝒲

𝐹(𝑀)

based only on stochastic information on 𝐹• Only unbiased estimates of 𝐹 𝑀 , 𝛻𝐹(𝑀)

• No direct access to 𝐹

E.g., fixed 𝑓(𝑀, 𝑧) but 𝒟 unknown• Optimize 𝐹(𝑀) based on iid sample 𝑧1, 𝑧2, 
 , 𝑧𝑚~𝒟

• 𝑔 = 𝛻𝑓(𝑀, 𝑧𝑡) is unbiased estimate of 𝛻𝐹 𝑀

• Traditional applications• Optimization under uncertainty

• Uncertainty about network performance

• Uncertainty about client demands

• Uncertainty about system behavior in control problems

• Complex systems where its easier to sample then integrate over 𝑧

= 𝔌𝑧~𝒟[𝑓 𝑀, 𝑧 ]

Page 4: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

Machine Learning isStochastic Optimization

minℎ

𝐿 ℎ = 𝔌𝑧∌𝒟 ℓ ℎ, 𝑧 = 𝔌𝑥,𝑊∌𝒟 𝑙𝑜𝑠𝑠 ℎ 𝑥 , 𝑊

• Optimization variable: predictor ℎ

• Objective: generalization error 𝐿(ℎ)

• Stochasticity over 𝑧 = (𝑥, 𝑊)

“General Learning” ≡ Stochastic Optimization:

ValdimirVapnik

ArkadiNemirovskii

Page 5: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

•Focus on computational efficiency

•Generally assumes unlimited sampling- as in monte-carlo methods for complicated objectives

•Optimization variable generally a vector in a normed space- complexity control through norm

•Mostly convex objectives

Stochastic Optimization

•Focus on sample size

•What can be done with a fixed number of samples?

•Abstract hypothesis classes- linear predictors, but also combinatorial hypothesis classes- generic measures of complexity such as VC-dim, fat shattering, Radamacher

• Also non-convex classes and loss functions

Statistical Learningvs

ValdimirVapnik

ArkadiNemirovskii

Page 6: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

Two Approaches to Stochastic Optimization / Learning

min𝑀∈𝒲

𝐹(𝑀) = 𝔌𝑧~𝒟[𝑓 𝑀, 𝑧 ]

• Empirical Risk Minimization (ERM) / Sample Average Approximation (SAA):• Collect sample z1,
,zm

• Minimize 𝐹𝑆 𝑀 =1

𝑚σ𝑖 𝑓 𝑀, 𝑧𝑖

• Analysis typically based on Uniform Convergence

• Stochastic Approximation (SA): [Robins Monro 1951]

• Update 𝑀𝑡 based on 𝑧𝑡• E.g., based on 𝑔𝑡 = 𝛻𝑓(𝑀, 𝑧𝑡)

• E.g.: stochastic gradient descent

• Online-to-batch conversion of online algorithm


Page 7: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

SA/SGD for Machine Learning

• In learning with ERM, need to optimize

ෝ𝑀 = arg min𝑀∈𝒲

𝐿𝑆(𝑀) 𝐿𝑆 𝑀 =1

𝑚σ𝑖 ℓ(𝑀, 𝑧𝑖)

• 𝐿𝑆 𝑀 is expensive to evaluate exactly— 𝑂(𝑚𝑑) time

• Cheap to get unbiased gradient estimate— 𝑂(𝑑) time

𝑖 ∌ 𝑈𝑛𝑖𝑓(1 𝑚) 𝑔 = 𝛻ℓ 𝑀, 𝑧𝑖

𝔌 𝑔 = σ𝑖1

𝑚𝛻ℓ 𝑀, 𝑧𝑖 = 𝛻𝐿𝑆(𝑀)

• SGD guarantee:

𝔌 𝐿𝑆 àŽ¥ð‘€ 𝑇 ≀ inf𝑀∈𝒲

𝐿𝑆 𝑀 +sup 𝛻ℓ 2

2 sup 𝑀 22

𝑇

Page 8: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

SGD for SVM

min 𝐿𝑆 𝑀 𝑠. 𝑡. 𝑀 2 ≀ 𝐵

Use 𝑔𝑡 = 𝛻𝑀𝑙𝑜𝑠𝑠ℎ𝑖𝑛𝑔𝑒 𝑀𝑡, 𝜙𝑖𝑡 𝑥 ; 𝑊𝑖𝑡 for random 𝑖𝑡

Initialize 𝑀 0 = 0At iteration t:• Pick 𝑖 ∈ 1 𝑚 at random

• If 𝑊𝑖 𝑀𝑡 , 𝜙 𝑥𝑖 < 1,

𝑀 𝑡+1 ← 𝑀 𝑡 + 𝜂𝑡𝑊𝑖𝜙 𝑥𝑖else: 𝑀 𝑡+1 ← 𝑀 𝑡

• If 𝑀 𝑡+12> 𝐵, then 𝑀 𝑡+1 ← 𝐵

𝑀 𝑡+1

𝑀 𝑡+12

Return àŽ¥ð‘€ 𝑇 =1

𝑇σ𝑡=1𝑇 𝑀 𝑡

𝜙 𝑥 2 ≀ 𝐺 𝑔𝑡 2 ≀ 𝐺 𝐿𝑆 àŽ¥ð‘€ 𝑇 ≀ 𝐿𝑆 ෝ𝑀 +𝐵2𝐺2

𝑇

(in expectation over randomness in algorithm)

Page 9: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

Stochastic vs Batch

x1,y1

x2,y2

x3,y3

x4,y4

x5,y5

xm,ym

𝑔1 = ∇𝑙𝑜𝑠𝑠(𝑀, (𝑥1, 𝑊1) )

𝑔2 = ∇𝑙𝑜𝑠𝑠(𝑀, (𝑥2, 𝑊2) )

𝑔3 = ∇𝑙𝑜𝑠𝑠(𝑀, (𝑥3, 𝑊3) )

𝑔4 = ∇𝑙𝑜𝑠𝑠(𝑀, (𝑥4, 𝑊4) )

𝑔5 = ∇𝑙𝑜𝑠𝑠(𝑀, 𝑥5, 𝑊5 )

𝑔𝑚 = ∇𝑙𝑜𝑠𝑠(𝑀, (𝑥𝑚, 𝑊𝑚) )

𝑔1 = ∇𝑙𝑜𝑠𝑠(𝑀, (𝑥1, 𝑊1) )

𝑔2 = ∇𝑙𝑜𝑠𝑠(𝑀, (𝑥2, 𝑊2) )

𝑔3 = ∇𝑙𝑜𝑠𝑠(𝑀, (𝑥3, 𝑊3) )

𝑔4 = ∇𝑙𝑜𝑠𝑠(𝑀, (𝑥4, 𝑊4) )

𝑔5 = ∇𝑙𝑜𝑠𝑠(𝑀, (𝑥5, 𝑊5) )

𝑔𝑚 = ∇𝑙𝑜𝑠𝑠(𝑀, (𝑥𝑚, 𝑊𝑚) )

𝑀 ← 𝑀 − 𝑔1

𝑀 ← 𝑀 − 𝑔2

𝑀 ← 𝑀 − 𝑔3

𝑀 ← 𝑀 − 𝑔4

𝑀 ← 𝑀 − 𝑔𝑚−1

𝑀 ← 𝑀 − 𝑔𝑖𝑀 ← 𝑀 − 𝑔𝑚

𝑀 ← 𝑀 − 𝑔5

min 𝐿𝑆 𝑀 𝑠. 𝑡. 𝑀 2 ≀ 𝐵

∇𝐿𝑆 𝑀 =1

𝑚σ𝑔𝑖

Page 10: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

Stochastic vs Batch• Intuitive argument: if only taking simple gradient steps, better

to be stochastic

• To get 𝐿𝑆 𝑀 ≀ 𝐿𝑆 ෝ𝑀 + 𝜖𝑜𝑝𝑡:

#iter cost/iter runtime

Batch GD 𝐵2𝐺2/𝜖𝑜𝑝𝑡2 𝑚𝑑 𝒎𝑑

𝐵2𝐺2

𝜖𝑜𝑝𝑡2

SGD 𝐵2𝐺2/𝜖𝑜𝑝𝑡2 𝑑 𝑑

𝐵2𝐺2

𝜖𝑜𝑝𝑡2

• Comparison to methods with a log 1/𝜖𝑜𝑝𝑡 dependence that use the structure of 𝐿𝑆(𝑀) (not only local access)?

• How small should 𝜖𝑜𝑝𝑡 be?

• What about 𝐿 𝑀 , which is what we really care about?

Page 11: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

Overall Analysis of 𝐿𝒟(𝑀)• Recall for ERM: 𝐿𝒟 ෝ𝑀 ≀ 𝐿𝒟 𝑀∗ + 2sup

𝑀𝐿𝒟 𝑀 + 𝐿𝑆 𝑀

• For 𝜖𝑜𝑝𝑡 suboptimal ERM àŽ¥ð‘€:

𝐿𝒟 àŽ¥ð‘€ ≀ 𝐿𝒟 𝑀∗ + 2sup𝑀

𝐿𝒟 𝑀 − 𝐿𝑆 𝑀 + 𝐿𝑆 àŽ¥ð‘€ − 𝐿𝑆 ෝ𝑀

• Take 𝜖𝑜𝑝𝑡 ≈ 𝜖𝑒𝑠𝑡, i.e. #𝑖𝑡𝑒𝑟 𝑇 ≈ 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒 𝑚

• To ensure 𝐿𝒟 𝑀 ≀ 𝐿𝒟 𝑀∗ + 𝜖:

𝑇,𝑚 = 𝑂𝐵2𝐺2

𝜖2

ෝ𝑀 = arg min𝑀 ≀𝐵

𝐿𝑆(𝑀) 𝑀∗ = arg min𝑀 ≀𝐵

𝐿𝒟(𝑀)

𝜖𝑎𝑝𝑟𝑜𝑥

𝜖𝑒𝑠𝑡 ≀ 2𝐵2𝐺2

𝑚𝜖𝑜𝑝𝑡 ≀

𝐵2𝐺2

𝑇

Page 12: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

Direct Online-to-Batch:SGD on 𝐿𝒟(𝑀)

min𝑀

𝐿𝒟(𝑀)

use 𝑔𝑡 = 𝛻𝑀ℎ𝑖𝑛𝑔𝑒 𝑊 𝑀, 𝜙 𝑥 for random 𝑊, 𝑥~𝒟

𝔌 𝑔𝑡 = 𝛻𝐿𝒟 𝑀

Initialize 𝑀 0 = 0At iteration t:• Draw 𝑥𝑡 , 𝑊𝑡~𝒟

• If 𝑊𝑡 𝑀𝑡 , 𝜙 𝑥𝑡 < 1,

𝑀 𝑡+1 ← 𝑀 𝑡 + 𝜂𝑡𝑊𝑡𝜙 𝑥𝑡else: 𝑀 𝑡+1 ← 𝑀 𝑡

Return àŽ¥ð‘€ 𝑇 =1

𝑇σ𝑡=1𝑇 𝑀 𝑡

𝐿𝒟 àŽ¥ð‘€ 𝑇 ≀ inf𝑀 2≀𝐵

𝐿𝒟 𝑀 +𝐵2𝐺2

𝑇

𝑚 = 𝑇 = 𝑂𝐵2𝐺2

𝜖2

Page 13: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

SGD for Machine Learning

Initialize 𝑀 0 = 0At iteration t:• Draw 𝑥𝑡 , 𝑊𝑡~𝒟

• If 𝑊𝑡 𝑀𝑡 , 𝜙 𝑥𝑡 < 1,

𝑀 𝑡+1 ← 𝑀 𝑡 + 𝜂𝑡𝑊𝑡𝜙 𝑥𝑡else: 𝑀 𝑡+1 ← 𝑀 𝑡

Return àŽ¥ð‘€ 𝑇 =1

𝑇σ𝑡=1𝑇 𝑀 𝑡

Draw 𝑥1, 𝑊1 , 
 , 𝑥𝑚, 𝑊𝑚 ~𝒟

Initialize 𝑀 0 = 0At iteration t:• Pick 𝑖 ∈ 1 𝑚 at random

• If 𝑊𝑖 𝑀𝑡 , 𝜙 𝑥𝑖 < 1,

𝑀 𝑡+1 ← 𝑀 𝑡 + 𝜂𝑡𝑊𝑖𝜙 𝑥𝑖else: 𝑀 𝑡+1 ← 𝑀 𝑡

• 𝑀 𝑡+1 ← 𝑝𝑟𝑜𝑗 𝑀 𝑡+1 𝑡𝑜 𝑀 ≀ 𝐵

Return àŽ¥ð‘€ 𝑇 =1

𝑇σ𝑡=1𝑇 𝑀 𝑡

Direct SA (online2batch) Approach:SGD on ERM:min𝑀 2≀𝐵

𝐿𝑆 𝑀

min𝑀

𝐿 𝑀

• Fresh sample at each iteration, 𝑚 = 𝑇• No need to project nor require 𝑀 ≀ 𝐵• Implicit regularization via early stopping

• Can have 𝑇 > 𝑚 iterations• Need to project to 𝑀 ≀ 𝐵• Explicit regularization via ‖𝑀‖

Page 14: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

SGD for Machine Learning

Initialize 𝑀 0 = 0At iteration t:• Draw 𝑥𝑡 , 𝑊𝑡~𝒟

• If 𝑊𝑡 𝑀𝑡 , 𝜙 𝑥𝑡 < 1,

𝑀 𝑡+1 ← 𝑀 𝑡 + 𝜂𝑡𝑊𝑡𝜙 𝑥𝑡else: 𝑀 𝑡+1 ← 𝑀 𝑡

Return àŽ¥ð‘€ 𝑇 =1

𝑇σ𝑡=1𝑇 𝑀 𝑡

Draw 𝑥1, 𝑊1 , 
 , 𝑥𝑚, 𝑊𝑚 ~𝒟

Initialize 𝑀 0 = 0At iteration t:• Pick 𝑖 ∈ 1 𝑚 at random

• If 𝑊𝑖 𝑀𝑡 , 𝜙 𝑥𝑖 < 1,

𝑀 𝑡+1 ← 𝑀 𝑡 + 𝜂𝑡𝑊𝑖𝜙 𝑥𝑖else: 𝑀 𝑡+1 ← 𝑀 𝑡

• 𝑀 𝑡+1 ← 𝑝𝑟𝑜𝑗 𝑀 𝑡+1 𝑡𝑜 𝑀 ≀ 𝐵

Return àŽ¥ð‘€ 𝑇 =1

𝑇σ𝑡=1𝑇 𝑀 𝑡

Direct SA (online2batch) Approach:SGD on ERM:min𝑀 2≀𝐵

𝐿𝑆 𝑀

min𝑀

𝐿 𝑀

𝐿 àŽ¥ð‘€ 𝑇 ≀ 𝐿 𝑀∗ +𝐵2𝐺2

𝑇𝐿 àŽ¥ð‘€ 𝑇 ≀ 𝐿 𝑀∗ + 2

𝐵2𝐺2

𝑚+

𝐵2𝐺2

𝑇

𝑀∗ = arg min𝑀 ≀𝐵

𝐿(𝑀)

𝜂𝑡 = 𝐵2/𝐺2𝑡

Page 15: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

SGD for Machine Learning

Initialize 𝑀 0 = 0At iteration t:• Draw 𝑥𝑡 , 𝑊𝑡~𝒟

• If 𝑊𝑡 𝑀𝑡 , 𝜙 𝑥𝑡 < 1,

𝑀 𝑡+1 ← 𝑀 𝑡 + 𝜂𝑡𝑊𝑡𝜙 𝑥𝑡else: 𝑀 𝑡+1 ← 𝑀 𝑡

Return àŽ¥ð‘€ 𝑇 =1

𝑇σ𝑡=1𝑇 𝑀 𝑡

Draw 𝑥1, 𝑊1 , 
 , 𝑥𝑚, 𝑊𝑚 ~𝒟

Initialize 𝑀 0 = 0At iteration t:• Pick 𝑖 ∈ 1 𝑚 at random

• If 𝑊𝑖 𝑀𝑡 , 𝜙 𝑥𝑖 < 1,

𝑀 𝑡+1 ← 𝑀 𝑡 + 𝜂𝑡𝑊𝑖𝜙 𝑥𝑖else: 𝑀 𝑡+1 ← 𝑀 𝑡

• 𝑀 𝑡+1 ← 𝑀 𝑡+1 − 𝜆𝑀(𝑡)

Return àŽ¥ð‘€ 𝑇 =1

𝑇σ𝑡=1𝑇 𝑀 𝑡

Direct SA (online2batch) Approach:

SGD on RERM:

min𝐿𝑆 𝑀 +𝜆

2‖𝑀‖

min𝑀

𝐿 𝑀

• Fresh sample at each iteration, 𝑚 = 𝑇• No need shrink 𝑀• Implicit regularization via early stopping

• Can have 𝑇 > 𝑚 iterations• Need to shrink 𝑀• Explicit regularization via ‖𝑀‖

Page 16: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

SGD vs ERM

𝑀 0

àŽ¥ð‘€ 𝑚

ෝ𝑀𝑀∗

𝑂𝐵2𝐺2

𝑚

argmin𝑀

𝐿𝑆(𝑀)

(overfit)

𝑻 > 𝒎w/ proj

𝑀∗ = arg min𝑀 ≀𝐵

𝐿(𝑀)ෝ𝑀 = arg min𝑀 ≀𝐵

𝐿𝑆(𝑀)

Page 17: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

Mixed Approach: SGD on ERM

0 1,000,000 2,000,000 3,000,000

0.052

0.054

0.056

0.058

# iterations k

Test

mis

clas

sifi

cati

on

err

or

m=300,000m=400,000m=500,000

Reuters RCV1 data, CCAT task

• The mixed approach (reusing examples) can make sense

Page 18: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

Mixed Approach: SGD on ERM

0 1,000,000 2,000,000 3,000,000

0.052

0.054

0.056

0.058

# iterations k

Test

mis

clas

sifi

cati

on

err

or

m=300,000m=400,000m=500,000

Reuters RCV1 data, CCAT task

• The mixed approach (reusing examples) can make sense• Still: fresh samples are better

)With a larger training set, can reduce generalization error faster) Larger training set means less runtime to get target generalization error

Page 19: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

Online Optimization vs Stochastic Approximation

• In both Online Setting and Stochastic Approximation • Receive samples sequentially• Update w after each sample

• But, in Online Setting:• Objective is empirical regret, i.e. behavior on observed instances• 𝑧𝑡 chosen adversarialy (no distribution involved)

• As opposed on Stochastic Approximation:• Objective is 𝔌 ℓ 𝑀, 𝑧 , i.e. behavior on “future” samples• i.i.d. samples 𝑧𝑡

• Stochastic Approximation is a computational approach,Online Learning is an analysis setup• E.g. “Follow the leader”

Page 20: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

Part II:Realizable vs Agnostic

Rates

Page 21: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

Realizable vs Agnostic Rates

• Recall for finite hypothesis classes:

𝐿𝒟 ℎ ≀ infℎ∈ℋ

𝐿𝒟(ℎ) + 2log |ℋ|+log ൗ2 𝛿

2𝑚 𝑚 = 𝑂

log ℋ

𝜖𝟐

• But in the realizable case, if infℎ∈ℋ

𝐿𝒟(ℎ) = 0:

𝐿𝒟 ℎ ≀log ℋ +log

1

𝛿

𝑚 𝑚 = 𝑂

log ℋ

𝜖

• Also for VC-classes, in general 𝑚 = 𝑂𝑉𝐶𝑑𝑖𝑚

𝜖𝟐while in the

realizable case 𝑚 = 𝑂𝑉𝐶𝑑𝑖𝑚⋅log 1/𝜖

𝜖

• What happens if 𝐿∗ = infℎ∈ℋ

𝐿𝒟(ℎ) is low, but not zero?

Page 22: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

Estimating the Bias of a Coin

𝑝 − ƞ𝑝 ≀log ൗ2 𝛿2𝑚

𝑝 − ƞ𝑝 ≀2 𝑝 log ൗ2 𝛿

𝑚+2 log ൗ2 𝛿3𝑚

Page 23: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

Optimistic VC bound(aka 𝐿∗-bound, multiplicative bound)

ℎ = argminℎ∈ℋ

𝐿𝑆(ℎ) 𝐿∗ = infℎ∈ℋ

𝐿(ℎ)

• For a hypothesis class with VC-dim 𝐷, w.p. 1-𝛿 over n samples:

𝐿 ℎ ≀ 𝐿∗ + 2 𝐿∗𝐷 log ΀2𝑒𝑚

𝐷+log ൗ2 𝛿

𝑚+ 4

𝐷 log ΀2𝑒𝑚𝐷+log ൗ2 𝛿

𝑚

= inf𝛌

1 + 𝛌 𝐿∗ + 1 +1

𝛌4𝐷 log ΀2𝑒𝑚

𝐷+log ൗ2 𝛿

𝑚

• Sample complexity to get 𝐿 ℎ ≀ 𝐿∗ + 𝜖:

𝑚 𝜖 = 𝑂𝐷

𝜖⋅𝐿∗ + 𝜖

𝜖log

1

𝜖

• Extends to bounded real-valued loss in terms of VC-subgraph dim

Page 24: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

From Parametric to Scale Sensitive𝐿 ℎ = 𝔌 𝑙𝑜𝑠𝑠 ℎ 𝑥 , 𝑊 ℎ ∈ ℋ

• Instead of VC-dim or VC-subgraph-dim (≈ #params), rely on metric scale to control complexity, e.g.:

ℋ = 𝑀 ↩ 𝑀, 𝑥 𝑀 2 ≀ 𝐵 }

• Learning depends on:

• Metric complexity measures: fat shattering dimension, covering numbers, Rademacher Complexity

• Scale sensitivity of loss (bound on derivatives or “margin”)

• For ℋwith Rademacher Complexity ℛ𝑚, and 𝑙𝑜𝑠𝑠′ ≀ 𝐺:

𝐿 ℎ ≀ 𝐿∗ + 2𝐺ℛ𝑚 +log ൗ2 𝛿2𝑚

≀ 𝐿∗ + 𝑂𝐺2𝑅 + log ൗ2 𝛿

2𝑚

ℛ𝑚 ≀𝑅

𝑚

𝑅=

ℛ𝑚 ℋ =𝐵2 sup 𝑥 2

𝑚

Page 25: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

Non-Parametric Optimistic Rate for Smooth Loss

• Theorem: for any ℋ with (worst case) RademacherComplexity ℛ𝑚(ℋ), and any smooth loss with 𝑙𝑜𝑠𝑠′′ ≀ 𝐻, 𝑙𝑜𝑠𝑠 ≀ 𝑏, w.p. 1 − 𝛿 over n samples:

• Sample complexity

Page 26: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

Parametric vs Non-Parametric

Parametricdim(ℋ) ≀ 𝐃, 𝒉 ≀ 𝟏

Scale-Sensitive

ℛ𝒏 ℋ ≀ ൗ𝑹 𝒏

Lipschitz: 𝜙′ ≀ 𝐺(e.g. hinge, ℓ1) 𝐺 𝐷

𝑚 + 𝐿∗𝐺𝐷𝑚

𝐺2𝑅𝑚

Smooth: 𝜙′′ ≀ 𝐻(e.g. logistic, Huber, smoothed hinge)

𝐻 𝐷𝑚 + 𝐿∗

𝐻𝐷𝑚

𝐻 𝑅𝑚 + 𝐿∗

𝐻𝑅𝑚

Smooth & strongly convex: 𝜇 ≀ 𝜙′′ ≀ 𝐻(e.g. square loss)

𝐻

𝜇⋅𝐻 𝐷

𝑚𝐻 𝑅𝑚 + 𝐿∗

𝐻𝑅𝑚

Min-max tight up to poly-log factors

Page 27: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

Optimistic Learning Guarantees

𝐿 ℎ ≀ 1 + 𝛌 𝐿∗ + 1 +1

𝛌ේ𝑂

𝑅

𝑚

𝑚 𝜖 ≀ ේ𝑂𝑅

𝜖⋅𝐿∗ + 𝜖

𝜖

Parametric classes

Scale-sensitive classes with smooth loss

Perceptron gurantee

Margin Bounds

Stability-based guarantees with smooth loss

Online Learning/Optimization with smooth loss

× Non-param (scale sensitive) classes with non-smooth loss

× Online Learning/Optimization with non-smooth loss

Page 28: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

Why Optimistic Guarantees?

𝐿 ℎ ≀ 1 + 𝛌 𝐿∗ + 1 +1

𝛌ේ𝑂

𝑅

𝑚

𝑚 𝜖 ≀ ේ𝑂𝑅

𝜖⋅𝐿∗ + 𝜖

𝜖

• Optimistic regime typically relevant regime:

• Approximation error 𝐿∗ ≈ Estimation error 𝜖

• If 𝜖 ≪ 𝐿∗, better to spend energy on lowering approx. error(use more complex class)

• Often important in highlighting true phenomena

Page 29: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

Part III:Nearest Neighbor

Classification

Page 30: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

The Nearest Neighbor Classifier• Training sample S = 𝑥1, 𝑊1 , 
 , 𝑥𝑚, 𝑊𝑚

• Want to predict label of new point 𝑥

• The Nearest Neighbor Rule:

• Find the closest training point: 𝑖 = argmin𝑖𝜌(𝑥, 𝑥𝑖)

• Predict label of 𝑥 as 𝑊𝑖

?

Page 31: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

The Nearest Neighbor Classifier• Training sample S = 𝑥1, 𝑊1 , 
 , 𝑥𝑚, 𝑊𝑚

• Want to predict label of new point 𝑥

• The Nearest Neighbor Rule:

• Find the closest training point: 𝑖 = argmin𝑖𝜌(𝑥, 𝑥𝑖)

• Predict label of 𝑥 as 𝑊𝑖

• As learning rule: 𝑁𝑁 𝑆 = ℎ where ℎ(𝑥) = 𝑊arg min𝑖

𝜌 𝑥,𝑥𝑖

Page 32: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

෩𝝓 𝒙 − ෩𝝓(𝒙′)𝟐

෩𝝓 𝒙 = (𝟓𝒙 𝟏 , 𝒙 𝟐 )𝒙 − 𝒙′ 𝟏𝒙 − 𝒙′ 𝟐

Where is the Bias Hiding?

• What is the right “distance” between images? Between sound waves? Between sentences?

• Option 1: 𝜌 𝑥, 𝑥′ = 𝜙 𝑥 − 𝜙 𝑥′ 2

• What representation 𝜙(𝑥)?

• Find the closest training point: 𝑖 = argmin𝑖𝜌(𝑥, 𝑥𝑖)

• Predict label of 𝑥 as 𝑊𝑖

Page 33: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

෩𝝓 𝒙 − ෩𝝓(𝒙′)𝟐

෩𝝓 𝒙 = (𝟓𝒙 𝟏 , 𝒙 𝟐 )𝒙 − 𝒙′ 𝟏 𝒙 − 𝒙′ ∞

Where is the Bias Hiding?

• What is the right “distance” between images? Between sound waves? Between sentences?

• Option 1: 𝜌 𝑥, 𝑥′ = 𝜙 𝑥 − 𝜙 𝑥′ 2

• What representation 𝜙(𝑥)?

• Maybe a different distance? 𝜙 𝑥 − 𝜙 𝑥′ 1? 𝜙 𝑥 − 𝜙 𝑥 ∞? sin(∠(𝜙 𝑥 , 𝜙 𝑥′ )? 𝐟𝐿(𝜙(𝑥)||𝜙 𝑥′ )?

• Find the closest training point: 𝑖 = argmin𝑖𝜌(𝑥, 𝑥𝑖)

• Predict label of 𝑥 as 𝑊𝑖

Page 34: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

Where is the Bias Hiding?

• What is the right “distance” between images? Between sound waves? Between sentences?

• Option 1: 𝜌 𝑥, 𝑥′ = 𝜙 𝑥 − 𝜙 𝑥′ 2

• What representation 𝜙(𝑥)?

• Maybe a different distance? 𝜙 𝑥 − 𝜙 𝑥′ 1? 𝜙 𝑥 − 𝜙 𝑥 ∞? sin(∠(𝜙 𝑥 , 𝜙 𝑥′ )? 𝐟𝐿(𝜙(𝑥)||𝜙 𝑥′ )?

• Option 2: Special-purpose distance measure on 𝑥• E.g. edit distance, deformation measure, etc

• Find the closest training point: 𝑖 = argmin𝑖𝜌(𝑥, 𝑥𝑖)

• Predict label of 𝑥 as 𝑊𝑖

Page 35: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

Nearest Neighbor Learning Guarantee

• Optimal predictor: ℎ∗ = argmin 𝐿𝒟(ℎ)

ℎ∗ 𝑥 = ቊ+1, 𝜂 𝑥 > 0.5

−1, 𝜂 𝑥 < 0.5𝜂 𝑥 = 𝑃𝒟(𝑊 = 1|𝑥)

• For the NN rule with 𝜌 𝑥, 𝑥′ = 𝜙 𝑥 − 𝜙 𝑥′ 2, and 𝜙:𝒳 → [0,1] 𝑑:

𝔌𝑆∌𝒟𝑚 𝐿 𝑁𝑁 𝑆 ≀ 2𝐿 ℎ∗ + 4𝑐𝒟𝑑

𝑑+1 𝑚

𝜂 𝑥 − 𝜂 𝑥′ ≀ 𝑐𝒟 ⋅ 𝜌(𝑥, 𝑥′)

Page 36: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

Data Fit / Complexity Tradeoff

• k-Nearest Neighbor: predict according to majority among 𝑘 closest point from S.

𝔌𝑆∌𝒟𝑚 𝐿 𝑁𝑁 𝑆 ≀ 2𝐿(ℎ∗) + 4𝑐𝒟𝑑

𝑑+1 𝑚

Page 37: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

k-Nearest Neighbor:Data Fit / Complexity Tradeoff

k=1 k=5 k=12

k=50 k=100 k=200

S= h*=

Page 38: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

k-Nearest Neighbor Guarantee

• For k-NN with 𝜌 𝑥, 𝑥′ = 𝜙 𝑥 − 𝜙 𝑥′ 2, and 𝜙:𝒳 → [0,1]𝑑:

𝔌𝑆∌𝒟𝑚 𝐿 𝑁𝑁𝑘 𝑆 ≀ 1 + 8𝑘 𝐿 ℎ∗ +

6𝑐𝒟 𝑑 + 𝑘𝑑+1 𝑚

• Should increase 𝑘 with sample size 𝑚

• Above theory suggests 𝑘𝑚 ∝ 𝐿 ℎ∗ 2/3 ⋅ 𝑚2

3(𝑑+1)

• “Universal” Learning: for any “smooth” 𝒟 and representation 𝜙 ⋅ (with continuous 𝑃(𝑊|𝜙 𝑥 )), if we increase k slowly enough, we will eventually converge to optimal 𝐿(ℎ∗)

• Very non-uniform: sample complexity depends not only on ℎ∗, but also on 𝒟

𝜂 𝑥 − 𝜂 𝑥′ ≀ 𝑐𝒟 ⋅ 𝜌(𝑥, 𝑥′)

Page 39: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔌 ~𝒟[ , ] •Empirical

Uniform and Non-Uniform Learnability

• Definition: A hypothesis class ℋ is agnostically PAC-Learnable if there exists a

learning rule 𝐎 such that ∀𝜖, 𝛿 > 0, ∃𝑚 𝜖, 𝛿 , ∀𝒟, ∀𝒉, ∀𝑆∌𝒟𝑚 𝜖,𝛿𝛿 ,

𝐿𝒟 𝐎 𝑆 ≀ 𝐿𝒟 ℎ + 𝜖

• Definition: A hypothesis class ℋ is non-uniformly learnable if there exists a

learning rule 𝐎 such that ∀𝜖, 𝛿 > 0, ∀ℎ, ∃𝑚 𝜖, 𝛿, 𝒉 , ∀𝒟, ∀𝑆∌𝒟𝑚 𝜖,𝛿,𝒉𝛿 ,

𝐿𝒟 𝐎 𝑆 ≀ 𝐿𝒟 ℎ + 𝜖

• Definition: A hypothesis class ℋ is “consistently learnable” if there exists a

learning rule 𝐎 such that ∀𝜖, 𝛿 > 0, ∀ℎ ∀𝓓, ∃𝑚 𝜖, 𝛿, ℎ, 𝓓 , ∀𝑆∌𝒟𝑚 𝜖,𝛿,ℎ,𝓓𝛿 ,

𝐿𝒟 𝐎 𝑆 ≀ 𝐿𝒟 ℎ + 𝜖

Realizable/Optimistic Guarantees: 𝒟 dependence through 𝐿𝒟(ℎ)


Top Related