Computational and StatisticalLearning Theory
TTIC 31120
Prof. Nati Srebro
Lecture 17:Stochastic Optimization
Part II: Realizable vs Agnostic RatesPart III: Nearest Neighbor Classification
Stochastic (Sub)-Gradient Descent
If π»π π€, π§ 2 β€ πΊ then with appropriate step size:
πΌ πΉ π€π β€ infπ€βπ², π€ 2β€π΅
πΉ π€ + ππ΅2πΊ2
π
Similarly, also Stochastic Mirror Descent
Optimize π π = πΌπβΌπ π π, π s.t. π β π¦1. Initialize π€1 = 0 β π²2. At iteration π‘ = 1,2,3, β¦
1. Sample π§π‘ βΌ π (Obtain ππ‘ s.t. πΌ ππ‘ β ππΉ(π€π‘))
2. π€π‘+1 = Ξ π² π€π‘ β ππ‘π»π π€π‘ , π§π‘
3. Return π€π =1
πΟπ‘=1π π€π‘
Stochastic Gradient DescentOnline Gradient Descent[Cesa-Binachi et al 02]
online2stochastic
ππ
[Zinkevich 03] [Nemirovski Yudin 78]
Online Mirror Descent Stochastic Mirror Descent[Shalev-Shwatz Singer 07] [Nemirovski Yudin 78]
Stochastic Optimization
minπ€βπ²
πΉ(π€)
based only on stochastic information on πΉβ’ Only unbiased estimates of πΉ π€ , π»πΉ(π€)
β’ No direct access to πΉ
E.g., fixed π(π€, π§) but π unknownβ’ Optimize πΉ(π€) based on iid sample π§1, π§2, β¦ , π§π~π
β’ π = π»π(π€, π§π‘) is unbiased estimate of π»πΉ π€
β’ Traditional applicationsβ’ Optimization under uncertainty
β’ Uncertainty about network performance
β’ Uncertainty about client demands
β’ Uncertainty about system behavior in control problems
β’ Complex systems where its easier to sample then integrate over π§
= πΌπ§~π[π π€, π§ ]
Machine Learning isStochastic Optimization
minβ
πΏ β = πΌπ§βΌπ β β, π§ = πΌπ₯,π¦βΌπ πππ π β π₯ , π¦
β’ Optimization variable: predictor β
β’ Objective: generalization error πΏ(β)
β’ Stochasticity over π§ = (π₯, π¦)
βGeneral Learningβ β‘ Stochastic Optimization:
ValdimirVapnik
ArkadiNemirovskii
β’Focus on computational efficiency
β’Generally assumes unlimited sampling- as in monte-carlo methods for complicated objectives
β’Optimization variable generally a vector in a normed space- complexity control through norm
β’Mostly convex objectives
Stochastic Optimization
β’Focus on sample size
β’What can be done with a fixed number of samples?
β’Abstract hypothesis classes- linear predictors, but also combinatorial hypothesis classes- generic measures of complexity such as VC-dim, fat shattering, Radamacher
β’ Also non-convex classes and loss functions
Statistical Learningvs
ValdimirVapnik
ArkadiNemirovskii
Two Approaches to Stochastic Optimization / Learning
minπ€βπ²
πΉ(π€) = πΌπ§~π[π π€, π§ ]
β’ Empirical Risk Minimization (ERM) / Sample Average Approximation (SAA):β’ Collect sample z1,β¦,zm
β’ Minimize πΉπ π€ =1
πΟπ π π€, π§π
β’ Analysis typically based on Uniform Convergence
β’ Stochastic Approximation (SA): [Robins Monro 1951]
β’ Update π€π‘ based on π§π‘β’ E.g., based on ππ‘ = π»π(π€, π§π‘)
β’ E.g.: stochastic gradient descent
β’ Online-to-batch conversion of online algorithmβ¦
SA/SGD for Machine Learning
β’ In learning with ERM, need to optimize
ΰ·π€ = arg minπ€βπ²
πΏπ(π€) πΏπ π€ =1
πΟπ β(π€, π§π)
β’ πΏπ π€ is expensive to evaluate exactlyβ π(ππ) time
β’ Cheap to get unbiased gradient estimateβ π(π) time
π βΌ ππππ(1β¦π) π = π»β π€, π§π
πΌ π = Οπ1
ππ»β π€, π§π = π»πΏπ(π€)
β’ SGD guarantee:
πΌ πΏπ ΰ΄₯π€ π β€ infπ€βπ²
πΏπ π€ +sup π»β 2
2 sup π€ 22
π
SGD for SVM
min πΏπ π€ π . π‘. π€ 2 β€ π΅
Use ππ‘ = π»π€πππ π βππππ π€π‘, πππ‘ π₯ ; π¦ππ‘ for random ππ‘
Initialize π€ 0 = 0At iteration t:β’ Pick π β 1β¦π at random
β’ If π¦π π€π‘ , π π₯π < 1,
π€ π‘+1 β π€ π‘ + ππ‘π¦ππ π₯πelse: π€ π‘+1 β π€ π‘
β’ If π€ π‘+12> π΅, then π€ π‘+1 β π΅
π€ π‘+1
π€ π‘+12
Return ΰ΄₯π€ π =1
πΟπ‘=1π π€ π‘
π π₯ 2 β€ πΊ ππ‘ 2 β€ πΊ πΏπ ΰ΄₯π€ π β€ πΏπ ΰ·π€ +π΅2πΊ2
π
(in expectation over randomness in algorithm)
Stochastic vs Batch
x1,y1
x2,y2
x3,y3
x4,y4
x5,y5
xm,ym
π1 = βπππ π (π€, (π₯1, π¦1) )
π2 = βπππ π (π€, (π₯2, π¦2) )
π3 = βπππ π (π€, (π₯3, π¦3) )
π4 = βπππ π (π€, (π₯4, π¦4) )
π5 = βπππ π (π€, π₯5, π¦5 )
ππ = βπππ π (π€, (π₯π, π¦π) )
π1 = βπππ π (π€, (π₯1, π¦1) )
π2 = βπππ π (π€, (π₯2, π¦2) )
π3 = βπππ π (π€, (π₯3, π¦3) )
π4 = βπππ π (π€, (π₯4, π¦4) )
π5 = βπππ π (π€, (π₯5, π¦5) )
ππ = βπππ π (π€, (π₯π, π¦π) )
π€ β π€ β π1
π€ β π€ β π2
π€ β π€ β π3
π€ β π€ β π4
π€ β π€ β ππβ1
π€ β π€ β πππ€ β π€ β ππ
π€ β π€ β π5
min πΏπ π€ π . π‘. π€ 2 β€ π΅
βπΏπ π€ =1
πΟππ
Stochastic vs Batchβ’ Intuitive argument: if only taking simple gradient steps, better
to be stochastic
β’ To get πΏπ π€ β€ πΏπ ΰ·π€ + ππππ‘:
#iter cost/iter runtime
Batch GD π΅2πΊ2/ππππ‘2 ππ ππ
π΅2πΊ2
ππππ‘2
SGD π΅2πΊ2/ππππ‘2 π π
π΅2πΊ2
ππππ‘2
β’ Comparison to methods with a log 1/ππππ‘ dependence that use the structure of πΏπ(π€) (not only local access)?
β’ How small should ππππ‘ be?
β’ What about πΏ π€ , which is what we really care about?
Overall Analysis of πΏπ(π€)β’ Recall for ERM: πΏπ ΰ·π€ β€ πΏπ π€β + 2sup
π€πΏπ π€ + πΏπ π€
β’ For ππππ‘ suboptimal ERM ΰ΄₯π€:
πΏπ ΰ΄₯π€ β€ πΏπ π€β + 2supπ€
πΏπ π€ β πΏπ π€ + πΏπ ΰ΄₯π€ β πΏπ ΰ·π€
β’ Take ππππ‘ β πππ π‘, i.e. #ππ‘ππ π β π πππππ π ππ§π π
β’ To ensure πΏπ π€ β€ πΏπ π€β + π:
π,π = ππ΅2πΊ2
π2
ΰ·π€ = arg minπ€ β€π΅
πΏπ(π€) π€β = arg minπ€ β€π΅
πΏπ(π€)
ππππππ₯
πππ π‘ β€ 2π΅2πΊ2
πππππ‘ β€
π΅2πΊ2
π
Direct Online-to-Batch:SGD on πΏπ(π€)
minπ€
πΏπ(π€)
use ππ‘ = π»π€βππππ π¦ π€, π π₯ for random π¦, π₯~π
πΌ ππ‘ = π»πΏπ π€
Initialize π€ 0 = 0At iteration t:β’ Draw π₯π‘ , π¦π‘~π
β’ If π¦π‘ π€π‘ , π π₯π‘ < 1,
π€ π‘+1 β π€ π‘ + ππ‘π¦π‘π π₯π‘else: π€ π‘+1 β π€ π‘
Return ΰ΄₯π€ π =1
πΟπ‘=1π π€ π‘
πΏπ ΰ΄₯π€ π β€ infπ€ 2β€π΅
πΏπ π€ +π΅2πΊ2
π
π = π = ππ΅2πΊ2
π2
SGD for Machine Learning
Initialize π€ 0 = 0At iteration t:β’ Draw π₯π‘ , π¦π‘~π
β’ If π¦π‘ π€π‘ , π π₯π‘ < 1,
π€ π‘+1 β π€ π‘ + ππ‘π¦π‘π π₯π‘else: π€ π‘+1 β π€ π‘
Return ΰ΄₯π€ π =1
πΟπ‘=1π π€ π‘
Draw π₯1, π¦1 , β¦ , π₯π, π¦π ~π
Initialize π€ 0 = 0At iteration t:β’ Pick π β 1β¦π at random
β’ If π¦π π€π‘ , π π₯π < 1,
π€ π‘+1 β π€ π‘ + ππ‘π¦ππ π₯πelse: π€ π‘+1 β π€ π‘
β’ π€ π‘+1 β ππππ π€ π‘+1 π‘π π€ β€ π΅
Return ΰ΄₯π€ π =1
πΟπ‘=1π π€ π‘
Direct SA (online2batch) Approach:SGD on ERM:minπ€ 2β€π΅
πΏπ π€
minπ€
πΏ π€
β’ Fresh sample at each iteration, π = πβ’ No need to project nor require π€ β€ π΅β’ Implicit regularization via early stopping
β’ Can have π > π iterationsβ’ Need to project to π€ β€ π΅β’ Explicit regularization via βπ€β
SGD for Machine Learning
Initialize π€ 0 = 0At iteration t:β’ Draw π₯π‘ , π¦π‘~π
β’ If π¦π‘ π€π‘ , π π₯π‘ < 1,
π€ π‘+1 β π€ π‘ + ππ‘π¦π‘π π₯π‘else: π€ π‘+1 β π€ π‘
Return ΰ΄₯π€ π =1
πΟπ‘=1π π€ π‘
Draw π₯1, π¦1 , β¦ , π₯π, π¦π ~π
Initialize π€ 0 = 0At iteration t:β’ Pick π β 1β¦π at random
β’ If π¦π π€π‘ , π π₯π < 1,
π€ π‘+1 β π€ π‘ + ππ‘π¦ππ π₯πelse: π€ π‘+1 β π€ π‘
β’ π€ π‘+1 β ππππ π€ π‘+1 π‘π π€ β€ π΅
Return ΰ΄₯π€ π =1
πΟπ‘=1π π€ π‘
Direct SA (online2batch) Approach:SGD on ERM:minπ€ 2β€π΅
πΏπ π€
minπ€
πΏ π€
πΏ ΰ΄₯π€ π β€ πΏ π€β +π΅2πΊ2
ππΏ ΰ΄₯π€ π β€ πΏ π€β + 2
π΅2πΊ2
π+
π΅2πΊ2
π
π€β = arg minπ€ β€π΅
πΏ(π€)
ππ‘ = π΅2/πΊ2π‘
SGD for Machine Learning
Initialize π€ 0 = 0At iteration t:β’ Draw π₯π‘ , π¦π‘~π
β’ If π¦π‘ π€π‘ , π π₯π‘ < 1,
π€ π‘+1 β π€ π‘ + ππ‘π¦π‘π π₯π‘else: π€ π‘+1 β π€ π‘
Return ΰ΄₯π€ π =1
πΟπ‘=1π π€ π‘
Draw π₯1, π¦1 , β¦ , π₯π, π¦π ~π
Initialize π€ 0 = 0At iteration t:β’ Pick π β 1β¦π at random
β’ If π¦π π€π‘ , π π₯π < 1,
π€ π‘+1 β π€ π‘ + ππ‘π¦ππ π₯πelse: π€ π‘+1 β π€ π‘
β’ π€ π‘+1 β π€ π‘+1 β ππ€(π‘)
Return ΰ΄₯π€ π =1
πΟπ‘=1π π€ π‘
Direct SA (online2batch) Approach:
SGD on RERM:
minπΏπ π€ +π
2βπ€β
minπ€
πΏ π€
β’ Fresh sample at each iteration, π = πβ’ No need shrink π€β’ Implicit regularization via early stopping
β’ Can have π > π iterationsβ’ Need to shrink π€β’ Explicit regularization via βπ€β
SGD vs ERM
π€ 0
ΰ΄₯π€ π
ΰ·π€π€β
ππ΅2πΊ2
π
argminπ€
πΏπ(π€)
(overfit)
π» > πw/ proj
π€β = arg minπ€ β€π΅
πΏ(π€)ΰ·π€ = arg minπ€ β€π΅
πΏπ(π€)
Mixed Approach: SGD on ERM
0 1,000,000 2,000,000 3,000,000
0.052
0.054
0.056
0.058
# iterations k
Test
mis
clas
sifi
cati
on
err
or
m=300,000m=400,000m=500,000
Reuters RCV1 data, CCAT task
β’ The mixed approach (reusing examples) can make sense
Mixed Approach: SGD on ERM
0 1,000,000 2,000,000 3,000,000
0.052
0.054
0.056
0.058
# iterations k
Test
mis
clas
sifi
cati
on
err
or
m=300,000m=400,000m=500,000
Reuters RCV1 data, CCAT task
β’ The mixed approach (reusing examples) can make senseβ’ Still: fresh samples are better
)With a larger training set, can reduce generalization error faster) Larger training set means less runtime to get target generalization error
Online Optimization vs Stochastic Approximation
β’ In both Online Setting and Stochastic Approximation β’ Receive samples sequentiallyβ’ Update w after each sample
β’ But, in Online Setting:β’ Objective is empirical regret, i.e. behavior on observed instancesβ’ π§π‘ chosen adversarialy (no distribution involved)
β’ As opposed on Stochastic Approximation:β’ Objective is πΌ β π€, π§ , i.e. behavior on βfutureβ samplesβ’ i.i.d. samples π§π‘
β’ Stochastic Approximation is a computational approach,Online Learning is an analysis setupβ’ E.g. βFollow the leaderβ
Part II:Realizable vs Agnostic
Rates
Realizable vs Agnostic Rates
β’ Recall for finite hypothesis classes:
πΏπ β β€ infβββ
πΏπ(β) + 2log |β|+log ΰ΅2 πΏ
2π π = π
log β
ππ
β’ But in the realizable case, if infβββ
πΏπ(β) = 0:
πΏπ β β€log β +log
1
πΏ
π π = π
log β
π
β’ Also for VC-classes, in general π = πππΆπππ
ππwhile in the
realizable case π = πππΆπππβ log 1/π
π
β’ What happens if πΏβ = infβββ
πΏπ(β) is low, but not zero?
Estimating the Bias of a Coin
π β ΖΈπ β€log ΰ΅2 πΏ2π
π β ΖΈπ β€2 π log ΰ΅2 πΏ
π+2 log ΰ΅2 πΏ3π
Optimistic VC bound(aka πΏβ-bound, multiplicative bound)
β = argminβββ
πΏπ(β) πΏβ = infβββ
πΏ(β)
β’ For a hypothesis class with VC-dim π·, w.p. 1-πΏ over n samples:
πΏ β β€ πΏβ + 2 πΏβπ· log Ξ€2ππ
π·+log ΰ΅2 πΏ
π+ 4
π· log Ξ€2πππ·+log ΰ΅2 πΏ
π
= infπΌ
1 + πΌ πΏβ + 1 +1
πΌ4π· log Ξ€2ππ
π·+log ΰ΅2 πΏ
π
β’ Sample complexity to get πΏ β β€ πΏβ + π:
π π = ππ·
πβ πΏβ + π
πlog
1
π
β’ Extends to bounded real-valued loss in terms of VC-subgraph dim
From Parametric to Scale SensitiveπΏ β = πΌ πππ π β π₯ , π¦ β β β
β’ Instead of VC-dim or VC-subgraph-dim (β #params), rely on metric scale to control complexity, e.g.:
β = π€ β¦ π€, π₯ π€ 2 β€ π΅ }
β’ Learning depends on:
β’ Metric complexity measures: fat shattering dimension, covering numbers, Rademacher Complexity
β’ Scale sensitivity of loss (bound on derivatives or βmarginβ)
β’ For βwith Rademacher Complexity βπ, and πππ π β² β€ πΊ:
πΏ β β€ πΏβ + 2πΊβπ +log ΰ΅2 πΏ2π
β€ πΏβ + ππΊ2π + log ΰ΅2 πΏ
2π
βπ β€π
π
π =
βπ β =π΅2 sup π₯ 2
π
Non-Parametric Optimistic Rate for Smooth Loss
β’ Theorem: for any β with (worst case) RademacherComplexity βπ(β), and any smooth loss with πππ π β²β² β€ π», πππ π β€ π, w.p. 1 β πΏ over n samples:
β’ Sample complexity
Parametric vs Non-Parametric
Parametricdim(β) β€ π, π β€ π
Scale-Sensitive
βπ β β€ ΰ΅πΉ π
Lipschitz: πβ² β€ πΊ(e.g. hinge, β1) πΊ π·
π + πΏβπΊπ·π
πΊ2π π
Smooth: πβ²β² β€ π»(e.g. logistic, Huber, smoothed hinge)
π» π·π + πΏβ
π»π·π
π» π π + πΏβ
π»π π
Smooth & strongly convex: π β€ πβ²β² β€ π»(e.g. square loss)
π»
πβ π» π·
ππ» π π + πΏβ
π»π π
Min-max tight up to poly-log factors
Optimistic Learning Guarantees
πΏ β β€ 1 + πΌ πΏβ + 1 +1
πΌΰ·¨π
π
π
π π β€ ΰ·¨ππ
πβ πΏβ + π
π
Parametric classes
Scale-sensitive classes with smooth loss
Perceptron gurantee
Margin Bounds
Stability-based guarantees with smooth loss
Online Learning/Optimization with smooth loss
Γ Non-param (scale sensitive) classes with non-smooth loss
Γ Online Learning/Optimization with non-smooth loss
Why Optimistic Guarantees?
πΏ β β€ 1 + πΌ πΏβ + 1 +1
πΌΰ·¨π
π
π
π π β€ ΰ·¨ππ
πβ πΏβ + π
π
β’ Optimistic regime typically relevant regime:
β’ Approximation error πΏβ β Estimation error π
β’ If π βͺ πΏβ, better to spend energy on lowering approx. error(use more complex class)
β’ Often important in highlighting true phenomena
Part III:Nearest Neighbor
Classification
The Nearest Neighbor Classifierβ’ Training sample S = π₯1, π¦1 , β¦ , π₯π, π¦π
β’ Want to predict label of new point π₯
β’ The Nearest Neighbor Rule:
β’ Find the closest training point: π = argminππ(π₯, π₯π)
β’ Predict label of π₯ as π¦π
?
The Nearest Neighbor Classifierβ’ Training sample S = π₯1, π¦1 , β¦ , π₯π, π¦π
β’ Want to predict label of new point π₯
β’ The Nearest Neighbor Rule:
β’ Find the closest training point: π = argminππ(π₯, π₯π)
β’ Predict label of π₯ as π¦π
β’ As learning rule: ππ π = β where β(π₯) = π¦arg minπ
π π₯,π₯π
ΰ·©π π β ΰ·©π(πβ²)π
ΰ·©π π = (ππ π , π π )π β πβ² ππ β πβ² π
Where is the Bias Hiding?
β’ What is the right βdistanceβ between images? Between sound waves? Between sentences?
β’ Option 1: π π₯, π₯β² = π π₯ β π π₯β² 2
β’ What representation π(π₯)?
β’ Find the closest training point: π = argminππ(π₯, π₯π)
β’ Predict label of π₯ as π¦π
ΰ·©π π β ΰ·©π(πβ²)π
ΰ·©π π = (ππ π , π π )π β πβ² π π β πβ² β
Where is the Bias Hiding?
β’ What is the right βdistanceβ between images? Between sound waves? Between sentences?
β’ Option 1: π π₯, π₯β² = π π₯ β π π₯β² 2
β’ What representation π(π₯)?
β’ Maybe a different distance? π π₯ β π π₯β² 1? π π₯ β π π₯ β? sin(β (π π₯ , π π₯β² )? πΎπΏ(π(π₯)||π π₯β² )?
β’ Find the closest training point: π = argminππ(π₯, π₯π)
β’ Predict label of π₯ as π¦π
Where is the Bias Hiding?
β’ What is the right βdistanceβ between images? Between sound waves? Between sentences?
β’ Option 1: π π₯, π₯β² = π π₯ β π π₯β² 2
β’ What representation π(π₯)?
β’ Maybe a different distance? π π₯ β π π₯β² 1? π π₯ β π π₯ β? sin(β (π π₯ , π π₯β² )? πΎπΏ(π(π₯)||π π₯β² )?
β’ Option 2: Special-purpose distance measure on π₯β’ E.g. edit distance, deformation measure, etc
β’ Find the closest training point: π = argminππ(π₯, π₯π)
β’ Predict label of π₯ as π¦π
Nearest Neighbor Learning Guarantee
β’ Optimal predictor: ββ = argmin πΏπ(β)
ββ π₯ = α+1, π π₯ > 0.5
β1, π π₯ < 0.5π π₯ = ππ(π¦ = 1|π₯)
β’ For the NN rule with π π₯, π₯β² = π π₯ β π π₯β² 2, and π:π³ β [0,1] π:
πΌπβΌππ πΏ ππ π β€ 2πΏ ββ + 4πππ
π+1 π
π π₯ β π π₯β² β€ ππ β π(π₯, π₯β²)
Data Fit / Complexity Tradeoff
β’ k-Nearest Neighbor: predict according to majority among π closest point from S.
πΌπβΌππ πΏ ππ π β€ 2πΏ(ββ) + 4πππ
π+1 π
k-Nearest Neighbor:Data Fit / Complexity Tradeoff
k=1 k=5 k=12
k=50 k=100 k=200
S= h*=
k-Nearest Neighbor Guarantee
β’ For k-NN with π π₯, π₯β² = π π₯ β π π₯β² 2, and π:π³ β [0,1]π:
πΌπβΌππ πΏ πππ π β€ 1 + 8π πΏ ββ +
6ππ π + ππ+1 π
β’ Should increase π with sample size π
β’ Above theory suggests ππ β πΏ ββ 2/3 β π2
3(π+1)
β’ βUniversalβ Learning: for any βsmoothβ π and representation π β (with continuous π(π¦|π π₯ )), if we increase k slowly enough, we will eventually converge to optimal πΏ(ββ)
β’ Very non-uniform: sample complexity depends not only on ββ, but also on π
π π₯ β π π₯β² β€ ππ β π(π₯, π₯β²)
Uniform and Non-Uniform Learnability
β’ Definition: A hypothesis class β is agnostically PAC-Learnable if there exists a
learning rule π΄ such that βπ, πΏ > 0, βπ π, πΏ , βπ, βπ, βπβΌππ π,πΏπΏ ,
πΏπ π΄ π β€ πΏπ β + π
β’ Definition: A hypothesis class β is non-uniformly learnable if there exists a
learning rule π΄ such that βπ, πΏ > 0, ββ, βπ π, πΏ, π , βπ, βπβΌππ π,πΏ,ππΏ ,
πΏπ π΄ π β€ πΏπ β + π
β’ Definition: A hypothesis class β is βconsistently learnableβ if there exists a
learning rule π΄ such that βπ, πΏ > 0, ββ βπ, βπ π, πΏ, β, π , βπβΌππ π,πΏ,β,ππΏ ,
πΏπ π΄ π β€ πΏπ β + π
Realizable/Optimistic Guarantees: π dependence through πΏπ(β)