Computational and StatisticalLearning Theory
TTIC 31120
Prof. Nati Srebro
Lecture 17:Stochastic Optimization
Part II: Realizable vs Agnostic RatesPart III: Nearest Neighbor Classification
Stochastic (Sub)-Gradient Descent
If ð»ð ð€, ð§ 2 †ðº then with appropriate step size:
ðŒ ð¹ ð€ð †infð€âð², ð€ 2â€ðµ
ð¹ ð€ + ððµ2ðº2
ð
Similarly, also Stochastic Mirror Descent
Optimize ð ð = ðŒðâŒð ð ð, ð s.t. ð â ðŠ1. Initialize ð€1 = 0 â ð²2. At iteration ð¡ = 1,2,3, âŠ
1. Sample ð§ð¡ ⌠ð (Obtain ðð¡ s.t. ðŒ ðð¡ â ðð¹(ð€ð¡))
2. ð€ð¡+1 = Î ð² ð€ð¡ â ðð¡ð»ð ð€ð¡ , ð§ð¡
3. Return ð€ð =1
ðÏð¡=1ð ð€ð¡
Stochastic Gradient DescentOnline Gradient Descent[Cesa-Binachi et al 02]
online2stochastic
ðð
[Zinkevich 03] [Nemirovski Yudin 78]
Online Mirror Descent Stochastic Mirror Descent[Shalev-Shwatz Singer 07] [Nemirovski Yudin 78]
Stochastic Optimization
minð€âð²
ð¹(ð€)
based only on stochastic information on ð¹â¢ Only unbiased estimates of ð¹ ð€ , ð»ð¹(ð€)
⢠No direct access to ð¹
E.g., fixed ð(ð€, ð§) but ð unknown⢠Optimize ð¹(ð€) based on iid sample ð§1, ð§2, ⊠, ð§ð~ð
⢠ð = ð»ð(ð€, ð§ð¡) is unbiased estimate of ð»ð¹ ð€
⢠Traditional applications⢠Optimization under uncertainty
⢠Uncertainty about network performance
⢠Uncertainty about client demands
⢠Uncertainty about system behavior in control problems
⢠Complex systems where its easier to sample then integrate over ð§
= ðŒð§~ð[ð ð€, ð§ ]
Machine Learning isStochastic Optimization
minâ
ð¿ â = ðŒð§âŒð â â, ð§ = ðŒð¥,ðŠâŒð ððð ð â ð¥ , ðŠ
⢠Optimization variable: predictor â
⢠Objective: generalization error ð¿(â)
⢠Stochasticity over ð§ = (ð¥, ðŠ)
âGeneral Learningâ â¡ Stochastic Optimization:
ValdimirVapnik
ArkadiNemirovskii
â¢Focus on computational efficiency
â¢Generally assumes unlimited sampling- as in monte-carlo methods for complicated objectives
â¢Optimization variable generally a vector in a normed space- complexity control through norm
â¢Mostly convex objectives
Stochastic Optimization
â¢Focus on sample size
â¢What can be done with a fixed number of samples?
â¢Abstract hypothesis classes- linear predictors, but also combinatorial hypothesis classes- generic measures of complexity such as VC-dim, fat shattering, Radamacher
⢠Also non-convex classes and loss functions
Statistical Learningvs
ValdimirVapnik
ArkadiNemirovskii
Two Approaches to Stochastic Optimization / Learning
minð€âð²
ð¹(ð€) = ðŒð§~ð[ð ð€, ð§ ]
⢠Empirical Risk Minimization (ERM) / Sample Average Approximation (SAA):⢠Collect sample z1,âŠ,zm
⢠Minimize ð¹ð ð€ =1
ðÏð ð ð€, ð§ð
⢠Analysis typically based on Uniform Convergence
⢠Stochastic Approximation (SA): [Robins Monro 1951]
⢠Update ð€ð¡ based on ð§ð¡â¢ E.g., based on ðð¡ = ð»ð(ð€, ð§ð¡)
⢠E.g.: stochastic gradient descent
⢠Online-to-batch conversion of online algorithmâŠ
SA/SGD for Machine Learning
⢠In learning with ERM, need to optimize
à·ð€ = arg minð€âð²
ð¿ð(ð€) ð¿ð ð€ =1
ðÏð â(ð€, ð§ð)
⢠ð¿ð ð€ is expensive to evaluate exactlyâ ð(ðð) time
⢠Cheap to get unbiased gradient estimateâ ð(ð) time
ð ⌠ðððð(1âŠð) ð = ð»â ð€, ð§ð
ðŒ ð = Ïð1
ðð»â ð€, ð§ð = ð»ð¿ð(ð€)
⢠SGD guarantee:
ðŒ ð¿ð àŽ¥ð€ ð †infð€âð²
ð¿ð ð€ +sup ð»â 2
2 sup ð€ 22
ð
SGD for SVM
min ð¿ð ð€ ð . ð¡. ð€ 2 †ðµ
Use ðð¡ = ð»ð€ððð ð âðððð ð€ð¡, ððð¡ ð¥ ; ðŠðð¡ for random ðð¡
Initialize ð€ 0 = 0At iteration t:⢠Pick ð â 1âŠð at random
⢠If ðŠð ð€ð¡ , ð ð¥ð < 1,
ð€ ð¡+1 â ð€ ð¡ + ðð¡ðŠðð ð¥ðelse: ð€ ð¡+1 â ð€ ð¡
⢠If ð€ ð¡+12> ðµ, then ð€ ð¡+1 â ðµ
ð€ ð¡+1
ð€ ð¡+12
Return àŽ¥ð€ ð =1
ðÏð¡=1ð ð€ ð¡
ð ð¥ 2 †ðº ðð¡ 2 †ðº ð¿ð àŽ¥ð€ ð †ð¿ð à·ð€ +ðµ2ðº2
ð
(in expectation over randomness in algorithm)
Stochastic vs Batch
x1,y1
x2,y2
x3,y3
x4,y4
x5,y5
xm,ym
ð1 = âððð ð (ð€, (ð¥1, ðŠ1) )
ð2 = âððð ð (ð€, (ð¥2, ðŠ2) )
ð3 = âððð ð (ð€, (ð¥3, ðŠ3) )
ð4 = âððð ð (ð€, (ð¥4, ðŠ4) )
ð5 = âððð ð (ð€, ð¥5, ðŠ5 )
ðð = âððð ð (ð€, (ð¥ð, ðŠð) )
ð1 = âððð ð (ð€, (ð¥1, ðŠ1) )
ð2 = âððð ð (ð€, (ð¥2, ðŠ2) )
ð3 = âððð ð (ð€, (ð¥3, ðŠ3) )
ð4 = âððð ð (ð€, (ð¥4, ðŠ4) )
ð5 = âððð ð (ð€, (ð¥5, ðŠ5) )
ðð = âððð ð (ð€, (ð¥ð, ðŠð) )
ð€ â ð€ â ð1
ð€ â ð€ â ð2
ð€ â ð€ â ð3
ð€ â ð€ â ð4
ð€ â ð€ â ððâ1
ð€ â ð€ â ððð€ â ð€ â ðð
ð€ â ð€ â ð5
min ð¿ð ð€ ð . ð¡. ð€ 2 †ðµ
âð¿ð ð€ =1
ðÏðð
Stochastic vs Batch⢠Intuitive argument: if only taking simple gradient steps, better
to be stochastic
⢠To get ð¿ð ð€ †ð¿ð à·ð€ + ðððð¡:
#iter cost/iter runtime
Batch GD ðµ2ðº2/ðððð¡2 ðð ðð
ðµ2ðº2
ðððð¡2
SGD ðµ2ðº2/ðððð¡2 ð ð
ðµ2ðº2
ðððð¡2
⢠Comparison to methods with a log 1/ðððð¡ dependence that use the structure of ð¿ð(ð€) (not only local access)?
⢠How small should ðððð¡ be?
⢠What about ð¿ ð€ , which is what we really care about?
Overall Analysis of ð¿ð(ð€)⢠Recall for ERM: ð¿ð à·ð€ †ð¿ð ð€â + 2sup
ð€ð¿ð ð€ + ð¿ð ð€
⢠For ðððð¡ suboptimal ERM àŽ¥ð€:
ð¿ð àŽ¥ð€ †ð¿ð ð€â + 2supð€
ð¿ð ð€ â ð¿ð ð€ + ð¿ð àŽ¥ð€ â ð¿ð à·ð€
⢠Take ðððð¡ â ððð ð¡, i.e. #ðð¡ðð ð â ð ððððð ð ðð§ð ð
⢠To ensure ð¿ð ð€ †ð¿ð ð€â + ð:
ð,ð = ððµ2ðº2
ð2
à·ð€ = arg minð€ â€ðµ
ð¿ð(ð€) ð€â = arg minð€ â€ðµ
ð¿ð(ð€)
ðððððð¥
ððð ð¡ †2ðµ2ðº2
ððððð¡ â€
ðµ2ðº2
ð
Direct Online-to-Batch:SGD on ð¿ð(ð€)
minð€
ð¿ð(ð€)
use ðð¡ = ð»ð€âðððð ðŠ ð€, ð ð¥ for random ðŠ, ð¥~ð
ðŒ ðð¡ = ð»ð¿ð ð€
Initialize ð€ 0 = 0At iteration t:⢠Draw ð¥ð¡ , ðŠð¡~ð
⢠If ðŠð¡ ð€ð¡ , ð ð¥ð¡ < 1,
ð€ ð¡+1 â ð€ ð¡ + ðð¡ðŠð¡ð ð¥ð¡else: ð€ ð¡+1 â ð€ ð¡
Return àŽ¥ð€ ð =1
ðÏð¡=1ð ð€ ð¡
ð¿ð àŽ¥ð€ ð †infð€ 2â€ðµ
ð¿ð ð€ +ðµ2ðº2
ð
ð = ð = ððµ2ðº2
ð2
SGD for Machine Learning
Initialize ð€ 0 = 0At iteration t:⢠Draw ð¥ð¡ , ðŠð¡~ð
⢠If ðŠð¡ ð€ð¡ , ð ð¥ð¡ < 1,
ð€ ð¡+1 â ð€ ð¡ + ðð¡ðŠð¡ð ð¥ð¡else: ð€ ð¡+1 â ð€ ð¡
Return àŽ¥ð€ ð =1
ðÏð¡=1ð ð€ ð¡
Draw ð¥1, ðŠ1 , ⊠, ð¥ð, ðŠð ~ð
Initialize ð€ 0 = 0At iteration t:⢠Pick ð â 1âŠð at random
⢠If ðŠð ð€ð¡ , ð ð¥ð < 1,
ð€ ð¡+1 â ð€ ð¡ + ðð¡ðŠðð ð¥ðelse: ð€ ð¡+1 â ð€ ð¡
⢠ð€ ð¡+1 â ðððð ð€ ð¡+1 ð¡ð ð€ †ðµ
Return àŽ¥ð€ ð =1
ðÏð¡=1ð ð€ ð¡
Direct SA (online2batch) Approach:SGD on ERM:minð€ 2â€ðµ
ð¿ð ð€
minð€
ð¿ ð€
⢠Fresh sample at each iteration, ð = ð⢠No need to project nor require ð€ †ðµâ¢ Implicit regularization via early stopping
⢠Can have ð > ð iterations⢠Need to project to ð€ †ðµâ¢ Explicit regularization via âð€â
SGD for Machine Learning
Initialize ð€ 0 = 0At iteration t:⢠Draw ð¥ð¡ , ðŠð¡~ð
⢠If ðŠð¡ ð€ð¡ , ð ð¥ð¡ < 1,
ð€ ð¡+1 â ð€ ð¡ + ðð¡ðŠð¡ð ð¥ð¡else: ð€ ð¡+1 â ð€ ð¡
Return àŽ¥ð€ ð =1
ðÏð¡=1ð ð€ ð¡
Draw ð¥1, ðŠ1 , ⊠, ð¥ð, ðŠð ~ð
Initialize ð€ 0 = 0At iteration t:⢠Pick ð â 1âŠð at random
⢠If ðŠð ð€ð¡ , ð ð¥ð < 1,
ð€ ð¡+1 â ð€ ð¡ + ðð¡ðŠðð ð¥ðelse: ð€ ð¡+1 â ð€ ð¡
⢠ð€ ð¡+1 â ðððð ð€ ð¡+1 ð¡ð ð€ †ðµ
Return àŽ¥ð€ ð =1
ðÏð¡=1ð ð€ ð¡
Direct SA (online2batch) Approach:SGD on ERM:minð€ 2â€ðµ
ð¿ð ð€
minð€
ð¿ ð€
ð¿ àŽ¥ð€ ð †ð¿ ð€â +ðµ2ðº2
ðð¿ àŽ¥ð€ ð †ð¿ ð€â + 2
ðµ2ðº2
ð+
ðµ2ðº2
ð
ð€â = arg minð€ â€ðµ
ð¿(ð€)
ðð¡ = ðµ2/ðº2ð¡
SGD for Machine Learning
Initialize ð€ 0 = 0At iteration t:⢠Draw ð¥ð¡ , ðŠð¡~ð
⢠If ðŠð¡ ð€ð¡ , ð ð¥ð¡ < 1,
ð€ ð¡+1 â ð€ ð¡ + ðð¡ðŠð¡ð ð¥ð¡else: ð€ ð¡+1 â ð€ ð¡
Return àŽ¥ð€ ð =1
ðÏð¡=1ð ð€ ð¡
Draw ð¥1, ðŠ1 , ⊠, ð¥ð, ðŠð ~ð
Initialize ð€ 0 = 0At iteration t:⢠Pick ð â 1âŠð at random
⢠If ðŠð ð€ð¡ , ð ð¥ð < 1,
ð€ ð¡+1 â ð€ ð¡ + ðð¡ðŠðð ð¥ðelse: ð€ ð¡+1 â ð€ ð¡
⢠ð€ ð¡+1 â ð€ ð¡+1 â ðð€(ð¡)
Return àŽ¥ð€ ð =1
ðÏð¡=1ð ð€ ð¡
Direct SA (online2batch) Approach:
SGD on RERM:
minð¿ð ð€ +ð
2âð€â
minð€
ð¿ ð€
⢠Fresh sample at each iteration, ð = ð⢠No need shrink ð€â¢ Implicit regularization via early stopping
⢠Can have ð > ð iterations⢠Need to shrink ð€â¢ Explicit regularization via âð€â
SGD vs ERM
ð€ 0
àŽ¥ð€ ð
à·ð€ð€â
ððµ2ðº2
ð
argminð€
ð¿ð(ð€)
(overfit)
ð» > ðw/ proj
ð€â = arg minð€ â€ðµ
ð¿(ð€)à·ð€ = arg minð€ â€ðµ
ð¿ð(ð€)
Mixed Approach: SGD on ERM
0 1,000,000 2,000,000 3,000,000
0.052
0.054
0.056
0.058
# iterations k
Test
mis
clas
sifi
cati
on
err
or
m=300,000m=400,000m=500,000
Reuters RCV1 data, CCAT task
⢠The mixed approach (reusing examples) can make sense
Mixed Approach: SGD on ERM
0 1,000,000 2,000,000 3,000,000
0.052
0.054
0.056
0.058
# iterations k
Test
mis
clas
sifi
cati
on
err
or
m=300,000m=400,000m=500,000
Reuters RCV1 data, CCAT task
⢠The mixed approach (reusing examples) can make sense⢠Still: fresh samples are better
)With a larger training set, can reduce generalization error faster) Larger training set means less runtime to get target generalization error
Online Optimization vs Stochastic Approximation
⢠In both Online Setting and Stochastic Approximation ⢠Receive samples sequentially⢠Update w after each sample
⢠But, in Online Setting:⢠Objective is empirical regret, i.e. behavior on observed instances⢠ð§ð¡ chosen adversarialy (no distribution involved)
⢠As opposed on Stochastic Approximation:⢠Objective is ðŒ â ð€, ð§ , i.e. behavior on âfutureâ samples⢠i.i.d. samples ð§ð¡
⢠Stochastic Approximation is a computational approach,Online Learning is an analysis setup⢠E.g. âFollow the leaderâ
Part II:Realizable vs Agnostic
Rates
Realizable vs Agnostic Rates
⢠Recall for finite hypothesis classes:
ð¿ð â †infâââ
ð¿ð(â) + 2log |â|+log àµ2 ð¿
2ð ð = ð
log â
ðð
⢠But in the realizable case, if infâââ
ð¿ð(â) = 0:
ð¿ð â â€log â +log
1
ð¿
ð ð = ð
log â
ð
⢠Also for VC-classes, in general ð = ððð¶ððð
ððwhile in the
realizable case ð = ððð¶ðððâ log 1/ð
ð
⢠What happens if ð¿â = infâââ
ð¿ð(â) is low, but not zero?
Estimating the Bias of a Coin
ð â Æžð â€log àµ2 ð¿2ð
ð â Æžð â€2 ð log àµ2 ð¿
ð+2 log àµ2 ð¿3ð
Optimistic VC bound(aka ð¿â-bound, multiplicative bound)
â = argminâââ
ð¿ð(â) ð¿â = infâââ
ð¿(â)
⢠For a hypothesis class with VC-dim ð·, w.p. 1-ð¿ over n samples:
ð¿ â †ð¿â + 2 ð¿âð· log ΀2ðð
ð·+log àµ2 ð¿
ð+ 4
ð· log ΀2ððð·+log àµ2 ð¿
ð
= infðŒ
1 + ðŒ ð¿â + 1 +1
ðŒ4ð· log ΀2ðð
ð·+log àµ2 ð¿
ð
⢠Sample complexity to get ð¿ â †ð¿â + ð:
ð ð = ðð·
ðâ ð¿â + ð
ðlog
1
ð
⢠Extends to bounded real-valued loss in terms of VC-subgraph dim
From Parametric to Scale Sensitiveð¿ â = ðŒ ððð ð â ð¥ , ðŠ â â â
⢠Instead of VC-dim or VC-subgraph-dim (â #params), rely on metric scale to control complexity, e.g.:
â = ð€ ⊠ð€, ð¥ ð€ 2 †ðµ }
⢠Learning depends on:
⢠Metric complexity measures: fat shattering dimension, covering numbers, Rademacher Complexity
⢠Scale sensitivity of loss (bound on derivatives or âmarginâ)
⢠For âwith Rademacher Complexity âð, and ððð ð Ⲡ†ðº:
ð¿ â †ð¿â + 2ðºâð +log àµ2 ð¿2ð
†ð¿â + ððº2ð + log àµ2 ð¿
2ð
âð â€ð
ð
ð =
âð â =ðµ2 sup ð¥ 2
ð
Non-Parametric Optimistic Rate for Smooth Loss
⢠Theorem: for any â with (worst case) RademacherComplexity âð(â), and any smooth loss with ððð ð â²â² †ð», ððð ð †ð, w.p. 1 â ð¿ over n samples:
⢠Sample complexity
Parametric vs Non-Parametric
Parametricdim(â) †ð, ð †ð
Scale-Sensitive
âð â †àµð¹ ð
Lipschitz: ðⲠ†ðº(e.g. hinge, â1) ðº ð·
ð + ð¿âðºð·ð
ðº2ð ð
Smooth: ðâ²â² †ð»(e.g. logistic, Huber, smoothed hinge)
ð» ð·ð + ð¿â
ð»ð·ð
ð» ð ð + ð¿â
ð»ð ð
Smooth & strongly convex: ð †ðâ²â² †ð»(e.g. square loss)
ð»
ðâ ð» ð·
ðð» ð ð + ð¿â
ð»ð ð
Min-max tight up to poly-log factors
Optimistic Learning Guarantees
ð¿ â †1 + ðŒ ð¿â + 1 +1
ðŒà·šð
ð
ð
ð ð †ේðð
ðâ ð¿â + ð
ð
Parametric classes
Scale-sensitive classes with smooth loss
Perceptron gurantee
Margin Bounds
Stability-based guarantees with smooth loss
Online Learning/Optimization with smooth loss
à Non-param (scale sensitive) classes with non-smooth loss
à Online Learning/Optimization with non-smooth loss
Why Optimistic Guarantees?
ð¿ â †1 + ðŒ ð¿â + 1 +1
ðŒà·šð
ð
ð
ð ð †ේðð
ðâ ð¿â + ð
ð
⢠Optimistic regime typically relevant regime:
⢠Approximation error ð¿â â Estimation error ð
⢠If ð ⪠ð¿â, better to spend energy on lowering approx. error(use more complex class)
⢠Often important in highlighting true phenomena
Part III:Nearest Neighbor
Classification
The Nearest Neighbor Classifier⢠Training sample S = ð¥1, ðŠ1 , ⊠, ð¥ð, ðŠð
⢠Want to predict label of new point ð¥
⢠The Nearest Neighbor Rule:
⢠Find the closest training point: ð = argminðð(ð¥, ð¥ð)
⢠Predict label of ð¥ as ðŠð
?
The Nearest Neighbor Classifier⢠Training sample S = ð¥1, ðŠ1 , ⊠, ð¥ð, ðŠð
⢠Want to predict label of new point ð¥
⢠The Nearest Neighbor Rule:
⢠Find the closest training point: ð = argminðð(ð¥, ð¥ð)
⢠Predict label of ð¥ as ðŠð
⢠As learning rule: ðð ð = â where â(ð¥) = ðŠarg minð
ð ð¥,ð¥ð
à·©ð ð â à·©ð(ðâ²)ð
à·©ð ð = (ðð ð , ð ð )ð â ðâ² ðð â ðâ² ð
Where is the Bias Hiding?
⢠What is the right âdistanceâ between images? Between sound waves? Between sentences?
⢠Option 1: ð ð¥, ð¥â² = ð ð¥ â ð ð¥â² 2
⢠What representation ð(ð¥)?
⢠Find the closest training point: ð = argminðð(ð¥, ð¥ð)
⢠Predict label of ð¥ as ðŠð
à·©ð ð â à·©ð(ðâ²)ð
à·©ð ð = (ðð ð , ð ð )ð â ðâ² ð ð â ðâ² â
Where is the Bias Hiding?
⢠What is the right âdistanceâ between images? Between sound waves? Between sentences?
⢠Option 1: ð ð¥, ð¥â² = ð ð¥ â ð ð¥â² 2
⢠What representation ð(ð¥)?
⢠Maybe a different distance? ð ð¥ â ð ð¥â² 1? ð ð¥ â ð ð¥ â? sin(â (ð ð¥ , ð ð¥â² )? ðŸð¿(ð(ð¥)||ð ð¥â² )?
⢠Find the closest training point: ð = argminðð(ð¥, ð¥ð)
⢠Predict label of ð¥ as ðŠð
Where is the Bias Hiding?
⢠What is the right âdistanceâ between images? Between sound waves? Between sentences?
⢠Option 1: ð ð¥, ð¥â² = ð ð¥ â ð ð¥â² 2
⢠What representation ð(ð¥)?
⢠Maybe a different distance? ð ð¥ â ð ð¥â² 1? ð ð¥ â ð ð¥ â? sin(â (ð ð¥ , ð ð¥â² )? ðŸð¿(ð(ð¥)||ð ð¥â² )?
⢠Option 2: Special-purpose distance measure on ð¥â¢ E.g. edit distance, deformation measure, etc
⢠Find the closest training point: ð = argminðð(ð¥, ð¥ð)
⢠Predict label of ð¥ as ðŠð
Nearest Neighbor Learning Guarantee
⢠Optimal predictor: ââ = argmin ð¿ð(â)
ââ ð¥ = á+1, ð ð¥ > 0.5
â1, ð ð¥ < 0.5ð ð¥ = ðð(ðŠ = 1|ð¥)
⢠For the NN rule with ð ð¥, ð¥â² = ð ð¥ â ð ð¥â² 2, and ð:ð³ â [0,1] ð:
ðŒðâŒðð ð¿ ðð ð †2ð¿ ââ + 4ððð
ð+1 ð
ð ð¥ â ð ð¥â² †ðð â ð(ð¥, ð¥â²)
Data Fit / Complexity Tradeoff
⢠k-Nearest Neighbor: predict according to majority among ð closest point from S.
ðŒðâŒðð ð¿ ðð ð †2ð¿(ââ) + 4ððð
ð+1 ð
k-Nearest Neighbor:Data Fit / Complexity Tradeoff
k=1 k=5 k=12
k=50 k=100 k=200
S= h*=
k-Nearest Neighbor Guarantee
⢠For k-NN with ð ð¥, ð¥â² = ð ð¥ â ð ð¥â² 2, and ð:ð³ â [0,1]ð:
ðŒðâŒðð ð¿ ððð ð †1 + 8ð ð¿ ââ +
6ðð ð + ðð+1 ð
⢠Should increase ð with sample size ð
⢠Above theory suggests ðð â ð¿ ââ 2/3 â ð2
3(ð+1)
⢠âUniversalâ Learning: for any âsmoothâ ð and representation ð â (with continuous ð(ðŠ|ð ð¥ )), if we increase k slowly enough, we will eventually converge to optimal ð¿(ââ)
⢠Very non-uniform: sample complexity depends not only on ââ, but also on ð
ð ð¥ â ð ð¥â² †ðð â ð(ð¥, ð¥â²)
Uniform and Non-Uniform Learnability
⢠Definition: A hypothesis class â is agnostically PAC-Learnable if there exists a
learning rule ðŽ such that âð, ð¿ > 0, âð ð, ð¿ , âð, âð, âðâŒðð ð,ð¿ð¿ ,
ð¿ð ðŽ ð †ð¿ð â + ð
⢠Definition: A hypothesis class â is non-uniformly learnable if there exists a
learning rule ðŽ such that âð, ð¿ > 0, ââ, âð ð, ð¿, ð , âð, âðâŒðð ð,ð¿,ðð¿ ,
ð¿ð ðŽ ð †ð¿ð â + ð
⢠Definition: A hypothesis class â is âconsistently learnableâ if there exists a
learning rule ðŽ such that âð, ð¿ > 0, ââ âð, âð ð, ð¿, â, ð , âðâŒðð ð,ð¿,â,ðð¿ ,
ð¿ð ðŽ ð †ð¿ð â + ð
Realizable/Optimistic Guarantees: ð dependence through ð¿ð(â)