Post on 25-Feb-2020
transcript
Tutorial: PART 1
Online Convex Optimization,A Game-‐Theoretic Approach to Learning
Elad Hazan Satyen KalePrinceton University Yahoo Research
http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm
Agenda
1. Motivating examples2. Unified framework & why it makes sense3. Algorithms & main results / techniques 4. Research directions & open questions
Section 1:Motivating ExamplesInherently adversarial & online
Sequential Spam Classification-‐ online & adversarial learning
• observe n features (words): 𝑎" ∈ 𝑅%
• Predict label 𝑏'" ∈ {−1,1}, feedback bt• Objective: average error à best linear model in hindsight
min1∈23
4 log(1 +�
"
𝑒<=>⋅1@A>)
Spam Classifier Selection
Stream of emails:
Objective: make predictions with accuracy à best classifier in hindsight
log 1 + 𝑒<=⋅1@A + 𝜆D|𝑥|G
log 1 + 𝑒<=⋅1@A + 𝜆%|𝑥|G
log 1 + 𝑒<=⋅1@A + 𝜆G|𝑥|G 𝑎 ∈ 𝑅H
Distribution of wealth xt
Objective: choose portfolios s.t. avg. log wealth à avg. log wealth of best CRP
rt× 1.5
× 1
× 1
× 0.5
Price Relatives vector rt
xt1/2
0
1/2
0
Wealth multiplier =
(½*1.5 + 0*1 +½*1 + 0 * ½) = 1 ¼
Market -‐ No Statisticalassumptions
Universal portfolio selectionPrice relatives:
rJ 𝑖 = MNOPQ%R STQMUOSU%Q%R STQMU
for commodity i
logWt+1 = logWt + log(r
>t xt)
Constant rebalancing portfolio
• Single currency depreciates exponentially • The 50/50 Constant Rebalanced Portfolio (CRP) makes 5% every day
× 0.1 × 2 × 0.1 × 2 × 0.1 ×…
× 2 × 0.1 × 2 × 0.1 × 2 ×…
X 1 1 X … X
X X X X … X
X X -1 X … X
X X 1 X … -1
X -1 X X … X
… X X X … X
X 1 X X … X
X X X -1 … X
Recommendation systems
• Objective: ratings with accuracy à best low rank matrix• Computation challenge: optimizing over low rank matrices
18000 movies
480000users
Ad Selection
• Yahoo, Google etc display ads on pages, clicks generate revenue• Abstraction:• Observe a stream of users• Given a user, select one ad to display• Feedback: click (or not) on the ad shown
• Objective: average revenue per ad à best policy in a given class
• Additional difficulty: partial (bandit) feedback (feedback only for ad actually shown)
2. Methodology: Online Convex Optimizationdefinition & relation to PAC learninggo through examplesargue it makes sense
Nature: i.i.d from distribution D over A ×𝐵 = {(𝑎, 𝑏)}
Statistical (PAC) learning
learner:
Hypothesis h
Loss, e.g. ℓ𝓁 ℎ, 𝑎, 𝑏 = ℎ 𝑎 − 𝑏 G
h1
hN
(a1,b1) (aM,bM)
h2
Hypothesis class H: X -‐> Y is learnable if ∀𝜖, 𝛿 > 0 exists algorithm s.t. after seeing mexamples, for 𝑚 = 𝑝𝑜𝑙𝑦(𝛿, 𝜖, 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛(𝐻))finds h s.t. w.p. 1-‐ δ:
err(h) minh⇤2H
err(h⇤) + ✏
𝑒𝑟𝑟 ℎ = 𝔼A,=∼l[ℓ𝓁(ℎ, 𝑎, 𝑏 ]
Iteratively, for t = 1,2, … , 𝑇
Player: ℎ" ∈ 𝐻Adversary: (𝑎",𝑏") ∈ 𝐴
Loss ℓ𝓁(ℎ",(𝑎", 𝑏"))
Goal: minimize (average, expected) regret:
Can we learn in games efficiently? Regret à o(T)?
(i.e. as fundamental theorem of stat. learning)
Online Learning in GamesA x B
H
1
T
"X
t
`(ht, (at, bt)� minh⇤2H
X
t
`(h⇤, (at, bt))
#�!T!1
0
Online vs. Batch (PAC, Statistical)
Online convex optimization & regret minimization
• Learning against an adversary
• Regret
• Streaming setting: process examples sequentially
• Less assumptions, stronger guarantees
• Harder algorithmically
Batch learning, PAC/statistical learning
• Nature is benign, i.i.d. samples from unknown distribution
• Generalization error
• Batch: process dataset multiple times
• Weaker guarantees (changing distribution, etc.)
• Easier algorithmically
Online Convex Optimization
Point xt in convex set K in Rn
Convex costfunction ft
Total loss Σt ft (xt)
lossft(xt)
Online PlayerAdversary
Access to K,ft?
Regret =X
t
f
t
(xt
)� minx
⇤2K
X
t
f
t
(x⇤)
Prediction from expert advice
• Decision set = set of all distributions over n experts
• Cost functions? Let: ct(i) = loss of expert i in round t
ft(x) = expected loss if choosing experts by distribution xtRegret = difference in # expected loss vs. best expert:
K = �n = {x 2 R
n,
X
i
xi = 1 , xi � 0}
Regret =X
t
f
t
(xt
)� minx
⇤2K
X
t
f
t
(x⇤)
ft(x) = c
>t x =
X
i
ct(i)x(i)
Parameterized Hypothesis Classes
• H -‐ parameterized by vector x in convex set 𝐾 ⊆ 𝑅%
• In round t, define loss function based on example (at, bt)
1. Online Linear Spam Filtering:• 𝐾 = 𝑥 | ∥ 𝑥 ∥ ≤ 𝜔• Loss function 𝑓" 𝑥 = 𝑎"{𝑥 − 𝑏" G
2. Online Soft Margin SVM:• K = Rn
• Loss function 𝑓" 𝑥 = max 0, 1 − 𝑏"𝑎"{𝑥 + 𝜆 ∥ 𝑥 ∥G
3. Online Matrix Completion:• K = 𝑋 ∈ 𝑅% ×% , ∥ 𝑋 ∥∗ ≤ 𝑘 matrices with bounded nuclear norm• At time t, if at = (it, jt), then loss function 𝑓" 𝑥 = 𝑥 𝑖", 𝑗" − 𝑏" G
Universal portfolio selection
logWt+1 = logWt + log(r
>t xt)
Regret
T
=1
T
X
t
f
t
(xt
)� 1
T
minx
⇤
X
t
f
t
(x⇤) 7! 0
A “Universal Portfolio” algorithm:
K = �n = {x 2 R
n,
X
i
xi = 1 , xi � 0}
ft(x) = � log(r
>t xt)
Recall change in log-‐wealth:
Thus:
Bandit Online Convex Optimization
• Same as OCO, except ft is not revealed to learner, only ft(xt) is.
• E.g. Ad selection: a bandit version of “prediction with expert advice”
• Decision set K = set of distributions over possible ads to be shown
• Loss function: let ct(i) = 0 if ad i would be clicked if it were shown, and 1 otherwise
K = �n = {x 2 R
n,
X
i
xi = 1 , xi � 0}
ft(x) = c
>t x =
X
i
ct(i)x(i)
Algorithmsrandomized weighted majorityfollow-‐the-‐leaderonline gradient descent, regularization? log-‐regret, Online-‐Newton-‐StepBandit algorithms (FKM, barriers, volumetric spanners)Projection-‐free methods (FW algorithm, online FW)
Part 1: Foundations
Prediction from expert advice
• N “experts” (different regularizations)• Can we perform as good as the best expert ?(non-‐stochastic, adversarial setting)
0 if correct
1 for mistake
𝑎 ∈ 𝑅H
Weighted Majority [Littlestone-‐Warmuth ‘89]
n experts
Initially, all 𝑥" ℎD = 1
round t, predict by weighted majority
Update weights:𝑥"�D ℎ = 1− 𝜂 𝑥" ℎ if expert h errs𝑥"�D ℎ = 𝑥" ℎ otherwise
Thm: Total-‐errors ≤ 2 1 + 𝜂 Best-‐Expert + G��� %�
Binary classification: {0, 1} loss
𝑥" ℎD𝑥" ℎG..𝑥" ℎ%
Weighted Majority: Analysis
• Total weight: Φ" = ∑ 𝑥"(ℎ)�� ∈�
1. Alg mistake à at least half weight on experts who erred
2. If h* is best expert,
• Thus:
�t+1 1
2�t(1� ⌘) +
1
2�t �t(1� ⌘
2 )
#(alg errors) 2(1 + ⌘)#(best expert errors) +
2 log n
⌘
(1� ⌘)#(best expert errors) �T �0
· (1� ⌘2
)#(alg errors)
�T � xT (h⇤) = (1� ⌘)#(h⇤
errors)
Randomized Weighted Majority [LW‘89]
n experts
Initially, all 𝑥" ℎD = 1
round t, predict h w.p.: 1> �∑ 1>(��)��
Update weights:𝑥"�D ℎ = 𝑒<�M> � ⋅ 𝑥" ℎ
Thm: For 𝜂 = ����{
�Regret = O( 𝑇 log𝑁 )�
general: expert h time t: loss = 𝑐" ℎ ∈ [0,1]
𝑥" ℎD𝑥" ℎG..𝑥" ℎ%
Reminder: Online Convex Optimization
Point xt in convex set K in Rn
Convex costfunction ft
Total loss Σt ft (xt)
lossft(xt)
Online PlayerAdversary
Access to ft?Benchmark? Regret =
X
t
f
t
(xt
)� minx
⇤2K
X
t
f
t
(x⇤)
• Most natural:
• Provably works: (even for non-‐convex sets [Kalai-‐Vempala’05])
• Is xt close to xt+1 ?
• Decision may be unstable! (teaser: this can be fixed w. regularization)
Minimize regret: best-‐in-‐hindsight
Regret =X
t
f
t
(xt
)� minx
⇤2K
X
t
f
t
(x⇤)
xt = argmin{t�1X
i=1
fi(x)}
xt = argmin{tX
i=1
fi(x) +1
⌘
R(x)}
x
t+1 = arg minx2K
(tX
i=1
f
i
(x)
)
• Move in the direction of steepest descent, which is:
Offline: Steepest Descent
p1p* p2p3
�[rf(x)]i = � @
@xif(x)
Online gradient descent [Zinkevich ’03]
x
t+1 = argminx2K
kyt+1 � x
t
k
Theorem: Regret = 𝑂 𝑇�
yt+1 = xt � ⌘rft(xt)
Observation 1:
Observation 2: (Pythagoras)
Thus:
Convexity:
Analysis
kyt+1 � x
⇤k2 = kxt � x
⇤k2 � 2⌘rt(x⇤ � xt) + ⌘
2krtk2
kxt+1 � x
⇤k2 kxt � x
⇤k2 � 2⌘rt(x⇤ � xt) + ⌘
2krtk2
kxt+1 � x
⇤k kyt+1 � x
⇤k
X
t
[ft(xt)� ft(x⇤)]
X
t
rt(xt � x⇤)
1
⌘(kxt � x⇤k2 � kxt+1 � x⇤k2) + ⌘
X
t
krtk2
1
⌘kx1 � x⇤k2 + ⌘TG = O(
pT )
∇"≔ ∇𝑓"(𝑥")
• 2 experts, T iterations:• First expert has random loss in {-‐1,1}• Second expert loss = first * -‐1
• Expected loss = 0 (any algorithm)• Regret = (think of first expert only)
Lower bound
Regret = ⌦(pT )
E[|#10s�#(�1)0s|] = ⌦(pT )
Algorithmsrandomized weighted majorityfollow-‐the-‐leaderonline gradient descent, regularization? log-‐regret, Online-‐Newton-‐StepBandit algorithms (FKM, barriers, volumetric spanners)Projection-‐free methods (FW algorithm, online FW)
Part 2 coming up…