Outline
Introduction
Why online learningBasic stuff about online learning
Prediction with expert advice
Halving algorithmWeighted majority algorithmRandomized weighted majority algorithmExponential weighted average algorithm
2 / 24
Why online learning?
In many cases, data arrives sequentially while predictions arerequired on-the-fly
Online algorithms do not require any distributional assumption
Applicable in adversarial environments
Simple algorithms
Theoretical guarantees
3 / 24
Introduction
Basic Properties:
Instead of learning from a training set and then testing on atest set, the online learning scenario mixes the training andtest phases.
Instead of assuming the distribution over data points is fixedboth for training/test points and points are sampled in andi.i.d fashion, no distributional assumption is assumed in onlinelearning.
Instead of learning a hypothesis with small generalizationerror, online learning algorithms is measured using a mistakemodel and the regret.
4 / 24
Introduction
Basic Setting:For t = 1, 2, ..., T
Receive an instance xt ∈ XMake a prediction yt ∈ YReceive true label yt ∈ YSuffer loss L(yt, yt)
Objective:
min
T∑t=1
L(yt, yt)
5 / 24
Prediction with Expert Advice
For t = 1, 2, ..., T
Receive an instance xt ∈ XReceive an advice yt,i ∈ Y, i ∈ [1, N ] from N experts
Make a prediction yt ∈ YReceive true label yt ∈ YSuffer loss L(yt, yt)
Figure : Weather forecast: an example of a prediction problem based onexpert advice [Mohri et al., 2012]
6 / 24
Regret Analysis
Objective: minimize the regret RT
RT =
T∑t=1
L(yt, yt)−N
mini=1
T∑t=1
L(yt,i, yt)
What does low regret mean?
It means that we don’t lose much from not knowing futureevents
It means that we can perform almost as well as someone whoobserves the entire sequence and picks the best predictionstrategy in hindsight
It means that we can compete with changing environment
7 / 24
Halving algorithm
Realizable case — After some number of rounds T , we willlearn the concept and no longer make errors.
Mistake bound — How many mistakes before we learn aparticular concept?
Maximum number of mistakes a learning algorithm A makes for aconcept c:
MA(c) = maxS|mistakes(A, c)|
Maximum number of mistakes a learning algorithm A makes for aconcept class C:
MA(C) = maxc∈C
MA(c)
8 / 24
Halving algorithm
Algorithm 1 HALVING(H)
1: H1 ← H2: for t← 1 to T do3: RECEIVE(xt)4: yt ← MAJORITYVOTE(Ht, xt)5: RECEIVE(yt)6: if yt 6= yt then7: Ht+1 ← {c ∈ Ht : c(xt) = yt}8: return HT+1
9 / 24
Halving algorithm
Theorem
Let H be a finite hypothesis set, then
MHalving(H) ≤ log2 |H|
Proof.
The algorithm makes predictions using majority vote from theactive set. Thus, at each mistake, the active set is reduced by atleast half. Hence, after log2 |H| mistakes, there can only remainone active hypothesis. Since we are in the realizable case, thishypothesis must coincide with the target concept and we won’tmake mistakes any more.
10 / 24
Weighted majority algorithm
Algorithm 2 WEIGHTED-MAJORITY(N)
1: for i← 1 to N do2: w1,i ← 1
3: for t← 1 to T do4: RECEIVE(xt)5: if
∑i:yt,i=1 wt,i ≥
∑i:yt,i=0 wt,i then
6: yt ← 17: else8: yt ← 0
9: RECEIVE(yt)10: if yt 6= yt then11: for i← 1 to N do12: if yt,i 6= yt then13: wt+1,i ← βwt,i14: else wt+1,i ← wt,i
15: return wT+1
11 / 24
Weighted majority algorithm
Theorem
Fix β ∈ (0, 1). Let mT be the number of mistakes made byalgorithm WM after T ≥ 1 rounds, and m∗T be the number ofmistakes made by the best of the N experts. Then, the followinginequality holds:
mT ≤logN +m∗T log 1
β
log 21+β
Proof.
Introduce a potential function Wt =∑N
1 wt,i, then derive its upperand lower bound.Since the predictions are generated using weighted majority vote, ifthe algorithm makes an error at round t, we have
Wt+1 ≤(1 + β
2
)Wt
12 / 24
Weighted majority algorithm
Proof (Cont.)
mT mistakes after T rounds and W1 = N , thus
WT ≤(1 + β
2
)mTN
Note that we also have
WT ≥ wT,i = βmT,i
where mT,i is the number of mistakes made by the ith expert.Thus,
βm∗T ≤
(1 + β
2
)mTN ⇒ mT ≤
logN +m∗T log 1β
log 21+β
13 / 24
Weighted majority algorithm
mT ≤logN+m∗
T log 1β
log 21+β
mT ≤ O(logN) + constant× |mistakes of best expert|
No assumption about the sequence of samples
The number of mistakes is roughly a constant times that ofthe best expert in hindsight
When m∗T = 0, the bound reduces to mT ≤ O(logN), whichis the same as the Halving algorithm
14 / 24
Randomized weighted majority algorithm
Drawback in weighted majority algorithm: zero-one loss; nodeterministic algorithm can achieve a regret RT = o(T )
In randomized scenario,
A = {1, ..., N} of N actions is available
At each round t ∈ [1, T ], an online algorithm A selects adistribution pt over the set of actions
Receive a loss vector lt, where lt,i ∈ 0, 1 is the loss associatedwith action i
Define the expected loss for round t: Lt =∑N
i=1 pt,ilt,i; the
total loss for T rounds: LT =∑T
t=1 Lt
Define the total loss associated with action i: L =∑T
t=1 lt,i;the minimal loss of a single action: Lmin
T = mini∈A LT,i
15 / 24
Randomized weighted majority algorithm
Algorithm 3 RANDOMIZED-WEIGHTED-MAJORITY(N)
1: for i← 1 to N do2: w1,i ← 13: p1,i ← 1/N
4: for t← 1 to T do5: for i← 1 to N do6: if lt,i = 1 then7: wt+1,i ← βwt,i8: else wt+1,i ← wt,i
9: Wt+1 ←∑Ni=1 wt+1,i
10: for i← 1 to N do11: pt+1,i ← wt+1,i/Wt+1
12: return wT+1
Note: Let w0 be the total weight on outcome 0, w1 be the total weight
on outcome 1, W = w1 + w2; then the prediction strategy is to predict i
with probability wi/W .16 / 24
Randomized weighted majority algorithm
Theorem
Fix β ∈ [1/2, 1]. Then for any T ≥ 1, the loss of algorithm RWMon any sequence can be bounded as follows:
LT ≤logN
1− β+ (2− β)Lmin
T
In particular, for β = max{1/2, 1−√
(logN)/T}, the loss can bebounded as:
LT ≤ Lmin + 2√T logN
Proof.
Define potential function Wt =∑N
i=1wt,i, t ∈ [1, T ].
17 / 24
Proof (Cont.)
Wt+1 =∑i:lt,i=0 wt,i + β
∑i:lt,i=1 wt,i
= Wt + (β − 1)Wt
∑i:lt,i=1 pt,i
= Wt(1− (1− β)Lt)
⇒ WT+1 = N∏Tt=1(1− (1− β)Lt)
Note that we also have WT+1 ≥ maxi∈[1,N ] wT+1,i = βLminT , thus,
βLminT ≤ N
∏Tt=1(1− (1− β)Lt)
⇒ LminT log β ≤ logN − (1− β)LT
⇒ LT ≤ logN1−β + (2− β)Lmin
T
Since LminT ≤ T , this also implies
LT ≤logN
1− β+ (1− β)T + Lmin
T
By minimizing the RHS w.r.t β, we get
LT ≤ LminT + 2
√T logN ⇔ RT ≤ 2
√T logN
18 / 24
Exponential weighted average algorithm
We have extended WM algorithm to other loss functions L takingvalues in [0,1]. The EWA algorithm here is a further extensionsuch that L is convex in its first argument.
Algorithm 4 EXPONENTIAL-WEIGHTED-AVERAGE(N)
1: for i← 1 to N do2: w1,i ← 1
3: for t← 1 to T do4: RECEIVE(xt)
5: yt ←∑Ni=1 wt,iyt,i∑Ni=1 wt,i
6: RECEIVE(yt)7: for i← 1 to N do8: wt+1,i ← wt,ie
−ηL(yt,i,yt)
9: return wT+1
19 / 24
Exponential weighted average algorithm
Theorem
Assume that the loss function L is convex in its first argument andtakes values in [0,1]. Then, for any η > 0 and any sequencey1, ..., yT ∈ Y, the regret for EWA algorithm is bounded as:
RT ≤logN
η+ηT
8
In particular, for η =√
8 logN/T , the regret is bounded as:
RT ≤√
(T/2) logN
Proof.
Define the potential function Φt = log∑N
i=1wt,i, t ∈ [1, T ]
20 / 24
Exponential weighted average algorithm
Proof.
We can prove that
Φt+1 − Φt ≤ −ηL(yt, yt) + η2
8
⇒ Φ(T + 1)− Φ1 ≤ −η∑T
t=1 L(yt, yt) + η2T8
Then we try to obtain the lower bound of ΦT+1 − Φ1.
ΦT+1 − Φ1 = log∑N
i=1 e−ηLT,i − logN
≥ log maxNi=1 e−ηLT,i − logN
= −ηminNi=1 LT,i − logN
Combining lower bound and upper bound, we get
T∑t=1
L(yt, yt)−N
mini=1
LT,i ≤logN
η+ηT
8
21 / 24
Exponential weighted average algorithm
The optimal choice of η requires knowledge of T , which is andisadvantage of this analysis. How to solve this?
The doubling trick. Dividing time into periods [2k, 2k+1 − 1] of
length 2k with k = 0, ..., n, and then choose ηk =√
8 logN2k
. This
leads to the following theorem.
Theorem
Assume that the loss function L is convex in its first argument andtakes values in [0, 1]. Then for any T ≤ 1 and any sequencey1, ..., yT ∈ Y, the regret of the EWA algorithm after T rounds isbounded as follows:
RT ≤√
2√2− 1
√T
2logN +
√log
N
2
22 / 24
Summary
In many cases, data arrives sequentially while predictions arerequired on-the-fly
Online algorithms do not require any distributional assumption
Applicable in adversarial environments
Simple algorithms
Theoretical guarantees
23 / 24
Reference I
Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2012).Foundations of machine learning.MIT press.
Shalev-Shwartz, S. and Yoram, S. (2008).Tutorial on theory and applications of online learning.http://ttic.uchicago.edu/~shai/icml08tutorial/OLtutorial.pdf.[Online; accessed 15-Mar-2015].
24 / 24