New Algorithms for Contextual Bandits
Lev Reyzin Georgia Institute of Technology
1 Work done at Yahoo!
2
S A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R.E. Schapire Contextual Bandit Algorithms with Supervised Learning Guarantees (AISTATS 2011)
S M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, T. Zhang Efficient Optimal Learning for Contextual Bandits (UAI 2011)
S S. Kale, L. Reyzin, R.E. Schapire Non-Stochastic Bandit Slate Problems (NIPS 2010)
Serving Content to Users
Query, IP address, browser properties, etc.
3
Serving Content to Users
Query, IP address, browser properties, etc.
result (ie. ad, news story)
4
Serving Content to Users
Query, IP address, browser properties, etc.
result (ie. ad, news story)
click or not
5
Serving Content to Users
Query, IP address, browser properties, etc.
result (ie. ad, news story)
click or not
6
Serving Content to Users
Query, IP address, browser properties, etc.
result (ie. ad, news story)
click or not
7
Serving Content to Users
8
Outline
S The setting and some background
S Show ideas that fail
S Give a high probability optimal algorithm
S Dealing with VC sets
S An efficient algorithm
S Slates
9
1
2
3
…
k
Multiarmed Bandits [Robbins ’52]
$0 1 2 4 … T 3
10
1
2
3
…
k
Multiarmed Bandits [Robbins ’52]
1 2 4 … T
$0.50
3 $0.50
2
11
1
2
3
…
k
Multiarmed Bandits [Robbins ’52]
1 2 4 … T
$0.50
3
$0
$0.50
3
12
1
2
3
…
k
Multiarmed Bandits [Robbins ’52]
1 2 4 … T
$0.50
3
$0
$0.33
$0.83
2
13
1
2
3
…
k
Multiarmed Bandits [Robbins ’52]
1 2 4 … T
$0.50
3
$0
$0.20
$0.90
$0.33
$0.4T
14
1
2
3
…
k
Multiarmed Bandits [Robbins ’52]
$0.5T
$0.2T
$0.33T
$0.1T
$0.4T
15
1
2
3
…
k
Multiarmed Bandits [Robbins ’52]
$0.5T
$0.2T
$0.33T
$0.1T
regret = $0.1T
$0.4T
16
1
2
3
…
k
$0 1 2 4 … T 3
N experts/policies/functions think of N >> K
context:
Contextual Bandits [Auer-CesaBianchi-Freund-Schapire ’02]
17
1
2
3
…
k
$0 1 2 4 … T 3
N experts/policies/functions think of N >> K
5
1
K
1
4
3
context: x1
Contextual Bandits [Auer-CesaBianchi-Freund-Schapire ’02]
18
1
2
3
…
k
$0.15 1 2 4 … T 3
N experts/policies/functions think of N >> K
1$0.15
context: x1
5
1
K
1
4
3
Contextual Bandits [Auer-CesaBianchi-Freund-Schapire ’02]
19
1
2
3
…
k
$0.2T 1 2 4 … T 3
N experts/policies/functions think of N >> K
$.12T
$0.15
context: x1 x2 x3 x4 … xT
$0.5
$0.1
$0.2
$0.4
$.1T
$.2T
$.22T
0
$.17T
Contextual Bandits [Auer-CesaBianchi-Freund-Schapire ’02]
20
1
2
3
…
k
$0.2T 1 2 4 … T 3
N experts/policies/functions think of N >> K
$.12T
context: x1 x2 x3 x4 … xT
$.1T
$.2T
$.22T
0
$.17T
regret = $0.02T
Contextual Bandits [Auer-CesaBianchi-Freund-Schapire ’02]
21
1
2
3
…
k
Contextual Bandits [Auer-CesaBianchi-Freund-Schapire ’02]
$0.2T 1 2 4 … T 3
context: x1 x2 x3 x4 … xT
the rewards can come i.i.d. from a distribution or be arbitrary stochastic / adversarial The experts can be present or not. contextual / non-contextual
22
The Setting
S T rounds, K possible actions, N policies π in Π (context à actions)
S for t=1 to T S world commits to rewards r(t)=r1(t),r2(t),…,rK(t) (adversarial or iid)
S world provides context xt
S learner’s policies recommend π1(xt), π2(xt), …, πN(xt)
S learner chooses action jt
S learner receives reward rjt(t)
S want to compete with following the best policy in hindsight
23
Regret
S reward of algorithm A:
S expected reward of policy i:
S algorithm A’s regret:
€
GA T( ) ˙ = rjt t( )t=1
T
∑
€
Gi T( ) ˙ = π i xt( ) ⋅ r t( )t=1
T
∑
€
maxiGi T( ) −GA (T)
24
Regret
S algorithm A’s regret:
S bound on expected regret:
S high probability bound:
€
maxiGi T( ) −GA (T)
maxiGi T( )−E[GA (T )]< ε
€
P[maxi
Gi T( ) −GA (T) > ε] ≤ δ
25
u Harder than supervised learning: u In the bandit setting we do not know the rewards of
actions not taken.
u Many applications
u Ad auctions, medicine, finance, …
u Exploration/Exploitation u Can exploit expert/arm you’ve learned to be good.
u Can explore expert/arm you’re not sure about.
26
Some Barriers
Ω(kT)1/2 (non-contextual) and ~ Ω(TK ln N)1/2 (contextual) are known lower bounds [Auer et al. ’02] on regret, even in the stochastic case.
Any algorithm achieving regret Õ(KT polylog N)1/2 is said to be optimal.
ε-greedy algorithms that first explore (act randomly) and then exploit (follow the best policy) cannot be optimal. Any optimal algorithm must be adaptive.
27
Two Types of Approaches
UCB [Auer ’02]
EXP3 Exponential Weights [Littlestone-Warmuth ’94]
[Auer et al. ’02] 1
0.5
0 t=1
t=2
t=3
…
Algorithm: at every time step 1) pull arm with highest UCB 2) update confidence bound of the
arm pulled.
Algorithm: at every time step 1) sample from distribution defined
by weights (mixed w/ uniform) 2) update weights “exponentially”
28
UCB vs EXP3 A Comparison
UCB [Auer ’02]
u Pros u Optimal for the stochastic
setting. u Succeeds with high probability.
u Cons u Does not work in the
adversarial setting. u Is not optimal in the contextual
setting.
EXP3 & Friends [Auer-CesaBianchi-Freund-Schapire ’02]
u Pros u Optimal for both the adversarial
and stochastic settings. u Can be made to work in the
contextual setting
u Cons u Does not succeed with high
probability in the contextual setting (only in expectation).
29
30
Algorithm Regret High Prob? Context?
Exp4 [ACFS ’02] Õ(KT ln(N))1/2 No Yes
ε-greedy, epoch-geedy [LZ ’07]
Õ((K ln(N) 1/3)T2/3) Yes Yes
Exp3.P[ACFS ’02] UCB [Auer ’00]
Õ(KT)1/2 Yes No
€
Ω KT( ) lower bound [ACFS ’02]
31
Algorithm Regret High Prob? Context?
Exp4 [ACFS ’02] Õ(KT ln(N))1/2 No Yes
ε-greedy, epoch-geedy [LZ ’07]
Õ((K ln(N) 1/3)T2/3) Yes Yes
Exp3.P[ACFS ’02] UCB [Auer ’00]
Õ(KT)1/2 Yes No
Exp4.P [BLLRS ’10] Õ(K ln(N/δ)T)1/2 Yes Yes
€
Ω KT( ) lower bound [ACFS ’02]
EXP4P [Beygelzimer-Langford-Li-R-Schapire ’11]
EXP4P combines the advantages of Exponential Weights and UCB. optimal for both the stochastic and adversarial settings
works for the contextual case (and also the non-contextual case) a high probability result
Main Theorem [Beygelzimer-Langford-Li-R-Schapire ’11]: For any δ>0, with probability at least 1-δ, EXP4P has regret at most
O(KT ln (N/δ))1/2 in the adversarial contextual bandit setting.
32
Outline
S The setting and some background
S Show ideas that fail
S Give a high probability optimal algorithm
S Dealing with VC sets
S An efficient algorithm
S Slates
33
34
S Bad idea 1: Maintain a set of plausible hypotheses and randomize uniformly over their predicted actions. S Adversary has two actions, one always paying off 1 and the other 0.
Hypothesis generally agree on correct action, except for a different one which defects each round. This incurs regret of ~T/2.
S Bad idea 2: Maintain a set of plausible hypotheses and randomize uniformly among the hypothesis. S Adversary has two actions, one always paying off 1 and the other 0.
If all but one of > 2T hypothesis always predict wrong arm, and only 1 hypothesis always predicts good arm, with probability > ½ it is never picked and algorithm incurs regret of T.
Some Failed Approaches
35
S Bad idea 1: Maintain a set of plausible hypotheses and randomize uniformly over their predicted actions. S Adversary has two actions, one always paying off 1 and the other 0.
Hypothesis generally agree on correct action, except for a different one which defects each round. This incurs regret of ~T/2.
S Bad idea 2: Maintain a set of plausible hypotheses and randomize uniformly among the hypothesis. S Adversary has two actions, one always paying off 1 and the other 0.
If all but one of > 2T hypothesis always predict wrong arm, and only 1 hypothesis always predicts good arm, with probability > ½ it is never picked and algorithm incurs regret of T.
Some Failed Approaches
36
S Bad idea 1: Maintain a set of plausible hypotheses and randomize uniformly over their predicted actions. S Adversary has two actions, one always paying off 1 and the other 0.
Hypothesis generally agree on correct action, except for a different one which defects each round. This incurs regret of ~T/2.
S Bad idea 2: Maintain a set of plausible hypotheses and randomize uniformly among the hypothesis. S Adversary has two actions, one always paying off 1 and the other 0.
If all but one of > 2T hypothesis always predict wrong arm, and only 1 hypothesis always predicts good arm, with probability > ½ it is never picked and algorithm incurs regret of T.
Some Failed Approaches
37
S Bad idea 1: Maintain a set of plausible hypotheses and randomize uniformly over their predicted actions. S Adversary has two actions, one always paying off 1 and the other 0.
Hypothesis generally agree on correct action, except for a different one which defects each round. This incurs regret of ~T/2.
S Bad idea 2: Maintain a set of plausible hypotheses and randomize uniformly among the hypothesis. S Adversary has two actions, one always paying off 1 and the other 0.
If all but one of > 2T hypothesis always predict wrong arm, and only 1 hypothesis always predicts good arm, with probability > ½ it is never picked and algorithm incurs regret of T.
Some Failed Approaches
epsilon-greedy
S Rough idea of ε-greedy (or ε-first): act randomly for ε rounds, then go with best (arm or expert).
S Even if we know the number of rounds in advance, ε-first won’t get us regret O(T)1/2, even in the non-contextual setting.
S Rough analysis: even for just 2 arms, we suffer regret of ε+(T-ε)/(ε1/2). S ε≈ T2/3 is optimal tradeoff. S gives regret ≈ T2/3
38
Outline
S The setting and some background
S Show ideas that fail
S Give a high probability optimal algorithm
S Dealing with VC sets
S An efficient algorithm
S Slates
39
Ideas Behind Exp4.P (all appeared in previous algorithms)
S exponential weights S keep a weight on each expert that drops exponentially in the
expert’s (estimated) performance
S upper confidence bounds S use an upper confidence bound on each expert’s estimated
reward
S ensuring exploration S make sure each action is taken with some minimum probability
S importance weighting S give rare events more importance to keep estimates unbiased
40
41
(slide from Beygelzimer & Langford ICML 2010 tutorial)
(Exp4.P) [Beygelzimer, Langford, Li, R, Schapire ’10]
42
43
Lemma 1
44
Lemma 2
EXP4P [Beygelzimer-Langford-Li-R-Schapire ’11]
Main Theorem [Beygelzimer-Langford-Li-R-Schapire ’11]: For any δ>0, with probability at least 1-δ, EXP4P has regret at most
O(KT ln (N/δ))1/2 in the adversarial contextual bandit setting.
key insights (on top of UCB/ EXP) 1) exponential weights and upper
confidence bounds “stack” 2) generalized Bernstein’s
inequality for martingales
t=1
t=2
t=3 45
46
Algorithm Regret High Prob? Context? Efficient?
Exp4 [ACFS ’02]
Õ(T1/2) No Yes No
epoch-geedy [LZ ’07]
Õ(T2/3) Yes Yes Yes
Exp3.P/UCB [ACFS ’02][A ’00]
Õ(T1/2) Yes No Yes
Exp4.P [BLLRS ’10]
Õ(T1/2) Yes Yes No
Efficiency
47
EXP4P Applied to Yahoo!
48
u We chose a policy class for which we could efficiently keep track of the weights.
u Created 5 clusters, with users (at each time step) getting features based on their distances to clusters.
u Policies mapped clusters to article (action) choices.
u Ran on personalized news article recommendations for Yahoo! front page.
u We used a learning bucket on which we ran the algorithms and a deployment bucket on which we ran the greedy (best) learned policy.
Experiments on Yahoo! Data
49
Reported estimated (normalized) click-through rates on front page news. Over 41M user visits. 253 total articles. 21 candidate articles per visit.
EXP4P EXP4 ε-greedy
Learning eCTR
1.0525 1.0988 1.3829
Deployment eCTR
1.6512 1.5309 1.4290
Experimental Results
50
Reported estimated (normalized) click-through rates on front page news. Over 41M user visits. 253 total articles. 21 candidate articles per visit.
EXP4P EXP4 ε-greedy
Learning eCTR
1.0525 1.0988 1.3829
Deployment eCTR
1.6512 1.5309 1.4290
Experimental Results
Why does this work in practice?
Outline
S The setting and some background
S Show ideas that fail
S Give a high probability optimal algorithm
S Dealing with VC sets
S An efficient algorithm
S Slates
51
Infinitely Many Policies
S What if we have an infinite number of policies?
S Our bound of Õ(K ln(N)T)1/2 becomes vacuous.
S If we assume our policy class has a finite VC dimension d, then we can tackle this problem.
S Need i.i.d. assumption. We will also assume k=2 to illustrate the argument.
52
VC Dimension
S The VC dimension of a hypothesis class captures the class’s expressive power.
S It is the cardinality of the largest set (in our case, of contexts) the class can shatter.
S Shatter means to label in all possible configurations.
53
VE, an Algorithm for VC Sets
The VE algorithm:
S Act uniformly at random for τ rounds.
S This partitions our policies Π into equivalence classes according to their labelings of the first τ examples.
S Pick one representative from each equivalence class to make Π’.
S Run Exp4.P on Π’.
54
55
S Sauer’s lemma bounds the number of equivalence classes to (eτ/d)d.
S Hence, using Exp4.P bounds, VE’s regret to Π’ is
≈τ+ O (Td ln(τ))
S We can show that the regret of Π’ to Π is ≈ (T/τ)(d lnT)
S by looking at the probability of disagreeing on future data given agreement for τ steps.
S τ≈ (Td ln 1/δ)1/2 achieves the optimal trade-off.
S Gives Õ(Td)1/2 regret.
S Still inefficient!
Outline of Analysis of VE
Outline
S The setting and some background
S Show ideas that fail
S Give a high probability optimal algorithm
S Dealing with VC sets
S An efficient algorithm
S Slates
56
Hope for an Efficient Algorithm? [Dudik-Hsu-Kale-Karampatziakis-Langford-R-Zhang ’11]
For EXP4P, the dependence on N in the regret is logarithmic.
this suggests
We could compete with a large, even super-polynomial number of policies! (e.g. N=K100 becomes 10 log1/2 K in the regret)
however
All known contextual bandit algorithms explicitly “keep track” of the N policies. Even worse, just reading in the N would take
too long for large N. 57
Idea: Use Supervised Learning
S “Competing” with a large (even exponentially large) set of policies is commonplace in supervised learning. S Targets: e.g. linear thresholds, CNF, decision trees (in practice only)
S Methods: e.g. boosting, SVM, neural networks, gradient descent
S The recommendations of the policies don’t need to be explicitly read in when the policy class has structure!
x1 x2
x3 x4 x5
x6
…
Supervised Learning
Oracle Policy class Π
A good policy in Π
idea originates with [Langford-Zhang ’07]
58
Idea: Use Supervised Learning
S “Competing” with a large (even exponentially large) set of policies is commonplace in supervised learning. S Targets: e.g. linear thresholds, CNF, decision trees (in practice only)
S Methods: e.g. boosting, SVM, neural networks, gradient descent
S The recommendations of the policies don’t need to be explicitly read in when the policy class has structure!
x1 x2
x3 x4 x5
x6
…
Supervised Learning
Oracle Policy class Π
A good policy in Π
idea originates with [Langford-Zhang ’07]
59
Warning: NP-Hard in
General
1
2
3
…
k
$1.20 1 2 4 … T 3
N experts/policies/functions think of N >> K
5
1
k
1
4
3
context: x1 x2 x3
Back to Contextual Bandits
$0.50
$0.70
60
1
2
3
…
k
$1.20 1 2 4 … T 3
N experts/policies/functions think of N >> K
5
1
k
1
4
3
context: x1 x2 x3
$0.50
$0.70
Supervised Learning
Oracle
made-up data
Back to Contextual Bandits
61
1
2
3
…
k
$1.20 1 2 4 … T 3
N experts/policies/functions think of N >> K
5
1
k
1
4
3
context: x1 x2 x3
$0.50
$0.70
Supervised Learning
Oracle
made-up data
Back to Contextual Bandits
62
Randomized-UCB
Main Theorem [Dudik-Hsu-Kale-Karampatziakis-Langford-R-Zhang ’11]: For any δ>0, w.p. at least 1-δ, given access to a supervised learning oracle, Randomized-UCB has regret at most O((KT ln (NT/δ))1/2+K ln(NK/δ)) in the
stochastic contextual bandit setting and runs in time poly(K,T, ln N).
63
Randomized-UCB
Main Theorem [Dudik-Hsu-Kale-Karampatziakis-Langford-R-Zhang ’11]: For any δ>0, w.p. at least 1-δ, given access to a supervised learning oracle, Randomized-UCB has regret at most O((KT ln (NT/δ))1/2+K ln(NK/δ)) in the
stochastic contextual bandit setting and runs in time poly(K,T, ln N).
if arms are chosen among only good policies s.t. all have variance < approx 2K, we win can prove this exists via a minimax theorem
this condition can be softened to occasionally allow choosing of bad policies via “randomized” upper confidence bounds
creates a problem of how to choose arms as to satisfy the constraints expressed as convex optimization problem
solvable by ellipsoid algorithm can implement a separation oracle with the supervised learning oracle
64
creates a problem of how to choose arms as to satisfy the constraints expressed as convex optimization problem
Randomized-UCB
Main Theorem [Dudik-Hsu-Kale-Karampatziakis-Langford-R-Zhang ’11]: For any δ>0, w.p. at least 1-δ, given access to a supervised learning oracle, Randomized-UCB has regret at most O((KT ln (NT/δ))1/2+K ln(NK/δ)) in the
stochastic contextual bandit setting and runs in time poly(K,T, ln N).
if arms are chosen among only good policies s.t. all have variance < approx 2K, we win can prove this exists via a minimax theorem
this condition can be softened to occasionally allow choosing of bad policies via “randomized” upper confidence bounds
solvable by ellipsoid algorithm can implement a separation oracle with the supervised learning oracle
65
Not practical to implement!
(yet)
Outline
S The setting and some background
S Show ideas that fail
S Give a high probability optimal algorithm
S Dealing with VC sets
S An efficient algorithm
S Slates
66
Bandit Slate Problems [Kale-R-Schapire ’11]
Problem: Instead of selecting one arm, we need to select s ≥ 1, arms (possibly ranked). The motivation is web ads where a search engine shows multiple ads at once.
67
Slates Setting
S On round t algorithm selects a sate St of s arms S Unordered or Ordered
S No context or Contextual
S Algorithm sees rj(t) for all j in S.
S Algorithm gets reward
S Obvious solution is to reduce to the regular bandit problem, but we can do much better.
68
rj (t)j∈S∑
69
Algorithm Idea
70
Algorithm Idea
.33 .33
.33
71
Algorithm Idea
72
Algorithm Idea
multiplicative update
73
Algorithm Idea
relative entropy projection
Also “Component Hedge,” independently by Koolen et al. ’10.
74
Algorithm Idea
75
Unordered Slates Ordered, with Positional Factors
No Policies Õ(sKT)1/2 * Õ(s(KT)1/2)
N Policies Õ(sKT ln N)1/2 Õ(s(KT ln N)1/2)
Slate Results
*Independently obtained by Uchiya et al. ’10, using different methods.
Discussion
S The contextual bandit setting captures many interesting real-world problems.
S We presented the first optimal, high-probability, contextual algorithm.
S We showed how one could possibly make it efficient. S Not fully there yet…
S We discussed slates – a more real-world setting. S How to make those efficient?
76