MotivationModel
Algorithms
The Multi-Armed Bandit Problem
Sumeet Katariya
Electrical and Computer Engineering
December 7, 2013
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
Outline
1 Motivation
2 Mathematical Model
3 Algorithmsǫ-Greedy AlgorithmSoftmax algorithmUpper Confidence Bound Algorithm
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
A/B Testing
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
Exploration vs. Exploitation
Scientist View
Explore new ideas
Businessman View
Exploit best idea found so far
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
Terminology
pulling an arm = making a choice (which ad/color todisplay)
reward/regret = measure of success (user-click, item-buy)
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
Problem Formulation
FormulationK arms 1, · · · ,K
Arm i gives reward distribution νi(x), x ∈ [0, 1] with meanµi . Think Bernoulli(pi )
νi ’s unknown
Finite time horizon (arm-pulls) n
At time t , player chooses arm It ∈ {1, · · · ,K}, theenvironment rewards gIt ,t ∼ νIt
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
Problem Formulation
FormulationK arms 1, · · · ,K
Arm i gives reward distribution νi(x), x ∈ [0, 1] with meanµi . Think Bernoulli(pi )
νi ’s unknown
Finite time horizon (arm-pulls) n
At time t , player chooses arm It ∈ {1, · · · ,K}, theenvironment rewards gIt ,t ∼ νIt
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
Problem Formulation
FormulationK arms 1, · · · ,K
Arm i gives reward distribution νi(x), x ∈ [0, 1] with meanµi . Think Bernoulli(pi )
νi ’s unknown
Finite time horizon (arm-pulls) n
At time t , player chooses arm It ∈ {1, · · · ,K}, theenvironment rewards gIt ,t ∼ νIt
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
Problem Formulation
FormulationK arms 1, · · · ,K
Arm i gives reward distribution νi(x), x ∈ [0, 1] with meanµi . Think Bernoulli(pi )
νi ’s unknown
Finite time horizon (arm-pulls) n
At time t , player chooses arm It ∈ {1, · · · ,K}, theenvironment rewards gIt ,t ∼ νIt
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
Problem Formulation
FormulationK arms 1, · · · ,K
Arm i gives reward distribution νi(x), x ∈ [0, 1] with meanµi . Think Bernoulli(pi )
νi ’s unknown
Finite time horizon (arm-pulls) n
At time t , player chooses arm It ∈ {1, · · · ,K}, theenvironment rewards gIt ,t ∼ νIt
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
Definitions
Define i∗ = arg maxi=1,··· ,K µi µ∗ = maxi=1,··· ,K µi
∆i = µ∗ − µi Ti(n) =n∑
t=11It=i
Cumulative regret R̂n =n∑
t=1gi∗,t −
n∑
t=1gIt ,t
Objective
Find best arm
Minimize expected regret
Rn = ER̂n = nµ∗ − E
K∑
i=1Ti(n)µi =
K∑
i=1∆iETi(n)
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
Definitions
Define i∗ = arg maxi=1,··· ,K µi µ∗ = maxi=1,··· ,K µi
∆i = µ∗ − µi Ti(n) =n∑
t=11It=i
Cumulative regret R̂n =n∑
t=1gi∗,t −
n∑
t=1gIt ,t
Objective
Find best arm
Minimize expected regret
Rn = ER̂n = nµ∗ − E
K∑
i=1Ti(n)µi =
K∑
i=1∆iETi(n)
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
Definitions
Define i∗ = arg maxi=1,··· ,K µi µ∗ = maxi=1,··· ,K µi
∆i = µ∗ − µi Ti(n) =n∑
t=11It=i
Cumulative regret R̂n =n∑
t=1gi∗,t −
n∑
t=1gIt ,t
Objective
Find best arm
Minimize expected regret
Rn = ER̂n = nµ∗ − E
K∑
i=1Ti(n)µi =
K∑
i=1∆iETi(n)
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
Definitions
Define i∗ = arg maxi=1,··· ,K µi µ∗ = maxi=1,··· ,K µi
∆i = µ∗ − µi Ti(n) =n∑
t=11It=i
Cumulative regret R̂n =n∑
t=1gi∗,t −
n∑
t=1gIt ,t
Objective
Find best arm
Minimize expected regret
Rn = ER̂n = nµ∗ − E
K∑
i=1Ti(n)µi =
K∑
i=1∆iETi(n)
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
ǫ-GreedySoftmax algorithmUCB
Outline
1 Motivation
2 Mathematical Model
3 Algorithmsǫ-Greedy AlgorithmSoftmax algorithmUpper Confidence Bound Algorithm
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
ǫ-GreedySoftmax algorithmUCB
Clarification
Objectively and Subjectively Best Options
Objectively best: Which option is truly the best (as knownto an oracle)
Subjectively best: Which option has been best in the past?
Exploitation vs. Exploration
Exploitation: Choose the subjectively best arm
Exploration: Choosing anything else
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
ǫ-GreedySoftmax algorithmUCB
Clarification
Objectively and Subjectively Best Options
Objectively best: Which option is truly the best (as knownto an oracle)
Subjectively best: Which option has been best in the past?
Exploitation vs. Exploration
Exploitation: Choose the subjectively best arm
Exploration: Choosing anything else
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
ǫ-GreedySoftmax algorithmUCB
ǫ-Greedy Algorithm
1
2
K
Strategy = ǫ·Scientist +(1 − ǫ)·Businessman
At each time t
With probability 1 − ǫ, pick the subjectively best arm
With probability ǫ
K , pick a random arm
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
ǫ-GreedySoftmax algorithmUCB
Probability of Selecting Best Arm
5 Bernoulli arms with reward probabilities 0.1, 0.1, 0.1, 0.1, 0.9
0.00
0.25
0.50
0.75
1.00
0 50 100 150 200 250Time
Pro
babi
lity
of S
elec
ting
Bes
t Arm
Epsilon
0.1
0.2
0.3
0.4
0.5
Accuracy of the Epsilon Greedy Algorithm
ǫ = 0.1(Businessman)
Learns slowly
Does well at the end
ǫ = 0.5(Scientist)
Learns quickly
Doesn’t exploit at theend
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
ǫ-GreedySoftmax algorithmUCB
Theoretical guarantee
Weakness - ǫ constant: Solution - annealing
Theoretical Guarantee (Auer, Cesa-Bianchi, Fischer, 2002)
Let ∆ = mini:∆i>0 ∆i and consider ǫt = min(
6K∆2t , 1
)
When t ≥ 6K∆2 , the probability of choosing a suboptimal arm
i is bounded by C∆2t , for some constant C > 0.
As a consequence, E[Ti(n)] ≤ C∆2 log n and
Rn ≤∑
i:∆i>0C∆i∆2 log n → logarithmic regret.
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
ǫ-GreedySoftmax algorithmUCB
Theoretical guarantee
Weakness - ǫ constant: Solution - annealing
Theoretical Guarantee (Auer, Cesa-Bianchi, Fischer, 2002)
Let ∆ = mini:∆i>0 ∆i and consider ǫt = min(
6K∆2t , 1
)
When t ≥ 6K∆2 , the probability of choosing a suboptimal arm
i is bounded by C∆2t , for some constant C > 0.
As a consequence, E[Ti(n)] ≤ C∆2 log n and
Rn ≤∑
i:∆i>0C∆i∆2 log n → logarithmic regret.
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
ǫ-GreedySoftmax algorithmUCB
Theoretical guarantee
Weakness - ǫ constant: Solution - annealing
Theoretical Guarantee (Auer, Cesa-Bianchi, Fischer, 2002)
Let ∆ = mini:∆i>0 ∆i and consider ǫt = min(
6K∆2t , 1
)
When t ≥ 6K∆2 , the probability of choosing a suboptimal arm
i is bounded by C∆2t , for some constant C > 0.
As a consequence, E[Ti(n)] ≤ C∆2 log n and
Rn ≤∑
i:∆i>0C∆i∆2 log n → logarithmic regret.
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
ǫ-GreedySoftmax algorithmUCB
Weakness of ǫ−Greedy
Exploration insensitive to relative performance levels
Two arms with rewards 0.9 and 0.1
Two arms with rewards 0.15 and 0.1
Solution - Softmax algorithm
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
ǫ-GreedySoftmax algorithmUCB
Idea:
P(arm 1) =µ̂1
µ̂1 + µ̂2
P(arm 2) =µ̂2
µ̂1 + µ̂2
Variant:
P(arm 1) =e
µ̂1T
eµ̂1T + e
µ̂2T
P(arm 2) =e
µ̂2T
eµ̂1T + e
µ̂2T
T → ∞ : Pure exploration
T = 0 : Pure exploitation
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
ǫ-GreedySoftmax algorithmUCB
Idea:
P(arm 1) =µ̂1
µ̂1 + µ̂2
P(arm 2) =µ̂2
µ̂1 + µ̂2
Variant:
P(arm 1) =e
µ̂1T
eµ̂1T + e
µ̂2T
P(arm 2) =e
µ̂2T
eµ̂1T + e
µ̂2T
T → ∞ : Pure exploration
T = 0 : Pure exploitation
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
ǫ-GreedySoftmax algorithmUCB
Weakness of Softmax
Doesn’t use confidencep̂1 = 0.15 after 100 plays, p̂2 = 0.1 after 100 plays.p̂1 = 0.15 after 100K plays, p̂2 = 0.1 after 100K plays.
Solution - UCB (Upper Confidence Bound) Algorithm
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
ǫ-GreedySoftmax algorithmUCB
UCB Algorithm
Optimism in the Face of Uncertainty
At time t , construct most optimistic estimate for each arm
Vi,t−1 = µ̂i,t−1 +√
2 log tTi (t−1)
Play arm with max upper bound.i.e. play It ∈ arg max
i∈{1,··· ,K}
{
Vi,t−1}
Proof based on Hoeffding’s inequality
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
ǫ-GreedySoftmax algorithmUCB
UCB Algorithm
Optimism in the Face of Uncertainty
At time t , construct most optimistic estimate for each arm
Vi,t−1 = µ̂i,t−1 +√
2 log tTi (t−1)
Play arm with max upper bound.i.e. play It ∈ arg max
i∈{1,··· ,K}
{
Vi,t−1}
Proof based on Hoeffding’s inequality
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
ǫ-GreedySoftmax algorithmUCB
UCB Algorithm
Optimism in the Face of Uncertainty
At time t , construct most optimistic estimate for each arm
Vi,t−1 = µ̂i,t−1 +√
2 log tTi (t−1)
Play arm with max upper bound.i.e. play It ∈ arg max
i∈{1,··· ,K}
{
Vi,t−1}
Proof based on Hoeffding’s inequality
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
ǫ-GreedySoftmax algorithmUCB
UCB Algorithm
Optimism in the Face of Uncertainty
At time t , construct most optimistic estimate for each arm
Vi,t−1 = µ̂i,t−1 +√
2 log tTi (t−1)
Play arm with max upper bound.i.e. play It ∈ arg max
i∈{1,··· ,K}
{
Vi,t−1}
Proof based on Hoeffding’s inequality
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
ǫ-GreedySoftmax algorithmUCB
Results
0.00
0.25
0.50
0.75
1.00
0 50 100 150 200 250Time
Pro
babi
lity
of S
elec
ting
Bes
t Arm
Accuracy of the UCB1 Algorithm
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
ǫ-GreedySoftmax algorithmUCB
Theoretical Guarantee
UCB Regret Bound (Auer, Cesa-Bianchi, Fischer, 2002)
Rn ≤
[
∑
i:µi<µ∗
(
log n∆i
)
]
+(
1 + π2
3
)
(
K∑
i=1∆i
)
Lower bound (Lai and Rubbins 1985)
Asymptotic total regret is at least logarithmic in number of stepslim
n→∞Rn ≥ log n
∑
i:∆i>0
∆iKL(νi ||ν∗)
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
ǫ-GreedySoftmax algorithmUCB
Theoretical Guarantee
UCB Regret Bound (Auer, Cesa-Bianchi, Fischer, 2002)
Rn ≤
[
∑
i:µi<µ∗
(
log n∆i
)
]
+(
1 + π2
3
)
(
K∑
i=1∆i
)
Lower bound (Lai and Rubbins 1985)
Asymptotic total regret is at least logarithmic in number of stepslim
n→∞Rn ≥ log n
∑
i:∆i>0
∆iKL(νi ||ν∗)
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
ǫ-GreedySoftmax algorithmUCB
Comparison
0.00
0.25
0.50
0.75
1.00
0 50 100 150 200 250Time
Pro
babi
lity
of S
elec
ting
Bes
t Arm
Algorithm
Annealing epsilon−Greedy
UCB1
Annealing Softmax
Accuracy of Different Algorithms
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
ǫ-GreedySoftmax algorithmUCB
Summary
1 Motivation
2 Mathematical Model
3 Algorithmsǫ-Greedy AlgorithmSoftmax algorithmUpper Confidence Bound Algorithm
Sumeet Katariya Multi-armed Bandit
MotivationModel
Algorithms
ǫ-GreedySoftmax algorithmUCB
References
White, John. Bandit Algorithms for Website Optimization.O’Reilly, 2012.
Auer, Peter, Nicolo Cesa-Bianchi, and Paul Fischer."Finite-time analysis of the multiarmed bandit problem."Machine learning 47.2-3 (2002): 235-256.
Sumeet Katariya Multi-armed Bandit