An Introduction to Stochastic Multi-armed Bandits

1/21

An Introduction to Stochastic Multi-armed Bandits

Shivaram [email protected]

Department of Computer Science and AutomationIndian Institute of Science

August 2014

Shivaram Kalyanakrishnan (2014) Multi-armed Bandits 1 / 21

2/21

Today’s Talk

� What we will cover


2/21

Today’s Talk


- Stochasticbandits


2/21

Today’s Talk


- Stochasticbandits

� What we will not cover


2/21

Today’s Talk


- Stochasticbandits


- Adversarialbandits


2/21

Today’s Talk


- Stochasticbandits


- Adversarialbandits- Duelingbandits


2/21

Today’s Talk


- Stochasticbandits


- Adversarialbandits- Duelingbandits- Contextualbandits


2/21

Today’s Talk


- Stochasticbandits


- Adversarialbandits- Duelingbandits- Contextualbandits- Mortal bandits


2/21

Today’s Talk


- Stochasticbandits


- Adversarialbandits- Duelingbandits- Contextualbandits- Mortal bandits- Sleepingbandits


2/21

Today’s Talk


- Stochasticbandits


- Adversarialbandits- Duelingbandits- Contextualbandits- Mortal bandits- Sleepingbandits- Realbandits


3/21

A Game

Coin 1 Coin 2 Coin 3

P{heads} = p1 P{heads} = p2 P{heads} = p3

� p1, p2, andp3 areunknown.

� You are given a total of 20 tosses.

� Maximise the total number of heads!


4/21

To Explore or to Exploit?

� On-line advertising: Template optimisation

CARS

CARS

Cars?


4/21



CARS

CARS

Cars?

� Clinical trials (Robbins, 1952)


4/21



CARS

CARS

Cars?


� Packet routing in communication networks (Altman, 2002)


4/21



CARS

CARS

Cars?


� Packet routing in communication networks (Altman, 2002)

� Game playing and reinforcement learning (Kocsis and Szepesvári, 2006)


5/21

Overview

1. Problem definition

2. Two natural algorithms

3. Lower bound

4. Two improved algorithms

5. Conclusion


6/21

Stochastic Multi-armed Bandits

R

1

0

� n arms, each associated with a Bernoulli distribution.


6/21


pa

a

R

1

0


� Arm a has meanpa.


6/21


pa

a

R

1

0

p*


� Arm a has meanpa.

� Highest mean isp∗.


7/21

One-armed Bandits


8/21

Regret Minimisation

� What does analgorithm do?


8/21

Regret Minimisation


For t = 1, 2, 3, . . . , T :

- Given the historya1, r1, a2, r2, a3, r3, . . . , at−1, rt−1,- Pick an armat to sample, and- Obtain a rewardrt drawn from the distribution corresponding to armat.


8/21

Regret Minimisation


For t = 1, 2, 3, . . . , T :


� T is the total sampling budget, or the horizon.


8/21

Regret Minimisation


For t = 1, 2, 3, . . . , T :



� The regretat timet is defined asp∗ − rt.


8/21

Regret Minimisation


For t = 1, 2, 3, . . . , T :




� The cumulative regretover a run isPT

t=1(p∗ − rt).


8/21

Regret Minimisation


For t = 1, 2, 3, . . . , T :





t=1(p∗ − rt).

� The expected cumulative regretof the algorithm (or simply “regret”) is

RT = E

"

TX

t=1

(p∗ − rt)

#

= Tp∗ −T

X

t=1

E[rt].


8/21

Regret Minimisation


For t = 1, 2, 3, . . . , T :





t=1(p∗ − rt).

� The expected cumulative regretof the algorithm (or simply “regret”) is

RT = E

"

TX

t=1

(p∗ − rt)

#

= Tp∗ −T

X

t=1

E[rt].

We desire an algorithm that minimises regret!


9/21

Overview



3. Lower bound


5. Conclusion


10/21

ǫ-Greedy Strategies

� ǫG1(parameterǫ ∈ [0, 1] controls the amount of exploration)- If t ≤ ǫT, sample an arm uniformly at random.- At t = ⌊ǫT⌋, identify abest , an arm with the highest empirical mean.- If t > ǫT, sampleabest .


10/21



� Test instanceI1: n = 20; means = 0.01, 0.02, 0.03, . . . , 0.2; T = 100, 000.


10/21



3000

2600

2200

1800

1400

1000

600 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01

ε

Regret

εG1



10/21



� ǫG2- If t ≤ ǫT, sample an arm uniformly at random.- If t > ǫT, sample an arm with the highest empirical mean.

3000

2600

2200

1800

1400

1000

600 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01

ε

Regret

εG1



10/21




3000

2600

2200

1800

1400

1000

600 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01

ε

Regret

εG1εG2



10/21




� ǫG3(Sutton and Barto, 1998; see Chapter 2.2)- With probabilityǫ, sample an arm uniformly at random; with probability 1− ǫ,

sample an arm with the highest empirical mean.

3000

2600

2200

1800

1400

1000

600 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01

ε

Regret

εG1εG2



10/21






3000

2600

2200

1800

1400

1000

600 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01

ε

Regret

εG1εG2εG3



10/21






3000

2600

2200

1800

1400

1000

600 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01

ε

Regret

εG1εG2εG3


ǫG2 with ǫ = 0.03 denotedǫG∗. Regret of822± 24 over a horizon of 100, 000.


11/21

Softmax Exploration� Softmax(Sutton and Barto, 1998; see Chapter 2.3)

- At time t, Sample arma with probability proportional toexp“

αp̂taTt

”

.

� p̂ta the empirical mean of arma.

� α a tunable parameter that controls exploration.

� One could “anneal” at rates different from1t .


11/21



αp̂taTt

”

.




600

800

1000

1200

1400

1600

1800

2000

2200

20 40 60 80 100 120 140 160 180 200

α

Regret

Softmax


11/21



αp̂taTt

”

.




600

800

1000

1200

1400

1600

1800

2000

2200

20 40 60 80 100 120 140 160 180 200

α

Regret

Softmax

Softmax withα = 100 denotedSoftmax∗. Regret of720± 13 onI1 over a horizon ofT = 100, 000.


12/21

Overview



3. Lower bound


5. Conclusion


13/21

A Lower Bound on Regret

Paraphrasing Lai and Robbins (1985; see Theorem 2).

LetA be an algorithm such that for everybandit instanceI and for everya > 0, asT → ∞:

RT(A, I) = o(Ta).


13/21

A Lower Bound on Regret

Paraphrasing Lai and Robbins (1985; see Theorem 2).

LetA be an algorithm such that for everybandit instanceI and for everya > 0, asT → ∞:

RT(A, I) = o(Ta).

Then, for everybandit instanceI, asT → ∞:

RT(A, I) ≥

0

@

X

a:pa(I) 6=p∗(I)

p∗(I) − pa(I)KL(pa(I), p∗(I))

1

A log(T).


14/21

Overview



3. Lower bound


5. Conclusion


15/21

Upper Confidence Bounds� UCB (Auer et al., 2002a)

- At time t, for every arma, defineucbta = p̂t

a +

r

2 ln(t)ut

a.

- uta the number of timesa has been sampled at timet.

R

1

0

pat

ucbat


15/21



a +

r

2 ln(t)ut

a.


R

1

0

pat

ucbat

- Sample an arma for which ucbta is maximal.


15/21



a +

r

2 ln(t)ut

a.


R

1

0

pat

ucbat



15/21



a +

r

2 ln(t)ut

a.


R

1

0

pat

ucbat


� Achieves regret ofO“

P

a:pa 6=p∗1

p∗−palog(T)

”

: optimal dependence onT .


15/21



a +

r

2 ln(t)ut

a.


R

1

0

pat

ucbat



P

a:pa 6=p∗1

p∗−palog(T)

”


� KL-UCB (Garivier and Cappé, 2011) yields regretO“

P

a:pa 6=p∗p∗−pa

KL(pa,p∗)log(T)

”

,

matching the lower bound from Lai and Robbins (1985).


15/21



a +

r

2 ln(t)ut

a.


R

1

0

pat

ucbat



P

a:pa 6=p∗1

p∗−palog(T)

”


� KL-UCB (Garivier and Cappé, 2011) yields regretO“

P

a:pa 6=p∗p∗−pa

KL(pa,p∗)log(T)

”

,

matching the lower bound from Lai and Robbins (1985).

Regret on instanceI1 (with T = 100, 000)–UCB:1168± 16; KL-UCB: 738± 18.


16/21

Thompson Sampling� Thompson(Thompson, 1933)

- At time t, let arma havesta successes (ones) andf t

a failures (zeroes).


16/21




- Beta(sta + 1, f t

a + 1) represents a “belief” about the true mean of arma.

- Mean =sta+1

sta+f t

a+2 ; variance =(st

a+1)(f ta+1)

(sta+f t

a+2)2(sta+f t

a+3).

R

1

0


16/21




- Beta(sta + 1, f t


- Mean =sta+1

sta+f t

a+2 ; variance =(st

a+1)(f ta+1)

(sta+f t

a+2)2(sta+f t

a+3).

R

1

0

- Computational step: For every arma, draw a samplexta ∼ Beta(st

a + 1, f ta + 1).

- Sampling step:Sample an arma for which xta is maximal.


16/21




- Beta(sta + 1, f t


- Mean =sta+1

sta+f t

a+2 ; variance =(st

a+1)(f ta+1)

(sta+f t

a+2)2(sta+f t

a+3).

R

1

0


a + 1, f ta + 1).



16/21




- Beta(sta + 1, f t


- Mean =sta+1

sta+f t

a+2 ; variance =(st

a+1)(f ta+1)

(sta+f t

a+2)2(sta+f t

a+3).

R

1

0


a + 1, f ta + 1).


� Achievesoptimal regret(Kaufmann et al., 2012); isexcellent in practice(Chapelle and Li, 2011).


16/21




- Beta(sta + 1, f t


- Mean =sta+1

sta+f t

a+2 ; variance =(st

a+1)(f ta+1)

(sta+f t

a+2)2(sta+f t

a+3).

R

1

0


a + 1, f ta + 1).


� Achievesoptimal regret(Kaufmann et al., 2012); isexcellent in practice(Chapelle and Li, 2011).

On instanceI1 (with T = 100, 000), regret is463± 18.


17/21

Consolidated Results on InstanceI1

Method Regret at T = 100, 000

ǫG∗ 822± 24Softmax∗ 720± 13UCB 1168± 16KL-UCB 738± 16Thompson 463± 18


17/21




0

1000

2000

3000

4000

5000

106105104103102

T

Regret

EE2*Softmax*

UCBKL-UCB

Thompson


17/21




0

1000

2000

3000

4000

5000

106105104103102

T

Regret

EE2*Softmax*

UCBKL-UCB

Thompson

Use Thompson Sampling!


17/21




0

1000

2000

3000

4000

5000

106105104103102

T

Regret

EE2*Softmax*

UCBKL-UCB

Thompson

Use Thompson Sampling!

Principle: “Optimism in the face of uncertainty.”


18/21

Overview



3. Lower bound


5. Conclusion


19/21

Discussion

� Challenges and extensions


19/21

Discussion


- Set of arms can change over time.


19/21

Discussion



- On-line updates not feasible; batch updating needed.


19/21

Discussion




- Rewards are delayed.


19/21

Discussion





- Arms might bedependent; “context” can be modeled (Li et al., 2010).


19/21

Discussion






- Nonstationary rewards; adversarial modeling possible (Auer et al., 2002b).


19/21

Discussion






- Nonstationary rewards; adversarial modeling possible (Auer et al., 2002b).

� Summary

- Adaptive sampling of options, based on stochastic feedback, to maximise total reward.

- Well-studied problem with long history.

- Thompson Sampling is an essentially optimal algorithm.

- Modeling assumptions typically violated only slightly in practice.


20/21

References

W. R. Thompson, 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.Biometrika, 25(3-4):285–294, 1933.

Herbert Robbins, 1952. Some aspects of the sequential design of experiments.Bulletin of the American Mathematical Society, 58(5):527–535, 1952.

T. L. Lai and Herbert Robbins, 1985. Asymptotically Efficient Adaptive Allocation Rules.Advances in Applied Mathematics, 6(1):4–22, 1985.

Richard S. Sutton and Andrew G. Barto, 1998. Reinforcement Learning: An Introduction. MIT Press, 1998.

Eitan Altman, 2002. Applications of Markov Decision Processes in Communication Networks.Handbook of Markov DecisionProcesses: International Series in Operations Research & Management Science, 40: 489–536, Springer, 2002.

Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer, 2002a. Finite-time Analysis of the Multiarmed Bandit Problem.MachineLearning, 47(2–3):235–256, 2002.

Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E.Schapire, 2002b. The Nonstochastic Multiarmed Bandit Problem.SIAM Journal on Computing, 32(1):48–77, 2002.

Levente Kocsis and Csaba Szepesvári, 2006. Bandit Based Monte-Carlo Planning. InProceedings of the Seventeenth EuropeanConference on Machine Learning (ECML 2006), pp. 282–293, Springer, 2006.

Lihong Li, Wei Chu, John Langford, and Robert E. Schapire, 2010. A contextual-bandit approach to personalized news articlerecommendation. InProceedings of the Nineteenth International Conference on the World Wide Web (WWW 2010), pp. 661–670, ACM,2010.

Olivier Chapelle and Lihong Li, 2011. An Empirical Evaluation of Thompson Sampling. InAdvances in Neural InformationProcessing Systems 24 (NIPS 2011), pp. 2249–2257, Curran Associates, 2011.

Aurélien Garivier and Olivier Cappé . The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond.Journal of MachineLearning Research (Workshop and Conference Proceedings), 19: 359–376, 2011.

Emilie Kaufmann, Nathaniel Korda, and Rémi Munos, 2012. Thompson Sampling: An Asymptotically Optimal Finite-TimeAnalysis. InProceedings of the Twenty-third International Conference on Algorithmic Learning Theory (ALT 2012), pp. 199–213,Springer, 2012.


21/21

Upcoming Talks

� PAC Subset Selection in Stochastic Multi-armed Bandits- 4.00 p.m. – 5.30 p.m.; Thursday,August 14, 2014; CSA 254.

� RoboCup: A Grand Challenge for AI- 4.00 p.m. – 5.00 p.m.; Wednesday,August 20, 2014; CSA 254.

� An Introduction to Reinforcement Learning- 4.00 p.m. – 5.15 p.m.; Wednesday,August 27, 2014; CSA 254.


21/21

Upcoming Talks

� PAC Subset Selection in Stochastic Multi-armed Bandits- 4.00 p.m. – 5.30 p.m.; Thursday,August 14, 2014; CSA 254.

� RoboCup: A Grand Challenge for AI- 4.00 p.m. – 5.00 p.m.; Wednesday,August 20, 2014; CSA 254.

� An Introduction to Reinforcement Learning- 4.00 p.m. – 5.15 p.m.; Wednesday,August 27, 2014; CSA 254.

Thank you!


Date post:	18-Dec-2021
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

An Introduction to Stochastic Multi-armed Bandits

Documents