Learning Reliable Policies in the Bandit Setting with ...ceur-ws.org/Vol-2429/paper7.pdfNote that...

Learning Reliable Policies in the Bandit Setting with Application to AdaptiveClinical Trials

Hossein Aboutalebi∗1, Doina Precup1, and Tibor Schuster2

1Department of Computer Science, McGill University. Mila Quebec AI Institute2Department of Family Medicine, McGill University

AbstractThe stochastic multi-armed bandit problem is awell-known model for studying the exploration-exploitation trade-off. It has significant possible ap-plications in adaptive clinical trials, which allow fora dynamic change of patient allocation ratios. How-ever, most bandit learning algorithms are designedwith the goal of minimizing the expected regret.While this approach is useful in many areas, in clin-ical trials, it can be sensitive to outlier data espe-cially when the sample size is small. In this article,we propose a modification of the BESA algorithm[Baransi, Maillard, and Mannor, 2014] which takesinto account the variance in the action outcomes inaddition to the mean. We present a regret bound forour approach and evaluate it empirically both onsynthetic problems as well as on a dataset form theclinical trial literature. Our approach compares fa-vorably to a suite of standard bandit algorithms.

IntroductionThe multi-armed bandit is a standard model for researchersto investigate the exploration-exploitation trade-off, seee.g [Baransi, Maillard, and Mannor, 2014; Auer, Cesa-Bianchi, and Fischer, 2002; Sani, Lazaric, and Munos, 2012a;Chapelle and Li, 2011; Sutton and Barto, 1998] . Unlikefully sequential decision-making problems, multi-armed ban-dit problems are simple enough to allow for theoretical stud-ies.

The multi-armed bandit problem consists of a set of arms,each of which generates a stochastic reward from a fixed butunknown distribution associated to it. The standard goal inthis setting is to find the arm ? which has the maximum ex-pected reward µ? (or equivalently, minimum expected regret).The expected regret RT is defined as the sum of the expecteddifference between the mean reward of the chosen arm at andthe optimal arm until t = T :

RT = E

[T∑t=1

(µ? − µat)

]∗[email protected]

While this objective is very popular, there are practi-cal applications, for example in medical research and AIsafety [Garcıa and Fernández, 2015] where maximizing ex-pected value is not sufficient, and it would be better to havean algorithm sensitive also to the variability of the outcomesof a given arm. For example, consider multi-arm clinical tri-als where the objective is to find the most promising treatmentamong a pool of available treatments. Due to heterogeneity inpatients’ treatment responses, considering only the expectedmean may not be of interest [Austin, 2011]. Specifically, asthe mean is usually sensitive to outliers and does not provideinformation about the dispersion of individual responses, theexpected reward has only limited value in achieving a clinicaltrial’s objective. This is especially true if some outcomes arevery bad for the patients. Due to this issue, analysis of vari-ance approaches for studying the effectiveness of the treat-ments were recently propose [Corbin-Berrigan et al., 2018].In other studies like [Carandini, 2004], the variance itself isthe source of interest. Also, the consistency of treatment ef-fects among patients is essential, with the ideal treatment usu-ally defined as the one which has a high positive response ratewhile showing low variability in response among patients.

In this paper, we tackle the problem of designing banditalgorithms that reflect both the mean and variability of thearms, by extending one of the recent algorithms in the ban-dit literature called BESA (Best Empirical Sampled Aver-age) [Baransi, Maillard, and Mannor, 2014]. One of the mainadvantage of BESA compared to other existing bandit algo-rithms is that it is a non-parametric learning algorithm. Thisis especially useful when one does not have any prior knowl-edge or has insufficient prior knowledge about the differentarms in the beginning. We establish regret bounds for the pro-posed algorithm, and we show that this new algorithm is su-perior to some of the past risk-averse learning algorithms likeMV-LCB and ExpExp [Sani, Lazaric, and Munos, 2012a] inboth simulated tasks as well as in some clinical trial tasks.

Background and NotationWe consider the standard bandit setting with action (arm)set A, where each action a ∈ A is characterized by a rewarddistribution ra bounded in the interval [0, 1]. The distributionfor action a has mean µa and variance σ2a. Let Xa,i ∼ radenote the i-th reward sampled from the distribution of ac-tion a. All actions and samples are independent. The bandit

43

Sadiq SaniCopyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).�

problem is described as an iterative game where, on each step(round) t, the player (an algorithm) selects action (arm) at andobserves sample Xa,Na,t , where Na,t =

∑ts=1 I{as = a}

denotes the number of samples observed for action a up totime t (inclusively). A policy is a distribution over A. In gen-eral, stochastic distributions are necessary during the learningstage, in order to identify the best arm. We discuss the exactnotion of “best” below.

We define IS(m, j) as the set obtained by sub-samplingwithout replacement j elements form the set S of size m.Let Xa,t denote the history of observations (records) obtainedfrom action (arm) a up to time t (inclusively), such that|Xa,t| = Na,t. The notation Xa,t(I) indicates the set of sub-samples from Xa,t, where sub-sample I ⊂ {1, 2, . . . , Na,t}.

The multi-armed bandit was first presented in the sem-inal work of Robbins [Robbins, 1985]. It has been shownthat under certain conditions [Burnetas and Katehakis, 1996;Lai and Robbins, 1985], a policy can have logarithmic cumu-lative regret:

limt→∞

infRt

log(t)>

∑a:µa 0.

Definition 0.2. Consistency-aware regret: The consistency-aware regret for a bandit algorithm B is defined as:

RBT (ρ) =∑a∈A

(µ? − ρσ2? − µa + ρσ2a)E[Na,T ] (2)

Note that when ρ = 0, the consistency-aware regret corre-sponds to the well-known expected regret.

RBT =∑a∈A

(µ? − µa)E[Na,T ]. (3)

Note that when the context is clear, we will just use RT .It is clear that computing the consistency-aware regret is

not feasible in a real environment, as we do not have accessto the underlying distributions of arms. Hence, we define thefollowing empirical mean and variance which will be used toestimate this regret in our algorithm:

Definition 0.3. Empirical mean and variance: For an algo-rithm B, the empirical mean and empirical variance of arm aup to time t is:

µ̂a,t =1

Na,t

Na,t∑i=1

ra,i (4)

σ̂2a,t =1

Na,t − 1

Na,t∑i=1

(ra,i − µ̂a,t)2 (5)

where ra,i is the ith reward obtained from pulling arm a.

Note that unlike in [Sani, Lazaric, and Munos, 2012b], theempirical estimation of the variance of an arm here is unbi-ased. We will exploit this feature later in our proofs.

For ease of notation, we define the value function for theset of records of an arm as follows:

Definition 0.4. Consistency-aware value function: For agiven record set Xa,t(I) of an arm a up to time step t, thecorresponding value function is defined as:

v̂(Xa,t(I)) = µ̂(Xa,t(I))− ρσ̂2(Xa,t(Ia,t)). (6)

In the next section, we are going to develop an algorithmwhich optimizes the consistency-aware regret using the quan-tities defined above in its estimation.

44

Proposed AlgorithmIn order to optimize the consistency-aware regret, we build onthe BESA algorithm, which we will now briefly review. Asdiscussed in [Baransi, Maillard, and Mannor, 2014], BESAis a non-parametric approach for finding the optimal arm ac-cording to the expected mean regret criterion. Consider a two-armed bandit with actions a and ? ,where µ? > µa, and as-sume that Na,t < N?,t at time step t. In order to select thenext arm for time step t + 1, BESA first sub-samples s? =I?(N?,t, Na,t) from the observation history (records) of thearm ? and similarly sub-sample sa = Ia(Na,t, Na,t) = Xa,tfrom the records the arm a. If µ̂sa > µ̂s? , BESA chooses arma, otherwise it chooses arm ?.

The main reason behind the sub-sampling is that it givesa similar opportunity to both arms. Consequently, the effectof having a small sample size, which may cause bias in theestimates, diminishes. When there are more than two arms,BESA runs a tournament algorithm on the arms [Baransi,Maillard, and Mannor, 2014].

Finally, it is worth mentioning that the proof of the regretbound of BESA uses a non-trivial lemma for which authorsdid not provide any formal proof. In this paper, we will avoidusing this lemma to prove the soundness of our proposed al-gorithm.

We are now ready to outline our proposed approach, whichwe call BESA+. As in [Baransi, Maillard, and Mannor, 2014],we focus on the two-arm bandit. For more than two arms, atournament can be set up in our case as well.

Algorithm BESA+ two action caseParameters: current time step t, actions a and b. InitiallyNa,0 = 0, Nb,0 = 0Shuffle the arms a and b with a functionM to get a′, b′.

1: if Na′,t−1 = 0 ∨Na′,t−1 < log(t) then2: at = a′

3: else if Nb′,t−1 = 0 ∨Nb′,t−1 < log(t) then4: at = b′

5: else6: nt−1 = min{Na′,t−1, Nb′,t−1}7: Ia′,t−1 ← Ia′(Na′,t−1, nt−1)8: Ib′,t−1 ← Ib′(Nb′,t−1, nt−1)9: Calculate ṽa′,t = v̂(Xa′,t−1(Ia′,t−1)) and ṽb′,t =

v̂(Xb′,t−1(Ib′,t−1))10: at = arg maxi∈{a′,b′} ṽi,t (break ties by choosing arm

with fewer tries)11: end if12: return M−1(at)

The first major difference between BESA+ and BESA isthe use of the consistency-aware value function instead of thesimple regret. A second important change is that BESA+ se-lects the arm which has been tried less up to time step t ifthe arm has been chosen less than log(t) times up to t. Es-sentially, this change in the algorithm is negligible in termsof establishing the total expected regret, as we cannot achieveany better bound than log(T ), as shown in Robbins’ lemma[Lai and Robbins, 1985]. This tweak also turns out to be vital

in proving that the expected regret of the BESA+ algorithmis bounded by log(T ) (a result which we present shortly).

To better understand why this modification is necessary,consider a two arms scenario. The first arm gives a determin-istic reward of r ∈ [0, 0.5) and the second arm has a uniformdistribution in the interval [0,1] with the expected reward of0.5. If we are only interested in the expected reward (ρ = 0),the algorithm should ultimately favor the second arm. On theother hand, there exists a probability of r that the BESA al-gorithm is going to constantly choose the first arm if the sec-ond arm gives a value less than r on its first pull. In contrast,BESA+ evades this problem by letting the second arm be se-lected enough times such that it eventually becomes distin-guishable from the first arm.

We are now ready to state the main theoretical result of ourproposed algorithm.

Theorem 0.1. Let A = {a, ?} be a two-armed bandit withbounded rewards ∈ [0, 1], and the value gap ∆ = v? − va.Given the value ρ, the expected consistency-aware regret ofthe Algorithm BESA+ up to time T is upper bounded as fol-lows:

RT = C∆,ρ +O(log(T )) (7)

where in (7), C∆,ρ is a constant which is dependent on thevalue of ρ,∆.

Interested reader can visit here to see the full proof.

Empirical resultsEmpirical comparison of BESA and BESA+As discussed in the previous section, BESA+ has some ad-vantages over BESA. We illustrate the example we discussedin the previous section through the results in Figures 1-6, forr ∈ {0.2, 0.3, 0.4}. Each experiment has been repeated 200times. Note that while BESA has an almost a linear regretbehavior, BESA+ can learn the optimal arm within the giventhe time horizon and its expected accumulated regret is up-per bounded by a log function. It is also easy to notice thatBESA+ has a faster convergence rate compared with BESA.As r gets closer to 0.5, the problem becomes harder. Thisphenomenon is a direct illustration of our theoretical result.

Statistical dispersion estimate via sub-samplingwithout replacementIn this subsection, we are going to explore the effect of sam-ple size and sub-sample size on the consistency-aware valuefunction error of BESA+. We have studied different distribu-tions to find out their effect on consistency-aware value func-tion error as well. Average results and standard error bars arecomputed over 200 independent experiments in all graphs.

Based on our experiments, changing the ρ value in theconsistency-aware value function definition does not havemuch impact on the convergence rate. Moreover, as onewould expect, the distribution type does not have a noticeableinfluence on the convergence rate either, although some smalldifferences can be observed. Due to the space limit, we onlyincluded two figures (figures 7, 8) to illustrate our claim. Asit can be seen in both figures, as we increase the sub-sample

45

https://drive.google.com/file/d/1SAg8b7pJeY_rp8O9CfpQuDyzMPMauNTF/view

0 2000 4000 6000 8000 10000steps

0.00

0.02

0.04

0.06

0.08

regr

et

BESABESA+

Figure 1: Result of expected regret per step for r = 0.4, ρ = 0

0 2000 4000 6000 8000 10000steps

0

100

200

300

400

500

600

700

800

regr

et

BESABESA+

Figure 2: Result of accumulated expected regret for r =0.4, ρ = 0

0 2000 4000 6000 8000 10000steps

0.000

0.025

0.050

0.075

0.100

0.125

0.150

0.175

0.200

regr

et

BESABESA+


0 2000 4000 6000 8000 10000steps

0

200

400

600

800

1000

regr

et

BESABESA+


size, the expected error decreases significantly. It is also in-

0 2000 4000 6000 8000 10000steps

0.00

0.05

0.10

0.15

0.20

0.25

0.30

regr

et

BESABESA+


0 2000 4000 6000 8000 10000steps

0

200

400

600

800

1000

regr

et

BESABESA+


teresting to note that the expected error is almost independentof the sample size given the sub-sample size.

Algorithm BESA+ performanceWe also evaluated the performance of BESA+ withconsistency-aware regret in a simulated environment. Ourwork here is similar to the work [Sani, Lazaric, and Munos,2012b] in which the authors depicted the performance of MV-LCB and ExpExp with a synthetic environment. Here, weconsidered a two-arm environment with arm 1 having mean 1and variance in the range [0.1, 1] and arm 2 having the meanin the range [0.1, 1] and variance 1. It is clear that under anypositive value of ρ, arm 1 should be preferred over arm 2. Wehave examined different values of ρ and studied their corre-sponding effects on the expected consistency-aware regret ofthe Algorithm BESA+. (figures 9, 10, 11). In the figures, nstands for time step. These figures uncover three importantaspects of BESA+. First, we can observe the kind of prob-lems which are difficult for BESA+. It appears that as thedifference between the mean or variance of two arms shrinks,BESA+ usually suffers a higher amount of regret. This factcan also be inferred from Theorem 0.1. Second, we can seethe importance of ρ value in diminishing the effect of vari-ance or mean. In figure 9, where ρ = 1, we can observe abump near the squares where both mean and variance gapsare small. In figure 10, when ρ = 10, the effect of the meangap almost vanishes and we see that as we go from figure 9to figure 10, the regret graph orients itself toward the smallervariance gap. The same thing happens as we go from figure10 to 11. In this regard, in figure 11 (when ρ = 0.1), the

46

sample_size

050

100150

200sub

sample

_size

0

50

100

150

200

erro

r

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

0.0

0.2

0.4

0.6

0.8

Figure 7: Result of expected error for normal distribution(µ = 0 and σ = 1)

sample_size

050

100150

200sub

sample

_size

0

50

100

150

200

erro

r

0.00

0.05

0.10

0.15

0.20

0.25

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

Figure 8: Result of expected error for uniform distribution ofthe interval [0, 1]

regret graph has oriented itself toward the smaller mean gapsize. Finally, these figures depict the speed of convergenceof BESA+ Algorithm which seems faster than MV-LCB andExpExp.

Real Clinical Trial DatasetFinally, we examined the performance of BESA+ againstother methods (BESA, UCB1 , Thompson sampling, MV-LCB, and ExpExp) based on a real clinical dataset. Thisdataset includes the survival times of patients who were suf-fering from lung cancer [Ripley et al., 2013]. Two differentkinds of treatments (standard treatment and test treatment)were applied to them and the results are based on the numberof days the patient survived after receiving one of the treat-ments. For the purpose of illustration and simplicity, we as-sumed non-informative censoring and equal follow-up timesin both treatment groups. As the experiment has already beenconducted, to apply bandit algorithms, each time a treatmentis selected by a bandit algorithm, we sampled uniformly from

variance gap

0.0 0.2 0.4 0.60.8

mean

gap

0.00.2

0.40.6

0.8

cons

isten

cy-a

ware

rege

rt

0.0000.0250.0500.0750.1000.1250.1500.1750.200

variance gap

0.0 0.2 0.4 0.60.8

mean

gap

0.00.2

0.40.6

0.8

cons

isten

cy-a

ware

rege

rt

0.000.010.020.030.040.050.06

variance gap

0.0 0.2 0.4 0.60.8

mean

gap

0.00.2

0.40.6

0.8

cons

isten

cy-a

ware

rege

rt

0.00000.00250.00500.00750.01000.01250.01500.01750.0200

Figure 9: Case: ρ = 1, Top figure: n = 20. Middle figure:n = 200. Bottom figure: n = 2000.

the recorded results of the patients whom received that se-lected treatment and used the survival time as the reward sig-nal. Figure 12 shows the distribution of treatment 1 and 2. Wecategorized the survival time into ten categories (category 1showing the minimum survival time). It is interesting to no-tice that while treatment 2 has a higher mean than treatment1 due to the effect of outliers, it has a higher level of variancecompared to treatment 1. From figure 12 it is easy to deducethat treatment 1 has a more consistent behavior than treatment2 and a higher number of patients who received treatment 2died early. That is why treatment 1 may be preferred overtreatment 2 if we use the consistency-aware regret. In thisregard, by setting ρ = 1, treatment 1 has less expected mean-variance regret than treatment 2, and it should be ultimatelyfavored by the learning algorithm. Figure 13 illustrates theperformance of different bandit algorithms. It is easy to no-tice that BESA+ has relatively better performance than all theother ones.

47

variance gap

0.0 0.2 0.4 0.60.8

mean

gap

0.00.2

0.40.6

0.8

cons

isten

cy-a

ware

rege

rt

0.00.20.40.6

0.8

1.0

variance gap

0.0 0.2 0.4 0.60.8

mean

gap

0.00.2

0.40.6

0.8

cons

isten

cy-a

ware

rege

rt

0.000.020.040.060.080.100.120.140.16

Figure 10: Case: ρ = 10, Top figure:n = 20. Middle figure:n = 200. Bottom figure: n = 2000.

Conclusion and future workIn this paper, we developed a new definition of regret (calledconsistency-aware regret) which is sensitive to the variabilityin rewards of the different arms by considering the variance ofthe rewards. We extended and modified the BESA algorithmto optimize consistency-aware regret and provided a boundon its performance. Finally, we illustrated the utility of ourproposed algorithm on a real clinical dataset and studied itsbehaviour on some synthetic datasets.

We believe still there exist a noticeable gap between clin-ical trial problems, which inspired our work, and the natureof multi-armed bandit problems. Considering other ways toincorporate reward variability and providing some bounds onthe confidence interval of the arm chosen by a bandit learn-ing algorithm are promising for the future studies. It is alsointeresting to extend other bandit algorithms like Thompsonsampling to consistency-aware regret and study their proper-ties. Finally, utilizing BESA+ in the acquisition of real datawould be an important future validation step.

variance gap

0.0 0.2 0.4 0.60.8

mean

gap

0.00.2

0.40.6

0.8

cons

isten

cy-a

ware

rege

rt

0.00

0.02

0.04

0.06

0.08

0.10

variance gap

0.0 0.2 0.4 0.60.8

mean

gap

0.00.2

0.40.6

0.8

cons

isten

cy-a

ware

rege

rt

0.0000.0050.0100.0150.0200.0250.030

variance gap

0.0 0.2 0.4 0.60.8

mean

gap

0.00.2

0.40.6

0.8

cons

isten

cy-a

ware

rege

rt

0.0000.0010.0020.0030.0040.0050.0060.0070.008

Figure 11: Case: ρ = 0.1, Top figure: n = 20. Middle figure:n = 200. Bottom figure: n = 2000.

AcknowledgmentWe would like to thank Audrey Durand for her comments andinsight on this project. We also thank department of familymedicine of McGill University and CIHR for their generoussupport during this project.

References[Agrawal and Goyal, 2013] Agrawal, S., and Goyal, N.

2013. Further optimal regret bounds for thompson sam-pling. In Artificial Intelligence and Statistics, 99–107.

[Auer, Cesa-Bianchi, and Fischer, 2002] Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finite-time analysisof the multiarmed bandit problem. Machine learning47(2-3):235–256.

[Austin, 2011] Austin, P. C. 2011. An introduction topropensity score methods for reducing the effects of con-founding in observational studies. Multivariate behavioralresearch 46(3):399–424.

48

1 2 3 4 5 6 7 8 9 100.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7 treatment 1treatment 2

Figure 12: Distribution graph

0 200 400 600 800 1000steps

0

2

4

6

8

10

12

regr

et

UCB1Thompson SamplingMV_LCB AlgorithmExpExp AlgorithmBESABESA+

Figure 13: Accumulated consistency-aware regret

[Baransi, Maillard, and Mannor, 2014] Baransi, A.; Mail-lard, O.-A.; and Mannor, S. 2014. Sub-sampling for multi-armed bandits. In ECML-KDD, 115–131.

[Burnetas and Katehakis, 1996] Burnetas, A. N., and Kate-hakis, M. N. 1996. Optimal adaptive policies for sequen-tial allocation problems. Advances in Applied Mathemat-ics 17(2):122–142.

[Carandini, 2004] Carandini, M. 2004. Amplification oftrial-to-trial response variability by neurons in visual cor-tex. PLoS biology 2(9):e264.

[Chapelle and Li, 2011] Chapelle, O., and Li, L. 2011. Anempirical evaluation of thompson sampling. In Advancesin neural information processing systems, 2249–2257.

[Corbin-Berrigan et al., 2018] Corbin-Berrigan, L.-A.;Kowalski, K.; Faubert, J.; Christie, B.; and Gagnon, I.2018. Three-dimensional multiple object tracking in thepediatric population: the neurotracker and its promisingrole in the management of mild traumatic brain injury.NeuroReport 29(7):559–563.

[Galichet, Sebag, and Teytaud, 2013] Galichet, N.; Sebag,M.; and Teytaud, O. 2013. Exploration vs exploitationvs safety: Risk-aware multi-armed bandits. In Asian Con-ference on Machine Learning, 245–260.

[Garcıa and Fernández, 2015] Garcıa, J., and Fernández, F.2015. A comprehensive survey on safe reinforce-ment learning. Journal of Machine Learning Research16(1):1437–1480.

[Kuleshov and Precup, 2014] Kuleshov, V., and Precup, D.2014. Algorithms for multi-armed bandit problems. arXivpreprint arXiv:1402.6028.

[Lai and Robbins, 1985] Lai, T. L., and Robbins, H. 1985.Asymptotically efficient adaptive allocation rules. Ad-vances in applied mathematics 6(1):4–22.

[Maillard, 2013] Maillard, O.-A. 2013. Robust risk-aversestochastic multi-armed bandits. In ICML, 218–233.

[Ripley et al., 2013] Ripley, B.; Venables, B.; Bates, D. M.;Hornik, K.; Gebhardt, A.; Firth, D.; and Ripley, M. B.2013. Package mass. Cran R.

[Robbins, 1985] Robbins, H. 1985. Some aspects of the se-quential design of experiments. In Herbert Robbins Se-lected Papers. Springer. 169–177.

[Sani, Lazaric, and Munos, 2012a] Sani, A.; Lazaric, A.; andMunos, R. 2012a. Risk-aversion in multi-armed bandits.In Advances in Neural Information Processing Systems,3275–3283.

[Sani, Lazaric, and Munos, 2012b] Sani, A.; Lazaric, A.;and Munos, R. 2012b. Risk-aversion in multi-armed ban-dits. In NIPS, 3275–3283.

[Sutton and Barto, 1998] Sutton, R. S., and Barto, A. G.1998. Reinforcement learning: An introduction, volume 1.MIT press Cambridge.

49

paper7IntroductionBackground and NotationMeasure of regretProposed AlgorithmEmpirical resultsEmpirical comparison of BESA and BESA+Statistical dispersion estimate via sub-sampling without replacementAlgorithm BESA+ performance Real Clinical Trial Dataset

Conclusion and future workAcknowledgment

Date post:	01-Feb-2021
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Learning Reliable Policies in the Bandit Setting with ...ceur-ws.org/Vol-2429/paper7.pdfNote that...

Documents