Post on 20-May-2020
transcript
Multi-Armed Bandits with Correlated ArmsSamarth Gupta, Shreyas Chaudhari, Gauri Joshi and Osman Yağan
samarthg,schaudh2,gaurij,oyagan@andrew.cmu.eduCarnegie Mellon University
Pittsburgh, PA, USA
ABSTRACT
We consider a multi-armed bandit framework where the rewardsobtained by pulling different arms are correlated. The correlation in-formation is captured in terms of pseudo-rewards, which are boundson the rewards on the other arm given a reward realization andcan capture many general correlation structures. We leverage thesepseudo-rewards to design a novel approach that extends any classi-cal bandit algorithm to the correlated multi-armed bandit settingstudied in the framework. In each round, our proposed C-Banditalgorithm identifies some arms as empirically non-competitive, andavoids exploring them for that round. Through a unified regret anal-ysis of the proposed C-Bandit algorithm, we show that C-UCB andC-TS (the correlated bandit versions of Upper-confidence-boundand Thompson sampling) pull certain arms called non-competitive
arms, only O(1) times. As a result, we effectively reduce a K-armedbandit problem to a C + 1-armed bandit problem, where C is thenumber of competitive arms, as only C sub-optimal arms are pulledO(logT ) times. In many practical scenarios, C can be zero due towhich our proposed C-Bandit algorithms achieve bounded regret.In the special case where rewards are correlated through a latentrandom variable X , we give a regret lower bound that shows thatbounded regret is possible only when C = 0. In addition to simula-tions, we validate the proposed algorithms via experiments on tworeal-world recommendation datasets, movielens and goodreads,and show that C-UCB and C-TS significantly outperform classicalbandit algorithms.
CCS CONCEPTS
• Theory of computation→Online learning theory; Sequen-tial decision making; Regret bounds.
KEYWORDS
multi-armed bandits, regret analysis, recommendation systems
ACM Reference Format:
Samarth Gupta, Shreyas Chaudhari, Gauri Joshi and Osman Yağan. 2020.Multi-Armed Bandits with Correlated Arms. In Under review at SIGMETRICS
’20, June 08–12, 2020, Boston, MA. ACM, New York, NY, USA, 24 pages.https://doi.org/xx.xxxx/xxxxxxx.xxxxxxx
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from permissions@acm.org.Under review at SIGMETRICS ’20, June 08–12, 2020, Boston, MA
© 2020 Association for Computing Machinery.ACM ISBN xxx-x-xxxx-xxxx-x/xx/xx. . . $xx.xxhttps://doi.org/xx.xxxx/xxxxxxx.xxxxxxx
1 INTRODUCTION
1.1 Background and Motivation
Classical Multi-armed Bandits. The multi-armed bandit (MAB)problem falls under the class of sequential decision making prob-lems. In the classical multi-armed bandit problem, there are K arms,with each arm having an unknown reward distribution. At eachround t , we need to decide an arm kt ∈ K and we receive a randomreward Rt drawn from the reward distribution of arm kt . The goalin the classical multi-armed bandit is to maximize the long-termcumulative reward. In order to maximize cumulative reward, itis important to balance the exploration-exploitation trade-off, i.e.,learning the mean reward of each arm while trying to make surethat the arm with the highest mean reward is played as many timesas possible. This problem has been well studied for a long timestarting with the work of Lai and Robbins [1] that proposed theupper confidence bound (UCB) arm-selection algorithm and studiedits fundamental limits in terms of bounds on regret. Subsequently,several other algorithms [2] including Thompson Sampling [3] andKL-UCB [4] have been proposed for this setting. The classical multi-armed bandit model is useful in numerous applications involvingmedical diagnosis [5], system testing [6], scheduling in computingsystems [7–9], and web optimization [10, 11], among others.
Of particular interest to this work is the problem of optimalad-selection. Suppose that a company is to run an display adver-tising campaign for one of their products, and its creative teamhave designed several different versions that can be displayed. Itis expected that the user engagement (in terms of click probabilityand time spent looking at the ad) depends the version of the ad thatis displayed. In order to maximize the total user engagement overthe course of the ad campaign, multi-armed bandit algorithms canbe used; different versions of the ad correspond to the arms and thereward from selecting an arm is given by the clicks or time spentlooking at the ad version corresponding to that arm.Personalized recommendations using Contextual and Struc-
tured bandits. Although the ad-selection problem can be solvedby standard MAB algorithms, there are several specialized MABvariants that are designed to give better performance. For instance,the contextual bandit problem [12, 13] has been studied to pro-vide personalized displays of the ads to the users. Here, beforemaking a choice at each time step (i.e., deciding which versionto show to a user), we observe the context associated with thatuser (e.g., age/occupation/income features). Contextual bandit algo-rithms learn the mappings from the context θ to the most favoredversion of ad k∗(θ ) in an online manner and thus are useful forpersonalized recommendations. A closely related problem is thestructured bandit problem [14–17], in which the context θ (age/income/ occupational features) is hidden but the mean rewards fordifferent versions of ad (arms) as a function of hidden context θ are
arX
iv:1
911.
0395
9v1
[st
at.M
L]
6 N
ov 2
019
Under review at SIGMETRICS ’20, June 08–12, 2020, Boston, MA Samarth Gupta, Shreyas Chaudhari, Gauri Joshi and Osman Yağan
…
Unknown population composition
Correlated events
Figure 1: The ratings of a user corresponding to different
versions of the same ad are likely to be correlated. For
eg., if a person likes first version, there is a good chance
that it will also like 2nd as it also has George Clooney in
it. However, the population composition is unknown, i.e.,
the fraction of people liking the first/second or the last
version is unknown. Photos of advertisement taken from
https://www.nespresso.com.
known. Such models prove useful for personalized recommenda-tion in which the context of the user is unknown, but the rewardmappings µk (θ ) are known through surveyed data.GlobalRecommendations usingCorrelated-RewardBandits.
In this work we study a variant of the classical multi-armed banditproblem in which rewards corresponding to different arms are cor-related to each other. In many practical settings, the reward we getfrom different arms at any given step are likely to be correlated. Inthe ad-selection example given in Figure 1, a user reacting positively(by clicking, ordering, etc.) to the first version of the ad with GeorgeClooney might also be more likely to click the second version thatalso has Clooney; of course one can construct examples where thereis negative correlation between click events to different ads. Themodel we study in this paper explicitly captures these correlations.Similar to the classical MAB setting, the goal here is to displayversions of the ad to maximize user engagement. Unlike contextualbandits, we do not observe the context (age/occupational/income)features of the user and do not focus on providing personalizedrecommendation. Instead our goal is to provide global recommen-dations to a population whose demographics is unknown. Unlikestructured bandits, we do not assume that the mean rewards arefunctions of a hidden context parameter θ . In structured bandits,although the mean rewards depend on θ the reward realizationscan still be independent.
1.2 Summary of Main Results.
Model overview.Motivated by the presence of correlation in userchoices in Multi-Armed Bandit environments, we study a multi-armed bandit problem that explicitly models correlations among re-wards. These correlation are captured in the form of pseudo-rewards,which provide upper bounds on the conditional expectation of re-wards. For example, in the context of displaying ad versions, wherethe user either likes or dislikes the version, pseudo-rewards rep-resent an upper bound on the chances that user likes version B of
…
Reward realization r
𝑅" 𝑅# 𝑅$ 𝑅%
E[R`|R1 = r] s`,1(r)
Figure 2: Upon observing a reward r from an arm k , pseudo-rewards sℓ,k (r ), give us an upper bound on the conditional
expectation of the reward fromarm ℓ given thatwe observed
reward r from arm k . These pseudo-rewards models the cor-
relation in rewards corresponding to different arms.
the ad if it liked/disliked version A. We show that the knowledge ofsuch bounds, even when they are not all tight, can lead to signifi-cant improvement in the cumulative reward obtained by reducingthe amount of exploration compared to classical MAB algorithms.Figure 2 presents an illustration of our correlation model, wherethe pseudo-rewards, denoted by sℓ,k (r ), provide an upper boundon the reward that we could have received from arm ℓ given thatpulling arm k led to a reward of r .Pseudo-rewards in practice. The pseudo-rewards sℓ,k (r ) can beobtained through domain knowledge or from historical data.
Pseudo-rewards from domain knowledge. For instance, in the con-text of medical testing, where the goal is to identify the best drugto treat an ailment from among a set of K possible options, theeffectiveness of two drugs is correlated when the drugs share somecommon ingredients. Through domain knowledge of doctors, itis possible answer questions such as “what are the chances thatdrug B would be effective given drugAwas not effective?", throughwhich we can infer the pseudo-rewards.
Pseudo-rewards from surveyed data. In the context of displayingadvertisements, such correlations can be learned through surveysin which a user is asked to rate different versions of the ad in anexperimental setup. Once these pseudo-rewards are learned, thecompany can then use them to perform ad-selection at a globallevel (where the reward distributions corresponding to differentarms are unknown). However, the correlations are still going to bethere because of the inherent similarity and differences betweenthe different versions of the ad.
A key advantage of our problem setup is that these pseudo-rewards are just upper bounds on the conditional expected rewardsand can be arbitrarily loose. The proposed algorithm adapts accord-ingly and performs at least as well as the classical bandit algorithmsin any case. This also makes this model practical – if some pseudo-rewards are unknown due to lack of domain knowledge/data, theycan simply be replaced by the maximum possible reward entries,which serves a natural upper bound.Algorithm Overview.We use the knowledge of pseudo-rewardsto extend any classical bandit strategy to the correlatedMAB setting.
Multi-Armed Bandits with Correlated Arms Under review at SIGMETRICS ’20, June 08–12, 2020, Boston, MA
To do so, in each round t , the algorithm performs the followingthree steps.
(1) Select arm kmax that has been pulled the most number of sofar until step t − 1.
(2) Identify the set At of arms that are empirically competitive
with respect to kmax, that is, arms that have empirical meanpseudo-rewards larger than the empirical mean of arm kmax.
(3) Use a classical multi-armed bandit algorithm (for example,UCB or Thompson Sampling) over the reduced set of armsAt ∪ kmax to determine the arm that is pulled in round t .
We refer to this algorithm as C-Bandit where Bandit refers to theclassical bandit algorithm used in the last step of the algorithm (i.e.,UCB/TS/KL-UCB).Regret Analysis and the Notion of Competitive Arms. By do-ing regret analysis of C-UCB and C-TS, we obtain the followingupper bound on the expected regret of C-UCB and C-TS.
Proposition 1.1 (Upper Bound on Expected Regret). Theexpected cumulative regret of the C-UCB and C-TS algorithms is
upper bounded as
E [Reд(T )] ≤ C · O(logT ) + O(1), (1)
Here C denotes the number of competitive arms. We call an arm kto be competitive if expected pseudo-reward of arm k with respectto the optimal arm k∗ is larger than the mean reward of arm k∗.Formally, an arm k is competitive if E
[sk,k∗ (r )
]≥ µk∗ , and the arm
is said to be non-competitive otherwise. The result in Proposition 1.1arises from the fact that the C-UCB and C-TS algorithms end uppulling the non-competitive arms only O(1) times and only thecompetitive arms are pulled O(logT ) times. In contrast to UCB/TS,that pulls allK−1 sub-optimal arms O(logT ) times, our proposed C-UCB and C-TS algorithms pull onlyC ≤ K − 1 arms O(logT ) times.In this sense, we reduce a K-armed bandit problem to aC +1-armedbandit problem. We emphasize that k∗, µ∗ andC are all unknown tothe algorithm at the beginning. In fact, when C = 0, our proposedalgorithms achieve bounded regret meaning that after some finitestep, no arm but the optimal one will be selected.Simulations and Experiments. Figure 3 illustrates the perfor-mance of C-UCB, C-TS relative to UCB in a correlated multi-armedbandit setting with three arms. The value ofC depends on the under-lying hidden joint probability distribution, we show a setup whereC = 0 in Figure 3(a), C = 1 in Figure 3(b) and C = 2 in Figure 3(c).We see that whenC = 0, our proposed algorithms achieve boundedregret. In Figure 3(b), we see reduction in regret over UCB as onlyone arm is pulled O(logT ) by C-UCB and C-TS and in Figure 3(c)we see performance of C-UCB similar to UCB as both sub-optimalarms are competitive. We do extensive validation of our results byperforming experiments of two real-world datasets, namelyMovie-lens and Goodreads, which show that the proposed approachyields drastically smaller regret than classical Multi-Armed Banditstrategies.
0 5
104
0
100
200
300
Cu
mu
lative
Re
gre
t
(a)
0 5Total Rounds
0
200
400
600(b)
UCBC-UCBC-TS
0 5
104
0
50
100
150
200(c)
Figure 3: The cumulative regret of C-UCB and C-TS depend
on the number of competitive arms, i.e., C. C itself depends
on the unknown joint probability distribution of rewards
and is not known beforehand. We consider a setup where
C = 0 in (a), C = 1 in (b) and C = 2 in (c).
1.3 Key Contributions.
i) A General and Previously Unexplored Correlated Multi-
Armed Bandit Model. In Section 2 we describe our novel corre-lated multi-armed bandit model, in which rewards of a user cor-responding to different arms are correlated with each other. Thiscorrelation is captured by the knowledge of pseudo-rewards, whichare upper bounds on the conditional mean reward of arm ℓ givenreward of arm k . While pseudo-rewards are known, they can be ar-bitrarily loose. Thus, our frameworks captures very general settingswhere only partial information about correlations is available.
ii) An approach to generalize algorithms to theCorrelated
MAB setting.We propose a novel approach in Section 3 that ex-tends any classical bandit (such as UCB, TS, KL-UCB etc.) algorithmto the correlated MAB setting studied in this paper. This is done byidentifying some arms as empirically non-competitive in each roundfrom the samples obtained so far. The empirically non-competitive
arms are likely to be sub-optimal and hence our algorithm focuseson picking one of the empirically competitive arms through any clas-sical bandit algorithms of choice. Being able to choose any banditalgorithm for selection among empirically competitive arms allowsus to leverage algorithms such as Thompson Sampling and KL-UCBthat are known to outperform UCB.
iii) Unified regret analysis.We present our regret bounds andanalysis in Section 4. A rigorous analysis of the regret achievedunder both C-UCB and C-TS are given through a unified technique.This technique can be of broad interest since it provides us a recipeto obtain regret analysis for any C-Bandit algorithm. Our regretbounds for C-UCB and C-TS show that the set of non-competitive
arms are pulled only O(1) times, as opposed to O(logT ) times astypical in bandit problems. As a result, onlyC ≤ K − 1 (competitive)arms are pulled O(logT ) times leading to a significant reduction inregret relative to UCB or TS that pull each of the K − 1 sub-optimalarm O(logT ) times.
iv) Evaluation using real-world datasets. We also performsimulations to validate our theoretical results in Section 5 andshow performance in the special case where rewards are corre-lated through hidden random variable X . Our experimental results
Under review at SIGMETRICS ’20, June 08–12, 2020, Boston, MA Samarth Gupta, Shreyas Chaudhari, Gauri Joshi and Osman Yağan
r s2,1(r ) r s1,2(r )0 0.7 0 0.81 0.4 1 0.5
(a) R1 = 0 R1 = 1R2 = 0 0.2 0.4R2 = 1 0.2 0.2
(b) R1 = 0 R1 = 1R2 = 0 0.2 0.3R2 = 1 0.4 0.1
Table 1: The top row shows the pseudo-rewards of arms 1
and 2, i.e., upper bounds on the conditional expected re-
wards (which are known to the player). The bottom row de-
picts two possible joint probability distribution (unknown
to the player). Under distribution (a), Arm 1 is optimal
whereas Arm 2 is optimal under distribution (b).
given in Section 6 on the Movielens [18] and the Goodreads [19]datasets demonstrate the applicability of our C-Bandit approach inpractical settings. In particular, they demonstrate how the pseudo-rewards can be learned in practice. The results show significantimprovement over the performance of classical bandit approach forrecommendation system applications.
2 PROBLEM FORMULATION
2.1 Correlated Multi-Armed Bandit Model
Consider a Multi-Armed Bandit setting with K arms 1, 2, . . .K.At each round t , a user enters the system and we need to decide anarm kt to display to the user. Upon displaying arm kt , we receive arandom rewardRkt ∈ [0,B]. Our goal is to maximize the cumulativereward over time. The expected reward (over the population ofusers) of arm k , is denoted by µk . If we knew the arm with highestmean, i.e., k∗ = arg maxk ∈K µk beforehand, then we would alwayspull arm k∗ to maximize expected cumulative reward. We nowdefine the cumulative regret, minimizing which is equivalent tomaximizing cumulative reward:
Reд(T ) =T∑t=1
µkt − µk∗ =∑k,k∗
nk (T )∆k . (2)
Here, nk (T ) denotes the number of times a sub-optimal arm ispulled till round T and ∆k denotes the sub-optimality gap of arm k ,i.e., ∆k = µk∗ − µk .
The classical multi-Armed bandit setting assumes the rewards tobe independent across arms. More formally, Pr(Rℓ = rℓ |Rk = r ) =Pr(Rℓ = rℓ) ∀rℓ , r . Consequently, E [Rℓ |Rk = r ] = E [Rℓ] ∀r .However, in most practical scenarios this assumption is unlikely tobe true. In fact, rewards of a user corresponding to different armsare likely to be correlated. Motivated by this we consider a setupwhere the conditional distribution of the reward from arm ℓ givenreward from arm k is not equal to the probability distribution ofthe reward from arm ℓ, i.e., fRℓ |Rk (rℓ |rk ) , fRℓ
(rℓ), with fRℓ(rℓ)
denoting the probability distribution function of the reward fromarm ℓ. Consequently, due to such correlations, we have E [Rℓ |Rk ] ,E [Rℓ].
In our problem setting, rewards obtained from a user correspond-ing to different arms are correlated and this correlation is modeledby the knowledge of pseudo-rewards that constitute an upper boundon conditional expected rewards.
r s2,1(r ) s3,1(r )0 0.7 21 0.8 1.22 2 1
r s1,2(r ) s3,2(r )0 0.5 1.51 1.3 22 2 0.8
r s1,3(r ) s2,3(r )0 1.5 21 2 1.32 0.7 0.75
Table 2: If some pseudo-reward entries are unknown (due to
lack of prior-knowledge/data), those entries can be replaced
with the maximum possible reward and then used in the C-
BANDIT algorithm. We do that here by entering 2 for the
entries where pseudo-rewards are unknown.
Definition 1 (Pseudo-Reward). Suppose we pull arm k and
observe reward r , then the pseudo-reward of arm ℓ with respect to armk , denoted by sℓ,k (r ), is an upper bound on the conditional expected
reward of arm ℓ, i.e.,
E[Rℓ |Rk = r ] ≤ sℓ,k (r ). (3)
2.2 Illustration
Consider the example shown in Table 1, where we have a 2 armedbandit problem in which the reward is either 0 or 1. Table 1 illus-trates example values of pseudo-rewards for this problem. Whilethe pseudo-rewards are known, the underlying joint probabilitydistribution is unknown. For instance when joint probability dis-tribution is as shown in Table 1 (a), Arm 1 is optimal and Arm 2 isoptimal if joint probability distribution is as shown in Table 1(b).
In practice, these pseudo-rewards can be learned from prior-available data, or through offline surveys in which users are pre-sented with all K arms allowing us to sample R1, . . . ,RK jointly.Through such data, we can evaluate an estimate on the conditionalexpected rewards. For example in Table 1, we can look at all userswho obtained 0 reward for Arm 1 and calculate their average re-ward for Arm 2, say µ2,1(0). This average provides an estimate onthe conditional expected reward. If the training data is large, onecan use this value directly as s2,1(0) because through law of largenumbers, the empirical average equals the E [R2 |R1 = 0]. Since weonly need an upper bound on E [R2 |R1 = 0], we can use severalapproaches to construct the pseudo-rewards. For example, we canset µ2,1(0) + σ2,1(0), with σ2,1(0) denoting the empirical standarddeviation on the conditional reward of arm 2. In addition, pseudo-rewards for any unknown conditional mean reward could be filledwith the maximum possible reward for the corresponding arm. Ta-ble 2 shows an example of a 3-armed bandit problem where somepseudo-reward entries are unknown, e.g., due to lack of data. Wecan fill these missing entries with maximum possible reward (i .e ., 2)as shown in Table 2 to complete the pseudo-reward entries.
Remark 1 (Reduction to Classical Multi-Armed Bandits).When all pseudo-reward entries are unknown, then all pseudo-reward
entries can be filled with maximum possible reward for each arm.
In such a case, the problem framework studied in this paper reduces
to the setting of the classical Multi-Armed Bandit problem and our
Multi-Armed Bandits with Correlated Arms Under review at SIGMETRICS ’20, June 08–12, 2020, Boston, MA
1 3 5X
Y2(X)Y1(X) Y3(X)
1
3
5
X
1 3 5X
1 3 5X
1
3
5
1
3
5
Latent
Random
Variable
Figure 4: Rewards for different arms are correlated through
a hidden random variable X. At each roundX takes a realiza-
tion in X. The reward obtained from an arm k is Yk (X ). Thefigure illustrates lower bounds and upper bounds on Yk (X )(through dotted lines). For instance, when X takes the real-
ization 1, reward of arm 3 is a random variable bounded be-
tween 1 and 3.
proposed C-Bandit algorithm performs exactly as standard bandit
(for e.g., UCB, TS etc.) algorithms.
2.3 Special Case: Correlated Bandits with a
Latent Random Source
Our proposed correlated multi-armed bandit framework subsumesmany interesting and previously unexplored multi-armed banditsettings. One such special case is the correlated multi-armed banditmodel where the rewards depend on a common latent source ofrandomness. More concretely, the rewards of different arms arecorrelated through a hidden random variable X (see Figure 4). Ateach round t , X takes a an i.i.d. realization Xt ∈ X (unobserved tothe player) and upon pulling arm k , we observe a random rewardYk (Xt ). The latent random variable X here could represent thefeatures (i.e., age/occupation etc.) of the user arriving to the system,to whom we show one of the K arms. These features of the userare hidden in the problem due to privacy concerns. The randomreward Yk (Xt ) represents the preference of user with context Xtfor the kth version of the ad, for the application of ad-selection.
In this problem setup, upper and lower bounds onYk (X ), namelyдk (X ) and д
k(X ) are known. For instance, the information on upper
and lower bounds of Yk (Xt ) could represent knowledge of the formthat children of age 5-10 rate documentaries only in the range 1-
3 out of 5. Such information can be known or learned throughprior available data. While the bounds on Yk (X ) are known, thedistribution of X and reward distribution within the bounds isunknown, due to which the optimal arm is not known beforehand.Thus, an online approach is needed to minimize the regret. We nowshow how this setting can be covered by our general framework,which allows us to use the algorithms proposed in this paper to solvethe correlated multi-armed bandit problem with a latent randomsource.
It is possible to translate this setting to the general frameworkdescribed in the problem by transforming the mappings Yk (X ) topseudo-rewards sℓ,k (r ). Recall the pseudo-rewards represent anupper bound on the conditional expectation of the rewards. In this
1 3 5X
Y2(X)Y1(X) Y3(X)
1
3
5
1 3 5X
1 3 5X
1
3
5
1
3
5
R1 = 4
s2;1(4) = 3:5 s2;1(4) = 4
Figure 5: An illustration on how to calculate pseudo-rewards
in CMAB with latent random source. Upon observing a re-
ward of 4 from arm 1, we can see that themaximumpossible
reward for arms 2 and 3 is 3.5 and 4 respectively. Therefore,
s2,1(4) = 3.5 and s3,1(4) = 4.
framework, sℓ,k (r ) can be calculated as:
sℓ,k (r ) = maxдk(x )<r<дk (x )
дℓ(x),
where дk(x) and дk (x) represent upper and lower bounds on Yk (x).
Upon observing a realization from arm k , it is possible to estimatethe maximum possible reward that would have been obtained fromarm ℓ through the knowledge of bounds on Yk (X ).
Figure 5 illustrates how pseudo-reward is evaluated when weobtain a reward r = 4 by pulling arm 1. We first infer that X lies in[0, 0.8] if r = 4 and then find the maximum possible reward for arm2 and arm 3 in [0, 0.8]. Once these pseudo-rewards are constructed,the problem fits in the general framework described in this paperand we can use the algorithms proposed for this setting directly.
2.4 Comparison with parametric (structured)
models
As mentioned in Section 1, a seemingly related model is the struc-tured bandits model [14, 15, 20]. Structured bandits is a class ofproblems that cover linear bandits [16], generalized linear bandits[21], Lipschitz bandits [22], global bandits [23], regional bandits[24] etc. In the structured bandits setup, mean rewards correspond-ing to different arms are related to one another through a hiddenparameter θ . The underlying value of θ is fixed and the mean re-ward mappings θ → µk (θ ) are known. Similarly, [25] studies adependent armed bandit problem, that also has mean rewards cor-responding to different arms related to one another. It considers aparametric model, where mean rewards of different arms are drawnfrom one of the K clusters, each having an unknown parameter πi .All of these models are fundamentally different from the problemsetting considered in this paper. We enlist some of the differenceswith the structured bandits (and the model in [25]) below.
(1) In this work we explicitly model the correlations in the re-wards of a user corresponding to different arms. While, meanrewards are related to each other in structured bandits and[25], the reward realizations are not necessarily correlated.
Under review at SIGMETRICS ’20, June 08–12, 2020, Boston, MA Samarth Gupta, Shreyas Chaudhari, Gauri Joshi and Osman Yağan
(2) Another key difference is that the model studied here is non-parametric in the sense that there is no hidden feature spaceas is the case in structured bandits and [25].
(3) An important point to highlight is that the reward mappingsfrom θ to µk (θ ) in the structured bandits setup need to beexact. If they happen to be incorrect, then the algorithmsfor structured bandit cannot be used as they rely on thecorrectness of µk (θ ) to construct confidence intervals onthe unknown parameter θ . In contrast, the model studiedhere relies on the pseudo-rewards being upper bounds onconditional expectations. These bounds need not be tight andthe proposed C-Bandit algorithms adjust accordingly andperform at least as well as the corresponding classical banditalgorithm. In other words, our approach can be useful in anysetting where the prior data is available with given confidenceintervals (which can then be converted into upper bounds onthe conditional mean rewards), while the structured settingrequires exact mean values to be known.
3 THE PROPOSED C-BANDIT ALGORITHMS
We now propose an approach that extends the classical multi-armedbandit algorithms (such as UCB, Thompson Sampling, KL-UCB) tothe correlated MAB setting. At each round t + 1, the UCB algorithm[26] selects the arm with the highest UCB index Ik,t , i.e.,
kt+1 = arg maxk ∈K
Ik,t , Ik,t = µk (t) + B
√2 log(t)nk (t)
, (4)
where µk (t) is the empirical mean of the rewards received from armk until round t , and nk (t) is the number of times arm k is pulled tillround t . The second term in the UCB index causes the algorithm toexplore arms that have been pulled only a few times (small nk (t)).Recall that we assume all rewards to be bounded within an intervalof size B. When the index t is implied by context, we abbreviateµk (t) and Ik (t) to µk and Ik respectively in the rest of the paper.
Under Thompson sampling [27], the armkt+1 = arg maxk ∈K Sk,tis selected at time step t + 1. Here, Sk,t is the sample obtained fromthe posterior distribution of µk . That is,
kt+1 = arg maxk ∈K
Sk,t , Sk,t ∼ N(µk (t),
βB
nk (t) + 1
). (5)
In correlated MAB framework, the rewards observed from onearm can help estimate the rewards from other arms. Our key ideais to use this information to reduce the amount of explorationrequired. We do so by evaluating the empirical pseudo-reward ofevery other arm ℓ with respect to an arm k . If this pseudo-rewardis smaller than empirical reward of arm k , then arm ℓ is consideredto be empirically non-competitive with respect to arm k , and we donot consider it as a candidate in the UCB/Thompson Sampling/anyother bandit algorithm.
We define the notion of empirically competitive arms in Sec-tion 3.2 and then describe how we modify the classical banditalgorithms to perform in the considered correlated MAB setting inSection 3.3.
3.1 Empirical and Expected Pseudo-Rewards
In our correlated MAB framework, pseudo-reward of arm ℓ withrespect to arm k provides us an estimate on the reward of arm ℓthrough the reward sample obtained from arm k . We now definethe notion of empirical pseudo-reward below which can be used toobtain an optimistic estimate of µℓ through just reward samples ofarm k .
Definition 2 (Empirical and Expected Pseudo-Reward). Af-ter t rounds, arm k is pulled nk (t) times. Using these nk (t) rewardrealizations, we can construct the empirical pseudo-reward ϕℓ,k (t)for each arm ℓ with respect to arm k as follows.
ϕℓ,k (t) ≜∑tτ=1 1kτ =k sℓ,k (rt )
nk (t), ℓ ∈ 1, . . . ,K \ k. (6)
The expected pseudo-reward of arm ℓ with respect to arm k is defined
as
ϕℓ,k ≜ E[sℓ,k (r )
]. (7)
Observe that E[sℓ,k (r )
]≥ E [E [Rℓ |Rk = r ]] = µℓ . Due to this,
empirical pseudo-reward ϕℓ,k (t) can be used to obtain an estimatedupper bound on µℓ . Note that the empirical pseudo-reward ϕℓ,k (t)is defined with respect to arm k and it is only a function of therewards observed by pulling k .
3.2 Competitive and Non-competitive arms
with respect to Arm kUsing the pseudo-reward estimates defined above, we can classifyeach arm ℓ , k as competitive or non-competitive with respect thearm k . To this end, we first define the notion of the pseudo-gap.
Definition 3 (Pseudo-Gap). The pseudo-gap ∆ℓ,k of arm ℓ with
respect to arm k is defined as
∆ℓ,k ≜ µk − ϕℓ,k , (8)i.e., the difference between expected reward of arm k and the expected
pseudo-reward of arm ℓ with respect to arm k .
From the definition of pseudo-reward, it follows that the expectedpseudo-reward ϕℓ,k is greater than or equal to the expected rewardµℓ from arm ℓ. Thus, a positive pseudo-gap ∆ℓ,k > 0 indicates thatit is possible to classify arm ℓ as sub-optimal using only the rewardsobserved from arm k (with high probability as the number of pullsfor arm k gets large); thus, arm ℓ needs not be explored. Such armsare called non-competitive, as we define below.
Definition 4 (Competitive and Non-Competitive arms). Anarm ℓ is said to be non-competitive if its pseudo-gap with respect to
the optimal arm k∗ is positive, that is, ∆ℓ,k∗ > 0. Similarly, an arm
ℓ is said to be competitive if ∆ℓ,k∗ < 0. The unique best arm k∗ has∆k∗,k∗ = 0 and is not counted in the set of competitive arms.
Since the reward distribution of each arm is unknown, we cannot find the pseudo-gap of each arm and thus have to resort toempirical estimates based on observed rewards. In our algorithm,we use a noisy notion of the competitiveness of an arm definedas follows. Note that since the optimal arm k∗ is also not known,empirical competitiveness of an arm ℓ is defined with respect toeach of the other arms k , ℓ.
Multi-Armed Bandits with Correlated Arms Under review at SIGMETRICS ’20, June 08–12, 2020, Boston, MA
Definition 5 (Empirically Competitive and Non-Competi-tive arms). An arm ℓ is said to be “empirically non-competitive with
respect to arm k at round t" if its empirical pseudo-reward is less than
the empirical reward of arm k , that is, µk (t) − ϕℓ,k (t) > 0. Similarly,
an arm ℓ , k is deemed empirically competitive with respect to arm
k at round t , if µk (t) − ϕℓ,k (t) ≤ 0.
3.3 The C-Bandit Algorithm
The central idea in our correlated C-BANDIT approach is that afterpulling the optimal arm k∗ sufficiently large number of times, thenon-competitive (and thus sub-optimal) arms can be classified asempirically non-competitive with increasing confidence, and thusneed not be explored. As a result, the non-competitive arms will bepulled only O(1) times. However, the competitive arms cannot bediscerned as sub-optimal by just using the rewards observed fromthe optimal arm, and have to be explored O(logT ) times each. Thus,we are able to reduce a K-armed bandit to a C + 1-armed banditproblem, where C is the number of competitive arms.
Using this idea, our C-BANDIT proceeds as follows. After everyround t , we maintain values for empirical reward, µk (t), for eacharm k . These empirical estimates are based on the nk (t) samples ofrewards that have been observed for k till round t . In addition tothis, we maintain empirical pseudo-reward of arm ℓ with respectto arm k , ϕℓ,k (t), for all pairs of arms (ℓ,k). In each round t , thealgorithm performs the following steps:
(1) Select arm kmax = arg maxk nk (t − 1), that has been pulledthe most until round t − 1.
(2) Identify empirically competitive arms At : Identify thesetAt of arms that are empirically competitive with respectto arm kmax, i.e.,
At = k ∈ K : µkmax ≤ ϕk,kmax .
(3) Play BANDIT algorithm in At ∪ kmax: For instance,the C-UCB pulls the arm
kt = arg maxk ∈At∪kmax
Ik,t−1,
where Ik,t−1 is the UCB index defined in (4).Similarly, C-TS pulls the arm
kt = arg maxk ∈At∪kmax
Sk,t−1,
where Sk,t is the Thompson sample defined in (5)).(4) Update the empirical pseudo-rewards ϕℓ,kt (t) for all ℓ, the
empirical reward for arm kt .In step 1, we choose the arm that has been pulled the most num-
ber of times because we have the maximum number of rewardsamples from this arm. Thus, it is likely to most accurately identifythe non-competitive arms. This property enables the proposed algo-rithm to achieve an O(1) regret contribution from non-competitivearms as we show in Section 4 below.
Note that our C-BANDIT approach allows using any classicalMulti-Armed Bandit algorithm in the correlated Multi-Armed Ban-dit setting. This is important because some algorithms such asThompson Sampling and KL-UCB are known to obtain much betterempirical performance over UCB. Extending those to the correlatedMAB setting allows us to have the superior empirical performance
Algorithm 1 C-UCB Correlated UCB Algorithm
1: Input: Pseudo-rewards sℓ,k (r )2: Initialize: nk = 0, Ik = ∞ for all k ∈ 1, 2, . . .K3: for each round t do4: Find kmax = arg maxk nk (t−1), the arm that has been pulled
most times until round t − 15: Initialize the empirically competitive setAt = 1, 2, . . . ,K\
kmax.6: for k , kmax
do
7: if µkmax > ϕk,kmax then
8: Remove arm k from the empirically competitive set:At = At⧹k
9: end if
10: end for
11: Apply UCB1 over arms in At ∪ kmax by pulling arm kt =arg maxk ∈At∪kmax Ik (t − 1)
12: Receive reward rt , and update nkt = nkt + 113: Update Empirical reward: µkt (t) =
µkt (t−1)(nkt (t )−1)+rtnkt (t )
14: Update the UCB Index: Ikt (t) = µkt + B
√2 log tnkt
15: Update empirical pseudo-rewards for all k , kt : ϕk,kt (t) =∑τ :kτ =kt sk,kτ (rτ )/nkt
16: end for
Algorithm 2 C-TS Correlated TS Algorithm
1: Steps 1 - 10 as in C-UCB2: Apply TS over arms in At ∪ kmax by pulling arm kt =
arg maxk ∈At∪kmax Sk,t , where Sk,t ∼ N(µk (t),
βBnk (t )+1
).
3: Receive reward rt , and update nkt , µkt and empirical pseudo-rewards ϕk,kt (t).
over UCB even in the correlated setting. This benefit is demon-strated in our simulations and experiments described in Section 5and Section 6.
4 REGRET ANALYSIS AND BOUNDS
We now characterize the performance of the C-UCB algorithmby analyzing the expected value of the cumulative regret (2). Theexpected regret can be expressed as
E [Reд(T )] =K∑k=1E [nk (T )]∆k , (9)
where ∆k = µk∗−µk is the sub-optimality gap of arm k with respectto the optimal arm k∗, and nk (T ) is the number of times arm k ispulled in T slots.
For the regret analysis, we assume without loss of generalitythat the rewards are between 0 and 1 for all k ∈ 1, 2, . . .K. Notethat the C-Bandit algorithms do not require this condition, andthe regret analysis can also be generalized to any bounded rewards.
Under review at SIGMETRICS ’20, June 08–12, 2020, Boston, MA Samarth Gupta, Shreyas Chaudhari, Gauri Joshi and Osman Yağan
4.1 Regret Bounds
In order to bound E [Reд(T )] in (9), we can analyze the expectednumber of times sub-optimal arms are pulled, that is, E [nk (T )], forall k , k∗. Theorem 1 and Theorem 2 below show that E [nk (T )]scales as O(1) and O(logT ) for non-competitive and competitivearms respectively. Recall that a sub-optimal arm is said to be non-competitive if its pseudo-gap ∆k,k∗ > 0, and competitive otherwise.
Theorem 1 (Expected Pulls of a Non-competitive Arm). Theexpected number of times a non-competitive arm with pseudo-gap
∆k,k∗ is pulled by C-UCB is upper bounded as
E [nk (T )] ≤ Kt0 + K2
T∑t=Kt0
3( tK
)−2+
T∑t=1
t−3, (10)
= O(1), (11)
and for C-TS is bounded as,
E [nk (T )] ≤ Ktb + K2
T∑t=Ktb
(3( tK
)−2+
( tK
)1−2β )+
T∑t=1
t−3
(12)= O(1) for β > 1, (13)
where,
t0 = infτ ≥ 2 : ∆min, ∆k,k∗ ≥ 4
√K logτ
τ
.
tb = infτ ≥ exp(11β) : ∆min, ∆k,k∗ ≥ 6
√2Kβ logτ
τ
.
Theorem 2 (Expected Pulls of a Competitive Arm). The ex-pected number of times a competitive arm is pulled by C-UCB algo-
rithm is upper bounded as
E [nk (T )] ≤ 8log(T )∆2k
+
(1 + π 2
3
)+
T∑t=1
t exp(−t∆2
min
2K
), (14)
= O(logT ) where ∆min = mink
∆k > 0. (15)
and for C-TS is bounded as
E [nk (T )] ≤18 log(T∆2
k )∆2k
+ exp(11β) + 182∆2
k
+
T∑t=1
t exp(−t∆2
min
2K
)= O(logT ) where ∆min = min
k∆k > 0. (16)
Substituting the bounds on E [nk (T )] derived in Theorem 1 andTheorem 2 into (9), we get the following upper bound on expectedregret.
Corollary 1 (Upper Bound on Expected Regret). The ex-pected cumulative regret of the C-UCB and C-TS algorithms is upper
bounded as
E [Reд(T )] ≤∑k ∈C
∆kU(c)k (T ) +
∑k ′∈1, ...,K \C∪k∗
∆k ′U(nc)k ′ (T ),
(17)= C · O(logT ) + O(1), (18)
where C ⊆ 1, . . . ,K \ k∗ is set of competitive arms with cardinal-
ity C , U(c)k (T ) is the upper bound on E [nk (T )] for competitive arms
given in (2), and U(nc)k (T ) is the upper bound for non-competitive
arms given in (1).
4.2 Proof Sketch
We now present an outline of our regret analysis of C-UCB andC-TS. A key strength of our analysis is that it can be extended veryeasily to any C-BANDIT algorithm. The results independent oflast step in the algorithm are presented in Appendix B, while therigorous regret upper bounds for C-UCB and C-TS are presented inAppendix D,F.
There are three key components to prove the result in Theorem 1and Theorem 2. The first two components hold independent ofwhich bandit algorithm (UCB/TS/KL-UCB) is used for selecting thearm from the set of competitive arms, which makes our analysiseasy to extend to any C-BANDIT algorithm.
i) Probability of optimal arm being identified as empir-
ically non-competitive at round t (denoted by Pr(E1(t))) issmall. In particular, we show that
Pr(E1(t)) ≤ t exp(−t∆2
min2K
).
This ensures that the optimal arm is identified as empirically non-competitive only O(1) times. We show that the number of times acompetitive arm is pulled is bounded as
E [nk (T )] ≤T∑t=1
Pr(E1(t)) + Pr(Ec1 (t),kt = k, Ik,t−1 > Ik∗,t−1).
(19)The first term sums to a constant, while the second term is upperbounded by the number of times UCB pulls the sub-optimal arm k .Due to this the upper bound on the number of pulls of competitivearm by C-UCB / C-TS is only an additive constant more than theupper bound on the number of pulls for an arm by UCB / TS algo-rithms and hence we have same pre-log constants for the upperbound on the pulls of competitive arms.
ii) Probability of identifying anon-competitive armas em-
pirically competitive jointly with optimal arm being kmax(t)is small. Notice that the first two steps of our algorithm involveidentifying kmax(t), arm that has been pulled most number of timesso far, and eliminating arms which are empirically non-competitivewith respect to kmax(t) for round t . We show that the joint eventthat arm k∗ is kmax(t) and a non-competitive arm k is identified asempirically non-competitive is small. Formally,
Pr(kt+1 = k,k∗ = kmax(t)) ≤ t exp
(−t ∆k,k∗
2K
). (20)
This occurs because upon obtaining a large number of samples ofarm k∗, expected reward of arm k∗ (i.e., µk∗ ) and expected pseudo-reward of arm k with respect to arm k∗ (i.e., ϕk,k∗ ) can be esti-mated fairly accurately. Since pseudo-gap of arm k is positive (i.e.,µk∗ > ϕk,k∗ ), the probability that arm k is identified as empiricallycompetitive is small.
An implication of (20) is that the expected number of times anon-competitive arm is identified as empirically competitive jointlywith the optimal arm having maximum number of pulls is boundedabove by a constant.
Multi-Armed Bandits with Correlated Arms Under review at SIGMETRICS ’20, June 08–12, 2020, Boston, MA
p1(r ) r s2,1(r ) s3,1(r )0.2 0 0.7 20.2 1 0.8 1.20.6 2 2 1
Table 3: SupposeArm1 is optimal and its unknownprobabil-
ity distribution is (0.2, 0.2, 0.6), then µ1 = 1.4, while ϕ2,1 = 1.5and ϕ3,1 = 1.2. Due to this Arm 2 is Competitive while Arm
3 is non-competitive
iii) Probability that a sub-optimal arm is kmaxat round t ,
is small. Formally, we show that for C-UCB, we have
Pr(k = kmax(t)) ≤ 3K( tK
)−2 ∀t > Kt0,k , k∗ (21)
This component of our analysis is specific to the classical banditalgorithm used in C-BANDIT. We show a similar result for C-TSrigorously in Lemma 10. Intuitively, a result of this kind should holdfor any good performing classical multi-armed bandit algorithm. Wereach the result of (21) in C-UCB by showing that
Pr(kt+1 = k,nk (t) >
t
2K
)≤ t−3 ∀t > t0,k , k
∗ (22)
The probability of selecting a sub-optimal arm k after it has beenpulled significantly many times is small as with more number ofpulls, the exploration component in UCB index of arm k becomessmall, and consequently it is likely to be smaller than the UCBindex of optimal arm k∗ (as it has larger empirical mean reward orhas been pulled fewer number of times). Our analysis in Lemma 8shows how the result in (22) can be translated to obtain (21) (thistranslation is again not dependent on which bandit algorithm isused in C-BANDIT).
We show that the expected number of pulls of a non-competitivearm k can be bounded as
E [nk (T )] ≤T∑t=1
Pr(kt+1 = k,k∗ = kmax) + Pr(k∗ , kmax) (23)
The first term in (23) is O(1) due to (20) and the second term is O(1)due to (21). Refer to Appendix D,F for rigorous regret analysis ofC-UCB and C-TS.
4.3 Discussion on Regret Bounds
Competitive Arms. Recall than an arm is said to be competi-tive if µk∗ (i.e., expected reward from arm k∗) > E
[ϕk,k∗
]=
E[E[Rk ′ |Rk ]
]. Since the distribution of reward of each arm is un-
known, initially the Algorithm does not know which arm is com-
petitive and which arm is non-competitive.Reduction in effective number of arms. Interestingly, our resultfrom Theorem 1 shows that the C-UCB and C-TS algorithms, thatoperate in a sequential fashion, make sure that non-competitive
arms are pulled only O(1) times. Due to this, only the competitivearms are pulled O(logT ) times. Moreover, the pre-log terms in theupper bound of UCB and C-UCB (and correspondingly TS and C-TS)for these arms is the same. In this sense, our C-BANDIT approachreduces aK-armed bandit problem to aC+1-armed bandit problem.Effectively only C ≤ K − 1 arms are pulled O(logT ) times, whileother arms are stopped being pulled after a finite time.
Depending on the joint probability distribution, different arms canbe optimal, competitive or non-competitive. Table 3 shows a casewhere arm 1 is optimal and the reward distribution of arm 1 is(0.2, 0.2, 0.6), which leads to µ1 = 1.4 > ϕ3,1 = 1.2 and µ1 =1.4 < ϕ2,1 = 1.5. Due to this Arm 2 is competitive while Arm 3 isnon-competitive.Achieving Bounded Regret. If the set of competitive arms C isempty (i.e., the number of competitive arms C = 0), then our algo-rithm will lead to (see (18)) an expected regret of O(1), instead ofthe typical O(logT ) regret scaling in classic multi-armed bandits.One such scenarion in which this can happen is if pseudo-rewardssk,k∗ of all arms with respect to optimal arm k∗ match the con-ditional expectation of arm k . Formally, if sk,k∗ = E [Rk |Rk∗ ]∀k ,then E
[sk,k∗
]= E [Rk ] = µk < µk∗ . Due to this, all arms are
non-competitive and our algorithms achieve only O(1) regret. Wenow evaluate a lower bound result for a special case of our model,where rewards are correlated through a latent random variable Xas described in Section 2.3.
We present a lower bound on the expected regret for the modeldescribed in Section 2.3. Intuitively, if an arm ℓ is competitive, itcan not be deemed sub-optimal by only pulling the optimal arm k∗
infinitely many times. This indicates that exploration is necessaryfor competitive arms. The proof of this bound closely follows thatof the 2-armed classical bandit problem [1]; i.e., we construct a newbandit instance under which a previously sub-optimal arm becomesoptimal without affecting reward distribution of any other arm.
Theorem 3 (Lower Bound for CorrelatedMABwith latentrandom source). For any algorithm that achieves a sub-polynomial
regret, the expected cumulative regret for the model described in
Section 2.3 is lower bounded as
limT→∞
inf E [Reд(T )]log(T ) ≥
maxk ∈C∆k
D(fRk | |fRk )if C > 0,
0 if C = 0.(24)
Here fRk is the reward distribution of arm k , which is linkedwith fX since Rk = Yk (X ). The term fRk
represents the reward dis-tribution of arm k in the new bandit instance where arm k becomesoptimal and distribution fRk∗ is unaffected. The divergence termrepresents "the amount of distortion needed in reward distributionof arm k to make it better than arm k∗", and hence captures theproblem difficulty in the lower bound expression.Bounded regretwhenever possible for the special case of Sec-
tion 2.3. From Corollary 1, we see that whenever C > 0, our pro-posed algorithm achieves O(logT ) regret matching the lower boundgiven in Theorem 3 order-wise. Also, when C = 0, our algorithmachieves O(1) regret. Thus, our algorithm achieves bounded regretwhenever possible, i.e., when C = 0 for the model described inSection 2.3. In the general problem setting, a lower bound Ω(logT )exists whenever it is possible to change the joint distribution ofrewards such that the marginal distribution of optimal arm k∗ isunaffected and pseudo-rewards sℓ,k (r ) still remain an upper boundon E [Rℓ |Rk = r ] under the new joint probability distribution. Ingeneral, this can happen even ifC = 0, we discuss one such scenarioin the Appendix G.2 and explain the challenges that need to comefrom the algorithmic side to meet the lower bound.
Under review at SIGMETRICS ’20, June 08–12, 2020, Boston, MA Samarth Gupta, Shreyas Chaudhari, Gauri Joshi and Osman Yağan
0 5
Total Rounds104
0
50
100
Cum
ula
tive R
egre
t
(a)
UCBC-UCBC-TS
0 5
104
0
50
100
150(b)
Figure 6: Cumulative regret for UCB, C-UCB and C-TS corre-
sponding to the problem shown in Table 1. For the setting (a)
in Table 1, Arm 1 is optimal and Arm 2 is non-competitive,
in setting (b) of Table 1 Arm 2 is optimal while Arm 1 is com-
petitive.
5 SIMULATIONS
We now present the empirical performance of proposed algorithms.For all the results presented in this section, we compare the perfor-mance of all algorithms on the same reward realizations and plotthe cumulative regret averaged over 100 independent trials.
5.1 Simulations with known pseudo-rewards
Consider the example shown in Table 1, with the top row showingthe pseudo-rewards, which are known to the player, and the bottomrow showing two possible joint probability distributions (a) and (b),which are unknown to the player. We show the simulation resultof our proposed algorithms C-UCB, C-TS against UCB in Figure 6for the setting considered in Table 1.Case (a): Bounded regret. For the probability distribution (a),notice that Arm 1 is optimal with µ1 = 0.6, µ2 = 0.4. Moreover,ϕ2,1 = 0.4 × 0.7 + 0.6.4 = 0.52. Since ϕ2,1 < µ1, Arm 2 is non-competitive. Hence, in Figure 6(a), we see that our proposed C-UCBand C-TS Algorithms achieve bounded regret, whereas UCB leadsto logarithmic regret.Case (b): All competitive arms. For the probability distribution(b), Arm 2 is optimal with µ2 = 0.5 and µ1 = 0.4. The expectedpseudo-reward of arm 1 w.r.t to arm 2 in this case is ϕ1,2 = 0.8 ×0.5 + 0.5 × 0.5 = 0.65. Since ϕ1,2 >, the sub-optimal arm (i.e., Arm1) is competitive and hence C-UCB and C-TS also end up exploringArm 1. Due to this we see that C-UCB, C-TS achieve a regret similarto UCB in Figure 6(b). C-TS has empirically smaller regret thanC-UCB as Thompson Sampling performs better empirically thanthe UCB algorithm. The design of our C-Bandit approach allowsthe use of any other bandit algorithm in the last step, e.g., KL-UCB.
5.2 Simulations for the model in Section 2.3
We now show the performance of C-UCB and C-TS against UCB forthe model considered in Section 2.3, where rewards correspondingto different arms are correlated through a latent random variableX . We consider a setting where reward obtained from Arm 1, givena realization x of X , is bounded between 2x − 1 and 2x + 1, i.e.,
0 2 4 6
X
0
2
4
6
8
10
12
14
Y1(X
)
0 2 4 6
X
0
2
4
6
8
10
Y2(X
)
Figure 7: Rewards corresponding to the two arms are corre-
lated through a random variable X lying in (0, 6). The lines
represent lower and upper bounds on reward of Arms 1
,Y1(X ), and 2, Y2(X ), given the realization of random variable
X .
0 5
Total Rounds104
0
500
1000C
um
ula
tive R
egre
t(a)
0 5
104
0
1000
2000
3000(b)
UCBC-UCBC-TS
Figure 8: Simulation results for the example shown in Fig-
ure 7. In (a), X ∼ Beta(1, 1) and in (b) X ∼ Beta(1.5, 5).
2X − 1 ≤ Y1(X ) ≤ 2X + 1. Similarly, conditional reward of Arm2 is, (3 − X )2 − 1 ≤ Y2(X ) ≤ (3 − X )2 + 1. Figure 7 demonstratesthese upper and lower bounds on Yk (X ). We run C-UCB, C-TSand UCB for this setting for two different distributions of X . Forthe simulations, we set the conditional reward of both the armsto be distributed uniformly between the upper and lower bounds,however this information is not known to the Algorithms.Case (a): X ∼ Beta(1, 1). When X is distributed as X ∼ Beta(1, 1),Arm 1 is optimal while Arm 2 is non-competitive. Due to this, weobserve that C-UCB and C-TS achieve bounded regret in Figure 8(a).Case (b): X ∼ Beta(1.5, 5). In the scenario where X has the distri-bution Beta(1.5, 5), Arm 2 is optimal while Arm 1 is competitive.Due to this, C-UCB and C-TS do not stop exploring Arm 1 in finitetime and we see the cumulative regret similar to UCB in Figure 8(b).
Our next simulation result considers a setting where the knownupper and lower bounds on Yk (X ) match and the reward Yk corre-sponding to a realization of X is deterministic, i.e., Yk (X ) = дk (X ).We show our simulation results for the reward functions describedin Figure 9 with three different distributions ofX . Corresponding toX ∼ Beta(4, 4), Arm 1 is optimal and Arms 2,3 are non-competitiveleading to bounded regret for C-UCB, C-TS in Figure 3(a). In setting
Multi-Armed Bandits with Correlated Arms Under review at SIGMETRICS ’20, June 08–12, 2020, Boston, MA
0 0.2 0.4 0.6 0.8 1
x
0
0.2
0.4
0.6
0.8
1
gk(x
)g
1(x)
g2(x)
g3(x)
Figure 9: Reward Functions used for the simulation results
presented in Figure 3.
(b), we consider X ∼ Beta(2, 5) in which Arm 1 is optimal, Arm2 is competitive and Arm 3 is non-competitive. Due to this, ourproposed C-UCB and C-TS Algorithms stop pulling Arm 3 aftersome time and hence achieve significantly reduced regret relativeto UCB in Figure 3(b). For third scenario (c), we set X ∼ Beta(1, 5),which makes Arm 3 optimal while Arms 1 and 2 are competitive.Hence, our algorithms explore both the sub-optimal arms and havea regret comparable to that of UCB in Figure 3(c).
6 EXPERIMENTS
We now show the performance of our proposed algorithms in real-world settings. Through the use of MovieLens and Goodreadsdatasets, we demonstrate how the correlated MAB framework canbe used in practical settings for recommendation system applica-tions. In such systems, it is possible to use the prior available data(from a certain population) to learn the correlation structure inthe form of pseudo-rewards. When trying to design a campaign tomaximize user engagement in a new unknown demographic, thelearned correlation information in the form of pseudo-rewards canhelp significantly reduce the regret as we show from our resultsdescribed below.
6.1 Experiments on theMovieLens dataset
The MovieLens dataset [18] contains a total of 1M ratings for atotal of 3883 Movies rated by 6040 Users. Each movie is rated ona scale of 1-5 by the users. Moreover, each movie is associatedwith one (and in some cases, multiple) genres. For our experiments,of the possibly several genres associated with each movie, one ispicked uniformly at random. To perform our experiments, we splitthe data into two parts, with the first half containing ratings of theusers who provided the most number of ratings. This half is used tolearn the pseudo-reward entries, the other half is the test set whichis used to evaluate the performance of the proposed algorithms.Doing such a split ensures that the rating distribution is differentin the training and test data.Recommending the Best Genre. In our first experiment, the goalis to provide the best genre recommendations to a population withunknown demographic. We use the training dataset to learn thepseudo-reward entries. The pseudo-reward entry sℓ,k (r ) is evalu-ated by taking the empirical average of the ratings of genre ℓ thatare rated by the users who rated genre k as r . To capture the factthat it might not be possible in practice to fill all pseudo-reward en-tries, we randomly remove p-fraction of the pseudo-reward entries.
0 2500 50000
250
500
750
1000
1250
1500
1750
Cum
ulat
ive
Reg
ret
UCBC-UCBC-TS
0 2500 5000
Number of rounds
0
250
500
750
1000
1250
1500
1750
0 2500 50000
250
500
750
1000
1250
1500
1750
(a) (b) (c)
Figure 10: Cumulative regret for UCB, C-UCB and C-TS
for the application of recommending the best genre in the
Movielens dataset, where p fraction of the pseudo-entries
are replaced with maximum reward i .e ., 5. In (a),p = 0.1, for(b),p = 0.25 and p = 0.5 in (c).
0 2500 5000(a)
0
250
500
750
1000
1250
1500
1750
2000
Cum
ulat
ive
Reg
ret
UCBC-UCBC-TS
0 2500 5000(b)
0
250
500
750
1000
1250
1500
1750
Number of rounds
Figure 11: Cumulative regret of UCB, C-UCB and C-TS for
providing the best movie recommendations in the Movie-
lens dataset. Each pseudo-reward entry is added by 0.1 in (a)
and by 0.4 in (b).
The removed pseudo-reward entries are replaced by the maximumpossible rating, i.e., 5 (as that gives a natural upper bound on theconditional mean reward). Using these pseudo-rewards, we evalu-ate our proposed algorithms on the test data. Upon recommendinga particular genre (arm), the rating (reward) is obtained by sam-pling one of the ratings for the chosen arm in the test data. Ourexperimental results for this setting are shown in Figure 10, withp = 0.1, 0.25 and 0.5 (i.e., fraction of pseudo-reward entries that areremoved). We see that the proposed C-UCB and C-TS algorithmssignificantly outperform UCB in all three settings. This occurs assome of the 18 arms are stopped being pulled after some time. Thisshows that even when only a subset of the correlations are known,it is possible to exploit them to improve the performance of classicalbandit algorithms.
Recommending the Best Movie.We now consider the goal ofproviding the best movie recommendations to the population. To doso, we consider the 50 most rated movies in the dataset. containing109,804 user-ratings given by 6,025 users. In the testing phase, thegoal is to recommend one of these 50 movies to each user. As wasthe case in previous experiment, we learn the pseudo-reward entriesfrom the training data. Instead of using the learned pseudo-reward
Under review at SIGMETRICS ’20, June 08–12, 2020, Boston, MA Samarth Gupta, Shreyas Chaudhari, Gauri Joshi and Osman Yağan
0 2500 5000(a)
0
200
400
600
800
1000
1200
Cum
ulat
ive
Reg
ret
UCBC-UCBC-TS
0 2500 5000(b)
0
200
400
600
800
1000
1200
Number of rounds
Figure 12: Cumulative regret of UCB, C-UCB and C-TS
for providing best poetry book recommendation in the
Goodreads dataset. Every pseudo-reward entry is added by
q and p fraction of the pseudo-reward entries are removed,
with (a) p = 0.1,q = 0.1 and (b) p = 0.3,q = 0.1.
directly, we add a safety buffer to each of the pseudo-reward entry;i.e., we set the pseudo-reward as the empirical conditional meanplus the safety buffer. Adding a buffer will be needed in practice,as the conditional expectations learned from the training data arelikely to have some noise and adding a safety buffer allows usto make sure that pseudo-rewards constitute an upper bound onthe conditional expectations. Our experimental result in Figure 11shows the performance of C-UCB and C-TS relative to UCB forthis setting with safety buffer set to 0.1 in Figure 11(a) and to 0.4in Figure 11. In both cases, even after addition of safety buffers,our proposed C-UCB and C-TS algorithms outperform the UCBalgorithm.
6.2 Experiments on the Goodreads datasetThe Goodreads dataset [19] contains the ratings for 1,561,465books by a total of 808,749 users. Each rating is on a scale of 1-5.For our experiments, we only consider the poetry section and focuson the goal of providing best poetry recommendations to the wholepopulation whose demographics is unknown. The poetry datasethas 36,182 different poems rated by 267,821 different users. We dothe pre-processing of goodreads dataset in the same manner as thatof the MovieLens dataset, by splitting the dataset into two halves,train and test. The train dataset contains the ratings of the userswith most number of recommendations.Recommending the best poetry book.We consider the 25 mostrated books in the dataset and use these as the set of arms to rec-ommend in the testing phase. These 25 poems have 349,523 user-ratings given by 171,433 users. As with theMovieLens dataset, thepseudo-reward entries are learned on the training data. In practicalsituations it might not be possible to obtain all pseudo-reward en-tries. Therefore, we randomly selectp fraction of the pseudo-rewardentries and replace them with maximum possible reward (i.e. 5).Among the remaining pseudo-reward entries we add a safety bufferof q to each entry. Our result in Figure 12 shows the performance ofC-UCB and C-TS relative to UCB in two scenarios. In scenario (a),10% of the pseudo-reward entries are replaced by 5 and remainingare padded with a safety buffer of 0.1. For case (b), 30% entriesare replaced by 5 and safety buffer is 0.1. Under both cases, our
proposed C-UCB and C-TS algorithms are able to outperform UCBsignificantly.
7 CONCLUSION
This work presents a new correlated Multi-Armed bandit problemin which rewards obtained from different arms are correlated. Wecapture this correlation through the knowledge of pseudo-rewards.These pseudo-rewards, which represent upper bound on conditionalmean rewards, could be known in practice from either domainknowledge or learned from prior data. Using the knowledge ofthese pseudo-rewards, we the propose C-Bandit algorithm whichfundamentally generalizes any classical bandit algorithm to thecorrelated multi-armed bandit setting. A key strength of our paperis that it allows pseudo-rewards to be loose (in case there is notmuch prior information) and even then our C-Bandit algorithmsadapt and provide performance at least as good as that of classicalbandit algorithms.
We provide a unified method to analyze the regret of C-Banditalgorithms. In particular, the analysis shows that C-UCB and C-TSend up pulling non-competitive arms only O(1) times; i.e., they stoppulling certain arms after a finite time t . Due to this, C-UCB and C-TS pull onlyC ≤ K −1 of theK −1 sub-optimal arms O(logT ) times,as opposed to UCB/TS that pull all K − 1 sub-optimal arms O(logT )times. In this sense, our C-Bandit algorithms reduce a K-armed ban-dit to aC+1-armed bandit problem. We present several cases whereC = 0 for which C-UCB and C-TS achieve bounded regret. For thespecial case when rewards are correlated through a latent randomvariable X , we show that bounded regret is possible only whenC = 0; if C > 0, then O(logT ) regret is not possible to avoid. Thus,our C-UCB and C-TS algorithms achieve bounded regret wheneverpossible. Simulation results validate the theoretical findings andwe perform experiments onMovielens and Goodreads datasetsto demonstrate the applicability of our framework in the context ofrecommendation systems. The experiments on real-world datasetsshow that our C-UCB and C-TS algorithms significantly outperformthe UCB algorithm.
There are several interesting open problems to be studied. Weplan to study the problem of best-arm identification in the correlatedmulti-armed bandit setting, i.e., to identify the best arm with aconfidence 1 − δ in as few samples as possible. Since rewards arecorrelated with each other, we believe the sample complexity canbe significantly improved relative to state of the art algorithms,such as LIL-UCB [28, 29], which are designed for classical multi-armed bandits. Another open direction is to improve the C-Banditalgorithm to make sure that it achieves bounded regret wheneverpossible in the general framework studied in this paper.
REFERENCES
[1] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocationrules. Advances in applied mathematics, 6(1):4–22, 1985.
[2] Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic andnonstochastic multi-armed bandit problems. Foundations and Trends in Machine
Learning, 5(1):1–122, 2012.[3] Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-
armed bandit problem. In Conference on Learning Theory, pages 39–1, 2012.[4] Aurélien Garivier and Olivier Cappé. The kl-ucb algorithm for bounded stochastic
bandits and beyond. In Proceedings of the 24th annual Conference On Learning
Theory, pages 359–376, 2011.
Multi-Armed Bandits with Correlated Arms Under review at SIGMETRICS ’20, June 08–12, 2020, Boston, MA
[5] Sofía S Villar, Jack Bowden, and James Wason. Multi-armed bandit models forthe optimal design of clinical trials: benefits and challenges. Statistical science: areview journal of the Institute of Mathematical Statistics, 30(2):199, 2015.
[6] Cem Tekin and Eralp Turgay. Multi-objective contextual multi-armed banditproblem with a dominant objective. arXiv preprint arXiv:1708.05655, 2017.
[7] J. Nino-Mora. Stochastic scheduling. In Encyclopedia of Optimization, pages 3818–3824. Springer, New York, 2 edition, 2009.
[8] Subhashini Krishnasamy, Rajat Sen, Ramesh Johari, and Sanjay Shakkottai. Regretof queueing bandits. CoRR, abs/1604.06377, 2016.
[9] Gauri Joshi. Efficient Redundancy Techniques to Reduce Delay in Cloud Systems.PhD thesis, Massachusetts Institute of Technology, June 2016.
[10] John White. Bandit algorithms for website optimization. " O’Reilly Media, Inc.",2012.
[11] Deepak Agarwal, Bee-Chung Chen, and Pradheep Elango. Explore/exploitschemes for web content optimization. In Data Mining, 2009. ICDM’09. Ninth
IEEE International Conference on, pages 1–10. IEEE, 2009.[12] Li Zhou. A survey on contextual multi-armed bandits. CoRR, abs/1508.03326,
2015.[13] Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert E.
Schapire. Taming the monster: A fast and simple algorithm for contextual bandits.In Proceedings of the International Conference on Machine Learning (ICML), pages1638–1646, 2014.
[14] Richard Combes, Stefan Magureanu, and Alexandre Proutière. Minimal explo-ration in structured stochastic bandits. In NIPS, 2017.
[15] Tor Lattimore and Rémi Munos. Bounded regret for finite-armed structuredbandits. In Advances in Neural Information Processing Systems, pages 550–558,2014.
[16] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms forlinear stochastic bandits. In Advances in Neural Information Processing Systems,pages 2312–2320, 2011.
[17] Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic linear optimiza-tion under bandit feedback. In Proceedings on the Conference on Learning Theory
(COLT), 2008.[18] F. Maxwell Harper and Joseph A. Konstan. The movielens datasets: History and
context. ACM Transactions on Interactive Intelligent Systems (TiiS), 5, 4, Article19, 2015.
[19] Mengting Wan and Julian McAuley. Item recommendation on monotonic behav-ior chains. In Proceedings of the 12th ACM Conference on Recommender Systems,pages 86–94. ACM, 2018.
[20] Samarth Gupta, Shreyas Chaudhari, Gauri Joshi, and Osman Yağan. Exploitingcorrelation in finite-armed structured bandits. arXiv preprint arXiv:1810.08164,2018.
[21] Sarah Filippi, Olivier Cappe, Aurélien Garivier, and Csaba Szepesvári. Parametricbandits: The generalized linear case. In Advances in Neural Information Processing
Systems, pages 586–594, 2010.[22] Stefan Magureanu, Richard Combes, and Alexandre Proutiere. Lipschitz bandits:
Regret lower bounds and optimal algorithms. arXiv preprint arXiv:1405.4758,2014.
[23] Onur Atan, Cem Tekin, and Mihaela van der Schaar. Global multi-armed banditswith hölder continuity. In AISTATS, 2015.
[24] Zhiyang Wang, Ruida Zhou, and Cong Shen. Regional multi-armed bandits. InAISTATS, 2018.
[25] Sandeep Pandey, Deepayan Chakrabarti, and Deepak Agarwal. Multi-armed ban-dit problems with dependent arms. In Proceedings of the International Conference
on Machine Learning, pages 721–728, 2007.[26] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the
multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.[27] Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson
sampling. In Artificial Intelligence and Statistics, pages 99–107, 2013.[28] K. Jamieson and R. Nowak. Best-arm identification algorithms for multi-armed
bandits in the fixed confidence setting. In Proceedings on the Annual Conference
on Information Sciences and Systems (CISS), pages 1–6, March 2014.[29] Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil’ucb:
An optimal exploration algorithm for multi-armed bandits. In Conference on
Learning Theory, pages 423–439, 2014.[30] Tor Lattimore. Instance dependent lower bounds. http://banditalgs.com/2016/09/
30/instance-dependent-lower-bounds/.[31] Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer
Publishing Company, Incorporated, 1st edition, 2008.[32] J. Bretagnolle and C. Huber. Estimation des densitiés: risque minimax. z. für
wahrscheinlichkeitstheorie und verw. Geb., 47:199–137, 1979.[33] Milton Abramowitz. Stegun. Handbook of mathematical functions, pages 299–300,
1964.
Under review at SIGMETRICS ’20, June 08–12, 2020, Boston, MA Samarth Gupta, Shreyas Chaudhari, Gauri Joshi and Osman Yağan
A STANDARD RESULTS FROM PREVIOUS WORKS
Fact 1 (Hoeffding’s ineqality). Let Z1,Z2 . . .Zn be i.i.d random variables bounded between [a,b] : a ≤ Zi ≤ b, then for any δ > 0, wehave
Pr(∑n
i=1 Zin
− E [Zi ] ≥ δ
)≤ exp
(−2nδ2
(b − a)2
).
Lemma 1 (Standard result used in bandit literature). If µk,nk (t ) denotes the empirical mean of arm k by pulling arm k nk (t) times
through any algorithm and µk denotes the mean reward of arm k , then we have
Pr(µk,nk (t ) − µk ≥ ϵ,τ2 ≥ nk (t) ≥ τ1
)≤
τ2∑s=τ1
exp(−2sϵ2
).
Proof. Let Z1,Z2, ...Zt be the reward samples of arm k drawn separately. If the algorithm chooses to play arm k formth time, then itobserves reward Zm . Then the probability of observing the event µk,nk (t ) − µk ≥ ϵ,τ2 ≥ nk (t) ≥ τ1 can be upper bounded as follows,
Pr(µk,nk (t ) − µk ≥ ϵ,τ2 ≥ nk (t) ≥ τ1
)= Pr ©«©«
∑nk (t )i=1 Zi
nk (t)− µk ≥ ϵ
ª®¬ ,τ2 ≥ nk (t) ≥ τ1ª®¬ (25)
≤ Pr(( τ2⋃m=τ1
∑mi=1 Zim
− µk ≥ ϵ
),τ2 ≥ nk (t) ≥ τ1
)(26)
≤ Pr( τ2⋃m=τ1
∑mi=1 Zim
− µk ≥ ϵ
)(27)
≤τ2∑
s=τ1
exp(−2sϵ2
). (28)
Lemma 2 (From Proof of Theorem 1 in [26]). Let Ik (t) denote the UCB index of arm k at round t , and µk = E [дk (X )] denote the mean
reward of that arm. Then, we have
Pr(µk > Ik (t)) ≤ t−3.
Observe that this bound does not depend on the number nk (t) of times arm k is pulled. UCB index is defined in equation (6) of the mainpaper.
Proof. This proof follows directly from [26]. We present the proof here for completeness as we use this frequently in the paper.
Pr(µk > Ik (t)) = Pr(µk > µk,nk (t ) +
√2 log tnk (t)
)(29)
≤t∑
m=1Pr
(µk > µk,m +
√2 log tm
)(30)
=
t∑m=1
Pr(µk,m − µk < −
√2 log tm
)(31)
≤t∑
m=1exp
(−2m
2 log tm
)(32)
=
t∑m=1
t−4 (33)
= t−3. (34)where (30) follows from the union bound and is a standard trick (Lemma 1) to deal with random variable nk (t). We use this trick repeatedlyin the proofs. We have (32) from the Hoeffding’s inequality.
Lemma 3. Let E[1Ik>Ik∗
]be the expected number of times Ik (t) > Ik∗ (t) in T rounds. Then, we have
E[1Ik>Ik∗
]=
T∑t=1
Pr (Ik > Ik∗ ) ≤ 8 log(T )∆2k
+
(1 + π 2
3
).
Multi-Armed Bandits with Correlated Arms Under review at SIGMETRICS ’20, June 08–12, 2020, Boston, MA
The proof follows the analysis in Theorem 1 of [26]. The analysis of Pr (Ik > Ik∗ ) is done by conditioning on the event that Arm k hasbeen pulled 8 log(T )
∆2k
. Conditioned on this event, Pr (Ik (t) > Ik∗ (t)|nk (t)) ≤ t−2.
Lemma 4 (Theorem 2 [1]). Consider a two armed bandit problem with reward distributionsΘ = fR1 (r ), fR2 (r ), where the reward distributionof the optimal arm is fR1 (r ) and for the sub-optimal arm is fR2 (r ), and E
[fR1 (r )
]> E
[fR2 (r )
]; i.e., arm 1 is optimal. If it is possible to create
an alternate problem with distributions Θ′ = fR1 (r ), fR2 (r ) such that E[fR2 (r )
]> E
[fR1 (r )
]and 0 < D(fR2 (r )| | fR2 (r )) < ∞ (equivalent to
assumption 1.6 in [1]), then for any policy that achieves sub-polynomial regret, we have
lim infT→∞
E [n2(T )]logT ≥ 1
D(fR2 (r )| | fR2 (r )).
Proof. Proof of this is derived from the analysis done in [30]. We show the analysis here for completeness. A bandit instance v is definedby the reward distribution of arm 1 and arm 2. Since policy π achieves sub-polynomial regret, for any instance v , Ev,π [(Reд(T ))] = O(Tp )as T → ∞, for all p > 0.
Consider the bandit instances Θ = fR1 (r ), fR2 (r ), Θ′ = fR1 (r ), fR2 (r ), where E[fR2 (r )
]< E
[fR1 (r )
]< E
[fR2 (r )
]. The bandit instance
Θ′ is constructed by changing the reward distribution of arm 2 in the original instance, in such a way that arm 2 becomes optimal in instanceΘ′ without changing the reward distribution of arm 1 from the original instance.
From divergence decomposition lemma (derived in [30]), it follows that
D(PΘ,Π | |PΘ′,Π) = EΘ,π [n2(T )]D(fR2 (r )| | fR2 (r )).The high probability Pinsker’s inequality (Lemma 2.6 from [31], originally in [32]) gives that for any event A,
PΘ,π (A) + PΘ′,π (Ac ) ≥12 exp
(−D(PΘ,π | |PΘ′,π )
),
or equivalently,D(PΘ,π | |PΘ′,π ) ≥ log 1
2(PΘ,π (A) + PΘ′,π (Ac )).
If arm 2 is suboptimal in a 2-armed bandit problem, then E [Reд(T )] = ∆2E [n2(T )] . Expected regret in Θ is
EΘ,π [Reд(T )] ≥ T∆22 PΘ,π
(n2(T ) ≥
T
2
),
Similarly regret in bandit instance Θ′ is
EΘ′,π [Reд(T )] ≥ Tδ
2 PΘ′,π
(n2(T ) <
T
2
),
since suboptimality gap of arm 1 in Θ′ is δ . Define κ(∆2,δ ) = min(∆2,δ )2 . Then we have,
PΘ,π
(n2(T ) ≥
T
2
)+ PΘ′,π
(n2(T ) <
T
2
)≤EΘ,π [Reд(T )] + EΘ′,π [Reд(T )]
κ(∆2,δ )T.
On applying the high probability Pinsker’s inequality and divergence decomposition lemma stated earlier, we get
D(fR2 (r )| | fR2 (r ))EΘ,π [n2(T )] ≥ log(
κ(∆2,δ )T2(EΘ,π [Reд(T )] + EΘ′,π [Reд(T )])
)(35)
= log(κ(∆2,δ )
2
)+ log(T )
− log(EΘ,π [Reд(T )] + EΘ′,π [Reд(T )]). (36)
Since policy π achieves sub-polynomial regret for any bandit instance, EΘ,π [Reд(T )] + EΘ′,π [Reд(T )] ≤ γTp for all T and any p > 0,hence,
lim infT→∞
D(fR2 (r )| | fR2 (r ))EΘ,π [n2(T )]
logT ≥ 1 − lim supT→∞
EΘ,π [Reд(T )] + EΘ′,π [Reд(T )]logT +
lim infT→∞
log(κ(∆2,δ )
2
)logT (37)
= 1. (38)
Hence, lim infT→∞
EΘ,π [n2(T )]logT ≥ 1
D(fR2 (r ) | |fR2 (r )).
Under review at SIGMETRICS ’20, June 08–12, 2020, Boston, MA Samarth Gupta, Shreyas Chaudhari, Gauri Joshi and Osman Yağan
B RESULTS FOR ANY C-BANDIT ALGORITHM
Lemma 5. Define E1(t) to be the event that arm k∗ is empirically non-competitive in round t + 1, then,
Pr(E1(t)) ≤ t exp(−t∆2
min
2K
),
where ∆min = mink ∆k , the gap between the best and second-best arms.
Proof. We analyze the probability that arm k∗ is empirically non competitive by conditioning on the event that arm k∗ is not pulled formaximum number of times till round t . Analyzing this expression gives us,
Pr(E1(t)) = Pr(E1(t),nk∗ (t) , maxk
nk (t)) (39)
=∑k,k∗
Pr (E1(t),nk (t) = maxk ′
nk ′(t)) (40)
≤ maxk
Pr(E1(t),nk (t) = maxk ′
nk ′(t)) (41)
= maxk
Pr(µk > ϕk∗,k ,nk (t) = maxk ′
nk ′(t)) (42)
≤ maxk
Pr(µk > ϕk∗,k ,nk (t) ≥
t
K
)(43)
= maxk
Pr(∑t
τ=1 1kτ =k rτnk (t)
>
∑tτ=1 1kτ =k sk∗,k (rτ )
nk (t),nk (t) ≥
t
K
)(44)
= maxk
Pr(∑t
τ=1 1kτ =k (rτ − sk∗,k (rτ )
)nk (t)
> 0,nk (t) ≥t
K
)(45)
= maxk
Pr(∑t
τ=1 1kτ =k (rτ − sk∗,k (rτ )
)nk (t)
− (µk − ϕk∗,k ) > ϕk∗,k − µk ,nk (t) ≥t
K
)(46)
≤ maxk
Pr(∑t
τ=1 1kτ =k (rτ − sk∗,k (rτ )
)nk (t)
− (µk − ϕk∗,k ) > ∆k ,nk (t) ≥t
K
)(47)
≤ maxk
t exp(−t∆2
k2K
)(48)
= t exp(−t∆2
min2K
), (49)
Here (42) follows from the fact that in order for arm k∗ to be empirically non-competitive, empirical mean of arm k should be more thanempirical pseudo-reward of arm k∗ with respect to arm k . Inequality (43) follows since nk (t) being more than t
K is a necessary condition fornk (t) = maxk ′ nk ′(t) to occur. We have (47) as sk∗,k is more than µk∗ . We have (48) from the Hoeffding’s inequality, as we note that rewardsrτ − sk∗,k (rτ ) : τ = 1, . . . , t , kτ = k form a collection of i.i.d. random variables each of which is bounded between [−1, 1] with mean(µk − ϕk∗,k ). The term t before the exponent in (48) arises as the random variable nk (t) can take values from t/K to t (Lemma 1).
Lemma 6. If for a suboptimal arm k , k∗, ∆k,k∗ > 0, then,
Pr(kt+1 = k,nk∗ (t) = maxk
nk ) ≤ t exp(−t ∆2
k,k∗
2K
).
Moreover, if ∆k,k∗ ≥ 2√
2K log t0t0
for some constant t0 > 0. Then,
Pr(kt+1 = k,nk∗ (t) = maxk
nk ) ≤ t−3 ∀t > t0.
Multi-Armed Bandits with Correlated Arms Under review at SIGMETRICS ’20, June 08–12, 2020, Boston, MA
Proof. We now bound this probability as,
Pr(kt+1 = k,nk∗ = maxk
nk ) = Pr(µk∗ (t) < ϕk,k∗ (t), Ik (t) = max
k ′Ik ′(t),nk∗ (t) = max
knk (t)
)(50)
≤ Pr(µk∗ (t) < ϕk,k∗ (t),nk∗ (t) = max
knk (t)
)(51)
≤ Pr(µk∗ (t) < ϕk,k∗ (t),nk∗ (t) ≥ t
K
)(52)
≤ Pr(∑t
τ=1 1kτ =k∗ rτnk∗ (t) <
∑tτ=1 1kτ =k∗ sk,k∗ (rτ )
nk∗(t ),nk∗ (t) ≥ t
K
)(53)
= Pr(∑t
τ=1 1kτ =k∗ (rτ − sk,k∗ )nk∗ (t) − (µk∗ − ϕk,k∗ ) < −∆k,k∗ ,nk∗ ≥ t
K
)(54)
≤ t exp(−t ∆2
k,k∗
2K
)(55)
≤ t−3 ∀t > t0. (56)
Here, (54) follows from the Hoeffding’s inequality as we note that rewards rτ − sk,k∗ (rτ ) : τ = 1, . . . , t , kτ = k form a collection of i.i.d.random variables each of which is bounded between [−1, 1] with mean (µk − ϕk,k∗ ). The term t before the exponent in (54) arises as the
random variable nk (t) can take values from t/K to t (Lemma 1). Step (56) follows from the fact that ∆k,k∗ ≥ 2√
2K log t0t0
for some constantt0 > 0.
C ALGORITHM SPECIFIC RESULTS FOR C-UCB
Lemma 7. If ∆min ≥ 4√
K log t0t0
for some constant t0 > 0, then,
Pr(kt+1 = k,nk (t) ≥ s) ≤ 3t−3for s >
t
2K ,∀t > t0.
Proof. By noting that kt+1 = k corresponds to arm k having the highest index among the set of arms that are not empirically non-
competitive (denoted by A), we have,
Pr(kt+1 = k,nk (t) ≥ s) = Pr(Ik (t) = arg maxk ′∈A
Ik ′(t),nk (t) ≥ s) (57)
≤ Pr(E1(t) ∪(Ec1 (t), Ik (t) > Ik∗ (t)
),nk (t) ≥ s) (58)
≤ Pr(E1(t),nk (t) ≥ s) + Pr(Ec1 (t), Ik (t) > Ik∗ (t),nk (t) ≥ s) (59)
≤ t exp(−t∆2
min2K
)+ Pr (Ik (t) > Ik∗ (t),nk (t) ≥ s) . (60)
Here E1(t) is the event described in Lemma 5. If arm k∗ is not empirically non-competitive at round t , then arm k can only be pulled inround t + 1 if Ik (t) > Ik∗ (t), due to which we have (58). Inequalities (59) and (60) follow from union bound and Lemma 5 respectively.
Under review at SIGMETRICS ’20, June 08–12, 2020, Boston, MA Samarth Gupta, Shreyas Chaudhari, Gauri Joshi and Osman Yağan
We now bound the second term in (60).Pr(Ik (t) > Ik∗ (t),nk (t) ≥ s) = Pr (Ik (t) > Ik∗ (t),nk (t) ≥ s, µk∗ ≤ Ik∗ (t))+
Pr (Ik (t) > Ik∗ (t),nk (t) ≥ s |µk∗ > Ik∗ (t)) × Pr (µk∗ > Ik∗ (t)) (61)≤ Pr (Ik (t) > Ik∗ (t),nk (t) ≥ s, µk∗ ≤ Ik∗ (t)) + Pr (µk∗ > Ik∗ (t)) (62)
≤ Pr (Ik (t) > Ik∗ (t),nk (t) ≥ s, µk∗ ≤ Ik∗ (t)) + t−3 (63)
= Pr (Ik (t) > µk∗ ,nk (t) ≥ s) + t−4 (64)
= Pr(µk (t) +
√2 log tnk (t)
> µk∗ ,nk (t) ≥ s
)+ t−3 (65)
= Pr(µk (t) − µk > µk∗ − µk −
√2 log tnk (t)
,nk (t) ≥ s
)+ t−3 (66)
= Pr(∑t
τ=1 1kτ =k rτnk (t)
− µk > ∆k −
√2 log tnk (t)
,nk (t) ≥ s
)+ t−3 (67)
≤ t exp ©«−2s(∆k −
√2 log ts
)2ª®¬ + t−3 (68)
≤ t−3 exp(−2s
(∆2k − 2∆k
√2 log ts
))+ t−3 (69)
≤ 2t−3 for all t > t0. (70)
We have (61) holds because of the fact that P(A) = P(A|B)P(B) + P(A|Bc )P(Bc ), Inequality (63) follows from Lemma 2. From the definitionof Ik (t) we have (65). Inequality (68) follows from Hoeffding’s inequality and the term t before the exponent in (48) arises as the randomvariable nk (t) can take values from s to t (Lemma 1). Inequality (70) follows from the fact that s > t
2K and ∆k ≥ 4√
K log t0t0
for some constantt0 > 0.
Plugging this in the expression of Pr(kt = k | nk (t) ≥ s) (60) gives us,
Pr(kt+1 = k | nk (t) ≥ s) ≤ t exp(−t∆2
min2K
)+ Pr(Ik (t) > Ik∗ (t)|nk (t) ≥ s) (71)
≤ t exp(−t∆2
min2K
)+ 2t−3 (72)
≤ 3t−3. (73)
Here, (73) follows from the fact that ∆min ≥ 2√
2K log t0t0
for some constant t0 > 0.
Lemma 8. If ∆min ≥ 4√
K log t0t0
for some constant t0 > 0, then,
Pr(nk (t) >
t
K
)≤ 3K
( tK
)−2 ∀t > Kt0.
Proof. We expand Pr(nk (t) > t
K)as,
Pr(nk (t) ≥
t
K
)= Pr
(nk (t) ≥
t
K| nk (t − 1) ≥ t
K
)Pr
(nk (t − 1) ≥ t
K
)+
Pr(kt = k,nk (t − 1) = t
K− 1
)(74)
≤ Pr(nk (t − 1) ≥ t
K
)+ Pr
(kt = k,nk (t − 1) = t
K− 1
)(75)
≤ Pr(nk (t − 1) ≥ t
K
)+ 3(t − 1)−3 ∀(t − 1) > t0. (76)
Here, (76) follows from Lemma 7.
Multi-Armed Bandits with Correlated Arms Under review at SIGMETRICS ’20, June 08–12, 2020, Boston, MA
This gives us
Pr(nk (t) ≥
t
K
)− Pr
(nk (t − 1) ≥ t
K
)≤ 3(t − 1)−3, ∀(t − 1) > t0.
Now consider the summation
t∑τ= t
K
Pr(nk (τ ) ≥
t
K
)− Pr
(nk (τ − 1) ≥ t
K
)≤
t∑τ= t
K
3(τ − 1)−3.
This gives us,
Pr(nk (t) ≥
t
K
)− Pr
(nk
( tK
− 1)≥ t
K
)≤
t∑τ= t
K
3(τ − 1)−3.
Since Pr(nk
( tK − 1
)≥ t
K)= 0, we have,
Pr(nk (t) ≥
t
K
)≤
t∑τ= t
K
3(τ − 1)−3 (77)
≤ 3K( tK
)−2 ∀t > Kt0. (78)
D REGRET BOUNDS FOR C-UCB
Proof of Theorem 1We bound E [nk (T )] as,
E [nk (T )] = E[ T∑t=1
1kt=k
](79)
=
T−1∑t=0
Pr(kt+1 = k) (80)
=
Kt0∑t=1
Pr(kt = k) +T−1∑t=Kt0
Pr(kt+1 = k) (81)
≤ Kt0 +T−1∑t=Kt0
Pr(kt+1 = k,nk∗ (t) = maxk ′
nk ′(t)) +T−1∑t=Kt0
∑k ′,k∗
Pr(nk ′(t) = maxk ′′
nk ′′(t)) Pr(kt+1 = k |nk ′(t) = maxk ′′
nk ′′(t)) (82)
≤ Kt0 +T−1∑t=Kt0
Pr(kt+1 = k,nk∗ (t) = maxk ′
nk ′(t)) +T−1∑t=Kt0
∑k ′,k∗
Pr(nk ′(t) = maxk ′′
nk ′′(t)) (83)
≤ Kt0 +T−1∑t=Kt0
t−3 +T∑
t=Kt0
∑k ′,k∗
Pr(nk ′(t) ≥
t
K
)(84)
≤ Kt0 +T∑t=1
t−3 + K(K − 1)T∑
t=Kt0
3( tK
)−2. (85)
Here, (84) follows from Lemma 6 and (85) follows from Lemma 8.Proof of Theorem 2
Under review at SIGMETRICS ’20, June 08–12, 2020, Boston, MA Samarth Gupta, Shreyas Chaudhari, Gauri Joshi and Osman Yağan
For any suboptimal arm k , k∗,
E [nk (T )] ≤T∑t=1
Pr(kt = k) (86)
=
T∑t=1
Pr(E1(t),kt = k ∪ (Ec1 (t), Ik > Ik∗ ),kt = k) (87)
≤T∑t=1
Pr(E1(t)) + Pr(Ec1 (t), Ik (t − 1) > Ik∗ (t − 1),kt = k) (88)
≤T∑t=1
Pr(E1(t)) + Pr(Ec1 (t), Ik (t − 1) > Ik∗ (t − 1))
≤T∑t=1
Pr(E1(t)) + Pr(Ik (t − 1) > Ik∗ (t − 1),kt = k) (89)
=
T∑t=1
t exp(−t∆2
min2K
)+
T−1∑t=0
Pr (Ik (t) > Ik∗ (t),kt = k) (90)
=
T∑t=1
t exp(−t∆2
min2K
)+ E
[1Ik>Ik∗ (T )
](91)
≤ 8log(T )∆2k
+
(1 + π 2
3
)+
T∑t=1
t exp(−t∆2
min2K
). (92)
Here, (90) follows from Lemma 5. We have (91) from the definition of E[nIk>Ik∗ (T )
]in Lemma 3, and (92) follows from Lemma 3.
Proof of Theorem 3: Follows directly by combining the results on Theorem 1 and Theorem 2.
E ALGORITHM SPECIFIC RESULTS FOR C-TS
As done in [27] Let us define two thresholds, a lower threshold Lk , and an upper thresholdUk for an arm k ∈ K ,
Uk = µk +∆k3 , Lk = µk∗ − ∆k
3 . (93)
Let Eµi (t) and ESi (t) be the events that,
Eµk (t) = ∃t : µk (t) ≤ Uk
ESk (t) = ∃t : Sk (t) ≤ Lk . (94)
Recall Sk (t) is the sample obtained from the posterior distribution on the mean reward of arm k at round t .
Fact 2. ([33]). For a Gaussian distributed random variable Z with mean µ and variance σ 2, then for any constant c ,
14√π
exp(−7c2/2) < Pr(|Z − µ | > cσ ) ≤ 12 exp(−c2/2).
Fact 3. ([33]). For a Gaussian distributed random variable Z with mean µ and variance σ 2, then for any constant c ,
Pr(|Z − µ | > cσ ) ≥ 1√
2πc
c2 + 1exp(−c2/2).
Lemma 9. If ∆min ≥ 6√
2β K log t0t0
for some constant t0 > exp((11βσ 2), s > t
K, then
Pr (kt = k,nk (t − 1) ≥ s) ≤ 3t−3 + 2t−2β
∀t > t0, where k , k∗ is a sub-optimal arm.
Multi-Armed Bandits with Correlated Arms Under review at SIGMETRICS ’20, June 08–12, 2020, Boston, MA
Proof. We start by bounding the probability of the pull of k-th arm at round t as follows,
Pr (kt = k,nk (t − 1) ≥ s) ≤ Pr (E1(t),kt = k,nk (t − 1) ≥ s) + Pr(E1(t),kt = k,nk (t − 1) ≥ s
)≤t exp
(−t∆2
min2K
)+ Pr
(E1(t),kt = k,nk (t − 1) ≥ s
)(95)
≤t−3 +
(Pr(kt = k,Eµk (t),E
Sk (t),nk (t − 1) ≥ s)︸ ︷︷ ︸
term A
+ Pr(kt = k,Eµk (t),ESk (t),nk (t − 1) ≥ s)︸ ︷︷ ︸
term B
+ Pr(kt = k,Eµk (t),nk (t − 1) ≥ s))
︸ ︷︷ ︸term C
(96)
where, in (95), comes from Lemma 5. Now we treat each term in (96) individually. Note that we know from [27] that for all s ≥ exp(11β),
(A) ≤ t−2β
For the term B we can show that,
(B) ≤ Pr(Eµk (t),E
Sk (t),nk (t − 1) > s
)(97)
(a)≤ Pr
(N
(Uk (θ∗),
β
nk (t) + 1
)> Lk (θ∗),nk (t − 1) > s
)(98)
(b)= Pr
(N
(µk +
∆k3 ,
β
nk (t) + 1
)> µk∗ − ∆k
3 ,nk (t − 1) > s
)(99)
(c)= Pr
(N
(µk +
∆k3 − 2∆k3 ,
β
nk (t) + 1
)> µk∗ − ∆k
3 − 2∆k3 ,nk (t − 1) > s
)(100)
≤ Pr(N
(µk − ∆k
3 ,β
nk (t) + 1
)> µk∗ − 2∆k3 −
√8β log t
s,nk (t − 1) > s
)· Pr
(∆k3 ≥
√8β log t
s
)(101)
+ Pr(N
(µk − ∆k
3 ,β
nk (t) + 1
)> µk∗ − 2∆k3 − ∆k
3 ,nk (t − 1) > s
)· Pr
(∆k3 <
√8β log t
s
)︸ ︷︷ ︸
Term A
(102)
(d )≤ Pr
(N
(µk −
∆k3 ,
β
nk (t) + 1
)> µk∗−2∆k3 −
√8β log t
s,nk (t − 1) > s
)(103)
(e)≤
t∑m=s
12 exp
©«−m
(µk∗ − 2∆k3 − µk +
∆k3 −
√8β log t
s
)2
2β ,nk (t − 1) =m
ª®®®®®®¬(104)
(f )≤ t
2 exp(− s
2β
(8β log t
s+
4∆2k
9 − 4∆k3
√8β log t
s
))(105)
(д)≤ t−3
2 exp ©«− s
2β©«
4∆2k
9 − 4∆k3
√3ασ 2 log t
s
ª®¬ª®¬(h)≤ t−3
2 . (106)
Here (a) follows as µk < Uk (θ∗) (through event Eµk (t)) and¯ESk (t) is the event that N
(µk ,
βσ 2
nk (t )+1
)> Lk (θ∗). Equality (b) follows by
substituting the expressions for Lk (θ∗) andUk (θ∗). Inequality (d) follows as for t > t0 and s > tK , ∆k > 6
√2β log t
s . Inequality (e) follows
from Fact 2 . We have (h) as ∆k > 6√
2β log ts for all s > t/k and t > t0.
Under review at SIGMETRICS ’20, June 08–12, 2020, Boston, MA Samarth Gupta, Shreyas Chaudhari, Gauri Joshi and Osman Yağan
Finally, for the last term C we can show that,
(C) = Pr(kt = k,Eµk (t),nk (t − 1) ≥ s) (107)
≤ Pr(Eµk (t),nk (t − 1) ≥ s) (108)
= Pr(µk − µk >
∆k3 ,nk (t − 1) ≥ s
)(109)
≤ 2t exp(−2s
∆2k
9
)(110)
≤ 2t−3 ∀t > t0 (111)
Here (110) follows from hoeffding’s inequality and the union bound trick to handle random variable nk (t − 1). We have (111) as ∆k >6√
2Kβ log t0t0
for some t0 > 0 and s > tK and β > 1.
Lemma 10. If ∆min ≥ 6√
2βK log t0t0
for some constant t0 > 0, then,
Pr(nk (t) >
t
K
)≤ 3K
( tK
)−2+ K
( tK
)1−2β ∀t > Kt0.
Proof. We expand Pr(nk (t) > t
K)as,
Pr(nk (t) ≥
t
K
)= Pr
(nk (t) ≥
t
K| nk (t − 1) ≥ t
K
)Pr
(nk (t − 1) ≥ t
K
)+
Pr(kt = k,nk (t − 1) = t
K− 1
)(112)
≤ Pr(nk (t − 1) ≥ t
K
)+ Pr
(kt = k,nk (t − 1) = t
K− 1
)(113)
≤ Pr(nk (t − 1) ≥ t
K
)+ 3(t − 1)−3 + (t − 1)−2β ∀(t − 1) > t0. (114)
Here, (114) follows from Lemma 9.
This gives us
Pr(nk (t) ≥
t
K
)− Pr
(nk (t − 1) ≥ t
K
)≤ 3(t − 1)−3 + (t − 1)−2β , ∀(t − 1) > t0.
Now consider the summationt∑
τ= tK
Pr(nk (τ ) ≥
t
K
)− Pr
(nk (τ − 1) ≥ t
K
)≤
t∑τ= t
K
3(τ − 1)−3.
This gives us,
Pr(nk (t) ≥
t
K
)− Pr
(nk
( tK
− 1)≥ t
K
)≤
t∑τ= t
K
3(τ − 1)−3 + (τ − 1)−2β .
Since Pr(nk
( tK − 1
)≥ t
K)= 0, we have,
Pr(nk (t) ≥
t
K
)≤
t∑τ= t
K
3(τ − 1)−3 + (τ − 1)−2β (115)
≤ 3K( tK
)−2+ K
( tK
)1−2β ∀t > Kt0. (116)
Multi-Armed Bandits with Correlated Arms Under review at SIGMETRICS ’20, June 08–12, 2020, Boston, MA
F REGRET BOUNDS FOR C-TS
Proof of Theorem 1. Following the same steps as in Appendix D, we get
E [nk (T )] ≤ Kt0 +T−1∑t=Kt0
Pr(kt+1 = k,nk∗ (t) = maxk ′
nk ′(t)) +T−1∑t=Kt0
∑k ′,k∗
Pr(nk ′(t) = maxk ′′
nk ′′(t)) (117)
≤ Kt0 +T−1∑t=Kt0
t−3 +T∑
t=Kt0
∑k ′,k∗
Pr(nk ′(t) ≥
t
K
)(118)
≤ Kt0 +T∑t=1
t−3 + K(K − 1)T∑
t=Kt0
(3( tK
)−2+
( tK
)1−2β.
)(119)
= O(1) for β > 1. (120)
Here, (118) follows from Lemma 6 and (119) follows from Lemma 10.Proof of Theorem 2. Following the same steps as in Appendix D, we get For any suboptimal arm k , k∗,
E [nk (T )] ≤T∑t=1
t exp(−t∆2
min2K
)+
T−1∑t=0
Pr (Sk (t) > Sk∗ (t),kt = k) (121)
≤18 log(T∆2
k )∆2k
+ exp(11β) + 5 + 132∆2
k
+
T∑t=1
t exp(−t∆2
min2K
). (122)
We have (122) follows from the analysis of Thompson sampling in [27].
G LOWER BOUNDS
For the proof we define Rk = Yk (X ) and Rk = дk (X ), where fX (x) is the probability density function of random variable X and fX (x) is theprobability density function of random variable X . Similarly, we define fRk (r ) to be the reward distribution of arm k .
G.1 Proof of Theorem 4
Let arm k be a Competitive sub-optimal arm, i.e ∆k,k∗ < 0. To prove that regret is Ω(logT ) in this setting, we need to create a new banditinstance, in which reward distribution of optimal arm is unaffected, but a previously competitive sub-optimal arm k becomes optimal inthe new environment. We do so by constructing a bandit instance with latent randomness X and random rewards Yk (X ). Let’s denote toYk (X ) to be the random reward obtained on pulling arm k given the realization of X . To make arm k optimal in the new bandit instance, weconstruct Yk (X ) and X in the following manner. Let Yk denote the support of Yk (X ).
Define
Yk (X ) =дk (X ) w.p. 1 − ϵ1Yk (X ) ∼ Uniform(Yk ) w.p. ϵ1
This changes the conditional reward of arm k in the new bandit instance (with increased mean).Furthermore, Define
X =
S(Rk∗ ) w .p.1 − ϵ2Uniform ∼ X w .p.ϵ2.
,
with S(Rk∗ ) = arg maxдk∗
(x )<Rk∗<дk∗ (x ) дk (x).Here Rk∗ represents the random reward of arm k∗ in the original bandit instance.
This construction of X is possible for some ϵ1, ϵ2 > 0, whenever arm k is competitive by definition. Moreover, under such a constructionone can change reward distribution of Yk∗ (X ) such that reward Rk∗ has the same distribution as Rk∗ . This is done by changing the conditionalreward distribution, fYk∗ |X (r ) =
fYk∗ |X (r )fX (x )fX (x ) .
Due to this, if an arm is competitive, there exists a new bandit instance with latent randomness X and conditional rewards Yk∗ |X andYk |X such that fRk∗ = fR∗
kand E
[Rk
]> µk∗ , with fRk denoting the probability distribution function of the reward from arm k and Rk
representing the reward from arm k in the new bandit instance.Therefore, if these are the only two arms in our problem, then from Lemma 4,
limT→∞
inf E [nk (T )]logT ≥ 1D(fRk (r )| | fRk (r ))
,
where fRk(r ) represents the reward distribution of arm k in the new bandit instance.
Under review at SIGMETRICS ’20, June 08–12, 2020, Boston, MA Samarth Gupta, Shreyas Chaudhari, Gauri Joshi and Osman Yağan
r s2,1(r ) r s1,2(r )0
23 0
34
167 1
23
(a) R2 = 0 R2 = 1R1 = 0 0.1 0.2R1 = 1 0.3 0.4
(b) R2 = 0 R2 = 1R1 = 0 a bR1 = 1 c d
Table 4: The top row shows the pseudo-rewards of arms 1 and 2, i.e., upper bounds on the conditional expected rewards (which
are known to the player). The bottom row depicts two possible joint probability distribution (unknown to the player). Under
distribution (a), Arm 1 is optimal and all pseudo-reward except s2,1(1) are tight.
Moreover, if we have more K − 1 sub-optimal arms, instead of just 1, then
limT→∞
infE
[∑ℓ,k∗ nℓ(T )
]logT ≥ 1
D(fRk (r )| | fRk (r )).
Consequently, since E [Reд(T )] = ∑Kell=1 ∆ℓE [nℓ(T )], we have
limT→∞
inf E [Reд(T )]log(T ) ≥ maxk ∈C
∆kD(fRk | | fRk )
. (123)
G.2 Lower bound discussion in general framework
Consider the example shown in Table 4, for the joint probability distribution (a), Arm 1 is optimal. Moreover, all pseudo-rewards excepts2,1(1) are tight, i.e.,sℓ,k (r ) = E [Rℓ |Rk = r ]. For the joint probability distribution shown in (a), expected pseudo-reward of Arm 2 is 0.8 andhence it is competitive. Due to this, our C-UCB and C-TS algorithms pull Arm 2 O(logT ) times.
However, it is not possible to construct an alternate bandit environment with joint probability distribution shown in Table 4(b), suchthat Arm 2 becomes optimal while maintaining the same marginal distribution for Arm 1, and making sure that the pseudo-rewards stillremain upper bound on conditional expected rewards. Formally, there does not exist a,b, c,d such that c + d = 0.7, c
a+c < 3/4, ba+b < 2/3,
db+d < 2/3, d
d+c < 6/7 and a + b + c + d = 1. This suggests that there should be a way to achieve O(1) regret in this scenario. We believethis can be done by using all the constraints (imposed by the knowledge of pair-wise pseudo-rewards to shrink the space of possible jointprobability distributions) when calculating empirical pseudo-reward. However, this becomes tough to implement as the ratings can havemultiple possible values and the number of arms is more than 2. We leave the task of coming up with a practically feasible and easy toimplement algorithm that achieves bounded regret whenever possible in a general setup as an interesting open problem.