+ All Categories
Home > Documents > Low Complexity Online Radio Access Technology Selection ...karandi/publications/... · (UMTS), Long...

Low Complexity Online Radio Access Technology Selection ...karandi/publications/... · (UMTS), Long...

Date post: 22-May-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
18
1 Low Complexity Online Radio Access Technology Selection Algorithm in LTE-WiFi HetNet Arghyadip Roy, Student Member, IEEE, Vivek Borkar, Fellow, IEEE, Prasanna Chaporkar, Member, IEEE, and Abhay Karandikar, Member, IEEE Abstract—In an offload-capable Long Term Evolution (LTE)- Wireless Fidelity (WiFi) Heterogeneous Network (HetNet), we consider the problem of maximization of the total system throughput under voice user blocking probability constraint. The optimal policy is threshold in nature. However, computation of optimal policy requires the knowledge of the statistics of system dynamics, viz., arrival processes of voice and data users, which may be difficult to obtain in reality. Motivated by the Post-Decision State (PDS) framework to learn the optimal policy under unknown statistics of system dynamics, we propose, in this paper, an online Radio Access Technology (RAT) selection algorithm using Relative Value Iteration Algorithm (RVIA). However, the convergence speed of this algorithm can be further improved if the underlying threshold structure of the optimal policy can be exploited. To this end, we propose a novel structure-aware online RAT selection algorithm which reduces the feasible policy space, thereby offering lesser storage and computational complexity and faster convergence. This algorithm provides a novel framework for designing online learning algorithms for other problems and hence is of independent interest. We prove that both the algorithms converge to the optimal policy. Simulation results demonstrate that the proposed algorithms converge faster than a traditional scheme. Also, the proposed schemes perform better than other benchmark algorithms under realistic network scenarios. Index Terms—User association, LTE-WiFi offloading, Constrained MDP, Threshold policy, Stochastic Approximation. 1 I NTRODUCTION Recent developments in wireless communications have witnessed a proliferation of end-user equipment, such as tablets and smartphones, with advanced capabili- ties. The ever-increasing demand for Quality of Service (QoS) requirement of users gives rise to the standard- ization of various Radio Access Technologies (RATs) [2], such as Universal Mobile Telecommunications System (UMTS), Long Term Evolution (LTE) etc. According to [3], monthly global mobile data traffic is expected to exceed 49 exabytes by 2021. One of the basic limitations of the present RATs is the lack of support for the co- existence of multiple RATs. Therefore, to address this exceptional growth of data traffic, network providers have expressed their interest towards efficient interwork- ing among various RATs in the context of upcoming Fifth Generation (5G) [2] network. This gives rise to the notion of Heterogeneous Network (HetNet), where different RATs coexist and interwork with each other. Users can be associated with any candidate RAT and moved from one RAT to another in a seamless manner. Recent advances in Software Defined Networking (SDN) [4], [5] paradigm also facilitates these mechanisms by bringing more flexibility in the integrated control and management of various RATs. This paper is a substantially expanded and revised version of the work in [1]. The authors are with the Department of Electrical Engineering, In- dian Institute of Technology Bombay, Mumbai, 400076, India. e-mail: {arghyadip, borkar, chaporkar, karandi}@ee.iitb.ac.in. Abhay Karandikar is currently Director, Indian Institute of Technology Kanpur (on leave from IIT Bombay), Kanpur, 208016, India. e-mail:[email protected]. In a Third Generation Partnership Project (3GPP) LTE HetNet, interworking with IEEE 802.11 [6] Wireless Local Area Network (WLAN) (popularly known as Wireless Fidelity (WiFi)) operating in unlicensed band provides a potential solution due to the complementary nature of these RATs. The LTE Base Stations (BSs) can be deployed to provide ubiquitous coverage, whereas the WiFi Access Points (APs) offer high bit rate solution in hot-spot regions. A user can be associated with either LTE or WiFi in an area where their coverage overlaps. Furthermore, according to 3GPP Release 12 specifications [7], data users can be steered from one RAT to another. This mechanism is known as mobile data offloading. Intelligent RAT selection 1 and offloading decisions [8]–[18] may lead to efficient resource utilization and can be triggered either by the user or by the network. User-initiated decisions are taken with an objective of optimizing individual utilities and hence, may not con- verge to the global optimality. Therefore, we focus on network-initiated association and offloading schemes which target to optimize different network level system parameters. In this paper, we propose online learning algorithms to obtain the optimal RAT selection policy for an LTE-WiFi HetNet. The model considered in this paper is similar to that of [19]. As illustrated in Fig. 1, the centralized controller (for example, an SDN controller) possesses an overall view of both the networks and is responsible for taking network-initiated RAT selection and offloading decisions. We consider voice and data users inside the 1. The terminologies “association” and “RAT selection” are used interchangeably throughout this paper.
Transcript

1

Low Complexity Online Radio AccessTechnology Selection Algorithm in LTE-WiFi

HetNet †

Arghyadip Roy, Student Member, IEEE, Vivek Borkar, Fellow, IEEE, Prasanna Chaporkar, Member, IEEE,and Abhay Karandikar, Member, IEEE

Abstract—In an offload-capable Long Term Evolution (LTE)- Wireless Fidelity (WiFi) Heterogeneous Network (HetNet), we consider theproblem of maximization of the total system throughput under voice user blocking probability constraint. The optimal policy is thresholdin nature. However, computation of optimal policy requires the knowledge of the statistics of system dynamics, viz., arrival processesof voice and data users, which may be difficult to obtain in reality. Motivated by the Post-Decision State (PDS) framework to learn theoptimal policy under unknown statistics of system dynamics, we propose, in this paper, an online Radio Access Technology (RAT)selection algorithm using Relative Value Iteration Algorithm (RVIA). However, the convergence speed of this algorithm can be furtherimproved if the underlying threshold structure of the optimal policy can be exploited. To this end, we propose a novel structure-awareonline RAT selection algorithm which reduces the feasible policy space, thereby offering lesser storage and computational complexityand faster convergence. This algorithm provides a novel framework for designing online learning algorithms for other problems andhence is of independent interest. We prove that both the algorithms converge to the optimal policy. Simulation results demonstrate thatthe proposed algorithms converge faster than a traditional scheme. Also, the proposed schemes perform better than other benchmarkalgorithms under realistic network scenarios.

Index Terms—User association, LTE-WiFi offloading, Constrained MDP, Threshold policy, Stochastic Approximation.

F

1 INTRODUCTION

Recent developments in wireless communications havewitnessed a proliferation of end-user equipment, suchas tablets and smartphones, with advanced capabili-ties. The ever-increasing demand for Quality of Service(QoS) requirement of users gives rise to the standard-ization of various Radio Access Technologies (RATs) [2],such as Universal Mobile Telecommunications System(UMTS), Long Term Evolution (LTE) etc. According to[3], monthly global mobile data traffic is expected toexceed 49 exabytes by 2021. One of the basic limitationsof the present RATs is the lack of support for the co-existence of multiple RATs. Therefore, to address thisexceptional growth of data traffic, network providershave expressed their interest towards efficient interwork-ing among various RATs in the context of upcomingFifth Generation (5G) [2] network. This gives rise tothe notion of Heterogeneous Network (HetNet), wheredifferent RATs coexist and interwork with each other.Users can be associated with any candidate RAT andmoved from one RAT to another in a seamless manner.Recent advances in Software Defined Networking (SDN)[4], [5] paradigm also facilitates these mechanisms bybringing more flexibility in the integrated control andmanagement of various RATs.

† This paper is a substantially expanded and revised version of the work in[1].

• The authors are with the Department of Electrical Engineering, In-dian Institute of Technology Bombay, Mumbai, 400076, India. e-mail:{arghyadip, borkar, chaporkar, karandi}@ee.iitb.ac.in. Abhay Karandikaris currently Director, Indian Institute of Technology Kanpur (on leavefrom IIT Bombay), Kanpur, 208016, India. e-mail:[email protected].

In a Third Generation Partnership Project (3GPP) LTEHetNet, interworking with IEEE 802.11 [6] Wireless LocalArea Network (WLAN) (popularly known as WirelessFidelity (WiFi)) operating in unlicensed band providesa potential solution due to the complementary nature ofthese RATs. The LTE Base Stations (BSs) can be deployedto provide ubiquitous coverage, whereas the WiFi AccessPoints (APs) offer high bit rate solution in hot-spotregions. A user can be associated with either LTE or WiFiin an area where their coverage overlaps. Furthermore,according to 3GPP Release 12 specifications [7], datausers can be steered from one RAT to another. Thismechanism is known as mobile data offloading.

Intelligent RAT selection1 and offloading decisions[8]–[18] may lead to efficient resource utilization andcan be triggered either by the user or by the network.User-initiated decisions are taken with an objective ofoptimizing individual utilities and hence, may not con-verge to the global optimality. Therefore, we focus onnetwork-initiated association and offloading schemeswhich target to optimize different network level systemparameters.

In this paper, we propose online learning algorithms toobtain the optimal RAT selection policy for an LTE-WiFiHetNet. The model considered in this paper is similarto that of [19]. As illustrated in Fig. 1, the centralizedcontroller (for example, an SDN controller) possesses anoverall view of both the networks and is responsible fortaking network-initiated RAT selection and offloadingdecisions. We consider voice and data users inside the

1. The terminologies “association” and “RAT selection” are usedinterchangeably throughout this paper.

2

Figure 1: LTE-WiFi HetNet architecture.

LTE-WiFi HetNet. Voice users are always associated withLTE since voice users require QoS guarantee which maynot be provided by WiFi. We assume that the data userscan be associated with either LTE or WiFi. Additionally,we assume that data users can be offloaded from oneRAT to another, whenever a voice user is associated, oran existing voice/data user departs.

Total system throughput is an important system pa-rameter from a network operator’s perspective since thethroughput experienced by the data users may have asignificant impact on the profit and customer base ofthe operator. From the perspective of maximization oftotal system throughput, WiFi generally provides morethroughput than that of LTE when WiFi load is less.However, the total throughput in WiFi decreases as theload increases [20]. Therefore, although in low load,WiFi may be preferable to data users for association,as the load increases, LTE may be preferred. LTE re-sources are shared between voice and data users froma common pool of resources. Throughput requirementof the LTE data users is usually more than that ofthe voice users. Therefore, LTE may reserve some ofthe resources, which otherwise would not be efficientlyutilized by low throughput voice users, for data users.Thus, total system throughput maximization may resultin excessive blocking of voice users. [19] addresses thisinherent trade-off between the total system throughputand blocking probability of voice users by formulatingthis problem as a constrained average reward continuoustime Markov Decision Process (MDP), which maximizesthe total system throughput subject to a constraint on thevoice user blocking probability. In this paper, we proposeonline learning algorithms which can be implementedwithout the knowledge of statistics of arrival processesof voice and data users to obtain the optimal RATselection policy for an LTE-WiFi HetNet.

One of the main advantage of learning is that itavoids explicit estimation which may have high varianceor may be computationally prohibitive. It replaces theconditional averaging in an iterative scheme for solv-ing the dynamic programming equation by an actualevaluation at an observed transition and an incrementalupdate which does the conditional averaging implicitly.Also, our scheme has the advantage that it exploitsthe known threshold structure of the optimal policy tofurther reduce the computational complexity, and is oneof the first to do so.

1.1 Related WorkRAT selection and offloading techniques in HetNets canbe either user-initiated [8]–[13] or network-initiated [14]–[19], [21], [22] in nature. In [8], authors propose a user-initiated RAT selection algorithm in an LTE-WiFi HetNet.The proposed algorithm addresses the trade-off betweenresource utilization and QoS. The problem where eachuser selfishly chooses a RAT with an objective of maxi-mizing its individual throughput is considered in [10]and formulated as a non-cooperative game. [11] pro-poses a heuristic “on-the-spot offloading” scheme wherea data user is always steered to WiFi, whenever it isinside the coverage area of WiFi APs.

Among the network-initiated approaches, in [15], op-timal client-AP association problem in a WLAN is con-sidered within the framework of continuous-time MDP,and user arrival/ departure is taken as a feasible de-cision epoch. In [22], authors propose an optimal RATselection algorithm aiming at maximizing the generatedrevenue. However, the proposed algorithm scales expo-nentially with the system size. Authors in [16] proposedistributed association algorithms by formulating theassociation problem as a non-cooperative game and com-pare the performance with the centralized globally op-timal scheme. A context-aware RAT selection algorithmis proposed in [17]. The proposed algorithm, which canbe implemented on the user side, albeit with networkassistance, minimizes signaling overhead as well as basestation computations.

While the above solutions provide significant in-sight into RAT selection and offloading strategies, noneof them specifically focus on computational efficiency.Therefore, practical implementations of the proposed al-gorithms become infeasible. There are two main drivingfactors behind this issue. Firstly, standard dynamic pro-gramming techniques to solve optimal RAT selection andoffloading problems become computationally inefficientand thus difficult to implement when the state space islarge. Furthermore, we need to know the statistics ofarrival processes of data and voice users which governthe transition probabilities between different states, inorder to determine the optimal policy. In practice, thismay be difficult to obtain. Recent studies [23]–[25] onthe characteristics of cellular traffic reveal that althoughthe voice traffic can be predicted accurately, the state-of-art prediction schemes for data traffic are not very satis-factory. Therefore, obtaining real-time traffic statistics intoday’s cellular network is very difficult.

In the case of unknown system statistics, we mayresort to Reinforcement Learning (RL) [26] algorithmswhich determine the optimal policy using a trial-and-error method. [1], [21], [27], [28] adopt RL based schemeswhich can be implemented online. Q-learning [1], [21],[27] is a traditional RL algorithm which learns theoptimal policy under unknown system dynamics. Ourpreliminary work [1] undertakes a Q-learning basedapproach to determine the optimal policy. Authors in[21] aim to improve the network performance anduser experience jointly and formulate the RAT selectionproblem as a Semi-Markov Decision Process (SMDP)

3

problem. A Q-learning based approach is also adoptedsince network parameters may be difficult to obtain inreality. Based on locally available information at usersand following a Q-learning approach, [27] undertakesdistributed traffic offloading decisions. Although theconvergence to optimality is guaranteed, these learningschemes need to iteratively learn the value functionsfor all state-action pairs, thus possessing large memoryrequirement. Additionally, due to associated explorationmechanism, their convergence rate is slow, especiallyunder large state space.

1.2 Our Contribution

In this paper, our primary contribution is to proposeonline learning algorithms to maximize the total systemthroughput subject to a constraint on the voice userblocking probability without knowing the statistics ofarrival processes of voice/data users in an LTE-WiFiHetNet. To address the issue of slow convergence ofexisting learning schemes in the literature, in this paper,we propose a Post-Decision State (PDS) learning algo-rithm which speeds up the learning process by remov-ing the action exploration. This approach is based onreformulation of the Relative Value Iteration Algorithm(RVIA) equation and can be implemented online in theStochastic Approximation (SA) framework. Furthermore,the PDS learning algorithm has a lower space complexitythan that of Q-learning [1] because instead of the state-action pair values, we need to store the value functionsassociated with states alone. We also prove the conver-gence of the PDS learning RAT selection algorithm to theoptimality.

We have shown in [19] that the optimal policy has athreshold structure, wherein after a certain threshold onthe number of WiFi data users, data users are servedusing LTE. A similar property exists for the admissionof voice users [19], where after a certain threshold onthe number of LTE data and voice users, voice usersare blocked. In this paper, we exploit the thresholdproperties in [19] and propose a structure-aware learningalgorithm which, instead of the entire policy space,searches the optimal policy only from the set of thresholdpolicies. This reduces the convergence time as well asthe computational and storage complexity in comparisonto that of the proposed PDS learning algorithm. Weprove that the threshold vector iterates in the proposedstructure-aware learning algorithm indeed converge tothe globally optimal solution. Note that the analyti-cal methodologies presented in this paper to learn theoptimal threshold policy are developed independentlyand can be applied to any learning problem where theoptimal policy is threshold in nature.

Although we make some simplifying assumptionsto facilitate the analysis, performance of the proposedschemes are studied in realistic conditions without thesimplifying assumptions. Extensive simulations are con-ducted in ns-3 [29], a discrete event network simula-tor, to characterize the performance of the proposedalgorithms. It is observed through simulations that the

proposed structure-aware RAT selection online learn-ing algorithm outperforms the PDS learning algorithm,providing faster convergence to optimality. Performancecomparison of the proposed algorithms is made with Q-learning based RAT selection algorithm [1]. Furthermore,we observe that the proposed algorithms outperformother benchmark algorithms under realistic network sce-narios like presence of channel fading, dynamic resourcescheduling and user mobility. We use 3GPP recom-mended parameters for the simulations.

Our key contributions can be summarized as follows.• Based on the PDS paradigm, we propose an online

algorithm for RAT selection in an LTE-WiFi HetNet.The convergence proof for the proposed algorithmis provided.

• We further exploit the threshold nature of opti-mal policies [19] and propose a novel structure-aware online association algorithm. Theoretical andsimulation results indicate that the knowledge ofthreshold property helps in achieving reductions instorage and computational complexity as well as inconvergence time. We also prove that the proposedscheme converges to the true value function.

• The proposed structure-aware algorithm provides anovel framework that can be applied for designingonline learning algorithms for other problems andhence is of independent interest.

• Performances of the proposed algorithms are com-pared with other online algorithms in the literature[1].

• Performances of the proposed algorithms are com-pared with other benchmark RAT selection algo-rithms under realistic scenarios.

The rest of the paper is organized as follows. Section 2describes the system model. In Section 3, the problemformulation within the framework of constrained MDPis described. The system model and formulation adoptedin our paper is analogous to [1], [14], [19]. The develop-ments described after Section 3 is our point of departure.We introduce the notion of PDS in Section 4. Section 5and 6 propose PDS learning algorithm and structure-aware learning algorithm, respectively, for RAT selectionin an LTE-WiFi HetNet. A comparison of computationaland storage complexities of the proposed and traditionalalgorithms is provided in Section 7. Simulation resultsare presented in Section 8, followed by conclusions inSection 9. In the interest of preserving the flow of thepaper, proofs are presented in Section 10.

2 SYSTEM MODEL

As demonstrated in Fig.1, we consider a system con-sisting of a WiFi AP inside the coverage area of an LTEBS, both connected to a centralized controller using ideallossless links. We assume that voice and data users arepresent at any geographical point in the coverage areaof the LTE BS. Data users outside the common coveragearea of the LTE BS and the WiFi AP always get associatedwith the LTE BS. Therefore no decision is involved inthis case. Hence without loss of generality, we take into

4

account only those data users which are present in thedual coverage area of the LTE BS and the WiFi AP. Datausers can be associated with either the LTE BS or theWiFi AP. We assume that in LTE, voice and data users areallocated resources from a common resource pool. Weassume that voice and data user arrivals are Poisson pro-cesses with means λv and λd, respectively. Service timesfor voice and data users follow exponential distributionswith means 1

µvand 1

µd, respectively. Assumptions on

service times follow justifications in [30]. All the usersare assumed to be stationary.

Remark 1. For brevity of notation, we have considered asingle LTE BS and a single WiFi AP. However, the systemmodel can be generalized to multiple LTE BSs and WiFi APswith small modifications. When the coverage areas of multipleAPs/BSs do not overlap, we need to consider the number ofusers in each AP/BS in the state space. In case of multipleoverlapping APs/BSs, we can construct a one-to-one mappingbetween a user location and an AP/ a BS using a simplecriterion such as highest average signal strength, etc. Usingthis criterion for multiple overlapping APs, the problem canbe reduced to single BS and multiple non-overlapping APproblem. The set where more than one APs have equal signalstrength is non-generic in the sense that in the associatedparameter space it has Lebesgue measure 0.

2.1 State & Action Space

The system can be modeled as a controlled continuoustime stochastic process {X(t)}t≥0. Any state s in the statespace S is symbolically represented as s = (i, j, k), wherei, j represent the number of voice and data users in LTE,respectively. The number of data users in WiFi is denotedby k. The arrival and departure of voice and data usersare taken as decision epochs. Due to the Markoviannature of the system, it suffices to observe the systemstate at each of the decision epochs. Note that the systemstate changes only at these decision epochs. Therefore,it is not required to consider the system state at otherpoints in time.

The state of the system changes whenever there is anarrival or a departure of voice/data users, referred toas events. We consider five type of events, namely, anarrival of a voice user in the system (E1), an arrival of adata user in the system (E2), a departure of an existingvoice user from LTE (E3), a departure of an existingdata user from LTE (E4) and a departure of an existingdata user from WiFi (E5). At every decision epoch,the centralized controller takes a decision based on thepresent system state. Based on the chosen action, thesystem moves to different states with finite probabilities.

We assume that (i, j, k) ∈ S if (i + j) ≤ C andk ≤W , where C is the total number of common resourceblocks for voice and data users in LTE resource pool,and W denotes the maximum number of users in WiFito guarantee that the average per-user throughput inWiFi is more than a certain threshold. Note that, the per-user throughput in WiFi monotonically decreases withthe number of WiFi data users [20]. The first condition

signifies that each admitted user is provided a singleresource block, whenever resources are available.

Remark 2. For sake of simplicity, we have considered singleresource block allocation to LTE users. Although the consider-ation of multiple resource blocks mimics the practical scenariobetter, it complicates the system model without bringing muchdifference in the formulation developed in this paper.

Let the action space, i.e., the set of all possible asso-ciation and offloading strategies in case of arrival anddeparture of users, be denoted by A. We describe Abelow.

A =

A1, Block the arriving user or do nothingduring departure,

A2, Accept voice/data user in LTE,A3, Accept data user in WiFi,A4, Accept voice user in LTE and offload

one data user to WiFi,A5, Move one data user to a RAT (from

which departure has occurred).

In case of voice and data user arrivals, the sets ofall possible actions are {A1, A2, A4} and {A1, A2, A3},respectively. In case of departure of voice/data users,the set of all possible actions is {A1, A5}. However,depending on the system state, one or more actionsmay be infeasible. Note that blocking is considered asa feasible action for voice (data) users, only when thesystem is non-empty (the capacity is reached for bothLTE and WiFi).

2.2 State TransitionsFrom each state s, under an action a, the system makes atransition to a different state s′ with positive probabilitypss′(a). In state s = (i, j, k), let the sum of arrival andservice rates of users be denoted by v(i, j, k). Therefore,we have,

v(i, j, k) = λv + λd + iµv + jµd + kµd.

Then,

pss′(a) =

λvv(i′,j′,k′) , s′ = (i′, j′, k′),

λdv(i′,j′,k′) , s′ = (i′, j′, k′),i′µv

v(i′,j′,k′) , s′ = (i′ − 1, j′, k′),j′µd

v(i′,j′,k′) , s′ = (i′, j′ − 1, k′),k′µd

v(i′,j′,k′) , s′ = (i′, j′, k′ − 1).

Values of i′, j′ and k′ as a function of different actionsa (conditioned on events El) are summarized in Table 1.

Table 1: Transition Probability Table.

a|El (i′, j′, k′)A1|(E1|| . . . ||E5) (i, j, k)A2|E1 (i+ 1, j, k)A2|E2 (i, j + 1, k)A3|E2 (i, j, k + 1)A4|E1 (i+ 1, j − 1, k + 1)A5|(E3||E4) (i, j + 1, k − 1)A5|E5 (i, j − 1, k + 1)

5

2.3 Rewards and CostsLet the reward and cost functions per unit time forstate s and action a be denoted by r(s, a) and c(s, a),respectively. We assume that before the data transferstarts, a Transmission Control Protocol (TCP) connectionis built between the data user and the LTE BS (WiFi AP).Let RL,V and RL,D denote the bit rates of voice and datausers in LTE, respectively. We assume that RL,D is themaximum data rate provided to the data users by theTCP pipe in LTE. In LTE network, generally voice usergenerates traffic at Constant Bit Rate (CBR). Therefore,we assume RL,V to be a constant. RW,D(k) denotes theper-user data throughput of k users in WiFi, assumingthe full buffer traffic WiFi model [20]. The calculation ofRW,D(k) takes into account factors like the contention-based medium access of WiFi users, success and collisionprobabilities as well as slot times for successful transmis-sion, idle slots and busy slots corresponding to collisions.

The reward rate for a state-action pair, which is amonotone increasing function of the number of voiceand data users in LTE, is defined as the total throughputof the system in the state under the corresponding action.The entire description of reward rates in state s = (i, j, k)for different events and actions is provided in Table 2.Whenever the centralized controller blocks an incoming

Table 2: Reward Rate Table.

(a|El) r(s, a)(A1|E1) iRL,V + jRL,D + kRW,D(k)(A1|E2) iRL,V + jRL,D + kRW,D(k)(A1|E3) (i− 1)RL,V + jRL,D + kRW,D(k)(A1|E4) iRL,V + (j − 1)RL,D + kRW,D(k)(A1|E5) iRL,V + jRL,D + (k − 1)RW,D(k − 1)(A2|E1) (i+ 1)RL,V + jRL,D + kRW,D(k)(A2|E2) iRL,V + (j + 1)RL,D + kRW,D(k)(A3|E2) iRL,V + jRL,D + (k + 1)RW,D(k + 1)(A4|E1) (i+ 1)RL,V + (j − 1)RL,D + (k + 1)RW,D(k + 1)(A5|E3) (i− 1)RL,V + (j + 1)RL,D + (k − 1)RW,D(k − 1)(A5|E4) iRL,V + jRL,D + (k − 1)RW,D(k)(A5|E5) iRL,V + (j − 1)RL,D + kRW,D(k)

voice user, the cost rate is one unit, else it is zero.Therefore,

c(s, a) =

{1, if voice user is blocked,0, otherwise.

3 PROBLEM DESCRIPTIONAssociation policy is a sequence of decision rules whichdescribes actions to be chosen at different states and de-cision epochs. Maximization of total system throughputmay result in blocking of voice users as the contributionof voice users towards the total system throughput isless than that of data users. Therefore, to address theinherent trade-off between the total system throughputand the voice user blocking probability, we aim to obtainan association policy which maximizes the total systemthroughput, subject to a constraint on the voice userblocking probability. Since arrival and departure of userscan happen at any point in time, this problem canbe formulated as a constrained continuous time MDP.It is well-known that a randomized stationary optimalpolicy, i.e., a mixture of pure policies with associatedprobabilities, exists [31].

3.1 Problem FormulationLet M be the set of all memoryless policies. We assumethat the Markov chains constructed under such poli-cies are unichain and therefore have unique stationarydistributions. Let the average reward and cost of thesystem over infinite horizon under the policy M ∈ Mbe denoted by VM and BM , respectively. Let R(t) andC(t) be the total reward and cost of the system up totime t, respectively. For the constrained MDP problem,our objective is described as follows.

Maximize: VM = limt→∞

1

tEM [R(t)],

subject to: BM = limt→∞

1

tEM [C(t)] ≤ Bmax,

(1)

where EM denotes the expectation operator under policyM , and Bmax denotes the constraint on the blockingprobability of voice users. As we know that the optimalpolicies are stationary, the limits in Equation (1) exist.

3.2 Equivalent Discrete-time MDP and LagrangianApproachTo obtain the optimal policy using RVIA [32], we need toemploy the Lagrangian approach [31]. In this approach,for a fixed value of Lagrange Multiplier (LM) β, thereward function is given by

r(s, a;β) = r(s, a)− βc(s, a).

The dynamic programming equation described belowprovides the necessary condition for optimality in caseof SMDP ∀s ∈ S, where s′ ∈ S.

V (s) = maxa

[r(s, a;β) +∑

s′

pss′(a)V (s′)− ρt(s, a)],

where V (s), ρ, t(s, a) denote the value function of states ∈ S, the optimal average reward and the mean transi-tion time from state s under the action a, respectively.

Since the sojourn times are exponential, this is a specialcase of continuous time controlled Markov chain forwhich we have,

0 = maxa

[r(s, a;β)− ρ+∑

s′

q(s′|s, a)V (s′)], (2)

where q(s′|s, a) are controlled transition rates satisfyingq(s′|s, a) ≥ 0 for s′ 6= s and

∑s′q(s′|s, a) = 0. If we

scale all the transition rates by a positive scalar, itamounts to time scaling which scales the average rewardaccordingly for every policy including the optimal, butdoes not change the optimal policy. Thus, without lossof generality, we assume that −q(s|s, a) ∈ (0, 1) ∀a,implying in particular that q(s′|s, a) ∈ [0, 1] for s′ 6= s.Adding V (s) to both sides of Equation (2), we have,

V (s) = maxa

[r(s, a;β)− ρ+∑

s′

pss′(a)V (s′)], (3)

where pss′(a) = q(s′|s, a) for s′ 6= s and pss′(a) =1 + q(s′|s, a) for s′ = s (recall that q(s|s, a) is negative).This equation is the DP equation for a discrete time

6

MDP (say {Xn}) with controlled transition probabilitiespss′(a). Here onwards we focus on discrete time settingas described in Equation (3).

For a fixed value of β, the following equation describeshow RVIA can be used to solve the equivalent uncon-strained maximization problem.

Vn+1(s) = maxa

[r(s, a;β)+∑

s′

pss′(a)Vn(s′)−Vn(s∗)], (4)

where Vn(.) is an estimate of the value function afternth iteration and s∗ is an arbitrary but fixed state. Next,we aim to determine the value of β (= β∗, say) whichmaximizes the average expected reward, subject to thecost constraint. The following gradient descent algorithmdescribes the rule to update the value of β.

βk+1 = βk +1

k(Bπβk −Bmax),

where βk is the value of β in kth iteration, and Bπβk isthe voice user blocking probability at kth iteration. Oncethe value of β∗ is determined, we obtain the optimalpolicy by employing a perturbation of β∗ by a smallamount ε in both directions (policies πβ∗−ε and πβ∗+ε,say) with associated costs Bβ∗−ε and Bβ∗+ε, respectively.Finally, we have a randomized optimal policy where thethe policies πβ∗−ε and πβ∗+ε are chosen with probabilityp and (1− p), such that

pBβ∗−ε + (1− p)Bβ∗+ε = Bmax.

We know [33] that the optimal stationary policy can berandomized in at most one s ∈ S where the optimalaction is randomized between two actions.

The gradient descent scheme for β converges to β∗. InEquation (1), instead of considering limiting values ofpointwise reward and cost, we consider limiting valuesof average reward and cost. Therefore, no relaxation ofthe constraint is performed, and the obtained solution isoptimal.

4 POST-DECISION STATE FRAMEWORK

PDS s

(i, j, k + 1)

State s′

(i− 1, j, k + 1)

State s

(i, j, k)

Voice departure (E3)Action a = A3

Figure 2: Transition among PDSs and pre-decision states.

As discussed in the previous section, the optimalpolicy can be determined using RVIA (Equation (4)), ifthe transition probabilities pss′(a) are known beforehand.However, knowledge of transition probabilities requiresthe knowledge of statistics of arrival processes of voiceand data users. In reality, it may be difficult to obtainthese parameters [23]–[25]. Therefore we aim to devisean online scheme which does not require the knowledgeof statistics of arrival process and can still converge to theoptimal solution. However, before describing the onlinealgorithm, we introduce the notion of the PDS.

A PDS is defined to be an imaginary state of thesystem just after an action is chosen and before theunknown system dynamics (noise) adds into the system.The idea behind PDS is to factor the transition fromone state to another into known and unknown com-ponents. The known component consists of the effectof the action taken in a state, whereas the unknowncomponent comprises the unknown random dynamicsof the system (viz., the arrival and departure of voice anddata users). Let us assume that the state of the systemis s = (i, j, k) ∈ S at some decision epoch. Based onthe chosen action, the system moves to the post-decisionstate s = (i, j, k) ∈ S. Based on the next event, the systemmoves to the actual pre-decision state s′ = (i′, j′, k′) ∈ S.Throughout this paper, whenever we refer to a “state”,we always refer to a pre-decision state. An exampletransition involving pre-decision states and PDSs is il-lustrated in Fig. 2. Under action A3, the system makes atransition from state s = (i, j, k) to PDS s = (i, j, k + 1).Under the next event E3, the system moves from the PDSs to the pre-decision state s′ = (i − 1, j, k + 1). In otherwords, the known information regarding the transitionfrom state s to s′ is incorporated in PDS s. On the otherhand, transition from PDS s to state s′ consists only ofthe unknown system dynamics which is not included inthe PDS. Let V (s) be the value function associated withthe PDS s ∈ S. Thus, we have,

V (s) = Es′ [V (s′)],

where the expectation Es′ is taken over all the pre-decision states which are reachable from the post-decision state s. Let the transition probability from PDSs to pre-decision state s′ be denoted by p(s, s′). The post-decision Bellman equation for the post-decision states = (x, y, z) ∈ S is

V (s) =∑

s′

p(s, s′) maxa

[r(s′, a;β) + V (s′)]− ρ, (5)

where s′ is the post-decision state when action a inchosen in pre-decision state s′. Using Equation (5), theRVIA based update rule is as follows.

Vn+1(s) =∑

s′

p(s, s′) maxa

[r(s′, a;β) + Vn(s′)]− Vn(s∗),

Vn+1(s′′) =Vn(s′′) ∀s′′ 6= s,(6)

where s is the PDS associated with the nth decisionepoch and s∗ is a fixed PDS. The idea is to update onecomponent at a time and keep the others unchanged.This idea is translated into an online algorithm statedbelow which updates the value function of the PDSassociated with the current decision epoch.

5 ONLINE RAT SELECTION ALGORITHMThe system changes states based on different events, i.e.,the arrival/departure of users and various actions takenin different states. Since we do not know the arrivalrates of voice and data users, and the max operatoroccurs outside the averaging operation with respect to

7

the transition probabilities of the underlying Markovchain in Equation (4), online implementation of the sameis not feasible. However, in Equation (6), the expectationoperation which resides outside the max operation canbe replaced by averaging over time to estimate theoptimal value function of the PDSs. Using the theoryof SA [34], we can remove the expectation operation inEquation (6) and still converge to optimality in policy bydoing averaging over time.

Let g(n) be a positive step-size sequence possessingthe following properties,

∞∑

n=1

g(n) =∞;

∞∑

n=1

(g(n))2 <∞. (7)

Let h(n) be another step-size sequence having the sameproperties as that of Equation (7) along with the addi-tional property

limn→∞

h(n)

g(n)→ 0. (8)

The key idea is to update the value function associatedwith one PDS at a time and keep the other PDS valuesunchanged. Let Yn be the PDS which is updated at nthiteration. Also, define γ(s, n) =

∑nm=0 I{s = Yn}, i.e.,

number of times PDS s is updated till nth iteration. Thescheme is as follows.

Vn+1(s) = (1−g(γ(s, n)))Vn(s) + g(γ(s, n)){maxa

[r(s′, a;β)

+Vn(s′)]− Vn(s∗)},Vn+1(s′′) =Vn(s′′) ∀s′′ 6= s.

(9)However, the scheme (9) is a primal RVIA algorithmwhich solves a dynamic programming equation for afixed value of LM β. To obtain optimality in β, β is tobe iterated along the timescale h(n), as described below.

βn+1 = Λ[βn + h(n)(Bn −Bmax)], (10)

where the projection operator Λ projects the value of LMto the interval [0, L] for a large L > 0. Therefore, theprimal-dual RVIA can be described as follows.

If the system is at PDS s at the nth iteration, then dothe following.

Vn+1(s) = (1−g(γ(s, n)))Vn(s) + g(γ(s, n)){maxa

[r(s′, a;β)

+Vn(s′)]− Vn(s∗)},Vn+1(s′′) =Vn(s′′) ∀s′′ 6= s,

(11)βn+1 = Λ[βn + h(n)(Bn −Bmax)]. (12)

The assumptions on g(n) and h(n) (Equation (7) and(8)) ensure that two quantities are updated on twodifferent timescales. The value of LM is updated on aslower timescale than that of the value function. Fromthe slower LM timescale point of view, V (s) appearsto be equilibrated in accordance with the current LMvalue, and from the faster timescale view, LM appears tobe almost constant. This two-timescale scheme inducesa “leader-follower” behavior. The slow (fast) timescale

iterate does not interfere in the convergence of the fast(slow) timescale iterate.

Theorem 1. The schemes (11)-(12) converges to (V , β∗)“almost surely”(a.s.).

Proof. Proof is provided in Section 10.1.

Based on the analysis presented above, the twotimescale PDS online learning algorithm is describedin Algorithm 1. As described in the algorithm, value

Algorithm 1 PDS learning algorithm

1: Initialize number of iterations k ← 1, value functionvector V (s)← 0, ∀s ∈ S and the LM β ← 0.

2: while TRUE do3: Determine the event (arrival/departure) in the

current decision epoch.4: Choose action a which maximizes the R.H.S ex-

pression in Equation (9).5: Update the value function of PDS s using (9).6: Update the LM according to Equation (10).7: Update s← s′ and k ← k + 1.8: end while

functions associated with different states, the LM andthe number of iterations are initialized at the begin-ning. Based on a random event (arrival or departure ofvoice/data user), the system state is initialized. When thecurrent PDS of the system is s, the system chooses an ac-tion which maximizes the R.H.S expression in Equation(9). Based on the observed reward in the current PDSs′, V (s) is updated along with the LM. This process isrepeated for every decision epoch.

6 STRUCTURE-AWARE ONLINE RAT SELEC-TION ALGORITHMIn this section, we propose a learning algorithm ex-ploiting the threshold properties of the optimal pol-icy. The PDS learning algorithm proposed in Section5 does not take into account the threshold nature ofthe optimal policy and hence optimizes over the entirepolicy space. However, utilizing the threshold nature ofoptimal policy, the policy space can be reduced signifi-cantly. To this end, we propose a structure-aware onlinelearning algorithm which searches the optimal policyonly from the set of threshold policies, providing fasterconvergence than PDS learning Algorithm. Note thatindependent methodologies which are developed in thissection can be applied to any learning problem havingsimilar structural properties.

6.1 Gradient Based Online AlgorithmLet the throughput increment in WiFi when the numberof WiFi users increases from k to (k + 1) be denoted byRW,D(k). Therefore, RW,D(k) = (k + 1)RW,D(k + 1) −kRW,D(k). We assume the following.

Assumption 1. RW,D(k) is a non-increasing function of k.This assumption is in line with the full buffer traffic model[20].

8

Summary of the structural properties of the optimalpolicy is as follows. Detailed proofs of the structuralproperties can be found in [19].

1) Upto a threshold on the number of WiFi data users(say kth) serve data users in WiFi (A3) and thenserve them using LTE (A2) until LTE is full. WhenLTE is full, i.e., (i+ j) = C, the optimal policy is toserve all data users using WiFi until k = W , wherean incoming data user is blocked.

2) {∀i, j|(i+ j) < C} and a voice user arrival, A4(A2)is better than A2(A4) if k < kth(k ≥ kth).

3) {∀i, j|(i + j) < C} and a voice user arrival, if theoptimal action in state (i, j, k) is blocking, then theoptimal action in state (i+ 1, j, k) is blocking.

4) {∀i, j|(i + j) = C} and a voice user arrival, if theoptimal action in state (i, j, k) is blocking, then theoptimal action in state (i+ 1, j − 1, k) is blocking.

Using the first two properties, we can eliminate a num-ber of suboptimal actions. In the case of data user arrival(event E2) and departure of voice and data users (eventsE3, E4 and E5), a single decision is involved. This mayprovide improved convergence because contrary to anonline algorithm without any knowledge of structuralproperty, we no longer need to learn optimal actions insome states. The only event where multiple decisionsare involved is the voice user arrival (event E1). Asstated in Property 3 and 4, the value of the thresholdon i, where the optimal action changes to blocking, is afunction of j and k. Thus, if we have the knowledge ofthe values of thresholds, we can characterize the policycompletely. The idea is to optimize over the thresholdvector (say θ) using an update rule, so that the value ofthe threshold vector θ converges to the optimal value.Before proceeding further, we determine the dimensionof θ using the analysis presented below.

Using Property 1 and 2, we can identify three regions.1) 0 ≤ k < kth: Using Property 1, we have j = 0.

For each value of k, we need to know the value ofthreshold which belongs to the set {0, 1, .., C}.

2) k = kth: Using Property 1, k = kth =⇒ j ≥ 0.Thus, it boils down to computing a single thresholdwhich belongs to the set {0, 1, ..C− j} (Property 3),for each value of j (0 ≤ j < C). Also, we need tocompute a single threshold for (i+j) = C (Property4).

3) W > k > kth: Using Property 1, k > kth =⇒(i + j) = C. Thus, using Property 4, we need toobtain the threshold of blocking for (W − kth − 1)values of k.

Therefore the dimension of θ = (kth+C+W −kth)=(C+W ).

Remark 3. When the state space becomes too large, then itbecomes cumbersome to represent a policy since this requirestabulating actions corresponding to each state. Due to thethreshold nature of the optimal policy, the representation usingthe threshold vector becomes computationally efficient. Insteadof storing the optimal action corresponding to each state, wejust need to store (C +W ) individual thresholds.

We consider a class of threshold policies which can

be described in terms of the threshold vector θ. Themain idea behind the online algorithm is to compute thegradient of the system metric, i.e., the average reward ofthe system, with respect to θ and improve the policy byupdating the value of θ in the direction of the gradient.Therefore, following [35], one needs to compute the gra-dient of the system metric. To express the dependence ofthe parameters associated with the underlying Markovchain on θ explicitly, we need to redefine the notations.Let the transition probability associated with the Markovchain {Xn} as a function of θ be given by

Pss′(θ) = P (Xn+1 = s′|Xn = s, θ).

Assumption 2. We assume that for every s, s′ ∈ S, Pss′(θ)is a bounded, twice differentiable function, and the first andsecond derivative of Pss′(θ) is bounded.

Let the average reward of the Markov chain, steadystate stationary probability of state s, value function ofstate s (as a function of θ) be denoted by ρ(θ), π(s, θ) andV (s, θ), respectively. The following proposition providesa closed-form expression for the gradient of the averagereward of the system. A proof for the same can be foundin [35]. Although [35] considers a generalized case wherethe reward function depends on θ, in our case the sameproof holds with the exception that the gradient of thereward function is zero.

Proposition 1. Under assumptions on Pss′(θ) as statedbefore, we have,

∇ρ(θ) =∑

s∈Sπ(s, θ)

s′∈S∇Pss′(θ)V (s′, θ) (13)

Hence, we can compute the value of ∇ρ(θ) (or∇Pss′(θ)) to construct an incremental scheme similar toa stochastic gradient algorithm for the threshold values,of the form

θn+1 = θn + h(n)∇ρ(θn), (14)

where θn represents the value of threshold vector innth iteration on the slower timescale h(n). Given athreshold θ, we assume that the state transition in states = (i, j, k) is given by P0(s′|s), if i < θ(T ) and P1(s′|s),otherwise, where θ(T ) denotes the component of θ whichcorresponds to state s. Specifically,

T =

{k + j, (i+ j) 6= C,

C + k, (i+ j) = C.(15)

Therefore, according to the two-timescale gradient basedlearning framework, on the faster timescale, we have,

Vn+1(s, θ) = (1−g(γ(s, n)))Vn(s, θ) + g(γ(s, n))[r(s, a;β)+

Vn(s′, θ)− Vn(s∗, θ)],

Vn+1(s′′, θ) =Vn(s′′, θ), ∀s′′ 6= s.(16)

For example, if the current state is s = (i, 0, 0) and(i < θn(0)), then state transition is determined byP0(s′|s) (accept in LTE (A2)), i.e., s′ = (i+1, 0, 0), else, s′ isdetermined by P1(s′|s) (blocking (A1)), i.e., s′ = (i, 0, 0).

9

However, value functions corresponding to other statesare kept unchanged.

Note that, the above scheme works for a fixed valueof threshold vector θ and LM β. To obtain the optimalvalue of θ, θ is to be iterated along the slower timescaleh(n). Note that, although individual components of thethreshold take discrete values, we interpolate them tothe continuous domain to be able to apply the onlineupdate rule. Since the threshold policy is a step function(governed by P0(s′|s) up to a threshold and P1(s′|s),thereafter) defined at discrete points, Assumption 2 isnot satisfied at every point. Therefore we approximatethe threshold policy in state s by a randomized policywhich is a function of θ (f(s, θ), say). We define

Pss′(θ) ≈ P1(s′|s)f(s, θ) + P0(s′|s)(1− f(s, θ)),

where f(s, θ(T )) =e(i−θ(T )−0.5)

1 + e(i−θ(T )−0.5) in state s = (i, j, k),provides a convenient approximation to the step func-tion.

Remark 4. The rationale behind the choice of this function isthe fact that it is continuously differentiable, and the derivativeis nonzero everywhere.

While designing an online update scheme for θ, in-stead of ∇ρ(θn) (See Equation (14)), we can evaluate∇Pss′(θ). The steady-state stationary probabilities insidethe summation inside Equation (13) can be omitted byperforming averaging over time. We have,

∇Pss′(θ) = (P1(s′|s)− P0(s′|s))∇f(s, θ). (17)

In the right hand side of Equation (17), we incorporatea multiplication factor of 1

2 since multiplication by aconstant term does not alter the online scheme. Thephysical significance of this operation is that at anyiteration, we have state transitions according to P0(.|.)and P1(.|.) with equal probabilities. The update of θ inthe slower timescale h(n) is as follows.

θn+1(T ) = ∆T [θn(T ) + h(n)∇f(s, θn(T ))(−1)αnVn(s′, θn)],

θn+1(T ′) = θn(T ′) ∀T ′ 6= T,(18)

where αn is a random variable which takes values 0 and1 with equal probabilities. When it takes the value 0, thens′ is determined by P1(s′|s), otherwise by P0(s′|s). Theaveraging property of SA then leads to the effective drift(17). Depending on the state visited, the T th componentof the vector θ is updated as shown in Equation (18).For example, if the current state is (1, 0, 0), then θn(0)is updated (See Equation (15)), and other componentsare kept unchanged. The projection operator ∆T is afunction which ensures that the iterates remain boundedin the interval [0,M(T )], where

M(T ) =

{C − (T − kth), if (kth + C) ≥ T ≥ kth,C, else.

(19)

Similar to Algorithm 1, to obtain the optimal value ofβ, β is to be iterated along the same timescale h(n), asspecified below.

βn+1 = Λ[βn + h(n)(Bn −Bmax)], (20)

Remark 5. The dynamics of LM and threshold vector arenot dependent on each other directly. However, both β andθ iterates depend on value functions in the faster timescale.Therefore θ is updated in the same timescale as that of β,without requiring a third timescale.

Theorem 2. The schemes (16), (18) and (20) converge tooptimality a.s.

Proof. Proof is provided in Section 10.2.

Based on the analysis described above, the structure-aware online learning algorithm is stated in Algorithm 2.As described in the algorithm, value functions associated

Algorithm 2 Structure-aware learning algorithm

1: Initialize number of iterations k ← 1, value functionV (s)← 0, ∀s ∈ S, the LM β ← 0 and the thresholdvector θ ← 0.

2: while TRUE do3: Choose action a given by the current value of

threshold vector θ.4: Update the value function of states s using (16).5: Update the LM according to Equation (20).6: Update threshold vector according to Equation

(18).7: Update s← s′ and k ← k + 1.8: end while

with different states, the LM, the threshold vector andthe number of iterations are initialized at the beginning.When the current state of the system is s, the systemchooses the action which is given by the current valueof the threshold vector. Based on the observed reward,V (s) and θ is updated along with the LM. This processis repeated for every decision epoch.

7 COMPARISON OF COMPLEXITIES OFLEARNING ALGORITHMS

In this section, we provide a comparison of storage andcomputational complexities of traditional Q-learning [1],proposed PDS learning and structure-aware learning al-gorithms. We summarize the storage and computationalcomplexities of these schemes in Table 3.

Q-learning algorithm [1] stores value functions forevery state-action pair, i.e., |S| × |A| values and updatesthe value function of one state-action pair at a time.While updating the value function, Q-learning evaluates|A| functions. The PDS learning algorithm (see Equation(11)) requires storing |S| PDS value functions and feasi-ble actions in every state, i.e., |S| values. While updatingthe PDS value function, PDS learning algorithm evalu-ates |A| functions, resulting in a per-iteration complexityof |A|.

In the case of structure-aware learning algorithm, weno longer need to store |S| value functions. Rather, byvirtue of the threshold nature of optimal policy, weconsider three cases.

1) 0 ≤ k < kth: Since we have j = 0, for each value ofk, we need to store (C + 1) value functions.

10

2) k = kth: k = kth =⇒ j ≥ 0. Thus, we need tostore (C + 1− j) value functions, for each value ofj (0 ≤ j ≤ C).

3) W ≥ k > kth: k > kth =⇒ (i + j) = C. Therefore,we need to store value functions of (C + 1) statesfor each value of k.

Therefore, the total number of value functions whichneed to be stored is (C+1)kth+ (C+1)(C+2)

2 +(C+1)(W−kth), which is equal to (C+1)(C+2)

2 +(C+1)W . Note that,this is a considerable reduction in storage complexity incomparison to the PDS learning scheme having a storagecomplexity of O(C2W ). For example, when W = C,the storage complexity reduces from O(C3) to O(C2).Furthermore, feasible actions corresponding to each stateneed not be stored separately since the threshold vectorcompletely characterizes the policy. The per-iterationcomputational complexity of this scheme (see Equation(16)) is O(1). This scheme also involves updating a singlecomponent of the threshold vector (Equation (18)) witha computational complexity of O(1).

Table 3: Computational and storage complexities of differentalgorithms.

Algorithm Storage complexity Computational complexityQ-learning [1] O(|S| × |A|) O(|A|)PDS learning O(|S|) = O(C2W ) O(|A|)Structure-aware O(C2 + CW ) O(1)learning

8 SIMULATION RESULTS

In this section, proposed PDS learning and structure-aware learning algorithms are simulated in ns-3 to char-acterize and compare their convergence behaviors. Con-vergence rates of the proposed algorithms are comparedwith that of the Q-learning, as proposed in our earlierwork [1]. Simulation results establish that the proposedPDS learning algorithm provides improved convergencethan Q-learning. Furthermore, it is observed that theknowledge of structural properties indeed reduces theconvergence time.

8.1 Simulation Model and Evaluation Methodology

The simulation setup comprises a 3GPP LTE BS andan operator-deployed IEEE 802.11g WiFi AP. All usersare assumed to be stationary. Data users are distributeduniformly within 30 m radius of the WiFi AP which isapproximately 50 m away from the LTE BS. LTE andWiFi network parameters used in simulations are chosenbased on 3GPP [36]- [37] models and saturation through-put [20] IEEE 802.11g WiFi [6] model and describedin Table 4 and 5. We consider the generation of CBRuplink traffic for voice and data users in LTE. This isimplemented in ns-3 using an application (similar toON/OFF application) developed by us.

For the update of the PDS value functions, thresholdvector and LM, we consider g(n) = 1

(b n1000 c+2)0.6 and

h(n) = 10n .

Table 4: LTE Network Model.

Parameter ValueMaximum capacity 10 usersVoice bit rate of a single user 20 kbpsData bit rate of a single user 5 MbpsVoice packet payload 50 bitsData packet payload 600 bitsTx power for BS and MS 46 dBm and 23 dBmNoise figure for BS and MS 5 dB and 9 dBAntenna height for BS and MS 32 m and 1.5 mAntenna parameter for BS and MS Isotropic AntennaPath loss (R in kms) 128.1 + 37.6 log(R)

Table 5: WiFi Network Model.

Parameter ValueChannel bit rate 54 MbpsUDP header 224 bitsPacket payload 1500 bytesSlot duration 20 µsShort inter-frame space (SIFS) 10µsDistributed Coordination Function IFS (DIFS) 50µsMinimum acceptable per-user throughput 3.5 MbpsTx power for AP 23dBmNoise figure for AP 4 dBAntenna height for AP 2.5 mAntenna parameter Isotropic antennaPath loss (R in kms) 140.3 + 36.7 log(R)

8.2 Convergence AnalysisFig.(3a) and (3b) illustrate how the Q-learning, PDSlearning and structure-aware learning algorithms con-verge with increasing number of iterations (n). We keepλv = λd = 1. It is evident that both the proposed algo-rithms outperform Q-learning in terms of convergencespeed. Contrary to PDS learning, even after a consider-able amount of iterations, Q-learning explores differentactions with a finite probability, thereby reducing theconvergence speed.

The knowledge of structure reduces the feasible policyspace. Therefore, the structure-aware learning algorithmoffers faster convergence to the optimal policy. We nolonger need to learn the optimal actions in a subsetof states, where the optimal policy is determined usingstructural properties. As observed in Fig.(3a) and (3b),the number of iterations before convergence reducesfrom 2000 to 300 and 1000 to 300, respectively. Notethat smaller number of iterations translates into lesseramount of time for convergence.

Fig.(4a) and (4b) depict the convergence of LM as nincreases. It is evident that as the number of iterationincreases, LM converges for both the schemes.

In Table 6, we illustrate the average time takenby a single iteration of Q-learning, PDS learning andstructure-aware learning algorithms, respectively, in sim-ulations. Average per-iteration time of an algorithmreflects the per-iteration computational complexity. Asobserved from Table 6, the time taken by the structure-aware learning algorithm is the least. Also, the aver-age per-iteration time of the PDS learning algorithmis slightly lower than that of Q-learning. These resultscorroborate the analysis illustrated in Table 3.

Table 6: Average per-iteration time of different algorithms.

Algorithm Average per-iteration time (µs)Q-learning [1] 49.66PDS learning 40.07Structure-aware learning 32.119

11

0 1,000 2,000 3,000 4,000

0

20

40

60

Number of iterations (n)

Totalsystem

through

put(M

bps)

Structure-aware learningPDS learningQ learning

(a) (µv = µd = 1/20).

0 500 1,000 1,500

0

20

40

60

Number of iterations (n)

Totalsystem

throughput(M

bps)

Structure-aware learningPDS learningQ learning

(b) (µv = µd = 1/30).Figure 3: Plot of total system throughput vs. number of iterations (n) for different algorithms.

0 2,000 4,000 6,000 8,000 10,000

0

10

20

30

40

50

Number of iterations (n)

Lag

range

Multiplier

(LM) µv = µd = 1/30

µv = µd = 1/20

(a) PDS learning.

0 2,000 4,000 6,000 8,000

0

10

20

30

40

50

Number of iterations (n)

Lag

range

Multiplier

(LM)

µv = µd = 1/30

µv = µd = 1/20

(b) Structure-aware learning.Figure 4: Plot of LM vs. number of iterations (n) for different algorithms.

0 200 400 600 800

0

20

40

60

n∑k=1

g(k)

Totalsystem

through

put(M

bps)

Structure-aware learningPDS learning

(a) (λd = µv = 1, µv = µd = 1/20).

0 200 400 600 800 1,000

0

20

40

60

n∑k=1

g(k)

Totalsystem

through

put(M

bps)

Structure-aware learningPDS learning

(b) (λd = µv = 1, µv = µd = 1/30).

Figure 5: Plot of total system throughput vs. sum of step sizes till nth iteration (n∑

k=1

g(k)) for different algorithms.

8.3 Stopping Criteria

While simulating an online learning algorithm, in prac-tical cases, we may not want to wait till the actualconvergence. Rather, we can observe the total systemthroughput over a moving window of the sums of step

sizes till the present iteration (n∑k=1

g(k)) and calculate

the ratio of maximum and minimum values of totalsystem throughput over this window. We set the window

size to be 100. If the ratio is more than 0.99, thenwe conclude that stopping criteria is reached, i.e., theobtained policy is in a close neighborhood of the optimalpolicy with high probability. In Fig.(5a) and (5b), total

system throughput is plotted againstn∑k=1

g(k). Note that,n∑k=1

g(k) is chosen instead of n, to decouple the effect of

diminishing step size, while analyzing the convergence

12

behavior of the proposed schemes. In other words, thisparameter is selected to establish that the convergenceof the algorithms is indeed due to the convergenceto the optimal policy and not due to very small val-ues of step size g(n) as n becomes large. We observein Fig. (5a) and (5b) that the PDS learning algorithm

reaches the stopping criteria whenn∑k=1

g(k) becomes 700,

which corresponds to approximately 1000 iterations. Onthe other hand, the structure-aware learning algorithm

reaches stopping criteria whenn∑k=1

g(k) becomes 300

which translates into almost 450 iterations.

8.4 Consideration of Realistic ScenariosWhile the above simulation results provide significantinsight into the convergence behavior of the proposed al-gorithms over traditional Q-learning, in this section, weevaluate the performances of the proposed algorithms inrealistic scenarios. We compare the total system through-put and voice user blocking probability performance ofthe proposed algorithms with that of the on-the-spotoffloading [38] and LTE-preferred schemes [19].

Although in the system model (See Section 2) weconsider single resource block allocation to LTE datausers, in simulations we relax this assumption andconsider proportional fair scheduling for the LTE BSwhich dynamically assigns resources to the users basedon user bandwidth demand. Users randomly generateindividual bandwidth demands. However, we assumethat the maximum data rate achievable for a singledata user is 5 Mbps and the bottleneck is in the accessnetwork. Furthermore, in the previous subsections, thereis no consideration of channel fading effects in LTE andWiFi. To address that, whenever we choose an actioninvolving offloading of a user from one RAT to another(A4 and A5), the user with the worst channel is selectedfor offloading. For example, whenever A4 is chosen andwe offload a data user from LTE to WiFi, we alwayschoose the data user with the lowest Signal-to-noiseRatio (SNR). Since, in general, a user with bad channelprovides bad throughput to the system, the user withthe worst channel is chosen for offloading. We considerExtended Pedestrian A model [39] for fading in LTE andRayleigh fading for WiFi.

In on-the-spot offloading [38], data users alwayschoose WiFi unless WiFi coverage is not present. There-fore, in our system model, on-the-spot offloading alwaysassociates data users with WiFi until capacity is reachedin WiFi. Voice users are associated with LTE, and whenLTE reaches its capacity, voice users are blocked. InLTE-preferred scheme [19], voice and data users areassociated with LTE until LTE reaches its capacity. WhenLTE reaches its capacity and a voice user arrives, thevoice user is blocked if there is no data user in LTE.Otherwise, one existing data user is offloaded to WiFiif capacity is available in WiFi. Upon the departure ofan existing voice or data user from LTE, an existing datauser in WiFi, if any, is offloaded to LTE. While offloading,we always choose the user with the worst channel.

8.4.1 Voice User Arrival Rate Variation

Fig. 6a depicts the blocking percentage of voice users foron-the-spot offloading, LTE-preferred and the proposedalgorithms for varying λv . Since on-the-spot offloadingblocks voice user when LTE reaches its capacity, blockingprobability of voice users increases with λv . Since PDSlearning and structure-aware learning algorithms learnin which states blocking is to be chosen as the optimalaction, voice user blocking probability corresponding tothese algorithms converge to the same value. Since theproposed algorithms may block voice users even whenthe LTE system does not reach its capacity, the blockingprobability values are marginally higher than that of on-the-spot offloading. Voice users may be blocked to saveLTE resources for future data user arrivals which havea higher throughput contribution to the system. LTE-preferred scheme blocks a voice user when LTE systemis full and there is no data user in LTE. Therefore, on-the-spot offloading and LTE-preferred schemes providesimilar blocking probability performance.

Fig. 6b illustrates the total system throughput perfor-mance of different algorithms under varying λv . Withincrease in λv , the average number of voice users in thesystem increases while the number of WiFi data usersremains the same. Therefore, in the case of on-the-spotoffloading, the total system throughput increases withλv . However, since the throughput of voice users is smallcompared to data users, the rate of increment is verysmall. PDS learning and structure-aware learning algo-rithms learn the optimal policy which does significantload balancing via A4 and A5. Also, while offloading,the proposed algorithms take channel state of usersinto account. Thus, these algorithms outperform on-the-spot offloading in terms of total system throughputwith performance improvement varying from 10.72%(for λv = 0.13) to 28.72% (for λv = 0.07). With increasein λv , LTE-preferred scheme starts offloading data usersto WiFi to accommodate incoming voice users. Underlow WiFi load, total throughput of the system increases.As WiFi load increases, the rate of increment decreases.However, both the proposed algorithms perform betterthan LTE-preferred scheme.

8.4.2 Data User Arrival Rate Variation

As observed in Fig.6c, since in on-the-spot offloading,data and voice users are served using WiFi and LTE,respectively, changes in λd do not impact the blockingprobability of voice users. Performances of both PDSlearning and structure-aware learning algorithms aresimilar to that of on-the-spot offloading. Due to thepresence of a constraint on the voice user blockingprobability, most of the voice users are blocked whenthe LTE system reaches capacity. Therefore, the proposedalgorithms do most of the blocking of voice users inthe same decision epochs as that of on-the-spot offload-ing. Since LTE-preferred scheme blocks voice users onlywhen LTE does not have available capacity and there isno data user in LTE, the blocking probabilities of LTE-preferred scheme and on-the-spot offloading are same.

13

0.05 0.1 0.15 0.2 0.250

10

20

30

40

Voice arrival rate (λv)(s−1)

Voice

userblockingfraction(%

)

On-the-spotLTE-preferredPDS learning

Structure-aware learning

(a)

0.05 0.1 0.15 0.2 0.250

10

20

30

Voice arrival rate (λv)(s−1)

Totalsystem

through

put(M

bps)

On-the-spotLTE-preferredPDS learning

Structure-aware learning

(b)

0.1 0.2 0.3 0.4 0.5

20

40

60

Data arrival rate (λd)(s−1)

Voice

userblockingfraction

(%)

On-the-spotLTE-preferredPDS learning

Structure-aware learning

(c)

0.1 0.2 0.3 0.4 0.50

20

40

60

Data arrival rate (λd)(s−1)

Total

system

through

put(M

bps)

On-the-spotLTE-preferredPDS learning

Structure-aware learning

(d)Figure 6: Plot of voice user blocking fraction and total system throughput for different algorithms. (a) Voice user blockingpercentage vs. λv , (b)Total system throughput vs. λv (λd = 1/20, µv = 1/60 and µd = 1/10). (c) Voice user blocking percentagevs. λd, (d)Total system throughput vs. λd (λv = 1/6, µv = 1/60 and µd = 1/10).

Since on-the-spot offloading associates data user withWiFi, with increase in λd, the load in WiFi increases.As a result, as λd (See Fig. 6d) increases, the effectof contention and channel fading reduces the rate ofincrement of throughput. Both the proposed algorithmsperform better than on-the-spot offloading by virtue ofoptimal RAT selection and load balancing actions whichreduces the effect of contention in WiFi. Also, whileoffloading, the proposed algorithms take channel stateof users into account. Therefore, the proposed algorithmsoutperform on-the-spot offloading with performance im-provement varying from 20% (for λd = 0.1) to 54.6%(for λd = 0.5). As λd increases, LTE-preferred schemestarts offloading more data users to WiFi. Therefore,the system throughput increases. Under high λd, theeffect of contention is lesser than that of on-the-spotoffloading, resulting in a better performance than on-the-spot offloading. However, the proposed algorithmsperform better than LTE-preferred scheme.

8.5 Consideration of User Mobility

In this section, we have evaluated how the proposedalgorithms perform in comparison to on-the-spot of-floading and LTE-preferred scheme in the face of usermobility. In addition to ns-3 simulation settings de-scribed in the last section, we have also consideredrandom waypoint model [40] for mobility of voice and

data users. As evident from Fig. 7a and Fig. 7b, althoughtotal system throughputs provided by different algo-rithms change due to mobility, comparative performanceof the proposed algorithms with respect to on-the-spotoffloading and LTE-preferred scheme remains the same.Since mobility does not have any impact on the blockingprobability of the voice users, the blocking probabilityperformances of the considered algorithms are exactlythe same as that described in Fig. 6a and 6c.

9 CONCLUSIONS

In this paper, we have proposed a PDS learning algo-rithm which can be implemented online without theknowledge of statistics of arrival processes. It has beenproved that the algorithm converges to the optimalpolicy. Furthermore, another online algorithm, whichexploits the threshold structure of optimal policy, hasbeen proposed. The knowledge of threshold structureprovides improvements in computational and storagecomplexity and convergence time. The proposed algo-rithm provides a novel framework that can be appliedfor designing online learning algorithms for any gen-eral problem and is of independent interest. We haveproved that the structure-aware learning algorithm con-vergences to globally optimal threshold vector. Simula-tion results have been presented to exhibit how the PDSparadigm and the knowledge of structural properties

14

0.05 0.1 0.15 0.2 0.250

10

20

30

Voice arrival rate (λv)(s−1)

Totalsystem

through

put(M

bps)

On-the-spotLTE-preferredPDS learning

Structure-aware learning

(a)

0.1 0.2 0.3 0.4 0.50

20

40

60

Data arrival rate (λd)(s−1)

Totalsystem

throughput(M

bps)

On-the-spotLTE-preferredPDS learning

Structure-aware learning

(b)Figure 7: Plot of total system throughput for different algorithms under user mobility. (a) Total system throughput vs. λv

(λd = 1/20, µv = 1/60 and µd = 1/10), (b)Total system throughput vs. λd (λv = 1/6, µv = 1/60 and µd = 1/10).

provide improvement in convergence time over tradi-tional online association algorithms. Moreover, simula-tion results establish that the proposed schemes outper-form on-the-spot offloading and LTE-preferred schemesunder realistic network scenarios.

10 PROOFS10.1 Proof of Theorem 1The idea of the proof is similar to that of the onlinealgorithm in [41]. The arguments are developed on thebasis of the standard Ordinary Differential Equation(ODE) approach [34, Chapter 2] of analyzing stochas-tic approximation algorithms, which can be consideredas a noisy discretization of a limiting ODE. Learningparameters are considered as discrete time steps anditerates, when linearly interpolated, are compared withthe trajectory of the ODE. Standard assumptions on step-sizes, viz., Equation (7) and (8), ensure that the dis-cretization error and error due to noise is asymptoticallynegligible. Therefore, the iterates track the behavior ofthe associated ODEs asymptotically and hence, convergeto the globally asymptotically stable equilibrium a.s.However, for the sake of completeness, we restate thedetails of the proof for the proposed algorithm.

We rewrite the update of fast and slow timescale as

Vn+1(s) = Vn(s) + g(γ(s, n)){maxa

[r(s′, a;β) + Vn(s′)]

− Vn(s∗)− Vn(s)},Vn+1(s′′) =Vn(s′′) ∀s′′ 6= s.

(21)and

βn+1 = Λ[βn + h(n)(Bn −Bmax)]. (22)

Due to the conditionh(n)

g(n)→ 0, we have a two-timescale

leader-follower behavior. The PDS value functions areupdated in the faster timescale and the LM in the slowerone. Let H1 : R|S| → R|S| be a map defined by (for s ∈ S),

H1(s) =∑

s′

p(s, s′) maxa

[r(s′, a;β) + Vn(s′)]− Vn(s∗).

(23)

Note that the knowledge of p(s, s′) is not required forthis algorithm. We consider them in Equation (23) forthe sake of analysis. Following the two time-scale anal-ysis in [34, Section 6.1], we analyze Equation (21) firstkeeping the LM constant. This translates into analyzingthe following limiting ODE which tracks Equation (21).

˙V (t) = H1(V (t))− V (t). (24)

Following [42], as t→∞, V (t) converges to the uniquefixed point of H1(.), i.e., V such that

H1(V ) = V .

Note that, this scheme is similar to [42], [43].

Lemma 1. The PDS value function iterates and the LMiterates are bounded a.s.

Proof. Consider a mapping H0 : R|S| → R|S| as follows.

H0(s) =∑

s′

p(s, s′)[maxa

Vn(s′)− Vn(s∗)]. (25)

Note that Equation (25) corresponds to Equation(23) with zero immediate reward. Now, we have,

limb→∞

H1(bV )

b= H0(V ). Also, the globally asymptotically

stable equilibrium of the limiting ODE ˙V (t) = H0(V (t))−

V (t), which is a scaled limit of the original ODE (24),is the origin (Using arguments of [41, Lemma 1]). Theboundedness of V follows from [44].

The physical interpretation of this approach, as statedin [34], is as follows. If the iterates of the PDSs becomeunbounded along a subsequence, then a suitably scaledversion of the original ODE approximately follows thelimiting ODE. Since the origin is the globally asymptot-ically stable equilibrium of the scaled ODE, the scaledODE must return towards the origin. Therefore, theoriginal PDS iterates also begin to move towards abounded set, ensuring stability of the iterates.

The iterates of β are bounded since they are con-strained to remain in [0, L], by definition.

Lemma 2. We have Vn → V βn a.s., where V βn is the valuefunction of the PDS for β = βn.

15

Proof. It can be seen that the LM is varied on a muchslower timescale than V . Therefore, the V iterationsconsider LM to be almost constant. Therefore, the βiterations can be written as βn+1 = βn + γ(n), whereγ(n) = O(h(n)) = o(g(n)). Thus, the limiting ODEs asso-ciated with the iterates are ˙

V (t) = H1(V (t)) − V (t) andβ(t) = 0. Since β(t) = 0, for analyzing the V (.) iterates, itis sufficient to consider the ODE ˙

V (t) = H1(V (t))− V (t),for a fixed value of the LM β. The rest of the prooffollows from [45].

The lemma presented next proves that the LM iteratesconverge to the optimal LM β∗ and hence, (Vn, βn)converges to (V , β∗).

Lemma 3. LM iterates converge to β∗.

Proof. The proof outline is similar to that of [41]. Notethat in Equations (11) and (12), the iterations for the PDSsdetermine the maximum of a Lagrangian for the policykeeping the LM almost constant. Therefore, the limitingODE for the LM iterations is same as a gradient descentfor the Lagrangian which minimizes over the PDS valuefunctions. The result follows from “envelope theorem”[45], which allows the interchange of min and gradientoperator.

For each fixed policy, the reward is linear in β withnegative slope. Thus V (.), which is by standard ar-gument, the upper envelope, is piecewise linear withfinitely many linear pieces and convex in β for eachcomponent. Let the stationary randomized policy Mhas a unique stationary distribution ηM . Let EM [.] de-note the expectation under a stationary randomizedpolicy M . Let L(M,β) = EM [r(Xn, Zn) − βc(Xn, Zn)]and I(β) = max

ML(M,β), where the controlled Markov

chain {Xn} on a finite state space S is controlled bya control process {Zn} taking values in a finite actionspace. Therefore, I is piecewise linear and convex. Definez(β) =

∑s,a η

(s)[r(s, a) − βc(s, a)], where Mβ is anoptimal stationary randomized policy when multiplierβ is used. The limiting ODE is

β(t) = z(β(t)).

According to the results in [34, Section 10.2], stochasticgradient descent for a convex function tracks this ODEand hence converges to the optimal β∗. Thus, the desiredβ∗ is the global minimum, as there is no local minimumwhich is not a global minimum.

10.2 Proof of Theorem 2Proof approach is similar to the approach adopted in theproof of Theorem 1. We describe the arguments whichare specific to this proof. We rewrite the updates of theiterates as

Vn+1(s, θ) = Vn(s, θ) + g(γ(s, n)){[r(s, a;β) + Vn(s′, θ)

− Vn(s∗, θ))− Vn(s, θ)]},Vn+1(s′′, θ) =Vn(s′′, θ) ∀s′′ 6= s,

(26)

θn+1(T ) = ∆T [θn(T ) + h(n)∇f(s, θn(T ))(−1)αnVn(s′, θn)],

θn+1(T ′) = θn(T ′) ∀T ′ 6= T,(27)

βn+1 = Λ[βn + h(n)(Bn −Bmax)]. (28)

Similar to Theorem 1, following the two timescale analy-sis [34], we analyze Equation (26) keeping the thresholdvector θ and LM β constant. Therefore, it can be arguedthat as t→∞, V converges to the asymptotically stableequilibrium of the associated ODE.

Lemma 4. The PDS value functions, LM and the thresholdvector iterates are bounded a.s.

Proof. The boundedness of PDS value functions andLM follows in an approach similar to Lemma 1. Also,the iterates of the threshold vector are bounded (SeeEquation (19)).

Lemma 5. We have Vn → V βn,θn a.s., where V βn,θn is thevalue function of states for β = βn, θ = θn.

Proof. Since the iterations of the threshold vector θ canbe expressed as θn+1 = θn + δ(n), where δ(n) = o(g(n)),the proof approach is similar to that of Lemma 2.

Convergence of LM follows immediately from Lemma3. The only thing left to prove is the convergence ofthreshold vector iterations θn to the optimal thresholdvector θ∗ and hence, the convergence of (Vn, βn, θn) to(V, β∗, θ∗). However, before that, we need to prove thefollowing lemma which establishes the unimodality ofthe average reward attained under a threshold policywith threshold θ (which is ρ(θ)), with respect to θ.

Lemma 6. ρ(θ) is unimodal in θ.

Proof. Proof is provided in Section 10.3.

Lemma 7. The threshold vector iterates θn → θ∗.

Proof. The limiting ODE for the threshold vector itera-tions (Equation (27) is same as a gradient ascent of theform

θ(t) = ∇ρ(θ(t)).

Since the average reward is known to be unimodal in θ(Lemma 6), there does not exist any local maxima whichis not a global maxima. Therefore, θn → θ∗.

Contrary to Theorem 1, convergence of the thresholdvector iterates does not require individual clocks forevery state, as long as all components are updated com-parably often, i.e., the relative frequencies of their updateremain bounded away from zero. This is true in generalfor stochastic gradient schemes, see [34, Chapter 7].

10.3 Proof of Lemma 6We prove this lemma componentwise, i.e., we provethat the average reward is unimodal with respect toeach θ(T ), hence proving unimodality with respect to θ.Consider value Iteration Algorithm (VIA). Let the valuefunction of PDS (i, j, k) at N th iteration be denoted byVN (i, j, k).

16

Lemma 8. VN (i+ 1, j, k)− VN (i, j, k) decreases with N .

Proof. Proof is provided in Section 10.4.

Lemma 9. {∀i, j | (i + j) = C}, VN (i + 1, j − 1, k + 1) −VN (i, j, k) decreases with N .

Proof. Proof is provided in Section 10.5.

We consider both the cases, where after a certainthreshold on i, the optimal action changes from (1) A2

to A1 and (2) A4 to A1, respectively.Proof of (1): We know that if the optimal action in PDS(i, j, k) is A1, then we have, V (i + 1, j, k) − V (i, j, k) ≤−β −RL,V . Since VIA converges to the threshold policywith threshold θ∗, there exists an integer N1 such thatfor all N ≥ N1, VN (i + 1, j, k) − VN (i, j, k) ≤ −β − RL,Vfor i ≥ θ∗(T ) and VN (i+1, j, k)− VN (i, j, k) ≥ −β−RL,Vfor i < θ∗(T ). Let us assume UN for N ≥ 1 to beUN = min{i ∈ N0 : VN (i + 1, j, k) − VN (i, j, k) ≤−β−RL,V }. If UN is empty, then we assume UN = M(T ).Hence, UN can be referred to as the optimal thresholdat N th iteration of VIA. Now, since VN (i + 1, j, k) −VN (i, j, k) ≤ −β − RL,V implies VN+1(i + 1, j, k) −VN+1(i, j, k) ≤ −β − RL,V (using Lemma 8), UN mono-tonically decreases with N . Also, limN→∞ UN = θ∗(T ).

Given a threshold θ1(T ) (θ∗(T ) < θ1(T ) ≤ M(T )),we consider a re-defined problem where blocking is notallowed at any state with i < θ1(T ), i.e., the actionin these states is to always accept in LTE. In this casealso, VN (i + 1, j, k) − VN (i, j, k) ≤ −β − RL,V impliesVN+1(i + 1, j, k) − VN+1(i, j, k) ≤ −β − RL,V . Let in thiscase, Nθ1(T ) = min{N : UN ≤ θ1(T )}. That is, Nθ1(T ) isthe first iteration of VIA at which the threshold dropsto θ1(T ). Therefore, Nθ1(T ) must be finite, because forN < Nθ1(T ), the value function iterates take exactly thesame values as that of the original problem. The fact thatblocking is not allowed for i < θ1(T ) makes no difference(for N < Nθ1(T )) since blocking is never chosen as theoptimal action in these states upto N th

θ1(T ) iteration in theoriginal problem. Since the original problem convergesto the threshold θ∗(T ), at N th

θ1(T ) iteration we have,VN+1(θ1(T )+1, j, k)−VN+1(θ1(T ), j, k) ≤ −β−RL,V , andthe same inequality holds for this redefined problem.Since VN (i + 1, j, k) − VN (i, j, k) ≤ −β − RL,V impliesVN+1(i+ 1, j, k)− VN+1(i, j, k) ≤ −β −RL,V (Lemma 8),the same inequality holds for all N ≥ Nθ1(T ). Hence, theVIA converges to the policy with threshold θ1(T ) in theredefined problem, which implies that θ1(T )-thresholdpolicy is superior to θ1(T )+1-threshold policy. Since thisargument holds for any threshold θ1(T ), we can claimthat the average reward is monotonically decreasingwith θ1(T ), for θ1(T ) > θ∗(T ).

Therefore, if we have ρ(θ(T )) ≥ ρ(θ(T ) + 1), we musthave θ(T ) ≥ θ∗(T ). Thus, we have, ρ(θ(T )+1) ≥ ρ(θ(T )+2). This completes the proof of unimodality for (1).

Proof of (2): In this case, if the optimal action in PDS(i, j, k) is blocking, we have V (i + 1, j − 1, k + 1) −V (i, j, k) ≤ −β − RL,V + RL,D − RW,D(k). Rest of theproof is similar to (1) (Using Lemma 9).

10.4 Proof of Lemma 8Let us denote DN V (i, j, k) = VN (i + 1, j, k) − VN (i, j, k).Therefore, we need to prove that DN+1V (i, j, k) ≤DN V (i, j, k), ∀N . We prove this using induction argu-ments on VIA. When N = 0, VN (i, j, k) = 0. Thus,D0V (i, j, k) = 0. We have,

VN+1(i, j, k) = λv max{f(i, j, k)− β + VN (i, j, k),

f(i+ 1, j, k) + VN (i+ 1, j, k), f(i+ 1, j − 1, k + 1)+

VN (i+ 1, j − 1, k + 1)}+ λd max{f(i, j + 1, k)+

VN (i, j + 1, k), f(i, j, k + 1) + VN (i, j, k + 1)}+ iµv

max{f(i− 1, j, k) + VN (i− 1, j, k), f(i− 1, j + 1, k − 1)+

VN (i− 1, j + 1, k − 1)}+ jµd max{f(i, j − 1, k)+

VN (i, j − 1, k), f(i, j, k − 1) + VN (i, j, k − 1)}+ kµd max{f(i, j, k − 1) + VN (i, j, k − 1),

f(i, j − 1, k) + VN (i, j − 1, k)}+ (1− v(i, j, k))VN (i, j, k),

where f(i, j, k) = iRL,V + jRL,D + kRW,D(k).For k ≥ kth, we prove the claim component wise. Prooffor k < kth follows in a similar way. Let us denote the pth

component of VN (i, j, k) by V pN (i, j, k), for p = 1, 2, . . . , 6.Therefore, we have,

V 11 (i, j, k) = max{f(i, j, k)− β + V0(i, j, k), f(i+ 1, j, k)+

V0(i+ 1, j, k), f(i+ 1, j − 1, k + 1) + V0(i+ 1, j − 1, k + 1)}.We know, for k ≥ kth, A4 is suboptimal (Property 2).Therefore, subtracting f(i, j, k) from the rewards corre-sponding to both actions,

V 11 (i, j, k) = max{−β + V0(i, j, k), RL,V + V0(i+ 1, j, k)}.

Therefore, D1V1(i, j, k) = 0. Again, we have,

V 21 (i, j, k) = max{f(i, j + 1, k) + V0(i, j + 1, k),

f(i, j, k + 1) + V0(i, j, k + 1)}.Or, equivalently,

V 21 (i, j, k) = max{RL,D + V0(i, j + 1, k), RW,D(k)

+ V0(i, j, k + 1)}.We know, for k ≥ kth, A2 is optimal. Thus,

V 21 (i, j, k) = RL,D + V0(i, j + 1, k).

Therefore, D1V2(i, j, k) = 0. Similarly, other compo-

nents can be proved to be equal to zero. Therefore,D1V (i, j, k) ≤ D0V (i, j, k).

Now, assume that the claim holds for arbitrary N , i.e.,DN+1V (i, j, k) ≤ DN V (i, j, k). Now, we need to provethat DN+2V (i, j, k) ≤ DN+1V (i, j, k). Similar to the caseof N = 0, we have, DN+2V

p(i, j, k) −DN+1Vp(i, j, k) =

DN+1V (i, j, k) − DN V (i, j, k) ≤ 0, p 6= 1. Therefore, itremains to prove that DN+2V

1(i, j, k)−DN+1V1(i, j, k) ≤

0. Let a0, a1 ∈ {A1, A2} be the maximizing actions inPDSs (i, j, k) and (i + 1, j, k), respectively, at (N + 2)th

iteration. Let b0, b1 ∈ {A1, A2} be the maximizing ac-tions in PDSs (i, j, k) and (i + 1, j, k), respectively, at(N + 1)th iteration. Now, it is not possible to have

17

a1 = A2 and b0 = A1. This is because if b0 = A1,we have, DN V (i, j, k) ≤ −β − RL,V . Using concav-ity with respect to i (shown in [19]), we must haveDN V (i + 1, j, k) ≤ −β − RL,V . However, if a1 = A2, wehave DN+1V (i+ 1, j, k) ≥ −β −RL,V , which contradictsthe inductive assumption. Therefore, we consider threecases as follows. In every case, given values of a1 and b0,if the inequality holds for any chosen values of a0 andb1, then the inequality must hold for maximizing actionsa0 and b1.1) If a1 = b0 = A1, then we can choose a0 = b1 = A1,resulting in

DN+2V1(i, j, k)−DN+1V

1(i, j, k)

= DN+1V (i, j, k)−DN V (i, j, k) ≤ 0.

2) If a1 = b0 = A2, then we can choose a0 = b1 = A2, andthe inequality satisfies due to same reasoning as above.3) If a1 = A1 and b0 = A2, then we can choose a0 = A2

and b1 = A1. In this case,

DN+2V1(i, j, k)−DN+1V

1(i, j, k) = −β + VN+1(i+ 1, j, k)

−RL,V − VN+1(i+ 1, j, k) + β − VN (i+ 1, j, k) +RL,V +

VN (i+ 1, j, k) = 0.

Thus, we have, DN+2V (i, j, k) ≤ DN+1V (i, j, k).

10.5 Proof of Lemma 9

Let us denote EN V (i, j, k) = VN (i + 1, j − 1, k +1) − VN (i, j, k). Therefore, we need to prove thatEN+1V (i, j, k) ≤ EN V (i, j, k). When N = 0, VN (i, j, k) =0. Thus, E0V (i, j, k) = 0. We have,

V 11 (i, j, k) = max{−β + V0(i, j, k), RL,V −RL,D+

RW,D(k) + V0(i+ 1, j − 1, k + 1)}.

Now, we have the following cases.1) If −β ≥ RL,V − RL,D + RW,D(k), then using As-sumption 1, we have −β ≥ RL,V −RL,D + RW,D(k + 1).Therefore, E1V

1(i, j, k) = 0.2) If −β ≤ RL,V − RL,D + RW,D(k + 1), then we haveE1V

1(i, j, k) = RW,D(k + 1) − RW,D(k) ≤ 0 (usingAssumption 1).3) If RW,D(k) > −β − RL,V + RL,D > RW,D(k + 1) , wehave E1V

1(i, j, k) = −β − (RL,V −RL,D + RW,D(k)) < 0.Similar to Lemma 8, other components can be proved tobe equal to zero. Therefore, E1V (i, j, k) ≤ E0V (i, j, k).

Now, assume that the claim holds for arbitrary N , i.e.,EN+1V (i, j, k) ≤ EN V (i, j, k). Now, we need to provethat EN+2V (i, j, k) ≤ EN+1V (i, j, k). Similar to the caseof N = 0, we have, EN+2V

p(i, j, k) − EN+1Vp(i, j, k) =

EN+1V (i, j, k) − EN V (i, j, k) ≤ 0, p 6= 1. Therefore, itremains to prove that EN+2V

1(i, j, k)−EN+1V1(i, j, k) ≤

0. Let a0, a1 ∈ {A1, A4} be the maximizing actions inPDSs (i, j, k) and (i+1, j−1, k+1), respectively, at (N +2)th iteration. Let b0, b1 ∈ {A1, A4} be the maximizingactions in PDSs (i, j, k) and (i+1, j−1, k+1), respectively,at (N + 1)th iteration. Therefore, we consider four cases

as follows.1) If a1 = b0 = A1, we choose a0 = b1 = A1, resulting in

EN+2V1(i, j, k)− EN+1V

1(i, j, k)

= EN+1V (i, j, k)− EN V (i, j, k) ≤ 0.

2) If a1 = b0 = A4, then we can choose a0 = b1 = A4, andthe inequality satisfies due to same reasoning as above.3) If a1 = A1 and b0 = A4, then we can choosea0 = A4 and b1 = A1. Similar to Case 3 in Lemma 8,EN+2V

1(i, j, k)− EN+1V1(i, j, k) = 0.

4) If a1 = A4 and b0 = A1, then we can choose a0 = A1

and b1 = A4.

EN+2V1(i, j, k)− EN+1V

1(i, j, k)

= [EN+1V (i+ 1, j − 1, k + 1) + EN+1V (i, j, k)]−[EN V (i+ 1, j − 1, k + 1) + EN V (i, j, k)] ≤ 0.

Thus, we have EN+2V (i, j, k) ≤ EN+1V (i, j, k) .

ACKNOWLEDGMENT

This work is funded by the Ministry of Electronics andInformation Technology (MeitY), Government of India.

REFERENCES

[1] A. Roy, P. Chaporkar, and A. Karandikar, “An on-line radio accesstechnology selection algorithm in an LTE-WiFi network,” in IEEEWCNC, 2017, pp. 1–6.

[2] Y. He, M. Chen, B. Ge, and M. Guizani, “On WiFi offloading inheterogeneous networks: Various incentives and trade-off strate-gies,” IEEE Communications Surveys & Tutorials, vol. 18, no. 4, pp.2345–2385, 2016.

[3] Cisco, “Cisco visual networking index: Global mobile data trafficforecast update, 2013–2018,” white paper, 2014.

[4] V. G. Nguyen, T. X. Do, and Y. Kim, “SDN and virtualization-based LTE mobile network architectures: A comprehensive sur-vey,” Wireless Personal Communications, vol. 86, no. 3, pp. 1401–1438, 2016.

[5] A. Nayak M., P. Jha, and A. Karandikar, “A centralized SDNarchitecture for the 5G cellular network,” in IEEE 5GWF, 2018,pp. 147–152.

[6] IEEE 802.11-2012, Part 11, “Wireless LAN Medium Access Control(MAC) and Physical Layer (PHY) Specifications,” 2012.

[7] 3GPP TR 37.834 v0.3.0, “Study on WLAN/3GPP Radio Interwork-ing,” 2013.

[8] A. Whittier, P. Kulkarni, F. Cao, and S. Armour, “Mobile dataoffloading addressing the service quality vs. resource utilisationdilemma,” in IEEE PIMRC, 2016, pp. 1–6.

[9] F. Moety, M. Ibrahim, S. Lahoud, and K. Khawam, “Distributedheuristic algorithms for RAT selection in wireless heterogeneousnetworks,” in IEEE WCNC, 2012, pp. 2220–2224.

[10] E. Aryafar, A. Keshavarz-Haddad, M. Wang, and M. Chiang,“RAT selection games in hetnets,” in IEEE INFOCOM, 2013, pp.998–1006.

[11] K. Lee, J. Lee, Y. Yi, I. Rhee, and S. Chong, “Mobile data of-floading: How much can WiFi deliver?” IEEE/ACM Transactionson Networking, vol. 21, no. 2, pp. 536–550, 2013.

[12] D. Suh, H. Ko, and S. Pack, “Efficiency analysis of WiFi offloadingtechniques,” IEEE Transactions on Vehicular Technology, vol. 65,no. 5, pp. 3813–3817, 2016.

[13] N. Cheng, N. Lu, N. Zhang, X. Zhang, X. S. Shen, and J. W.Mark, “Opportunistic WiFi offloading in vehicular environment:A game-theory approach,” IEEE Transactions on Intelligent Trans-portation Systems, vol. 17, no. 7, pp. 1944–1955, 2016.

[14] A. Roy and A. Karandikar, “Optimal radio access technologyselection policy for LTE-WiFi network,” in IEEE WiOpt, 2015, pp.291–298.

[15] G. S. Kasbekar, P. Nuggehalli, and J. Kuri, “Online client-APassociation in WLANs,” in IEEE WiOpt, 2006, pp. 1–8.

18

[16] K. Khawam, S. Lahoud, M. Ibrahim, M. Yassin, S. Martin,M. El Helou, and F. Moety, “Radio access technology selectionin heterogeneous networks,” Physical Communication, vol. 18, pp.125–139, 2016.

[17] S. Barmpounakis, A. Kaloxylos, P. Spapis, and N. Alonistioti,“Context-aware, user-driven, network-controlled RAT selectionfor 5G networks,” Computer Networks, vol. 113, pp. 124–147, 2017.

[18] B. H. Jung, N. O. Song, and D. K. Sung, “A network-assisted user-centric WiFi-offloading model for maximizing per-user through-put in a heterogeneous network,” IEEE Transactions on VehicularTechnology, vol. 63, no. 4, pp. 1940–1945, 2014.

[19] A. Roy, P. Chaporkar, and A. Karandikar, “Optimal radio ac-cess technology selection algorithm for LTE-WiFi network,” IEEETransactions on Vehicular Technology, vol. 67, no. 7, pp. 6446–6460,2018.

[20] G. Bianchi, “Performance analysis of the IEEE 802.11 distributedcoordination function,” IEEE Journal on Selected Areas in Commu-nications, vol. 18, no. 3, pp. 535–547, 2000.

[21] M. El Helou, M. Ibrahim, S. Lahoud, K. Khawam, D. Mezher,and B. Cousin, “A network-assisted approach for RAT selectionin heterogeneous cellular networks,” IEEE Journal on Selected Areasin Communications, vol. 33, no. 6, pp. 1055–1067, 2015.

[22] E. Khloussy, X. Gelabert, and Y. Jiang, “Investigation on MDP-based radio access technology selection in heterogeneous wirelessnetworks,” Computer Networks, vol. 91, pp. 57–67, 2015.

[23] R. Li, Z. Zhao, J. Zheng, C. Mei, Y. Cai, and H. Zhang, “Thelearning and prediction of application-level traffic data in cellularnetworks,” IEEE Transactions on Wireless Communications, vol. 16,no. 6, pp. 3899–3912, 2017.

[24] K. Kumar, A. Gupta, R. Shah, A. Karandikar, and P. Chaporkar,“On analyzing indian cellular traffic characteristics for energyefficient network operation,” in IEEE NCC, 2015, pp. 1–6.

[25] U. Paul, A. P. Subramanian, M. M. Buddhikot, and S. R. Das,“Understanding traffic dynamics in cellular data networks,” inIEEE INFOCOM, 2011, pp. 882–890.

[26] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduc-tion. MIT press Cambridge, 1998.

[27] K. Adachi, M. Li, P. H. Tan, Y. Zhou, and S. Sun, “Q-Learningbased intelligent traffic steering in heterogeneous network,” inIEEE VTC Spring, 2016, pp. 1–5.

[28] S. Anbalagan, D. Kumar, D. Ghosal, G. Raja, and V. Muthuval-liammai, “SDN-assisted learning approach for data offloading in5G HetNets,” Mobile Networks and Applications, vol. 22, no. 4, pp.771–782, 2017.

[29] http://code.nsnam.org/ns-3-dev/.[30] T. Bonald and J. W. Roberts, “Internet and the Erlang formula,”

ACM SIGCOMM Computer Communication Review, vol. 42, no. 1,pp. 23–30, 2012.

[31] E. Altman, Constrained Markov decision processes. CRC Press, 1999.[32] M. L. Puterman, Markov decision processes: discrete stochastic dy-

namic programming. John Wiley & Sons, 2014.[33] F. J. Beutler and K. W. Ross, “Optimal policies for controlled

Markov chains with a constraint,” Journal of mathematical analysisand applications, vol. 112, no. 1, pp. 236–252, 1985.

[34] V. S. Borkar, Stochastic approximation: A dynamical systems viewpoint.Cambridge University Press, 2008.

[35] P. Marbach and J. N. Tsitsiklis, “Simulation-based optimization ofMarkov reward processes,” IEEE Transactions on Automatic Control,vol. 46, no. 2, pp. 191–209, 2001.

[36] 3GPP TR 36.814 v9.0.0, “Further Advancements for E-UTRAPhysical Layer Aspects,” 2010.

[37] 3GPP TR 36.839 v11.1.0, “Mobility Enhancements in Heteroge-neous Networks,” 2012.

[38] F. Mehmeti and T. Spyropoulos, “Performance analysis of “on-the-spot” mobile data offloading,” in IEEE GLOBECOM, 2013, pp.1577–1583.

[39] 3GPP TS 36.104 V10.2.0, “Base Station (BS) Radio Transmissionand Reception,” 2011.

[40] D. B. Johnson and D. A. Maltz, “Dynamic source routing in adhoc wireless networks,” in Mobile computing. Springer, 1996, pp.153–181.

[41] N. Salodkar, A. Bhorkar, A. Karandikar, and V. S. Borkar, “Anon-line learning algorithm for energy efficient delay constrainedscheduling over a fading channel,” IEEE Journal on Selected Areasin Communications, vol. 26, no. 4, pp. 732–742, 2008.

[42] V. R. Konda and V. S. Borkar, “Actor-critic-type learning algo-rithms for Markov decision processes,” SIAM Journal on controland Optimization, vol. 38, no. 1, pp. 94–123, 1999.

[43] J. Abounadi, D. Bertsekas, and V. S. Borkar, “Learning algorithmsfor Markov decision processes with average cost,” SIAM Journalon Control and Optimization, vol. 40, no. 3, pp. 681–698, 2001.

[44] V. S. Borkar and S. P. Meyn, “The ODE method for convergenceof stochastic approximation and reinforcement learning,” SIAMJournal on Control and Optimization, vol. 38, no. 2, pp. 447–469,2000.

[45] V. S. Borkar, “An actor-critic algorithm for constrained Markovdecision processes,” Systems & control letters, vol. 54, no. 3, pp.207–213, 2005.

Arghyadip Roy is currently a research scholarin the department of electrical engineering at IITBombay, India. He received the B.E. degree fromJadavpur University, Kolkata, India, in 2010, andthe M.Tech. degree from IIT Kharagpur, India, in2012. He previously worked in Samsung R&DInstitute-Bangalore, India. His research interestsare resource allocation, optimization and controlof stochastic systems.

Vivek Borkar completed his B.Tech. (EE) fromIIT Bombay (’76), M.S. (Systems and Control)from Case Western Reserve Uni. (’77) and Ph.D.(EECS) from Uni. of California at Berkeley (’80).He has held positions at TIFR Center for Appli-cable Mathematics and Indian Institute of Sci-ence in Bengaluru, and Tata Institute of Fun-damental Research, Mumbai, before joining IITBombay, Mumbai, as Institute Chair Professor ofElectrical Engineering in Aug. 2011. He has heldvisiting positions at Uni. of Twente, MIT, Uni. of

Maryland at College Park and Uni. of California at Berkeley. He is aFellow of IEEE, American Math. Society, TWAS and the science andengineering academies in India. His research interests are in stochasticoptimization and control, covering theory, algorithms and applications.

Prasanna Chaporkar received the MS degreefrom the Faculty of Engineering, Indian Instituteof Science, Bangalore, India, in 2000, and thePhD degree from the University of Pennsylvania,Philadelphia, Pennsylvania, in 2006. He wasa ERCIM post-doctoral fellow with ENS, Paris,France, and NTNU, Trondheim, Norway. Cur-rently, he is an associate professor in the IndianInstitute of Technology, Mumbai. His researchinterests include resource allocation, stochasticcontrol, queueing theory, and distributed sys-

tems and algorithms.

Abhay Karandikar is currently Director of IITKanpur (on leave from IIT Bombay). He is also aMember (part-time) of Telecom Regulatory Au-thority of India (TRAI). In IIT Bombay, he servedas Institute Chair Professor in the ElectricalEngineering Department, Dean (Faculty Affairs)from 2017 to 2018 and Head of the Electrical En-gineering Department from 2012 to 2015. Prof.Karandikar is the founding member of TelecomStandards Development Society, India (TSDSI),India’s standards body for telecom. He was the

Chairman of TSDSI from 2016 to 2018. His research interests includeresource allocation in wireless networks, Software defined networking,frugal 5G and rural broadband.


Recommended