AI-Based Trajectory Design and Radio Resource Management ...

IEEE TRANSACTIONS ON COMMUNICATIONS 1

Multi Agent Reinforcement Learning Trajectory Design andTwo-Stage Resource Management in CoMP UAV VLC

NetworksMohammad Reza Maleki, Mohammad Robat Mili, Mohammad Reza Javan, Senior Member, IEEE, Nader Mokari,

Senior Member, IEEE, and Eduard A. Jorswieck, Fellow, IEEE

Abstract—In this paper, we consider unmanned aerial vehicles(UAVs) equipped with a visible light communication (VLC) accesspoint and coordinated multipoint (CoMP) capability that allowsusers to connect to more than one UAV. UAVs can move in3-dimensional (3D) at a constant acceleration, where a centralserver is responsible for synchronization and cooperation amongUAVs. The effect of accelerated motion in UAV is necessary to beconsidered. Unlike most existing works, we examine the effectsof variable speed on kinetics and radio resource allocations. Forthe proposed system model, we define two different time frames.In the frame, the acceleration of each UAV is specified, and ineach slot, radio resources are allocated. Our goal is to formulatea multiobjective optimization problem where the total data rateis maximized, and the total communication power consumptionis minimized simultaneously. To handle this multiobjective op-timization, we first apply the scalarization method and thenapply multi-agent deep deterministic policy gradient (MADDPG).We improve this solution method by adding two critic networkstogether with two-stage resources allocation. Simulation resultsindicate that the constant acceleration motion of UAVs showsabout 8% better results than conventional motion systems interms of performance.Index Terms— Visible light communication, UAV, CoMP, twotime frame, reinforcement learning , DDPG, MADDPG, resourceallocation, 3D movements, constant acceleration.

I. INTRODUCTION

A. State of the art

AS a cooperative communication system based on multipletransmission and reception points, coordinated multi-

point (CoMP) is consolidated into the long-term evolution-advanced releases [1] as an adequate method for relievinginter-cell interference. It also enables symbol-level cooperationamong unmanned aerial vehicles (UAVs) and base stations(BS) to enhance communication quality. CoMP techniquesignificantly improves data-rate, and connection availabilityfor cell center and edge users [2], [3]. Despite the fact thatCoMP can mitigate the effects of severe inter-cell interference(ICI), it is considered a key enabling technology for beyondfifth-generation (B5G) and sixth-generation (6G) networks. Inorder to improve network coverage for the next-generation

M. R. Maleki, and N. Mokari are with the Department of Electricaland Computer Engineering, Tarbiat Modares University, Tehran, Iran, (e-mail: {M.mohammadreza, nader.mokari}@modares.ac.ir). M. R. Javan iswith the Department of Electrical Engineering, Shahrood University ofTechnology, Iran, (e-mail: [email protected]). E A. Jorswieck is withthe Institute of Communications Technology, TU Braunschweig, Germany (e-mail: [email protected]). This work was supported by the joint Irannational science foundation (INSF) and German research foundation (DFG)under grant No. 96007867.

mobile phone networks, CoMP is used to increase the receivedsignal-to-interference-plus-noise ratio (SINR) [4].UAVs are predicted to play an essential role B5G and 6Gcellular networks [5]. On the one hand, for improving thecommunication and service range and enhancing the quality ofservice (QoS), UAVs with specific purposes such as aerial ma-neuvers can be linked directly with cellular BSs [6]–[9]. On theother hand, we can utilize UAVs as aerial wireless BSs in thesky to implement flexible and on-demand wireless services tomobile users, promoting communication performance and im-proved coverage [10]–[14]. Several technical opportunities andchallenges are created with the advent of cellular-connectedUAVs and wireless transmissions aided by UAVs. As a firstconsideration, UAVs usually have a strong line-of-sight (LoS)to users. As a result, channel gain and communication qualityare improved, but inter-cell interference increases correspond-ingly. Second, UAVs offer high mobility in the 3-dimensional(3D) environment. Trajectory management becomes morecomplex due to 3D motion, but it provides more opportunityfor UAV positioning and trajectory control, which can improvecommunication performance [3]. Visible light communication(VLC) is an evolving communication technology with lowenergy consumption and flexible coverage [15]. The VLCnetwork can support a large number of services due to theavailable bandwidth in unlicensed spectrum, its ubiquitouspresence, and low power consumption. Using light-emittingdiodes (LEDs) in VLC, the technology offers illumination andcommunication in scenarios such as search and rescue. It willplay an essential role in future generations [16], [17]. Withthe emergence of new technologies and applications, machinelearning becomes more prevalent in B5G wireless applicationsin [18].In this paper, we present a unified framework addressingthese concerns and utilizing different emerging technologies.By designing a complex two time frame system for resourceallocation and trajectory planning, we provide an approachthat solves the multifaceted problem. Considering the constantacceleration makes our work more practical. We can solvecomplex problems by combining multiple technologies withnovel reinforcement learning (RL) algorithms. We adopt andextend these technologies to maximize the total data rate whileusers are moving, which is challenging in VLC networks.However, our paper is different compared to previous works.We design a new trajectory planning with two time framesand utilize novel RL with two-stage actions.

arX

iv:2

111.

0511

6v3

[ee

ss.S

P] 3

Dec

202

1

2 IEEE TRANSACTIONS ON COMMUNICATIONS

B. Related works

We organize the review according to several system models.These categories include UAV-RL, UAV-CoMP, and VLC-RF,respectively, as follows:

1) UAV-RL: QoS constrained energy efficiency function isproposed by the authors in [19] as a reward function forproviding reliable communication. To deal with dynamics anduncertainty in environments, they form a stochastic gamefor maximizing the expected rewards and solve the problemby using the multi-agent reinforcement learning (MARL),where each UAV acts as an agent, which seeks the bestpolicy independently and only relies on each agent’s localobservations. As a way to support the maximum numberof offloaded tasks while maintaining the QoS requirements,each mobile edge computing (MEC) server is assigned uniqueoptimization problems to jointly manage the MEC-mountedmacro eNodeBs (MeNBs) to allocate resources to vehiclesand make association decisions in [20]. Through the use of theMEC servers as agents, they convert the transformed problemsinto agents, and then develop the MADDPG algorithm to solvethem. In high mobility and highly heterogeneous networks,[21] investigates the channel model and develops a novel deepRL-based time division duplex (TDD) configuration methodfor dynamically allocating radio resources online. A jointscheduling approach between the UAVs’ trajectory planningand time resource assignment is implemented to maximizeminimum throughput under the constraints of maximum flightspeed, peak uplink power, and area of flight in [22]. Tomaximize the minimum throughput, they propose a multi-agent deep Q-learning (DQL)-based strategy for jointly opti-mizing the paths and time allocation of the UAVs. Every UAVpossesses an independent deep Q-network (DQN) for its ownaction strategy, while the rest of the UAVs are considered partsof the environment. Specifically, the authors of [23] developan architecture for delivering content to ground users in ahotspot area using UAV NOMA cellular networks enablingcaching. By optimizing the caching placement of a UAV,the scheduling of content requests by users, and the powerallocation of NOMA users, they formulate an optimizationproblem to minimize content delivery delay as a Markovdecision process (MDP). They propose an algorithm basedon Q-learning that allows the UAV to learn and select actionscenarios based on the MDP.

2) UAV-CoMP: The authors of [24] survey the top issuesfacing UAV-based wireless communication networks, and theyinvestigate UAV networks with flying drones, the energyefficiency of UAVs, and seamless handover between UAVsand the ground BSs. [2] studies performance optimization ofUAV placement and movement in multi-UAV CoMP com-munication, where each UAV forwards its received signalsto a central processor for decoding. Through using a trajec-tory design to exploit the high mobility of the UAV accesspoint, it is possible to enhance the data rate significantly.However this may result in very long delays for users [25],making it problematic for delay-sensitive ultra-reliable lowlatency communications (URLLC) applications. In [3], CoMPis investigated with 3D trajectory optimization where multiple

TABLE I: Notations and Symbols

Notation Descriptionm/M/M Index/number/set of usersf/F/F Index/number/set of UAVst/T/T Index/number/set of time framesn/N/N Index/number/set time slotswm Location of each user [m×m×m ]vm Velocity of each user [m/s ×m/s ×m/s ]qf Location of the fth UAV [m×m×m]vf Velocity of the fth UAV [m/s ×m/s ×m/s ]af Acceleration of the fth UAV [m/s2 ×m/s2 ×m/s2]dm,f Distance between the fth UAV and the mth user [m]Ts(·) Gain of the optical filterg(·) Optical concentrator gain at PDφ Angle of irradianceAr Active area of PD i$ Order of Lambertian emissionψ Incidence angle between LED and deviceψc Semi-angle field of view (FOV) of PD

suspicious eavesdroppers are considered. It is demonstratedin [26] that maximizing sum rates for heterogeneous networkscan lead to a remarkable enhancement of spectral efficiency forjoint transmission CoMP-NOMA for a wide range of accessdistances. In [27], the UAV coverage problem is addressed inorder to either maximize coverage region or enhance the QoS.The authors of [28] examine the strategies for incorporating aUAV into a two-cell NOMA COMP system so that the BS canbe sustained. A novel VLC/UAV framework is designed in [29]to both communicate and illuminate while also optimizing thelocations of UAVs to minimize the total power consumption.

3) VLC-RF: In order to enhance communication coverage,hybrid VLC/RF systems emerge, [30], so that mobile userscan achieve higher rates of data transmission via integratedVLC/RF. A multi-agent reinforcement learning method isused to improve the QoS for the users in [31]. An RF/VLCaggregated system is discussed by the authors of [32] inorder to maximize energy efficiency. However, they do notassume the user mobility, which is challenging in RF/VLChybrid networks. By using NOMA-based hybrid VLC-RFwith common backhaul, [33] addresses the problem of op-timal resource allocation to maximize achievable data rate.An iterative algorithm is presented to train users on accessnetworks of a hybrid RF/VLC system in [34]. To maximizethe total achievable throughput, they formulate an optimizationproblem to assign power to RF APs and VLC APs. Thesymbol error rate of the code domain NOMA-based VLCsystem is investigated in [35], revealing that users exhibitidentical error rate performance across locations, while recentworks demonstrate that power domain-NOMA is an effectivemultiple access scheme for VLC systems.In terms of resource allocation, considering the movementof UAVs is challenging. First, UAV movement with constantspeed is practically impossible. In addition, it remarkablydeteriorates maneuverability, however, flying at a constantacceleration with increasing maneuverability provides betteropportunities for allocating resources and tracking the users.Likewise, 3D motion improves UAV performance by increas-ing maneuverability. We also provide a minimum data rate foreach user to assist users with weaker channels usually locatedat the edge of the cell to enhance QoS. In [36] and [37], the

SKM: MY IEEE ARTICLE 3

authors proved that using NOMA in VLC systems results inbetter performance.

C. Contribution

In this paper, we consider the downlink scenario for UAVsthat are equipped with VLC APs. Studies show that 3D motioncan improve UAV performance in terms of the allocation ofradio resources [38], [39]. However, we improve UAV motionby designing a two time frame system applying constantacceleration movement. To enhance the resource allocationframework, we utilize constant acceleration motion to coverthe weakness of low maneuverability. It also assists UAVs tohastily reach better locations in terms of allocating resources.Increasing in maneuverability manifests itself to have a highersystem complexity. Obviously, solving complex problems re-quires a better and more robust learning method. We addressthis challenge using two time frame allocation method, whichare entirely compatible with the proposed system model. In theframes, the constant acceleration is determined, whereas in theslot, radio resources are allocated, and the initial velocities arecalculated. Our contributions can be summarized as follows:• We propose a mathematical model for the movement of

UAVs so that they can move at various velocities. Itmeans that all UAVs’ accelerations are different in theframes. This model adds more complexity to the problem.

• To formulate the problem, we need a new frame structurethat includes two time frames for resource allocation. Inframes, the constant acceleration of each UAV is deter-mined, whereas in slots, radio resources are allocated.

• Our goal is to maximize data rates and minimize powerconsumption simultaneously. We formulate our problemin two time frames to fit the equations of motion, anddetermining the movements of UAVs and users into ourproblem. We also consider a minimum data rate requiredfor each user to improve QoS at the edge of the cell.

• By considering UAV trajectory and user movement ina complex frame structure, the problem becomes chal-lenging. We present a novel solution method that adaptsto the operation of UAVs and the use of two timeframes to allocate resources to deal with. A multi-agent-RL-based solution is used where agents interact with acentral server. Using this approach increases the speedof convergence and solves the problem better comparedwith existing methods.

• In the simulation section, we provide a comprehensive re-view of the system model. The located baselines confirmthe superior performance of our solution. The impactsof different terms in the objective function are inspected.The influence of various constraints on the objective func-tion is also studied. We additionally examine the systemwith and without CoMP, which shows the positive impactof employing CoMP in our system model. Also, theconstant acceleration and constant velocity are examined,in which constant acceleration gives better performancethan constant velocity.

The rest of this paper is organized as follows: Section IIdescribes the system model for our considered UAV-VLC-

enabled CoMP. Section III formulates UAV resource allocationand movement optimization problems to maximize data-rateand the minimum total power consumption. We propose ourreinforcement learning (RL) approach in Section IV. SectionV includes simulation results and at the end, conclusion is inSection VI.Symbol Notations: Matrix variables and vector are repre-sented by bold upper-case and lower-case letters, respectively.|.| stands for the absolute value. S indicates set {1, 2, . . . , S}and |S| = S is the cardinality of set S. Transpose is indicatedby (·)T . ‖·‖ represents the Euclidean norm, |·| stands for theabsolute value. E {·} is an exception operator. Set of realnumbers is represented as R.

II. SYSTEM MODEL

In this paper, we consider UAVs mounted with VLC AP,where the UAVs hover over ground with CoMP systemwhile applying PD-NOMA to serve both communication andillumination simultaneously. Users are randomly distributedon ground. As shown in Fig. 1, we consider the down-link transmission scenario where single-antenna UAVs aredeployed as aerial BSs. For ease of exposition, the timehorizon, T , is equally divided into N time slots with slotduration δ as shown in Fig. 2 where T = {1, . . . , T}, N ={1, . . . , N} are the frame set and the slot set, respectively.We denote the set of UAVs by F = {1, . . . , F}, the setof users is indicated by M = {1, . . . ,M}. We consider a3D Cartesian coordinate system, with location, velocity, andacceleration are measured in m, m/s, and m/s2, respectively,where the horizontal coordinate and velocity of user m aredenoted by wm[t, n] = (xm[t, n], ym[t, n], 0)

T ∈ R3×1 and{vm[t, n]} , respectively. We assume that each UAV flieswith the maximum speed constraint vmax and the maximumacceleration constraint amax. As such, the UAV trajectory,speed, acceleration, coordinates, velocity, and acceleration ofUAV f over time T can be denoted by {q[t, n]}, {v[t, n]},{a[t]}, and qf [t, n] =

(xf [t, n], yf [t, n], zf [t, n]

)T ∈ R3×1,vf [t, n] =

(vfx [t, n], vfy [t, n], vfz [t, n]

)T ∈ R3×1, andaf [t, n] =

(afx[t, n], afy [t, n], afz [t, n]

)T ∈ R3×1, respectively.Assume that the communication channel from UAV to eachuser is dominated by a line-of-sight (LoS) link. In VLC andUAV networks without loss of generality, the LoS channel gainof the VLC link between UAV f and user m can be expressedas:

dm,f [t, n] =∥∥qf [t, n]−wm[t, n]

∥∥ , (2)

and

F (ψ[t, n]) = Ts(ψ[t, n]) cos(ψ[t, n])g(ψ[t, n]), (3)

where Ar is the active area of the photo detector (PD).dm,f [t, n] is the transmission distance from the UAV to theuser and ψ[t, n] denotes the angle of incidence between theUAV and the device, φ[t, n] is the angle of irradiance fromthe UAV to the device. m is the order of Lambertian emissionwith m = − ln 2/

(ln cosφ1/2

)where φ1/2 is the LED’s semi-

angle at half power. Ts(ψ[t, n]) is the gain of the opticalfilter and g(ψ[t, n]) is the optical concentrator gain at the PD,


hm,f [t, n] =

{(m+1)Ar

2π(dm,f [t,n])2cosm(φ[t, n])F (ψ[t, n]), 0 ≤ ψ[t, n] ≤ ψc,

0, ψc ≤ ψ[t, n],(1)

UAV 2

UAV 3

UAV 1

Cell 1

Cell 2

Cell 3

Central

server

Downlink non-CoMP

UAV

Downlink CoMP

Interference Non-CoMP user

CoMP user

Fig. 1: System Modal UAV VLC CoMP

𝑡 = 1

𝑡 = 𝑇

𝑛=0

𝑛=𝑁

𝑛=0

𝑛=𝑁

UAVs have constant

acceleration

UAVs have constant

acceleration

UAVs’ movement with

constant acceleration

UAVs’ movement with

constant acceleration

Initial velocity is determined

due to last time slot velocity

Initial velocity is determined

due to last time slot velocity

Initial point of next frame

with a new acceleration

Initial point of next frame

with a new acceleration

Resource allocation

and velocity are determined

Resource allocation

and velocity are determined

𝑡 = 0

Fig. 2: The two frame uses constant acceleration, allocates resources,and calculates the initial velocity to start a slot.

g(ψ[t, n]) can be expressed as: g(ψ[t, n]) = η/ sin2 ψc when0 ≥ ψ[t, n] ≥ ψc, and g(ψ[t, n]) = 0 if ψc < ψ[t, n], whereψc and η are the semi-angle field of view (FoV) of the PDand the refractive index.There are two types of users in this system, including CoMP

and non-CoMP. A CoMP user is associated with more thanone UAV, whereas, a non-CoMP is the same as a traditionalnetwork users. The SINR for the mth user data, received atuser m is expressed as

γm,f [t, n](m) =µ2pm,f [t, n]ρm,f [t, n]

∣∣hm,f [t, n]∣∣2

Im,fIntra [t, n] +Bm,f (σm)2

, (4)

where µ is the PD’s responsibility, pm,f [t, n] is the allocatedpower between user m and UAV f , ρm,f [t, n] is the userassociation variable where ρm,f [t, n] ∈ {0, 1}, and

∣∣hm,f ∣∣ ischannel coefficient between user m and UAV f . Im,fIntra [n] isNOMA interference and (σm)2 is noise variance. Due to lowinterference in VLC system, we have only NOMA interferencein the denominator of SINR. We can indicate Im,fInter and Im,fIntradue to the principle of NOMA as:

Im,fIntra [t, n] =∑i∈Um

ρi,f′[t, n]pi,f

′[t, n]

∣∣∣hi,f ′[t, n]

∣∣∣2 , (5)

where

Um =

{b ∈M

∥∥∥hb,f [t, n]∣∣∣2 > ∣∣∣hm,f ′ [t, n]

∣∣∣2} , (6)

In (5), the user with the best channel ignores other signals andonly senses noise as its interference as it is obvious in (6).We define the total date rate as summation of CoMP andnonCoMP users as:

R[t] =∑m∈M

RmCoMP[t] +∑f∈F

Rm,fnonCoMP[t]

, (7)

where

R mCoMP[t] =

1

N

N∑n=1

R′ mCoMP[t, n], (8)

R m,fnonCoMP[t] =

1

N

N∑n=1

R′ m,fnonCoMP[t, n], (9)

where

R′ mCoMP[t, n] = νm,f [t, n]∑f∈F

Bm,f2

log2

(1 + γm,f [t, n]

),

(10)

R′ m,fnonCoMP[t, n] =(1− νm,f [t, n]

) Bm,f2

log2

(1 + γm,f [t, n]

),

(11)where νm,f is a binary variable which indicates whether theuser is CoMP or not. The scaling factor 1/2 is due to theHermitian symmetry [40].

III. PROBLEM FORMULATION

In this section, we introduce a multiobjective optimizationproblem (MOOP) where the data rates of CoMP users andnonCoMP users are maximized, and the total power of theseusers is minimized simultaneously as follows:

max{w

m

[t,n],q[t,n],v[t,n−1],a[t],ρm,f[t,n],pm,f[t,n]}R[t] (12a)

min{w

m

[t,n],q[t,n],v[t,n−1],a[t]}

∑f∈F

∑m∈M

pm,f [t, n], (12b)

s.t.∑f∈F

∑m∈M

ρm,f [t, n]pm,f [t, n] ≤ pmax, ∀n ∈ N , t ∈ T ,

(12c)

0 ≤∑m∈M

ρm,fpm,f [t, n] ≤ pfmax, ∀f ∈ F , n ∈ N , t ∈ T ,

(12d)

RmCoMP[t, n] ≥ R′fmin, ∀f ∈ F , m ∈M, n ∈ N , t ∈ T ,(12e)

Rm,fnonCoMP[t, n] ≥ Rfmin, ∀f ∈ F , m ∈M, n ∈ N , t ∈ T ,(12f)

qf [t, n] = qf [t, n− 1] + vf [t, n− 1]δ +1

2af [t] (δ)

2,


Central server

Replay buffer

Global critic 1

Global critic 2

Target global

critic 1

Target global

critic 2

)

•Train global critic

•Minimize loss function and

update global critic

•Update target network

parameters

UAV 1

En

viro

nm

ent

UAV f

UAV F

Target critic 1

Target critic 2Target actor i

Actor i

Critic 1

Critic 2

,

,

,

Fig. 3: Proposed MADDPG two-stage resource allocation and trajectory planning

∀f ∈ F , n ∈ N , t ∈ T , (12g)

vf [t, n] = vf [t, n− 1] + af [t]δ, ∀f ∈ F , n ∈ N , t ∈ T ,(12h)

wm[t, n] = wm[t, n− 1] + vm[t, n− 1]δ, ∀m ∈M,

n ∈ N , t ∈ T , (12i)

‖(xf [t, n], yf [t, n])− (xmid, ymid)‖ ≤ rRadius,

0 ≤ zf [t, n] ≤ zmax, ∀f ∈ F , n ∈ N , t ∈ T , (12j)

‖wf [t, n]− (xmid, ymid)‖ ≤ rRadius,∀f ∈ F , n ∈ N ,t ∈ T , (12k)

‖vf [t, n]‖ ≤ vmax, ∀f ∈ F , n ∈ N , t ∈ T , (12l)

‖af [t]‖≤ amax, ∀f ∈ F , t ∈ T , (12m)‖vm[n, t]‖≤ v′max, ∀m ∈M, n ∈ N , t ∈ T , (12n)∑m∈M

ρm,f [t, n] ≤ JK , ∀f ∈ F , n ∈ N , t ∈ T , (12o)

γm,f [t, n](i)− γm,f [t, n](m) ≥ 0,

, |hi,f [t, n]|2 > |hm,f [t, n]|2

∀ f ∈ F , m, i ∈M, n ∈ N , t ∈ T , (12p)

ρm,f [t, n] ∈ {0, 1}, ∀f ∈ F , m ∈M, n ∈ N , t ∈ T ,(12q)

νm,f [t, n] ∈ {0, 1}, ∀f ∈ F , m ∈M, n ∈ N , t ∈ T ,(12r)

(12c) demonstrates the maximum power of each UAV thatcan transmit, (12d) is UAVs power constrained between themaximum and zero. (12e) and (12f) are the minimum datarate constraint which all UAVs need to satisfy. (12g)-(12n)are movement constraints. The equations of motion of eachUAV is shown in (12g), (12h) presents the velocity equationof UAVs, the location of users is obtained from equation (12i).(12j) and (12k) illustrate the spatial constraint of the simulationfor UAVs and users. (12l), (12m), and (12n) indicate the max-imum UAV speed, the maximum UAV acceleration, and themaximum speed of each user, respectively. (12o) demonstratesthe number of users in the service of each UAV. (12p) are

the NOMA constraints which shows the mth user SINR atthe ith user must be bigger than at the mth user. (12q) is theassignment index that specifies assignment of user to UAV andas a binary variable. (12r) is a binary variable which indicateswhether the user is CoMP or not. (12) is a non-convex NP-hardproblem that conventional mathematical methods cannot solve.To handle the above problem, in the following, we present theMADDPG approach, a machine learning method suitable forcomplex problems.

IV. MULTI AGENT BASED SOLUTION

Due to the complexity of the problem, we are not able tosolve the problem by classical programming methods, so wemove to use RL methods. Single-agent methods face prob-lems in estimating and overloading information. Conventionalmulti-agents methods cannot obtain better results than single-agent methods in small environments due to utterly indepen-dent functionality. In this section, we form our environment,agents, and the relevant interaction among the agents, whichis states, actions, and reward. This section ends up withformulating our multi-agent approach.

A. Environment

The purpose of each agent, in a multi-agent environment,is to maximize its policy function, which can be shown as:

maxπ f

J f(πf), f ∈ F , πf ∈ Πf , (13)

where J f(πf)

= E[∑∞

t=0 γtrft+1 | s

f0

]is our conditional

expectation, πf is the policy of UAV f , and the set Πj

contains all feasible policies that are available for UAV f .Information from the aforementioned discussion about eachUAV as an agent interacts with the UAV network environmentand takes action relevant to its policy. The goal is to solvethe optimization problem (12). The agents seek to improvetheir reward (13), which is related to the objective function.At each time n, the UAV immediately takes action, at,n after


observing environment state, st,n. The environment transfersto a new state st,n+1 in the transition step and UAV obtains areward related to its action. Next we describe the state spaceS , action space A, and the reward function rt,n, in our systemmodel:• State Space: The state of each agent includes all im-

mediate channel gains at time n of users hm,f [t, n] forall m ∈ M&f ∈ F , all UAV previous velocity andlocation q[t, n−1],v[t, n−1] and users previous locationwm,k[t, n−1], all interferences involved in the n−1 slot,Im,f

′

Intra [t, n− 1] and Im,fInter [t, n− 1].

sft,n = [hm,f [t, n],q[t, n− 1],v[t, n− 1],wm,k[t, n− 1],(14)

Im,f′

Intra [t, n− 1], Im,fInter [t, n− 1]], f ∈ F .

• Action Space: In each time step UAVs (agents) act asaft,n = {ρft,n,p

ft,n, ν

m,ft,n ,aft }, as we mentioned above,

ρft,n is our assignment variable, pft,n is the power that isallocated to the users, νm,ft,n is CoMP indicator that showsthe user is either CoMP or not, and aft is the accelerationmatrix. Our variables contain both integer and continuousvariables. It suggests to adopt policy gradients to solvethe problem. Making use of this capability allows us tocome up with better solutions.

• Reward function: The reward is one of the most importantparts of RL because the major driver of RL is thereward. It is crucial to formulate a function accuratelythat can both represent the objective function and fasterand more stable convergence. First we form our rewardsthen discuss it. In this system we employ two typesof reward, one type per agent and another one is forglobal critic network. Agents’ goals are to maximize theirrates while minimizing power consumption. The globalreward decreases the total interference among the usersas mentioned in [41]. All agents have commitment onlyto their goal and it is not necessary to know about otheragents policy, since they share global critic. The goal ofthe global critic is to connect all the agents together toaid faster convergence with the support of the reward.To maintain the stable convergence the central serverconnects the entire system and receives all the states andcriticizes the actions of agents. Our global and per agentreward are defined as:

rf` =α

∑m∈M

(Rm,fCoMP(ρ, p) +Rm,fnonCoMP(ρ, p)

)B

m,f

2 log2

(1 +

pf

maxB

m,f

(σ m)2

) −

(1− α)

∑m∈M ρm,fpm,f

pfmax,

(15)

and global reward

rG = −(Im,f

′

Intra + Im,fInter

). (16)

We use linear scalarization to transform our multi-objective problem to single objective using weight factorα ∈ (0, 1) [42].

B. MADDPG

We assume π as a set for all UAVs policies UAV fpolicy is πf

(πf ∈ π = {π1, . . . , πF }

)with parameters θf ,

Qf for UAV critic (Q-function) with parameter φf and QG forglobal Q-function with parameter ψ. Now, we can form ourneural network, after that, we can discuss gradient policy. Weconsider Lfπ , Lfq , and LG as the number of layers in neural net-work for each agent action and critic and global critic, respec-tively. According to the above information, we can developour neural network in this way Θi =

(W

(1)π , . . . ,W

(Lπ

)π

),

Φq =(W

(1)q , . . . ,W

(Lq

)q

), and ΨG =

(W

(1)G , . . . ,W

(LG

)G

)as

actor, UAV critic and global critic. According to the above,the gradient policy is

∇θ fJ f = E[∇θ f

πf(af | sf

)∇a

f

Qfπ(s,a)∣∣af=π f(sf)

], (17)

where a =(a1, . . . , aF

)is all actions that are taken by each

UAV with observation s =(s1, . . . , sF

). We integrate all

actions and states in Qfπ(s,a) as inputs to approximate Q-function for UAV f . Here we utilize two critic networks, andwe reformulate the policy gradient follows:

∇θ fJ f =Es,a∼D

[∇θ fπf

(af | sf

)∇afQψG(s,a)

]︸︷︷︸

Global Critic

+

Esf,af∼D

[∇θ fπf

(af | sf

)∇afQfφf

(sf , af

)]︸︷︷︸

UAV critic

,(18)

as shown above af = πf (sf ) are actions which UAV f takeswith observation sf , respect to policy πf . In (18), we have twoterms, the first one shows global critic which receives actionsand states of all UAVs and it estimates global Q-function usingglobal reward rG and other term shows critic of UAV whichreceives only itself actions and states. For updating the lossfunction of global critic, we use

L(ψ) = Es,a,r,s′

[(QψG(s,a)− yG

)2], (19)

where yG is a target value of estimation and

yG = rG + γQψ′

G (s′,a′)∣∣∣a′f=π ′f(s′f)

, (20)

where our target policy is π′ = {π′1, . . . , π′F }. We parame-terize it with θ′ = {θ′1, . . . , θ′F }, and the UAV loss functionand its target update as follows:

Lf(φf)

= Esf,af,rf,s′f

[(Qfφf

(sf , af

)− yf

)2], (21)

and yf :

yf = rf + γQfφ′f

(s′f , a′f

)∣∣∣a′f=π ′f(s′f)

. (22)

Although this framework is able to reach quite good per-formance and converge in moderated steps, it still sufferfrom overwhelmed estimation and its loss in approximationsdeteriorates framework performance due to sub-optimal policyin Q-function. With results in [41] and [43], we swap global


50 100 150 200 250 300 350 400 450 500

Episode

-1

0

1

2

3

4

Mean

re

ward

Proposed algorithm

MADDPG

Decentralized MADDPG

DDPG

Fig. 4: Comparing our approach with MADDPG dec-MADDPG,DDPG.

critic with twin delayed deterministic policy gradient in (23),then our new gradient policy is

∇θ fJ f =Es,a∼D

[∇θ fπf

(af | sf

)∇afQψi

Gi

(s,a)]

︸︷︷︸TD3 Global Critic

+

Esf,af∼D

[∇θ fπf

(af | sf

)∇afQfφf

(sf , af

)]︸︷︷︸

UAV critic

,(23)

and loss function

L(ψi) = Es,a,r,s′

[(Qψi

Gi

(s,a)− yG)2]

, (24)

and yG is updated as

yf = rf + γ(1− d) mini=1,2

Qfi

φ′fi

i

(s′f , a′f

)∣∣∣∣a′f=π ′f(s′f)

. (25)

By (21) and (22), the agents critic network are updated. Ourproposed algorithm is summarized in Algorithm 1 and withintuitive illustrated in Fig. 3. Now, we are going to discussthe reasons for choosing the rewards of system. Our firstreward is directly related to the objective of the problem,but is calculated independently for each UAV. Each UAVstrives to get as close to its maximum and best as possible bymaximizing its reward. Due to the neural network’s weaknessin calculating interference, which plays a very important rolein the management of radio resources, our global rewardmaximizes the symmetry of the total system interference.We apply TD3 because it trains global with with two extranetworks, d is delay hyper-parameter.

C. Computational Complexity

In this section we investigate the computational complexityof the algorithm. We divide it into two sorts and address themin their dedicated section. Complexity is related to a) numberof trainable variables and b) total neural network applied tonetwork.

Algorithm 1: Two time frame modified MADDPG1 Initiate environment, generate UAVs and users2 Inputs: Enter number of at, st agents and users3 Initialize all, global critic networks Qfφ

1

and Qfφ1

, targetglobal critic networks Q′fφ

1

and Q′fφ1

and agents policy andcritic networks.

4 for t=1 to T do5 for n = 1 : N do do6 for each agnet f do do7 Observe state sft and take action aft8 st =

[s1t , . . . , s

Ft

], at =

[at, . . . , a

Ft

].

9 Receive global and local rewards, rG,t and rft10 Store

(st,at, r

ft , rG,t, st+1

)in replay buffer D

11 Sample minibatch of size S,(sj ,aj , rjg, r

j` , s′j), from

replay buffer D12 Set yjg = rjg + γminiQ

gi

ψ ′

i

(s′,a′j)

13 Update global critics by minimizing the loss:14

L (ψi) =1

S

∑j

{(Qgi

ψi

(sj ,aj

)− yjg

)2}.

15 Update target parameters: ψ′i ← τψi + (1− τ)ψ′i16 if episode mod d then then17 Train actor and critic nerwork18 for for each agent f do do19 episode mod d then |

L (φi) =1

S

∑j

{(Qiφ

i

(sji , a

ji

)− yji

)2}.

Update local actors:

∇Jθi

≈ 1

S

∑j

{∇θ

i

πi(ai | sji

)∇a

i

Qg1

ψ1

(sj ,aj

).

∇θı

πi(ai | sji

)∇a

i

Qiφi

(sji , a

ji

)}20 Update target networks parameters:[

θ′i ← τθi + (1− τ)θ′i,φ′i ← τφi + (1− τ)φ′i.

a) Number of trainable variable: The input of theQ-function includes all actions and states of agents. a InMADDPG, it is similar, but all action and states are specificfor each agent. If we assume ω and α as observation andaction space, number of trainable variable of MADDPG willbe O

(c2(ω + α)

), where c indicates the number of agents.

But in the utilized algorithm, we apply global critic whichis being shared among agents. Hence complexity representedas O (c(ω + α)), on the other hand, for UAV critic, it onlyobserves its own agent, states and action space and fed bythem. For each UAV critic we have O (ω + α) as parameterspace.

b) Number of the nodes in the neural network: Inconventional MADDPG the total number of network is 2 ×(c(1Q + 1A

)).This multiplying by 2 is because of one extra

network as target network. 1Q and 1A are the numbers of criticand actor networks, respectively, and c indicates the numberof agents. In our proposed algorithm, we applied TD3 and dueto 2 critics in its global network, we can demonstrate the totalnumber of the network as 2×

(c(1Q + 1A

)+ 2Q

G

), where 2Q

G


TABLE II: Simulation parameters

UAV-VLC CoMP-enabled environment parameters ValueFoV Ψc 60◦ [40]Detector area of PD APD 1 cm2 [40]pmax 36 Wpfmax 12 WNumber of user 9Number of UAV 3Half power angle,θ1/2 30◦ [40]The order of the Lambertian emission $ 1PD responsivity 0.53 A/W [40]Sigma-noise 10e− 12

Rm,fmin 0.1(kbps) [44]R′m,fmin 0.1(kbps) [44]JK 3xmid 25 mxmid 25 mzmax 100 mamax 2 2

√3 m/s2

vmax 10 2

√3 m/s [21], [45]

v′max 5 2

√2 m/s [21]

T 500 s × 10e-1N 100 msδ 1 msNeural networks hyper-parameters ValueExperience replay buffer size 50000Mini batch size 64Number/size of local actor networks hidden layers 2/1024, 512Number/size of local critic networks hidden layers 2/512, 256Number/size of global critic hidden layers 3/1024, 512, 256Critic/Actor networks learning rate 0.001/0.0001Discount factor 0.99Target networks soft update parameter, τ 0.0005Number of episodes 500Number of iterations per episode 100

indicates 2 critic networks in our global critic.

V. SIMULATION

In this section, an evaluation of the performance of theproposed algorithm is presented via numerical results. Thesimulation settings are summarized in Table II. For the mainsimulation, the number of UAVs is 3 and 9 users are movingon the ground and the environment is limited to a cylindricalwith radius and height of 50 m and 100 m, respectively. Thetime slot of the frame is considered to be 100 ms and the slotsare considered to be 1 ms, in which the constant acceleration isspecified in the frame, and in the slots the final speeds of eachslot are considered as the initial speed of the next slot. Themaximum speed for UAVs is 10 m/s in each direction, andthe maximum acceleration is 2 m/s2. All information aboutthe neural network and the number of layers for each factoris given in Table II. In the following, we will discuss all thefigures obtained from the simulation. First, we will explain thebaselines, and then we will examine the reward and the overalldata rate of the network and the allocated powers, the effectof the minimum value of the data rate, the trajectory of theUAVs, and finally, the impact of constant velocity and constantacceleration on the performers. In addition, the source code ofthe proposed modified MADDPG is available in [46].

A. Solution Baselines

The baselines include three solution methods as show inFig. 4, MADDPG, fully decentralized MADDPG, and DDPG,

described below.

• MADDPG: A standard method that operates on a multi-agent basis and has DDPG neural networks. The variousagents interact with each other through a central server.

• Fully decentralized MADDPG: In this solution method,agents work independently of each other, have no contact,and only have their observations from the environment.

• DDPG: The single agent interacts with the environmentand has the whole environment as its neural networkinput.

B. System model baselines

• Non-CoMP: A conventional system model withoutCoMP technology. Users with awful channels are de-prived of receiving any assist from nearby UAVs.

• Constant velocity: This type of motion is entirely unre-alistic and is also a subset of accelerated motion in whichthe acceleration is zero, which is applied in UAV systemmodel in previous works.

C. Trade off between data rate and power consumption

Giving the weight factor α, that changes the effect of eachobjective in the primary reward, we swipe this factor from0 to 1 with step size of 0.2. Note that using this approachallows us to choose the priority between the power and thedata rate flexibly. It can be seen in Fig. 5a that the goalis only to minimize the amount of power where only threeconstraints are considered. The required minimum data ratefor each type of user and clipping the power between 0.1and pfmax helps stabilize the learning process. In this figure,the only goal of the agents is to meet the minimum rate foreach type of user. In Fig. 5b, our reward is controlled bythe power minimization instead of the data rate. In Fig. 5b,the sudden drop of the reward is due to the movement ofusers and handover, that the UAV is no longer able to trackits users, and users switch among UAVs. Finally, in Fig. 5c,the effect of data rate exceeds the effect of power, and as wecan observe in the figure, that rewards converge to a positivevalue. However, the effect of power minimization also remainsstrong, but the priority is to maximize the data rate. In Fig. 5d,the impact of the data rate is increasing, and almost all agentspolicies are affected by the data rate of the whole system,and the sudden drop can be related to user movement andchange in users assignment. In Fig. 5e, the impact of datarate increases, and at the beginning of learning, it can beobserved that the agents try to increase the reward. However,the commitment to reduce power consumption will reducereward in episodes afterward. It converges to another pointis due to the term of power minimization. Finally, in Fig. 5f,we see that the agents seek to maximize network data rateswithout considering power consumption. The fluctuation inFig. 5 can be attributed to the complexity of the system modeland corresponding constraints, and the agents that attempt tofind the best case for the objective function.


0 100 200 300 400 500-5

-4

-3

-2

-1

0 100 200 300 400 500-4

-3

-2

-1

0 100 200 300 400 500-3

-2

-1

0

UAV_1UAV_2UAV_3

0 100 200 300 400 500-1.5

-1

-0.5

0

0.5

0 100 200 300 400 500

0

0.5

1

1.5

2

0 100 200 300 400 5000

1

2

3

4

Fig. 5: The rewards and how it converges: (a) α = 0.0, (b) α = 0.2, (c) α = 0.4, (d) α = 0.6, (e) α = 0.8, and (f) α = 1.0.

0 100 200 300 40016

18

20

22

24

0 100 200 300 400 50026

28

30

32

0 100 200 300 400 50030

31

32

33

34

35

0 100 200 300 400 50034

35

36

37

38

0 100 200 300 400 50036

37

38

39

40

41

0 100 200 300 400 50037

38

39

40

41

Fig. 6: The data rate and it convergence: (a) α = 0.0, (b) α = 0.2, (c) α = 0.4, (d) α = 0.6, (e) α = 0.8, and (f) α = 1.0.

D. Data rate

Here, we gather all the data rate figures in Fig. 6 to comparewith each other. In Fig. 6a, the goal is only to minimize thetotal power of system. In this case we do not have the termto maximize the data rate, so it can be seen in the figurethat the data rate changes instantly, and only three constraintsare involved in this situation including minimum system datarate and power clipping between 0.1 and pfmax for a reasonmentioned above. In Fig. 6b, while increasing weighted factorα, an attempt is initially made to increase the value of the data

rate, but because the main priority is on power, the data ratedecreases and converges to a lower number. Fig. 6c shows thetrade off between the data rate and power consumption, butas we can see in the figure, there are no obvious differencesbetween the two terms of the reward. Furthermore, it seemsthat the two parts have neutralized each other’s effect, althoughthis is clearly shown in Fig. 6c, which shows the data rate.In Fig. 6d, Fig. 6e, and Fig. 6f, as the weight of data rate(α) in the reward increases, the priority changes in favor ofincreasing the data rate. However, due to the existence of a


0 100 200 300 400 500

Episode

34

35

36

37

38

39

0 100 200 300 400 50010

15

20

25

30

35

40

0 100 200 300 400 500

Episode

20

25

30

35

40

0 100 200 300 400 500

Episode

20

25

30

35

40

45

0 100 200 300 400 500

Episode

25

30

35

40

45

0 100 200 300 400 500

Episode

30

35

40

45

50

Fig. 7: The total power consumption and it convergence: (a) α = 0.0, (b) α = 0.2, (c) α = 0.4, (d) α = 0.6, (e) α = 0.8, and (f) α = 1.0.

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.215

20

25

30

35

40

= 1.0= 0.6

Fig. 8: Effect of the minimum data rate on the total data rate.

power control term, it is evident that the effect of the powerminimization leads to decreasing the total data rate. The wholeneural network strives to achieve the highest data rate.

E. Power

In Fig. 7, we examine the transmission power of UAVsin downlink communication. As mentioned in the previoussections, in Fig. 7a, due to two constraints for the minimumdata rate and clipping the power between 0.1 and pfmax, theallocated power does not converge to zero, and agents try toestablish the minimum data rate. By increasing the weight ofdata rate (α) in the reward, it can be noticed that the powerconsumption further increases to meet the constraints (12e)and (12f). In Fig. 7e, it can be seen that there is a significantdecrease in power consumption at the beginning of the run,which can be justified for the same reasons as above. In Fig. 7f,we can observe that the system’s whole purpose is to maximizethe user data rate and meet the minimum requirements for theusers, but the central server decides to make a cooperationamong the agents while reducing the interference. This centralserver also prevents the maximum power usage.

F. Minimum data rate constraint

The more we increase the minimum data rate constraint,the smaller the set of feasible answers gets. As illustrated inFig. 8, we continuously see a reduction in the data rate of theentire network by increasing the minimum data rate constraint,and this drop is more tangible in the higher numbers in theminimum data rate constraint.

G. CoMP versus non-CoMP

Fig. 9 shows a complete run with F = 5 and M = 15. Onaverage, we observe 12% improvement of CoMP system inthe reward compared with non-CoMP system. Fig. 10 showsthe variation of the number of UAVs and users, α value, andtheir impact on the data rate and the power. From this figurewe can observe as the number of UAVs increases, the datarate increases. The power consumption of CoMP system isalso significantly reduced in comparison with non-CoMP. Theeffect of CoMP is similar to the increasing of the number ofusers, i.e., using CoMP enhances the data rate. CoMP usersalso reduce the power consumption of each UAV.

H. Constant acceleration versus constant velocity

Here, we provide the same conditions regarding time andresource allocation to compare the two modes of motion withconstant acceleration and motion with constant speed. Fromour simulation results considering the reward with weightedfactor α = 0.8, we perceive a better result through constantacceleration motion in terms of convergence value and conver-gence speed. For justifying this behavior, we can rely on thefact that moving at a constant acceleration makes UAVs moremaneuverable and the ability to track the users on the groundbetter and faster as observed in Fig. 11. Fig. 11 shows thatmore data rates are allocated to the users, while, the amount


0 50 100 150 200 250 300 350 400 450 5001

1.5

2

2.5

3

Fig. 9: Comparing to sort of system model CoMP and non-CoMPF = 5, M = 15.

15 20 25 30 35 40 45 50 5510

20

30

40

50

60

70

Fig. 10: Data rate versus total power consumption CoMP and non-CoMP with varying number of the UAVs, the users, and values ofα.

of communication power consumption is significantly reduced.Also, the results are obtained for different values of α.

VI. CONCLUSION

In this paper, we proposed a new frame structure to studythe effect of constant acceleration on the CoMP UAV-enabledVLC networks. While giving two time frame, and usingconstant acceleration, which assists to improve the systemefficiency, we solved this complex problem using novel ma-chine learning and multi-agent method. Also, we presenteda solution method based on our system model. The resultsobtained from the simulation proved a better performancecompared to other methods. The constant acceleration systemwith two time frame shows a better performance than thecommon system model including a constant velocity. As afuture work, a novel RL method related to federated learningin swarm UAV networks can be investigated.

REFERENCES

[1] D. Lee, H. Seo, B. Clerckx, E. Hardouin, D. Mazzarese, S. Nagata,and K. Sayana, “Coordinated multipoint transmission and reception inLTE-advanced: deployment scenarios and operational challenges,” IEEECommunications Magazine, vol. 50, no. 2, pp. 148–155, 2012.

[2] L. Liu, S. Zhang, and R. Zhang, “CoMP in the sky: UAV placementand movement optimization for multi-user communications,” IEEETransactions on Communications, vol. 67, no. 8, pp. 5645–5658, 2019.

[3] J. Yao and J. Xu, “Joint 3D maneuver and power adaptation forsecureUAV communication with CoMP reception,” IEEE Transactionson Wireless Communications, vol. 19, no. 10, pp. 6992–7006, 2020.

30 35 40 45 50 5530

35

40

45

50

55

60

65

70

Fig. 11: Constant acceleration versus constant velocity.

[4] M. Elhattab, M.-A. Arfaoui, and C. Assi, “CoMP transmission indownlink NOMA-based heterogeneous cloud radio access networks,”IEEE Transactions on Communications, vol. 68, no. 12, pp. 7779–7794,2020.

[5] Y. Zeng, Q. Wu, and R. Zhang, “Accessing from the sky: A tutorial onUAV communications for 5G and beyond,” Proceedings of the IEEE,vol. 107, no. 12, pp. 2327–2375, 2019.

[6] Y. Zeng, J. Lyu, and R. Zhang, “Cellular-connected UAV: Potential,challenges, and promising technologies,” IEEE Wireless Communica-tions, vol. 26, no. 1, pp. 120–127, 2018.

[7] H. Shakhatreh, A. H. Sawalmeh, A. Al-Fuqaha, Z. Dou, E. Almaita,I. Khalil, N. S. Othman, A. Khreishah, and M. Guizani, “Unmannedaerial vehicles (UAVs): A survey on civil applications and key researchchallenges,” Ieee Access, vol. 7, pp. 48572–48634, 2019.

[8] Z. Ullah, F. Al-Turjman, and L. Mostarda, “Cognition in UAV-aided5G and beyond communications: A survey,” IEEE Transactions onCognitive Communications and Networking, vol. 6, no. 3, pp. 872–891,2020.

[9] Q. Wu, S. Zhang, B. Zheng, C. You, and R. Zhang, “Intelligent reflectingsurface aided wireless communications: A tutorial,” IEEE Transactionson Communications, 2021.

[10] Y. Zeng, R. Zhang, and T. J. Lim, “Wireless communications withunmanned aerial vehicles: Opportunities and challenges,” IEEE Com-munications Magazine, vol. 54, no. 5, pp. 36–42, 2016.

[11] M. Mozaffari, W. Saad, M. Bennis, Y.-H. Nam, and M. Debbah, “Atutorial on UAVs for wireless networks: Applications, challenges, andopen problems,” IEEE communications surveys & tutorials, vol. 21,no. 3, pp. 2334–2360, 2019.

[12] H. Wang, H. Zhao, W. Wu, J. Xiong, D. Ma, and J. Wei, “Deploymentalgorithms of flying base stations: 5G and beyond with UAVs,” IEEEInternet of Things Journal, vol. 6, no. 6, pp. 10009–10027, 2019.

[13] B. Li, Z. Fei, and Y. Zhang, “UAV communications for 5G and beyond:Recent advances and future trends,” IEEE Internet of Things Journal,vol. 6, no. 2, pp. 2241–2263, 2018.

[14] Y. Zeng, J. Xu, and R. Zhang, “Energy minimization for wirelesscommunication with rotary-wing UAV,” IEEE Transactions on WirelessCommunications, vol. 18, no. 4, pp. 2329–2345, 2019.

[15] D. Bykhovsky and S. Arnon, “Multiple access resource allocation invisible light communication systems,” Journal of Lightwave Technology,vol. 32, no. 8, pp. 1594–1600, 2014.

[16] C. Chen, W.-D. Zhong, H. Yang, and P. Du, “On the performanceof MIMO-NOMA-based visible light communication systems,” IEEEPhotonics Technology Letters, vol. 30, no. 4, pp. 307–310, 2017.

[17] J. Chen, Z. Wang, and R. Jiang, “Downlink interference management inCell-Free VLC network,” IEEE Transactions on Vehicular Technology,vol. 68, no. 9, pp. 9007–9017, 2019.

[18] Y. Wang, M. Chen, Z. Yang, T. Luo, and W. Saad, “Deep learning foroptimal deployment of UAVs with visible light communications,” IEEETransactions on Wireless Communications, vol. 19, no. 11, pp. 7049–7063, 2020.

[19] J. Cui, Y. Liu, and A. Nallanathan, “Multi-agent reinforcement learning-based resource allocation for UAV networks,” IEEE Transactions onWireless Communications, vol. 19, no. 2, pp. 729–743, 2019.

[20] H. Peng and X. Shen, “Multi-agent reinforcement learning based re-source management in MEC-and UAV-assisted vehicular networks,”IEEE Journal on Selected Areas in Communications, vol. 39, no. 1,pp. 131–141, 2020.

[21] F. Tang, Y. Zhou, and N. Kato, “Deep reinforcement learning for dy-namic uplink/downlink resource allocation in high mobility 5G HetNet,”


IEEE Journal on Selected Areas in Communications, vol. 38, no. 12,pp. 2773–2782, 2020.

[22] J. Tang, J. Song, J. Ou, J. Luo, X. Zhang, and K.-K. Wong, “Mini-mum throughput maximization for multi-UAV enabled WPCN: A deepreinforcement learning method,” IEEE Access, vol. 8, pp. 9124–9132,2020.

[23] T. Zhang, Z. Wang, Y. Liu, W. Xu, and A. Nallanathan, “Cachingplacement and resource allocation for cache-enabling UAV NOMAnetworks,” IEEE Transactions on Vehicular Technology, vol. 69, no. 11,pp. 12897–12911, 2020.

[24] L. Gupta, R. Jain, and G. Vaszkun, “Survey of important issues in UAVcommunication networks,” IEEE Communications Surveys & Tutorials,vol. 18, no. 2, pp. 1123–1152, 2015.

[25] L. Xie, J. Xu, and Y. Zeng, “Common throughput maximization forUAV-enabled interference channel with wireless powered communica-tions,” IEEE Transactions on Communications, vol. 68, no. 5, pp. 3197–3212, 2020.

[26] M. S. Ali, E. Hossain, A. Al-Dweik, and D. I. Kim, “Downlink powerallocation for CoMP-NOMA in multi-cell networks,” IEEE Transactionson Communications, vol. 66, no. 9, pp. 3982–3998, 2018.

[27] M. Hua, Y. Wang, M. Lin, C. Li, Y. Huang, and L. Yang, “Joint CoMPtransmission for UAV-aided cognitive satellite terrestrial networks,”IEEE Access, vol. 7, pp. 14959–14968, 2019.

[28] A. Kilzi, J. Farah, C. A. Nour, and C. Douillard, “Analysis of droneplacement strategies for complete interference cancellation in two-cellNOMA CoMP systems,” IEEE Access, vol. 8, pp. 179055–179069, 2020.

[29] Y. Yang, M. Chen, C. Guo, C. Feng, and W. Saad, “Power efficientvisible light communication with unmanned aerial vehicles,” IEEECommunications Letters, vol. 23, no. 7, pp. 1272–1275, 2019.

[30] M. Kashef, M. Ismail, M. Abdallah, K. A. Qaraqe, and E. Serpedin,“Energy efficient resource allocation for mixed RF/VLC heterogeneouswireless networks,” IEEE Journal on Selected Areas in Communications,vol. 34, no. 4, pp. 883–893, 2016.

[31] J. Kong, Z.-Y. Wu, M. Ismail, E. Serpedin, and K. A. Qaraqe, “Q-learning based two-timescale power allocation for multi-homing hybridRF/VLC networks,” IEEE Wireless Communications Letters, vol. 9,no. 4, pp. 443–447, 2019.

[32] S. Ma, F. Zhang, H. Li, F. Zhou, M.-S. Alouini, and S. Li, “AggregatedVLC-RF systems: Achievable rates, optimal power allocation, andenergy efficiency,” IEEE Transactions on Wireless Communications,vol. 19, no. 11, pp. 7265–7278, 2020.

[33] V. K. Papanikolaou, P. D. Diamantoulakis, P. C. Sofotasios, S. Muhaidat,and G. K. Karagiannidis, “On optimal resource allocation for hybridVLC/RF networks with common backhaul,” IEEE Transactions onCognitive Communications and Networking, vol. 6, no. 1, pp. 352–365,2020.

[34] M. Obeed, A. M. Salhab, S. A. Zummo, and M.-S. Alouini, “Jointoptimization of power allocation and load balancing for hybrid VLC/RFnetworks,” Journal of Optical Communications and Networking, vol. 10,no. 5, pp. 553–562, 2018.

[35] J. Dai, K. Niu, and J. Lin, “Code-domain non-orthogonal multiple accessfor visible light communications,” in 2018 IEEE Globecom Workshops(GC Wkshps), pp. 1–6, IEEE, 2018.

[36] H. Ren, Z. Wang, S. Han, J. Chen, C. Yu, C. Xu, and J. Yu, “Performanceimprovement of M-QAM OFDM-NOMA visible light communicationsystems,” in 2018 IEEE Global Communications Conference (GLOBE-COM), pp. 1–6, IEEE, 2018.

[37] J. C. Estrada-Jimenez, B. G. Guzman, M. J. F.-G. Garcıa, and V. P. G.Jimenez, “Superimposed training-based channel estimation for MISOoptical-OFDM VLC,” IEEE Transactions on Vehicular Technology,vol. 68, no. 6, pp. 6161–6166, 2019.

[38] Y. Sun, D. Xu, D. W. K. Ng, L. Dai, and R. Schober, “Optimal 3D-trajectory design and resource allocation for solar-powered uav com-munication systems,” IEEE Transactions on Communications, vol. 67,no. 6, pp. 4281–4298, 2019.

[39] E. Kalantari, H. Yanikomeroglu, and A. Yongacoglu, “On the numberand 3D placement of drone base stations in wireless cellular networks,”in 2016 IEEE 84th Vehicular Technology Conference (VTC-Fall), pp. 1–6, IEEE, 2016.

[40] H. Zhang, N. Liu, K. Long, J. Cheng, V. C. Leung, and L. Hanzo,“Energy efficient subchannel and power allocation for software-definedheterogeneous VLC and RF networks,” IEEE Journal on Selected Areasin Communications, vol. 36, no. 3, pp. 658–670, 2018.

[41] M. Parvini, M. R. Javan, N. Mokari, B. Abbasi, and E. A. Jorswieck,“AoI-aware resource allocation for platoon-based C-V2X networks viamulti-agent multi-task reinforcement learning,” 2021.

[42] O. Aydin, E. A. Jorswieck, D. Aziz, and A. Zappone, “Energy-spectralefficiency tradeoffs in 5G multi-operator networks with heterogeneousconstraints,” IEEE Transactions on Wireless Communications, vol. 16,no. 9, pp. 5869–5881, 2017.

[43] H. Van Seijen, M. Fatemi, J. Romoff, R. Laroche, T. Barnes, andJ. Tsang, “Hybrid reward architecture for reinforcement learning,” arXivpreprint arXiv:1706.04208, 2017.

[44] M. Savi, M. Tornatore, and G. Verticale, “Impact of processing-resourcesharing on the placement of chained virtual network functions,” IEEETransactions on Cloud Computing, 2019.

[45] X. Liu, Y. Liu, and Y. Chen, “Machine learning empowered trajectoryand passive beamforming design in UAV-RIS wireless networks,” IEEEJournal on Selected Areas in Communications, 2020.

[46] M. R. Maleki, M. Robat Mili, M. R. Javan, N. Mokari, and E. A. Jor-swieck, “Multi Agent Reinforcement Learning Trajectory Design andTwo-Stage Resource Management in CoMP UAV VLC Networks,”2021.

Date post:	08-Apr-2022
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

AI-Based Trajectory Design and Radio Resource Management ...

Documents