Automatic Successive Reinforcement Learning with Multiple ... · Automatic Successive Reinforcement...

Automatic Successive Reinforcement Learning with Multiple Auxiliary Rewards

Zhao-Yang Fu , De-Chuan Zhan , Xin-Chun Li and Yi-Xing LuNational Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China

{fuzy,lixc}@lamda.nju.edu.cn, [email protected], [email protected]

AbstractReinforcement learning has played an importantrole in decision making related applications, e.g.,robotics motion, self-driving, recommendation, etc.The reward function, as a crucial component, af-fects the efficiency and effectiveness of reinforce-ment learning to a large extent. In this paper, wefocus on the investigation of reinforcement learn-ing with more than one auxiliary reward. It is foundthat different auxiliary rewards can boost up thelearning rate and effectiveness in different stages,and consequently we propose the Automatic Suc-cessive Reinforcement Learning (ASR) for auxil-iary rewards grading selection for efficient rein-forcement learning by stages. Experiments andsimulations have shown the superiority of our pro-posed ASR on a range of environments, includ-ing OpenAI classical control domains and videogames; Freeway and Catcher.

1 IntroductionReinforcement learning [Sutton and Barto, 1998] has playedan important role in many realistic domains, such as robotics,self-driving and recommendation systems. In reinforcementlearning, agents interact with the environment by trial anderror. Meanwhile, a reward signal, indicating how well theyare performing, is given. In general cases, the goal of an agentis to maximize the expected cumulative reward.

The reward function is crucial to reinforcement learn-ing [Ng et al., 1999]. For policy-based reinforcement learn-ing methods, the reward provided by environment determinesthe search directions of policies which will eventually af-fect the final policies obtained. For example, consideringthe reward signal in FetchReach environment [Plappert et al.,2018], which becomes higher as the gripper gets closer tothe target, then the policy obtained will lead the gripper toapproach the target location. For more general cases, thechoices of reward functions can be reflected in the efficien-cies of general reinforcement learning approaches, e.g., theshaping reward is more efficient than the original reward inPathfinding environment [Brys et al., 2014a]. Though vari-ous rewards may lead to the final results, a reward functionwithout elaborate designing may take more exploration.

A severe problem of reinforcement learning is how to trainfast and efficiently using information provided by rewards.Although reinforcement learning has achieved great successand has been applied in many domains, existing methods havedifficulty exploring effectively to learn a good policy whenthe reward is sparse and rare. They usually need to interactwith the environment millions of times, especially in compli-cated real-world environments, and consequently leading tothe whole procedure intractable. Delayed rewards and sparserewards build a barrier to the widespread applicability of re-inforcement learning.

It is a natural desire for selecting or designing an appropri-ate reward for better reinforcement learning with efficiencyconsidered, especially those with sparse and rare reward func-tions. In real-world scenarios, rewards can be designed inmany aspects. A typical example is the traffic light problem,in which car delay and system throughout are two correlatedreward signals [Brys et al., 2014b]. More generally, a rein-forcement learning task in the real world is usually accompa-nied by a lot of rewards and domain knowledge. Therefore,the optimal policy can be calculated in many ways.

Auxiliary rewards have attracted much attention recently.Inspired by human learning, Bengio et al. proposed curricu-lum learning, i.e., when training reinforcement learning mod-els, we can start with easier tasks and gradually increase thedifficulty level of the tasks, and this brings benefits on learn-ing effectiveness [Bengio et al., 2009]. Reward shaping [Nget al., 1999] provides us another way to modify the origi-nal reward function for optimal policy preserving as well asboosting the performance.

Making use of auxiliary rewards can definitely improvethe model. Previous work has shown that auxiliary rewardsare useful to the learning rate since the multiple auxiliary re-wards are expected to capture more information during learn-ing. However, existing multiple auxiliary rewards methodsare limited to simple linear combinations [Brys et al., 2014a]or ensemble of multiple rewards [Brys et al., 2017], whichneglect the stage influence of rewards and reward selectionwe considered. In this paper, we focus on the investigationof reinforcement learning with multiple auxiliary rewards ex-ploitation and selection in different stages.

It is notable that in reinforcement learning the guidanceprovided by rewards is various according to the changes ofstages. For example, a simple ball moving task contains re-

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

2336

wards about approaching a ball, grasping it, moving to a lo-cation; the reward based on the distance between the gripperand the ball is more important at first, and consequently asthe gripper approaches the ball, another reward about grasp-ing becomes more important, etc. Curriculum reinforcementlearning manually designs the stages and invokes different re-ward functions and learning strategies for better results. How-ever, in most cases, the hierarchical stages are hard to design.

In this paper, we propose the Automatic Successive Re-inforcement Learning (ASR) with multiple auxiliary rewards.ASR performs reward selection automatically at each trainingstep, which is the “atom” of the training stages. It is expectedthat ASR can achieve faster and better training. We empir-ically investigate the effectiveness of ASR, and it achievessuperior performance on various environments.

In the following of this paper, we start with a brief reviewof related work, then give the ASR approach and the experi-mental results. Finally, we conclude the paper.

2 Related WorkThe exploitation of different rewards has attracted much at-tention recently. In this paper, our method concentrates ontaking advantage of multiple auxiliary rewards. There are twokinds of methods using multiple rewards: curriculum learn-ing and multi-objective reinforcement learning (MORL).

Curriculum learning breaks a complicated problem downto a sequence of manageable stages and tasks manually.Since the learning process of curriculum learning is simi-lar to human learning, it has been widely used in robotics.Most practical curriculum learning approaches use manualtask sequences [Karpathy and van de Panne, 2012]. Somegeneral frameworks aim to generate increasingly difficulttasks [Schmidhuber, 2013]. Unlike ASR, all existing studiesin curriculum learning rely on curriculum design from easyto difficult, and the agent starts with easier tasks and learnsall tasks in order of difficulty. Designing an effective cur-riculum by human efforts is a complex problem. However,ASR makes use of multiple auxiliary rewards simultaneouslyto help training, and doesn’t require human efforts for cur-riculum designing. The multiple auxiliary rewards only needto encode some information for learning the task in ASR.

MORL, from another aspect, focuses on handling taskswith inherent multiple opposite objectives. By how manypolicies are computed in a single run, there are two types ofmethods. The first type to solve the MORL involves the useof scalarization function, known as the single policy methods,including the min-max method [Lin, 2005], weighted summethod[Kim and De Weck, 2006], and Chebyshev scalariza-tion method [Moffaert et al., 2013]. More and more compli-cated scalarization functions are proposed to approximate thePareto frontier better. Other approaches seek to find multiplepolicies in a single run, e.g., radial method and Pareto follow-ing method for discrete Pareto frontier approximation [Parisiet al., 2014], and the manifold method for continuous Paretofrontier approximation [Pirotta et al., 2015]. Solutions ofsuch problems often need to make a trade-off between mul-tiple objectives since there are conflictions among objectives.Different from our scenario, MORL aims at the Pareto fron-

tier, which actually is a preprocessing for missing constraintssituations. The goal of ASR is to learn the optimal policyfaster with multiple auxiliary rewards.

Therefore, many researchers have devoted to taking advan-tage of multiple rewards. However, there is no previous re-search performing reward selection for deep reinforcementlearning. In this paper, a novel ASR framework is proposed,which utilizes all rewards to learn an agent and can determinethe importance of each reward automatically. Specifically, bymaximizing the improvement of objective function or size ofparameter update at each training step, ASR could utilize thegradient information from multiple auxiliary rewards to finda better update direction. Consequently, faster and better per-formance can be achieved.

3 PreliminariesIn this section, we introduce the required concepts and meth-ods for the derivation of ASR. We start from the notations ofreinforcement learning.

3.1 Reinforcement LearningA reinforcement learning environment usually consists offour components: state space S, action space A, state tran-sition function T and scalar reward function R. At time t, anagent observes the state st ∈ S, then takes action at ∈ Aand receives a real-valued reward rt. A probabilistic policyπ is defined as a mapping from the state space to probabilitydistributions over the action space.

In general, the state value function V π and state-actionvalue function Qπ for policy π are defined by discountedcumulative reward: V π(s) = Eπ [

∑∞t=0 γ

trt|s0 = s] andQπ(s, a) = Eπ [

∑∞t=0 γ

trt|s0 = s, a0 = a], where γ is thediscount factor. The advantage function is usually defined as:Aπ(s, a) = Qπ(s, a) − V π(s). The reinforcement learningalgorithms aim to maximize the following expected total re-ward: Jπ = Eπ [

∑∞t=0 γ

trt].

3.2 Reward ShapingReward shaping is a useful method to incorporate auxiliaryknowledge safely. The purpose of reward shaping is to ex-plore how to modify the native reward function without mis-leading the agent. Let F be the shaping function, then R+Fis the new reward. Ng et al. point out that when F is anarbitrary potential-based shaping function, the optimality ofpolicies will not be changed [Ng et al., 1999]. Accordingto this, many shaping functions can be constructed based onexpert demonstrations or domain knowledge. Shaping func-tion F is potential-based if there exists a real-valued functionΦ : S → R such that ∀s, s′ ∈ S, the Equation 1 holds.

F (s, a, s′) = γΦ(s′)− Φ(s) . (1)

Typically, the potential function indicates how good a stateis, hence it’s beneficial for learning a policy. Potential-basedreward shaping has been successfully applied in many com-plex environments, such as StarCraft [Efthymiadis and Ku-denko, 2013], RoboCup TakeAway [Devlin et al., 2011] andMario [Brys et al., 2014a].


2337

4 Proposed MethodDetailed approach and its optimization strategies are pre-sented in this section. We restrict the discussion in the policygradient approaches and propose the method concretely.

4.1 Automatic Successive Reinforcement LearningMost of the recent deep reinforcement learning algorithmsaim to optimize an objective function using a gradient de-scent method. Multiple auxiliary rewards can provide moregradient information, and the improvement of objective func-tion or size of parameter update could tell us how much anagent learns at each iteration. Thus we can get a better gradi-ent direction by combining the multiple gradient directions.Consequently, we propose the Automatic Successive Rein-forcement Learning (ASR) framework which focuses on theexploitation of multiple auxiliary rewards. At each trainingstep, we determine the weight of each reward automatically,and then a single reward algorithm is applied to learn an agentusing the weighted sum of multiple rewards.

It’s worthy to emphasize that the setting of ASR is dif-ferent from MORL’s. MORL attempts to approximate thePareto frontier, and usually, the optimal policy is not unique.However, in this paper, auxiliary rewards are constructed bypotential-based reward shaping, and thus the optimal policyholds for each auxiliary reward. ASR aims to learn the opti-mal policy faster and better with the help of auxiliary rewards.

The trust region policy optimization method (TRPO) is oneof the state-of-the-art policy gradient methods [Schulman etal., 2015]. TRPO is an effective method for optimizing largenonlinear policies such as neural networks with guaranteedmonotonic improvement. In ASR, we choose TRPO as thebase learner.

Standard TRPO optimization problem is defined by:

maxθ

Lθold(θ) = Et

[πθ(a|s)πθold(a|s)

Qπθold (s, a)

]s.t. Et [DKL(πθold(·|s)||πθ(·|s))] ≤ δ ,

(2)

where θ and θold are parameters of policy network, πθold is thecurrent policy, πθ is the policy to be optimized, Et[·] repre-sents the empirical average over several sampling trajectorieslike (s0, a0, s1, a1, . . . , sT ), and δ is the desired KL diver-gence. TRPO maximizes an objective function subject to aconstraint on the KL divergence of new policy and old policy.Such constraint could avoid changing policy parameters toomuch. Consequently, the training process is more stable.

In practice, Equation 2 is hard to handle, and consequently,we usually deal with the approximation problem instead. LetH be the Hessian matrix of KL divergence. After making alinear approximation to the objective and a quadratic approx-imation to the constraint, the approximation problem is:

maxθ

(∇Lθold(θ))T

(θ − θold)

s.t. 12 (θ − θold)TH(θ − θold) ≤ δ .

(3)

Now Equation 3 becomes a natural gradient problem [Amari,1998]. The search direction z is given by:

z = H−1g , (4)

where g = ∇Lθold is the gradient of the objective function.The search direction can be computed by approximately solv-ing the equation Hz = g, which can be efficiently solved us-ing the conjugate gradient algorithm [Schulman et al., 2015].Suppose the step size is β, the parameter θ is updated by:

θ = θold + βz . (5)

From the KL divergence constraint 12β

2zTHz ≤ δ, the max-imum value of β is represented as:

β =√

2δ/(zTHz) . (6)

For the idea maximization of Equation 3 should be alongwith the gradient direction of the objective of Equation 2. Inthe multiple rewards scenarios, affections on gradient direc-tions include the importance of each gradient and the differ-ences between the old policy and the new one. Policy updateis guided by multiple search directions. However, how to se-lect the rewards using the information of all reward signals aspossible is not an obvious problem. It is notable that the lin-ear combination of potential-based shaping functions is alsopotential-based. We can formulate this problem as calculatinga weighted sum of multiple rewards. At each training step,the optimal weight of each reward is determined by solvingan optimization problem.

In scenarios where there are multiple reward functions, themultiple rewards can be denoted as a vector R = [R+F1, R+F2, . . . , R + Fn], where R is the original reward and Fi isthe potential-based shaping function. Corresponding to thei-th reward signal, Li represents the TRPO’s objective func-tion, gi is the gradient of Li, zi is the search direction and wiis the weight of the i-th reward. Their vector forms are de-noted by L = [L1, L2, . . . , Ln]T , G = [g1, g2, . . . , gn], Z =[z1, z2, . . . , zn], and w = [w1, w2, . . . , wn]T . According toEquation 4, we get each zi = H−1gi, and it holds HZ = G.The objective function of the weighted sum of multiple re-wards is given by LTw, the gradient is Gw, and the searchdirection is Zw. For the fixed weights w, the optimizationproblem for the weighted sum of multiple rewards is:

maxθ

(∇(LTw)

)T(θ − θold)

s.t. 12 (θ − θold)TH(θ − θold) ≤ δ .

(7)

Note that the solution of Equation 7 is related to weightsw.ASR aims to determine the weight of each reward automati-cally, and the weights w is a variable to be optimized. Whenw has some specific properties, ASR can select or weight theauxiliary rewards in different stages of reinforcement learn-ing. Consequently, ASR can automatically “generate” newcombined reward for better reinforcement learning. In thispaper, we advocate three strategies to determine the weightof each reward based on maximizing the improvement of ob-jective function or size of parameter update, including rewardselection, reward shrinking and maximum gradient.

In the ASR framework, each reward’s gradient and searchdirection are computed at each iteration first, and then we findthe optimal combinations of rewards by some strategies. Asummary of our ASR framework is shown in Algorithm 1.


2338

Implementation 1: Reward SelectionThe target of reward selection strategy is to find the optimalweights w which can maximize the improvement of objec-tive function with `1 constraint on w. Then the optimizationproblem becomes:

maxw

maxθ

(∇(LTw)

)T(θ − θold)

s.t. 12 (θ − θold)TH(θ − θold) ≤ δ,‖w‖1 = 1,wi ≥ 0, i = 1, 2, . . . , n .

(8)

Substitute the Equation 5 and Equation 6 into Equation 8. Weget the `1 constraint ASR problem:

maxw

wTZTGw

s.t. ‖w‖1 = 1,wi ≥ 0, i = 1, 2, . . . , n .

(9)

`1 constraint produces sparse solutions, therefore inher-ently performing reward selection. At each training step, thereward could obtain maximal improvement is more likely tobe selected to perform parameter update.

Implementation 2: Reward ShrinkingWe propose the reward shrinking strategy by applying the `2constraint on weights. The `2 constraint has a different effectfrom `1 constraint; namely, it forces the weights to be spreadout more equally. The `2 constraint is more likely to get adense solution. Hence it can take advantage of more rewardsat each iteration. The `2 constraint ASR problem is:

maxw

wTZTGw

s.t. ‖w‖2 = 1,wi ≥ 0, i = 1, 2, . . . , n .

(10)

Implementation 3: Maximum GradientIn general case, agents learn the target policy by gradient de-scent methods. The learning process of an agent is reflected inthe parameter update. An uninformative reward will return asmall gradient, leading to slow learning. Our maximum gra-dient strategy is the radical one intending to find the searchdirection with the biggest change in parameter space. Thatis, we maximize the parameter update ‖θ − θold‖2. The opti-mization problem of maximum gradient strategy is:

maxw

wTZTZw

wTZTHZws.t. ‖w‖2 = 1,

wi ≥ 0, i = 1, 2, . . . , n .

(11)

Here we could remove the `2 constraint because the objectivehas nothing to do with the weight’s magnitude. Equation 11can be rewritten as Equation 12.

maxw

wTZTZw

s.t. wTZTHZw = 1,wi ≥ 0, i = 1, 2, . . . , n .

(12)

Algorithm 1 The pseudo code of ASR framework

Initialize policy πfor Iteration i = 0, 1, . . . until convergence do

Run policy πθold in environment for T time stepsfor Reward k = 1, 2, . . . , n do

Compute all advantage values Aπθold

kCompute gradient gk of Lk and search direction zk

end forFind optimal weight wθ ← θ + βZwθold ← θ

end for

4.2 OptimizationIn this section, we mainly focus on the optimization of ourproposed ASR methods. Note that ZTG,ZTZ and ZTHZare all n× n matrices, and n usually is very small comparedto parameter size. Thus, solving such problem doesn’t takemuch time.

Firstly, Equation 9 is a quadratic programming problemwith linear constraints which is like the optimization prob-lem of SVM. The SMO algorithm breaks the optimizationproblem of SVM into a series of sub-problems that could besolved analytically [Platt, 1998]. Inspired by the SMO algo-rithm, we design a similar algorithm to solve Equation 9 effi-ciently. At every iteration, a pair (wi, wj) is selected first,and then we compute the maximum of a one-dimensionalquadratic function analytically.

When ignoring the positive weight constraints, Equation 10is a Rayleigh quotient problem and Equation 12 is a generalRayleigh quotient problem which could transfer to Rayleighquotient problem, so we only need to take into account Equa-tion 10. Furthermore, the Rayleigh quotient problem can besolved analytically by matrix eigenvalue decomposition. Theoptimal solution is given by the eigenvector v correspondingto the largest eigenvalue of ZTG.

Suppose the feasible region of Equation 10 is denoted byD, and PD is a projection operator that projects a vector toregion D. Even though the eigenvector v may be no longerthe optimal solution of Equation 10, PD(v) is an excellentinitial point. After initializing, projected gradient descent isperformed to find a better point.

5 ExperimentsIn this section, empirical experiments and investigations areconducted to validate the effectiveness of ASR framework.We compare our approach to state-of-the-art single rewardmethods and multiple rewards methods. Since there are lessready-made multiple rewards environments, we extend somepopular environments to multiple rewards environments, i.e.,OpenAI classical control domains MountainCar, CartPoleand Acrobot [Brockman et al., 2016], PLE game Catcher1

and Atari game Freeway. Firstly, we show the comparisonresults on three classical control environments and two more

1https://github.com/ntasfi/PyGame-Learning-Environment


2339

0 200 400 600 800 1000episodes

−1000

−800

−600

−400

−200

rewa

rdMountainCar (Compared to Single Reward Methods)

A2CACERPPOTRPOASR-L1ASR-L2ASR-MG

0 200 400 600 800 1000episodes

0

200

400

600

800

1000

rewa

rd

CartPole (Compared to Single Reward Methods)A2CACERPPOTRPOASR-L1ASR-L2ASR-MG

0 200 400 600 800 1000episodes

−500

−400

−300

−200

−100

rewa

rd

Acrobot (Compared to Single Reward Methods)

A2CACERPPOTRPOASR-L1ASR-L2ASR-MG

0 200 400 600 800 1000episodes

−1000

−800

−600

−400

−200

rewa

rd

MountainCar (Compared to Multiple Rewards Methods)ENSEMBLE-LENSEMBLE-MENSEMBLE-RLSRHASR-L1ASR-L2ASR-MG

0 200 400 600 800 1000episodes

0

200

400

600

800

1000

rewa

rd

CartPole (Compared to Multiple Rewards Methods)ENSEMBLE-LENSEMBLE-MENSEMBLE-RLSRHASR-L1ASR-L2ASR-MG

0 200 400 600 800 1000episodes

−500

−400

−300

−200

−100

rewa

rd

Acrobot (Compared to Multiple Rewards Methods)

ENSEMBLE-LENSEMBLE-MENSEMBLE-RLSRHASR-L1ASR-L2ASR-MG

Figure 1: Comparisons on classical control domains

complicated video games. Then we demonstrate the visual-ization on weights during the training process.

5.1 Compared Methods and ConfigurationsWe compare ASR to two kinds of methods. First is the sin-gle reward policy gradient methods, i.e., A2C [Mnih et al.,2016], ACER [Wang et al., 2017], TRPO [Schulman et al.,2015], and PPO [Schulman et al., 2017]. The second kind ofmethods is multiple rewards methods, i.e., linear scalarizationreward shaping [Brys et al., 2014a] and ensemble methodswith different voting strategies [Brys et al., 2017]. In detail,the compared methods are listed as follows:

Single Reward Methods• A2C is a synchronous advantage actor-critic approach,

which gives an equal performance with A3C;• ACER is an off-policy actor-critic model with experience

replay, greatly increasing the sampling efficiency;• TRPO applies a KL divergence constraint to avoid large

parameter update, improving training stability;• PPO simplifies TRPO by using a clipped surrogate objec-

tive while retaining similar performance.

Multiple Rewards Methods• LSRH applies a fixed linear scalarization of multiple shap-

ing rewards;• ENSEMBLE-L is an ensemble method with linear voting;• ENSEMBLE-M is an ensemble method with majority vot-

ing;• ENSEMBLE-R is an ensemble method with rank voting.

Single reward methods are implemented by OpenAI Base-lines,2 which are high-quality implementations of reinforce-ment learning algorithms. Since our proposed ASR methods

2https://github.com/openai/baselines

are based on TRPO, for fairness, the base learner of multiplerewards methods is TRPO too.

For classical control environments, we perform one milliontime steps of training for each method. For video games, weperform ten million time steps of training for each method.Each method is run with five random seeds. We demonstratethe average training curves of two runs with the highest aver-age cumulative reward over the entire training period, and theshade in figures shows the error in the estimate of the mean.

5.2 Comparisons on Classical Control DomainsClassical control is a collection of classical reinforcementlearning environments implemented by OpenAI. We extendthree classical control environments to multiple rewards envi-ronments, i.e., CartPole, MountainCar, and Acrobot. Detailsof auxiliary reward designs for the selected environments are:

• MountainCar The goal of MountainCar environment is todrive the car up the mountain. The original reward is −1for every step taken. Therefore, the reward function is veryuninformative especially when the car doesn’t arrive at thegoal location. We suggest using height, speed and positionas potential functions to construct auxiliary rewards;

• CartPole A pole is attached by an unactuated joint to acart, which moves along a frictionless track. The originalreward is 1 for every step taken. We suggest using cart po-sition and pole angle to construct the potential functions,since keeping cart position and pole angle small is con-ducive to preventing the pole from falling over;

• Acrobot Acrobot is a 2-link pendulum with only the sec-ond joint actuates. The goal is to swing the end-effector ata height at least the length of one link above the base. Sim-ilar to MountainCar environment, the original reward func-tion is very uninformative. Heuristically, we use the heightof end-effector as a potential function, since encouragingthe agent to move up will help to reach the target height.


2340

0 200 400 600 800 1000episodes

0

5

10

15

20

25

30rewa

rd

Freeway

ENSEMBLE-LENSEMBLE-MENSEMBLE-RLSRHASR-L1ASR-L2ASR-MG

0 200 400 600 800 1000episodes

0

50

100

150

200

250

300

rewa

rd

CatcherENSEMBLE-LENSEMBLE-MENSEMBLE-RLSRHASR-L1ASR-L2ASR-MG

Figure 2: Comparisons on video games

The learning curves of all methods are shown in Figure 1.We plot the first one thousand episodes of each method, andthe vertical axis is the cumulative reward of each episode.From the results, we can see some learning curves are flatbecause of the low learning rate. And the multiple rewardsmethods, especially ASR and LSRH, are faster than single re-ward methods, which verifies that auxiliary rewards could im-prove the training speed. Moreover, compared to multiple re-wards methods, especially LSRH with fixed linear scalariza-tion of rewards, ASR methods are faster and more stable. Thelearning curves of ASR show that ASR can provide mono-tonic and fast improvement during the learning process.

5.3 Comparisons on Video GamesIn addition to classical control environments, much morecomplex pixel input environments are considered. Atari andPLE games are widely used video games. We extend twoenvironments of Atari and PLE to multiple rewards environ-ments, including Freeway and Catcher.

• Freeway In Freeway, the agent controls chickens runacross a ten lane highway filled with traffic. The goal is tocontrol chickens to move up and keep away from cars. Thechicken is forced back if hit by a car. If the chicken getsacross, the reward is 1, otherwise 0. The potential functionof this environment is the chicken’s height;

• Catcher In Catcher, the agent controls a paddle to moveleft or right to catch falling fruit. The agent receives a pos-itive reward for each successful fruit catch. If the fruit isnot caught, it receives a negative reward. Since the origi-nal reward is a little sparse, we introduce a denser rewardin Catcher, i.e., the horizontal distance between paddle andfruit is used as a potential function.

0 20 40 60 80 100iterations

0.0

0.1

0.2

0.3

0.4

0.5

weight

AcrobotCartPoleMountainCarCatcherFreeway

Figure 3: Weights of original rewards in ASR

We plot the first one thousand episodes of each methodin Figure 2. From the results, our ASR approaches canachieve the best performance in Freeway and Catcher, whichattributes to the automatic reward selection. Compared toLSRH, which applies a fixed linear scalarization of shapingrewards, the superiority of ASR validates the effectiveness ofdetermining the weights automatically.

5.4 Visualization on WeightsThe weight of a reward could reflect the importance of thisreward. As mentioned before, the original reward functionsare a little uninformative, especially at the preliminary stage.Therefore, it’s reasonable to give the original reward a lowerweight first. We demonstrate the smoothed curves of weightsduring the training process in Figure 3. From the results, theASR can assign weights to original rewards lower than aver-ages, which shows that our ASR methods can find a mean-ingful and helpful combination at each training step. Conse-quently, ASR can achieve faster and better performance.

6 ConclusionResearchers have paid great attention to efficient reinforce-ment learning, while delayed rewards or uninformative re-wards usually lead to low efficiency. Considering that mul-tiple rewards are usually available in many real-world en-vironments and can provide more information to the agent,we make use of multiple auxiliary rewards for speeding upthe training process and develop Automatic Successive Rein-forcement Learning (ASR) framework with the help of multi-ple auxiliary rewards. Investigations show ASR can increasethe learning rate compared to existing methods. Experimen-tal results also show that ASR can assign a low weight to thesparse reward and can take advantage of informative rewardsat the preliminary stage. This observation validates that ourreward selection is meaningful.

AcknowledgementsThis work is supported by National Key R&D Program ofChina (2018YFB1004300), NSFC(61773198), and Collabo-rative Innovation Center of Novel Software Technology andIndustrialization of NJU, Jiangsu. De-Chuan Zhan is the cor-responding author. We thank Prof. Yang Yu for his valuablecomments and suggestions on this work. Yi-Xing Lu is anundergraduate student of Dept. EE, NJU.


2341

References[Amari, 1998] Shun-ichi Amari. Natural gradient works ef-

ficiently in learning. Neural Computation, 10(2):251–276,1998.

[Bengio et al., 2009] Yoshua Bengio, Jerome Louradour,Ronan Collobert, and Jason Weston. Curriculum learn-ing. In Proceedings of the 26th Annual InternationalConference on Machine Learning, pages 41–48, Montreal,Canada, 2009.

[Brockman et al., 2016] Greg Brockman, Vicki Cheung,Ludwig Pettersson, Jonas Schneider, John Schulman, JieTang, and Wojciech Zaremba. OpenAI Gym. arXivpreprint arXiv:1606.01540, 2016.

[Brys et al., 2014a] Tim Brys, Anna Harutyunyan, PeterVrancx, Matthew E. Taylor, Daniel Kudenko, and AnnNowe. Multi-objectivization of reinforcement learningproblems by reward shaping. In Proceedings of the 2014International Joint Conference on Neural Networks, pages2315–2322, Beijing, China, 2014.

[Brys et al., 2014b] Tim Brys, Ann Nowe, Daniel Kudenko,and Matthew E. Taylor. Combining multiple correlatedreward and shaping signals by measuring confidence. InProceedings of the 28th AAAI Conference on Artificial In-telligence, pages 1687–1693, Quebec, Canada, 2014.

[Brys et al., 2017] Tim Brys, Anna Harutyunyan, PeterVrancx, Ann Nowe, and Matthew E. Taylor. Multi-objectivization and ensembles of shapings in reinforce-ment learning. Neurocomputing, 263:48–59, 2017.

[Devlin et al., 2011] Sam Devlin, Daniel Kudenko, andMarek Grzes. An empirical study of potential-based re-ward shaping and advice in complex, multi-agent systems.Advances in Complex Systems, 14(2):251–278, 2011.

[Efthymiadis and Kudenko, 2013] Kyriakos Efthymiadisand Daniel Kudenko. Using plan-based reward shaping tolearn strategies in StarCraft: Broodwar. In Proceedings ofthe 2013 IEEE Conference on Computational Inteligencein Games, pages 1–8, Niagara Falls, Canada, 2013.

[Karpathy and van de Panne, 2012] Andrej Karpathy andMichiel van de Panne. Curriculum learning for motorskills. In Advances in Artificial Intelligence, pages325–330, Heidelberg, Germany, 2012.

[Kim and De Weck, 2006] I. Y. Kim and O. L. De Weck.Adaptive weighted sum method for multiobjective op-timization: A new method for Pareto front generation.Structural and Multidisciplinary Optimization, 31(2):105–116, 2006.

[Lin, 2005] JiGuan G. Lin. On min-norm and min-maxmethods of multi-objective optimization. MathematicalProgramming, 103(1):1–33, 2005.

[Mnih et al., 2016] Volodymyr Mnih, Adria PuigdomenechBadia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap,Tim Harley, David Silver, and Koray Kavukcuoglu. Asyn-chronous methods for deep reinforcement learning. InProceedings of the 33rd International Conference on Ma-chine Learning, pages 1928–1937, New York, NY., 2016.

[Moffaert et al., 2013] Kristof Van Moffaert, Madalina M.Drugan, and Ann Nowe. Scalarized multi-objective re-inforcement learning: Novel design techniques. In Pro-ceedings of the 2013 IEEE Symposium on Adaptive Dy-namic Programming and Reinforcement Learning, pages191–199, Singapore, 2013.

[Ng et al., 1999] Andrew Y. Ng, Daishi Harada, and Stuart J.Russell. Policy invariance under reward transformations:Theory and application to reward shaping. In Proceedingsof the 16th International Conference on Machine Learn-ing, pages 278–287, Bled, Slovenia, 1999.

[Parisi et al., 2014] Simone Parisi, Matteo Pirotta, NicolaSmacchia, Luca Bascetta, and Marcello Restelli. Policygradient approaches for multi-objective sequential deci-sion making. In Proceedings of the 2014 InternationalJoint Conference on Neural Networks, pages 2323–2330,Beijing, China, 2014.

[Pirotta et al., 2015] Matteo Pirotta, Simone Parisi, and Mar-cello Restelli. Multi-objective reinforcement learning withcontinuous Pareto frontier approximation. In Proceed-ings of the 29th AAAI Conference on Artificial Intelligence,pages 2928–2934, Austin, TX., 2015.

[Plappert et al., 2018] Matthias Plappert, MarcinAndrychowicz, Alex Ray, Bob McGrew, Bowen Baker,Glenn Powell, Jonas Schneider, Josh Tobin, MaciekChociej, Peter Welinder, Vikash Kumar, and WojciechZaremba. Multi-goal reinforcement learning: Challengingrobotics environments and request for research. arXivpreprint arXiv:1802.09464, 2018.

[Platt, 1998] John C. Platt. Sequential minimal optimization:A fast algorithm for training support vector machines.Advances in Kernel Methods - Support Vector Learning,208:185–208, 1998.

[Schmidhuber, 2013] Jurgen Schmidhuber. PowerPlay:Training an increasingly general problem solver by contin-ually searching for the simplest still unsolvable problem.Frontiers in Psychology, 4:313, 2013.

[Schulman et al., 2015] John Schulman, Sergey Levine,Pieter Abbeel, Michael I. Jordan, and Philipp Moritz. Trustregion policy optimization. In Proceedings of the 32ndInternational Conference on Machine Learning, pages1889–1897, Lille, France, 2015.

[Schulman et al., 2017] John Schulman, Filip Wolski, Pra-fulla Dhariwal, Alec Radford, and Oleg Klimov. Prox-imal policy optimization algorithms. arXiv preprintarXiv:1707.06347, 2017.

[Sutton and Barto, 1998] Richard S. Sutton and Andrew G.Barto. Introduction to Reinforcement Learning. MITPress, Cambridge, MA., 1998.

[Wang et al., 2017] Ziyu Wang, Victor Bapst, Nicolas Heess,Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, andNando de Freitas. Sample efficient actor-critic with expe-rience replay. In Proceedings of 5th International Confer-ence on Learning Representations, Toulon, France, 2017.


2342

Date post:	28-May-2020
Category:	Documents
Upload:	others
View:	12 times
Download:	0 times

Automatic Successive Reinforcement Learning with Multiple ... · Automatic Successive Reinforcement...

Documents