A Framework for Data-Driven Robotics · the usual scale of RL methods in robotics. Among the RL for...

Scaling data-driven robotics with reward sketching andbatch reinforcement learning

Serkan Cabi, Sergio Gómez Colmenarejo, Alexander Novikov, Ksenia Konyushkova,Scott Reed, Rae Jeong, Konrad Zołna, Yusuf Aytar, David Budden, Mel Vecerik,

Oleg Sushkov, David Barker, Jonathan Scholz, Misha Denil, Nando de Freitas, Ziyu WangDeepmind

Abstract—By harnessing a growing dataset of robot experience,we learn control policies for a diverse and increasing set ofrelated manipulation tasks. To make this possible, we introducereward sketching: an effective way of eliciting human preferencesto learn the reward function for a new task. This rewardfunction is then used to retrospectively annotate all historicaldata, collected for different tasks, with predicted rewards forthe new task. The resulting massive annotated dataset can thenbe used to learn manipulation policies with batch reinforcementlearning (RL) from visual input in a completely off-line way, i.e.,without interactions with the real robot. This approach makes itpossible to scale up RL in robotics, as we no longer need to runthe robot for each step of learning. We show that the trainedbatch RL agents, when deployed in real robots, can perform avariety of challenging tasks involving multiple interactions amongrigid or deformable objects. Moreover, they display a significantdegree of robustness and generalization. In some cases, they evenoutperform human teleoperators.

I. INTRODUCTION

Deep learning has successfully advanced many areas ofartificial intelligence, including vision [39, 26], speech recog-nition [24, 46, 4], natural language processing [17], and rein-forcement learning (RL) [49, 63]. The success of deep learningin each of these fields was made possible by the availability ofhuge amounts of labeled training data. Researchers in visionand language can easily train and evaluate deep neural networkson standard datasets with crowdsourced annotations such asImageNet [58], COCO [45] and CLEVR [33]. In simulatedenvironments like video games, where experience and rewardsare easy to obtain, deep RL is tremendously successful inoutperforming top skilled humans by ingesting huge amountsof data [63, 69, 9]. The OpenAI Five DOTA bot [9] processes180 years of simulated experience every day to play at aprofessional level. Even playing simple Atari games typicallyrequires 40 days of game play [49]. In contrast, in roboticswe lack abundant data since data collection implies executionon a real robot, which cannot be accelerated beyond real time.Furthermore, task rewards do not naturally exist in the real-world robotics as it is the case in simulated environments.The lack of large datasets with reward signals has limited theeffectiveness of deep RL in robotics.

This paper presents a data-driven approach to apply deepRL effectively to learn to perform manipulation tasks on realrobots from vision. Our solution is illustrated in Fig. 1. At itsheart are three important ideas: (i) efficient elicitation of userpreferences to learn reward functions, (ii) automatic annotation

Reward SketchingEvaluation

5

4

Learning Reward

3

2Demonstrations

1

Batch RL

NeverEnding storage

Fig. 1: Our cyclical approach for never-ending collection ofdata and continual learning of new tasks consists of fivestages: (1) generation of observation-action pairs by eitherteleoperation, scripted policies or trained agents, (2) a novelinteractive approach for eliciting preferences for a specific newtask, (3) learning the reward function for the new task andapplying this function to automatically label all the historicaldata, (4) applying batch RL to learn policies purely fromthe massive growing dataset, without online interaction, and(5) evaluation of the learned policies.

of all historical data with any of the learned reward functions,and (iii) harnessing the large annotated datasets to learn policiespurely from stored data via batch RL.

Existing RL approaches for real-world robotics mainlyfocus on tasks where hand-crafted reward mechanisms canbe developed. Simple behaviours such as learning to graspobjects [35] or learning to fly [23] by avoiding crashing canbe acquired by reward engineering. However, as the taskcomplexity increases, this approach does not scale well. Wepropose a novel way to specify rewards that allows to generatereward labels for a large number of diverse tasks. Our approachrelies on human judgments about progress towards the goal totrain task-specific reward functions. Annotations are elicitedfrom humans in the form of per-timestep reward annotationsusing a process we call reward sketching, see Fig. 2. Thesketching procedure is intuitive for humans, and allows themto label many timesteps rapidly and accurately. We use thehuman annotations to train a ranking reward model, which isthen used to annotate all other episodes.

arX

iv:1

909.

1220

0v3

[cs

.RO

] 4

Jun

202

0

Fig. 2: Reward sketching procedure. Sketch of a reward functionfor stack_green_on_red task. A video sequence (top) with areward sketch (bottom), shown in blue. Reward is the perceivedprogress towards achieving the target task. The annotators areinstructed to indicate successful timesteps with reward highenough to reach green area.

Learning Reward

NeverEnding storage

episode 1

episode 2

episode 3

reward sketch 1

reward sketch 2

reward sketch 3 predicted rewards

Fig. 3: Retrospective reward assignment. The reward functionis learned from a limited set of episodes with reward sketches.The learned reward function is applied to a massive dataset ofepisodes from NeverEnding Storage. All historical episodesare now labelled with a newly learned reward function.

To generate enough data to train data-demanding deepneural network agents, we record experience continuously andpersistently, regardless of the purpose or quality of the behavior.We collected over 400 hours of multiple-camera videos (Fig. 6),proprioception, and actions from behavior generated by humanteleoperators, as well as random, scripted and trained policies.By using deep reward networks obtained as a result of rewardsketching, it becomes possible to retrospectively assign rewardsto any past or future experience for any given task. Thus, thelearned reward function allows us to repurpose a large amountof past experience using a fixed amount of annotation effort pertask, see Fig. 3. This large dataset with task-specific rewardscan now be used to harness the power of deep batch RL.

For any given new task, our data is necessarily off-policy, andis typically off-task (i.e., collected for other tasks). In this case,batch RL [41] is a good method to learn visuomotor policies.Batch RL effectively enables us to learn new controllers withoutexecution on the robot. Running RL off-line gives researchersseveral advantages. Firstly, there is no need to worry about wearand tear, limits of real-time processing, and many of the otherchallenges associated with operating real robots. Moreover,researchers are empowered to train policies using their batchRL algorithm of choice, similar to how vision researchers areempowered to try new methods on ImageNet. To this end, werelease datasets [16] with this paper.

The integration of all the elements into a scalable system

Fig. 4: Each row is an example episode of a successfultask illustrating: (1) the ability to recover from a mistakein stack_green_on_red task, (2) robustness to adversarialperturbations in the same task, (3) generalization to unseeninitial conditions in the same task, 4) generalizing to previouslyunseen objects in a lift_green task, (5) the ability to liftdeformable objects, (6) inserting a USB key, (7) inserting aUSB key despite moving the target computer.

to tightly close the loop of human input, reward learningand policy learning poses substantial engineering challenges.Nevertheless, this work is essential to advance the data-drivenrobotics. For example, we store all robot experience includingdemonstrations, behaviors generated by trained policies orscripted random policies. To be useful in learning, this dataneeds to be appropriately annotated and queried. This isachieved thanks to a design of our storage system dubbedNeverEnding Storage (NES).

This multi-component system (Fig. 1) allows us to solvea variety of challenging tasks (Fig. 4) that require skillfulmanipulation, involve multi-object interaction, and consist ofmany time steps. An example of such task is stacking arbitrarilyshaped objects. In this task, small perturbations at the beginningcan easily cause failure later: The robot not only has to achievea successful grasp, but it must also grasp the first object in away that allows for safe placement on top of the second object.Moreover, the second object may have a small surface areawhich varies how demanding the task is. Learning policiesdirectly from pixels makes the task more challenging, buteliminates the need for feature engineering and allows foradditional generalization capacity. While some of our tasks canbe solved effectively with scripted policies, learning policies

that generalize to arbitrary shapes, sizes, textures and materialsremains a formidable challenge, and hence the focus of thispaper is on making progress towards meeting this challenge.

As shown in Fig. 4, the policies learned with our approachsolve a variety of tasks including lifting and stacking ofrigid/deformable objects, as well as USB insertion. Importantly,thanks to learning from pixels, the behaviour generalizes tonew object shapes and to new initial conditions, recovers frommistakes and is robust to some real-time adversarial interference.Fig. 9 shows that the learned policies can also solve tasks moreeffectively than human teleoperators. To better view our resultsand general approach, we highly recommend watching theaccompanying video on the project website.

The remainder of this paper is organized as follows. Sec. IIintroduces the methods, focusing on reward sketching, rewardlearning and batch RL, but also provides the bigger contexthighlighting the engineering contributions. Sec. III is devotedto describing our experimental setup, network architectures,benchmark results, and an interactive insertion task of industrialrelevance. Sec. IV explores some of the related work.

II. METHODS

The general workflow is illustrated in Fig. 1 and a moredetailed procedure is presented in Fig. 5. NES accumulatesa large dataset of task-agnostic experience. A task-specificreward model allows us to retrospectively annotate data inNES with reward signals for a new task. With rewards, we canthen train batch RL agents with all the data in NES.

The procedure for training an agent to complete a new taskhas the following steps which are described in turn in theremainder of the section:

A. A human teleoperates the robot to provide first-persondemonstrations of the target task.

B. All robot experience, including demonstrations, is accu-mulated into NES.

C. Humans annotate a subset of episodes from NES (includ-ing task-specific demos) with reward sketches.

D. A reward model for the target task is trained using thefixed amount of labelled experience.

E. An agent for the target task is trained using all experiencein NES, using the predicted reward values.

F. The resulting policy is deployed on a real robot, whilerecording more data into NES. can further be annotated.

G. Occasionally we select an agent for careful evaluation, totrack overall progress on the task.

A. Teleoperation

To specify a new target task, a human operator firstremotely controls the robot to provide several successful(and occasionally unsuccessful1) examples of completingthe task. By employing the demonstration trajectories, wefacilitate both reward learning and reinforcement learningtasks. Demonstrations help to bootstrap the reward learning by

1We notice that in our dataset around 15% of human demonstrations fail toaccomplish the task at the end of the episode.

Human OperatorNo Human Operator

Task AgnosticTask Specific

Execute

Learn Q/pi

Learn R Labeled Experience Sketch

Teleoperate

NeverEnding Storage

CloudRobot

Evaluate

A

B

CD

E

F

G

Fig. 5: Structure of the data-driven workflow. Each step isdescribed in Sec. II and the figure highlights which steps areperformed on the robot or not, involving human operator ornot and if they are task-specific or task-agnostic.

providing examples of successful behavior with high rewards,which are also easy to interpret and judge for humans. In RL,we circumvent the problem of exploration: Instead of requiringthat the agent explores the state space autonomously, we useexpert knowledge about the intended outcome of the task toguide the agent. In addition to full episodes of demonstrations,when an agent controls the robot, interactive interventions canbe also performed: A human operator can take over from, orreturn control to, an agent at any time. This data is useful forfixing particular corner cases that the agents might encounter.

The robot is controlled with a 6-DoF mouse with anadditional gripper button (see the video) or hand-held virtualreality controller. A demonstrated sequence contains pairs ofobservations and corresponding actions for each time step t:((x0, a0), . . . , (xt, at), . . . , (xT , aT )). Observations xt containall available sensor data including raw pixels from multiplecameras as well as proprioceptive inputs (Fig. 6).

B. NeverEnding Storage

NES captures all of the robot experience generated acrossall tasks in a central repository. This allows us to make useof historical data each time when learning a new target task,instead of generating a new dataset from scratch. NES includesteleoperated trajectories for various tasks, human play data,and experience from the execution of either scripted or learnedpolicies. For every trajectory we store recordings from severalcameras and sensors in the robot cage (Fig. 6). The maininnovation in NES is the introduction of a rich metadata systeminto the RL training pipeline. It is implemented as a relationaldatabase that can be accessed using SQL-type queries. Weattach environment and policy metadata to every trajectory(e.g., date and time of operation), as well as arbitrary human-readable labels and reward sketches. This information allowsus to dynamically retrieve and slice the data relevant for aparticular stage of our training pipeline.

C. Reward Sketching

The second step in task specification is reward sketching.We ask human experts to provide per-timestep annotations

https://sites.google.com/corp/view/data-driven-robotics/

of reward using a custom user interface. As illustrated inFig. 2, the user draws a curve indicating the progress towardsaccomplishing the target task as a function of time, while theinterface shows the frame corresponding to the current cursorposition. This intuitive interface allows a single annotator toproduce hundreds of frames of reward annotations per minute.

To sketch an episode, a user interactively selects a framext and provides an associated reward value s(xt) ∈ [0, 1].The sketching interface allows the annotator to draw rewardcurves while “scrubbing” through a video episode, rather thanannotating frame by frame. This efficient procedure providesa rich source of information about the reward across the entireepisode. The sketches for an episode {s(xt)}|Tt=1 are storedin NES as described in Sec. II-B.

The reward sketches allow comparison of perceived valueof any two frames. In addition, the green region in Fig. 2is reserved for frames where the goal is achieved. For eachtask the episodes to be annotated are drawn from NES. Theyinclude both the demonstrations of the target task, as well asexperience generated for prior tasks. Annotating data fromprior tasks ensures better coverage of the state space.

Sketching is particularly suited for tasks where humansare able to compare two timesteps reliably. Typical objectmanipulation tasks fall in this category, but not all robot tasksare like this. For instance, it would be hard to sketch taskswhere variable speed is important, or with cycles as in walking.While we are aware of these limitations, the proposed approachdoes however cover many manipulation tasks of interest asshown here. We believe future work should advance interfacesto address a wider variety of tasks.

D. Reward Learning

The reward annotations produced by sketching are used totrain a reward model. This model is then used to predict rewardvalues for all experience in NES (Fig. 3). As a result, we canleverage all historical data in training a policy for a new task,without manual human annotation of the entire repository.

Episodes annotated with reward sketches are used to train areward function in the form of neural network with parametersψ in a supervised manner. We find that although there ishigh agreement between annotators on the relative quality oftimesteps within an episode, annotators are often not consistentin the overall scale of the sketched rewards. We therefore adoptan intra-episode ranking approach to learn reward functions,rather than trying to regress the sketched values directly.

Specifically, given two frames xt and xq in the same episode,we train the reward model to satisfy two conditions. First,if frame xt is (un)successful according to the sketch s(xt),it should be (un)successful according the estimated rewardfunction rψ(xt). The successful and unsuccessful frames inreward sketches are defined by exceeding or not a thresholdτs, the (un)successful frames in the predicted reward exceed(or not) a threshold τr1 (τr2). Second, if s(xt) is higher thans(xq) by a threshold µs, then rψ(xt) should be higher thanrψ(xq) by another threshold µr. These conditions are captured

by the following two hinge losses:

Lrank(ψ) =max {0, rψ(xt)− rψ(xq) + µr}1s(xq)−s(xt)>µs

Lsuccess(ψ) =max {0, τr1 − rψ(x)}1s(x)>τs +

max {0, rψ(x)− τr2}1s(x)<τsThe total loss is obtained by adding these terms: Lrank +λLsuccess. In our experiments, we set µs = 0.2, µr = 0.1,τs = 0.85, τr1 = 0.9, τr2 = 0.7, and λ = 10.

E. Batch RL

We train policies using batch RL [41]. In batch RL, thenew policy is learned using a single batch of data generatedby different previous policies, and without further executionon the robot. Our agent is trained using only distributionalRL [7], without any feature pretraining, behaviour cloning (BC)initialization, any special batch correction terms, or auxiliarylosses. We do, however, find it important to use the historicaldata from other tasks.

Our choice of distributional RL is partly motivated by thesuccess of this method for batch RL in Atari [1]. We comparethe distributional and non-distributional RL alternatives in ourexperiments. We note that other batch RL methods (see Sec. IV)might also lead to good results. Because of this, we release ourdatasets [16] and canonical agents [28] to encourage furtherinvestigation and advances to batch RL algorithms in robotics.

We use an algorithm similar to D4PG [7, 28] as our trainingalgorithm. It maintains a value network Q(xt, h

Qt , a | θ) and a

policy network π(x, hπt |φ). Given the effectiveness of recurrentvalue functions [36], both Q and π are recurrent with hQt andhπt representing the corresponding recurrent hidden states. Thetarget networks have the same structure as the value and policynetworks, but are parameterized by different parameters θ′ andφ′, which are periodically updated to the current parametersof the original networks.

Given the Q function, we update the policy using DPG [62].As in D4PG, we adopt a distributional value function [8] andminimize the associated loss to learn the critic. During learning,we sample a batch of sequences of observations and actions{xit, ait, · · · , xit+n}i and use a zero start state to initialize allrecurrent states at the start of sampled sequences. We thenupdate φ and θ using BPTT [70].

Since NES contains data from many different tasks, arandomly sampled batch from NES may contain data mostlyirrelevant to the task at hand. To increase the representationof data from the current task, we construct fixed ratio batches,with 75% of the batch drawn from the entirety of NES and25% from the data specific to the target task. This is similarto the solution proposed in previous work [54], where fixedratio batches are formed with agent and demonstration data.

F. Execution

Once an agent is trained, we can run it on the real robot. Byrunning the agent, we collect more experience, which can beused for reward sketching or RL in future iterations. Runningthe agent also allows us to observe its performance and makejudgments about the steps needed to improve it.

Basket front

Basket back

Basket front

Wrist depthWrist wide angle 1

Wrist wide angle 2

Training object set

Fig. 6: The robot senses and records all data acquired with its3 cage cameras, 3 wrist cameras (wide angle and depth) andproprioception. It also records its actions continuously. Therobot is trained with a wide variety of object shapes, texturesand sizes to achieve generalization at deployment time.

In early workflow iterations, before the reward functionsare trained with sufficient coverage of state space, the policiesoften exploit “delusions” where high rewards are assignedto undesired behaviors. To fix a reward delusion, a humanannotator sketches some of the episodes where the delusionis observed. New annotations are used to improve the rewardmodel, which is used in training a new policy. For eachtarget task, this cycle is typically repeated 2–3 times untilthe predictions of a reward function are satisfactory.

III. EXPERIMENTS

A. Experimental Setup

Robotic setup: Our setup consists of a Sawyer robot with aRobotiq 2F-85 gripper and a wrist force-torque sensor facinga 35× 35 cm basket. The action space has six continuousdegrees of freedom, corresponding to Cartesian translationaland rotational velocity targets of the gripper pinch point andone binary control of gripper fingers. The agent control loopis executed at 10Hz. For safety, the pinch point movement isrestricted to be in a 35× 35× 15 cm workspace with maximumrotations of 30◦, 90◦, and 180◦ around each axis.

Observations are provided by three cameras around the cage,as well as two wide angle cameras and one depth cameramounted at the wrist, and proprioceptive sensors in the arm(Fig. 6). NES captures all of the observations, and we indicatewhat subset is used for each learned component.

Tasks and datasets: We focus on 2 subsets of NES, withdata recorded during manipulation of 3 variable-shape rigidobjects coloured red, green and blue (rgb dataset, Fig. 6), and 3deformable objects: a soft ball, a rope and a cloth (deformabledataset, Fig. 4, row 5). The rgb dataset is used to learn policiesfor two tasks: lift_green and stack_green_on_red, andthe deformable dataset is used for the lift_cloth task.Statistics for both datasets are presented in Tab. I whichdescribes how much data is teleoperated, how much comesfrom the target tasks and how much is obtained by randomscripted policies. Each episode lasts for 200 steps (20 seconds)unless it is terminated earlier for safety reasons.

Type No. Episodes No. steps Hours

Teleoperation 6.2 K 1.1 M 31.9lift_green 8.5 K 1.5 M 41.3stack_green_on_red 10.3 K 2.0 M 56.1random_watcher 13.1 K 2.6 M 70.9Total 37.9 K 7.0 M 193.3

(a) RGB dataset.

Type No. Episodes No. steps Hours

Teleoperation 2.8 K 568 K 15.8lift_cloth 13.3 K 2.4 M 66.0random_watcher 6.0 K 1.2 M 32.1Total 36.5 K 6.9 M 191.2

(b) Deformable dataset.

TABLE I: Dataset statistics. Total includes off-task data notlisted in individual rows, teleoperation and tasks lift_green,stack_green_on_red, lift_cloth partly overlap.

To generate initial datasets for training we use a scriptedpolicy called the random_watcher. This policy moves the endeffector to randomly chosen locations and opens and closes thegripper at random times. When following this policy, the robotoccasionally picks up or pushes the objects, but is typicallyjust moving in free space. This data not only serves to seed theinitial iteration of learning, but removing it from the trainingdatasets degrades performance of the final agents.

The datasets contain a significant number of teleoperatedepisodes. The majority are recorded via interactive teleoperation(Sec. II-A), and thus require limited human intervention.Only about 600 full teleoperated episodes correspond to thelift_green or stack_green_on_red tasks.

There are 894, 1201, and 585 sketched episodes for thelift_green, stack_green_on_red and lift_cloth tasks,respectively. Approximately 90% of the episodes are used fortraining and 10% for validation. The sketches are not obtainedall at once, but accumulated over several iterations of theprocess illustrated in Fig. 1. At the first iteration, the humansannotate randomly sampled demonstrations. In next iterations,the annotations are usually done on agent data, and occasionallyon demonstrations or random watcher data. Note that only asmall portion of data from NES is annotated.

Agent network architecture: The agent network is illustratedin Fig. 7. Each camera is encoded using a residual networkfollowed by a spatial softmax keypoint encoder with 64channels [42]. The spatial softmax layer produces a list of64 (x, y) coordinates. We use one such list for each cameraand concatenate the results.

Before applying the spatial softmax, we add noise from thedistribution U [−0.1, 0.1] to the logits so that the network learnsto concentrate its predictions, as illustrated with the circlesin Fig. 7. Proprioceptive features are concatenated, embeddedwith a linear layer, layer-normalized [5], and finally mappedthrough a tanh activation. They are then appended to thecamera encodings to form the joint input features.

The actor network π(x) consumes these joint input featuresdirectly. The critic network Q(x, a) additionally passes them

96

128

3x3 convMax pool

16 filters

3x3 convMax pool

32 filters

48

64

24

32

x1, y1x2, y2 . . .x64, y64

Spatial softmax

2D expectation coordinates list

ReLU, 3x3 conv,ReLU, 3x3 conv

16 filters

48

64

3x3 convMax pool

32 filters

12

16


32 filters

24

32

ReLU1x1 conv

32 filters

12

16


32 filters

12

16

LSTMs Value and Policy heads

Actions + Proprio USB wrist

camera

Additional cameras

64 filters

Proprio

256 256

256256

Fig. 7: Agent network architecture. Only the wrist camera encoder is shown here, but in practice we encode each cameraindependently and concatenate the results.

through a linear layer, concatenates the result with actionspassed through a linear layer, and maps the result througha linear layer with ReLU activations. The actor and criticnetworks each use two layer-normalized LSTMs with 256hidden units. Action outputs are further processed through atanh layer placing them in the range [−1, 1], and then re-scaledto their native ranges before being sent to the robot.

The agent for lift_green and stack_green_on_red

tasks observes two cameras, a basket front left camera(80× 128) and one of wrist-mounted wide angle cameras(96× 128) (Fig. 6). The agent for lift_cloth uses anadditional back left camera (80× 128).

Reward network architecture: The reward network is anon-recurrent residual network with a spatial softmax layer[42] as in the agent network architecture. We also use theproprioceptive features as in the agent learning. As the sketchedvalues are in the range of [0, 1], the reward network ends witha sigmoid non-linearity.

Training: We train multiple RL agents in parallel andbriefly evaluate the most promising ones on the robot. Eachagent is trained for 400k update steps. To further improveperformance, we save all episodes from RL agents, and sketchmore reward curves if necessary, and use them when trainingthe next generation of agents. We iterated this procedure 2–3times and at each iteration the agent becomes more successfuland more robust. Three typical episodes from three steps ofimprovement in stack_green_on_red task are depicted inFig. 8. They correspond to agents trained using approximately82%, 94% and 100% of the collected data. In the first iteration,the agent could pick a green block, but drops it. In the seconditeration, the agent attempts stacking a green block on red, andonly in the third iteration it succeeds in it. Next, we report theperformance of the final agents.

Evaluation: While the reward and policy are learned fromdata, we cannot assess their ultimate quality without runningthe agent on the real robot. That is, we need to evaluate whetherour agents learned using the stored datasets transfer to the realrobot. As the agent is learned off-line, good performance onthe real robot is a powerful indicator of generalization.

To this end, we conducted controlled evaluations on thephysical robot with fixed initial conditions across differentpolicies. For the lift_green and stack_green_on_red

datasets, we devise three different evaluation conditions withvarying levels of difficulty:

1) normal: basic rectangular green blocks (well representedin the training data), large red objects close to the center;

iteration 1

iteration 2

iteration 3

Fig. 8: Iterative improvement of the agent on taskstack_green_on_red. Each iteration corresponds to a cyclethrough the steps as shown in Fig. 1. With more training data,the performance of agent improves.

2) hard: more diverse objects (less well represented in thetraining data), smaller red objects with diverse locations;

3) unseen: green objects that were never seen during training,large red objects.

Each condition specifies 10 different initial positions of theobjects (set by a human operator) as well as the initial pose ofthe robot (set automatically). The hard and unseen conditionsare especially challenging, since they require the agent to copewith novel objects and novel object configurations.

We use the same 3 evaluation sets for both thelift_green and stack_green_on_red tasks. To evaluatethe lift_cloth task, we randomize the initial conditionsat every trial. As a quality metric, we measure the rate ofsuccessfully completed episodes, where success is indicatedby a human operator.

B. Results

Results on the rgb dataset are summarized in Tab. II. Ouragent achieves a success rate of 80% for lifting and 60% forstacking. Even with rarely seen objects positioned in adversarialways, the agent is quite robust with success rates being 80%and 40%, respectively. Remarkably, when dealing with objectsthat were never seen before, it can lift or stack them in 50%and 40% of cases (see Fig. 4 for examples of such behavior).The success rate of our agent for the lift_cloth task in 50episodes with randomized initial conditions is 74%.

Our results compare favorably with those of Zhu et al. [73],where block lifting and stacking success rates are 64% and35%. Note that these results are not perfectly comparable dueto different physical setups, but we believe they provide some

Agent Normal Hard Unseen

Our approach 80% 80% 50%No random watcher data 80% 70% 20%Only lift data 0% 0% 0%Non-distributional RL 30% 20% 10%

(a) lift_green

Agent Normal Hard Unseen

Our approach 60% 40% 40%No random watcher data 50% 30% 30%Only stacking data 0% 10% 0%Non-distributional RL 20% 0% 0%

(b) stack_green_on_red.

TABLE II: The success rate of our agent and ablations for agiven task in different difficulty settings. Recall that out agentis trained off-line.

guidance. Wulfmeier et al. [72] also attempted reward learningwith the block stacking task. Instead of learning directly frompixels, they rely on QR-code state estimation for a fixed setof cubes, whereas our policies can handle objects of variousshapes, sizes and material properties. Jeong et al. [31] achieve62% accuracy on block stacking (but with a fixed set of largeblocks) using a sim2real approach with continuous 4-DoFcontrol. In contrast, we can achieve similar performance with avariety of objects and more complex continuous 6-DoF control.

To understand the benefits of relabelling the past experiencewith learned reward functions, we conduct the ablations withfixed reward functions and varying training subsets for RLagents. Firstly, we train the lifting (stacking) policy using onlythe lifting (stacking) episodes. Using only task-specific datais interesting because the similarity between training data andtarget behavior is higher (i.e., the training data is more on-policy). Secondly, we train an agent with access to data fromall tasks, but no access to the random_watcher data. As thisdata is unlikely to contain relevant to the task episodes, wewant to know how much it contributes to the final performance.

Tab. II show the results of these two ablations. Remarkably,using only a task-specific dataset dramatically degrades thepolicy (its performance is 0% in almost all scenarios). Randomwatcher data proves to be valuable as it contributes up to anadditional 30% improvement, showing the biggest advantagein the hardest case with unseen objects.

We also evaluate the effect of distributional value functions.Confirming previous findings in Atari [1], the results in thelast rows of Tab. II show that distributional value functions areessential for good performance in batch RL.

For qualitative results, we refer the reader to the accom-panying video and Fig. 4 that demonstrate the robustnessof our agents. The robot successfully deals with adversarialperturbations by a human operator, stacking several unseen andnon-standard objects and lifting toys, such as a robot and apony. Our agents move faster and are more efficient comparedto a human operator in some cases as illustrated in Fig. 9.

0s. 3s. 6s. 9s. 12s.

agent

human

Fig. 9: Agent vs human in stack_green_on_red task. Weshow frames of an episode performed by an agent (top) and ahuman (bottom) after every 3 seconds. The agent accomplishesthe task faster than a human operator.

Fig. 10: USB-insertion task success rate during the process ofon-line training. It illustrates the rapid progress of training arobot to solve an industrially relevant task.

C. Interactive Insertion

An alternative way to obtain a policy is to perform datacollection, reward learning and policy learning in a tight loop.Here, the human operator interactively refines the learnedreward function on-line at the same time when a policy islearned. In this experiment, the policy is learned from scratchwithout relying on historical data and batch RL, which ispossible in less data-demanding applications. In this section,we present an example of this approach applied to industriallyrelevant task: insert a USB key into a computer port.

We consider 6-DoF velocity control. The velocity actionsare fed to a stateful safety controller, which uses a previouslylearned model to limit excess forces and a Mujoco inversekinematics model to infer target joint velocities. Episodes areset to last 15 seconds with 10 Hz control, for a total of 150steps. Both the policy and reward model use wrist cameraimages of size 84× 84 pixels.

At the start of each episode, the robot position is set withina 6× 6× 6 cm region with 8.6◦ rotation in each direction, andthe allowed workspace is 8× 8× 15 cm with 17.2◦ rotation.This is known to be significant amount of variation for suchtask. Episodes are terminated with a discount of zero whenthe robot reached the boundary of the workspace. For fasterconvergence, a smaller network architecture is chosen with 3convolutional layers and 2 fully connected layers. At the startof the experiment, 100 human demonstrations are collectedand annotated with sketches.

This experiment is repeated 3 times. The average successrate of the agent as a function of time is shown in Fig. 10. Theagent reaches over 80% success rate within 8 hours. During thistime, the human annotator provides 65± 10 additional reward

sketches. This experiment demonstrates that it is possible tosolve an industrial robot task from vision using human feedbackwithin a single working day.

Two successful episodes of USB insertion are shown in Fig. 4in two last rows. In the first example the robot successfullyinserts a key using only pixel inputs. As only vision input isused during training and actions are defined with respect to thewrist frame, the resulting policy is robust to unseen positionalchanges. In the second example, the agent (which is trained onthe unperturbed state) can perform insertion despite movingthe input socket significantly.

IV. RELATED WORK

RL has a long history in robotics [37, 53, 34, 25, 42, 43,35]. However, applying RL in this domain inherits all thegeneral difficulties of applying RL in the real world [18]. Mostpublished works either rely on state estimation for a specifictask, or work in a very limited regime to learn from rawobservations. These methods typically entail highly engineeredreward functions. In our work, we go beyond the usual scaleof application of RL to robotics, learn from raw observationsand without predefined rewards.

Batch RL trains policies from a fixed dataset and, thus, it isparticularly useful in real-world applications like robotics. Itis currently an active area of research (see the work of Langeet al. [41] for an overview), with a number of recent worksaimed at improving the stability [22, 30, 1, 40].

In the RL-robotics literature, QT-Opt [35] is the closestapproach to ours. The authors collect a dataset of over 580,000grasps for several weeks with 7 robots. They train a distributedQ-learning agent that shows remarkable generalization todifferent objects. Yet, the whole system focuses on a singletask: grasping. This task is well-suited for reward engineeringand scripted data collection policies. However, these techniquesare not easy to design for many tasks and, thus, relying onthem limits the applicability of the method. In contrast, wecollect the diverse data and we learn the reward functions.

Learning reward functions using inverse RL [52] achievedtremendous success [20, 27, 44, 21, 48, 73, 6]. This class ofmethods works best when applied to states or well-engineeredfeatures. Making it work for high-dimensional input spaces,particularly raw pixels, remains a great challenge.

Learning from preferences has a long history [66, 50, 19,64, 14, 32]. Interactive learning and optimization with humanpreferences dates back to works at the interface of machinelearning and graphics [10, 11]. Preference elicitation is alsoused for reward learning in RL [65, 71]. It can be doneby whole episode comparisons [2, 3, 12, 59] or shorter clipcomparisons [13, 29]. A core challenge is to engineer methodsthat acquire many preferences with as little user input aspossible [38]. To deal with this challenge, our reward sketchinginterface allows perceptual reward learning [60] from any, evenunsuccessful trajectories.

Many works in robotics choose to learn from demonstrationsto avoid hard exploration problems of RL. For example, super-vised learning to mimic demontrations is done in BC [55, 56].

However, BC requires high-quality consistent demonstrations ofthe target task and as such, it cannot benefit from heterogeneousdata. Moreover, BC policies generally cannot outperform thehuman demonstrator. Demonstrations could be also used inRL [51, 57] to address the exploration problem. As in priorworks [67, 54, 68], we use demonstrations as part of the agentexperience and train with temporal difference learning in amodel-free setting.

Several recent large-scale robotic datasets were releasedrecently to advance the data-driven robotics. Roboturk [47]collects crowd-sourced demonstrations for three tasks with themobile platform. The dataset is used in the experiments withonline RL. MIME [61] dataset contains both human and robotdemonstrations for 20 diverse tasks and its potential is tested inthe experiments with BC and task recognition. RoboNet [15]database focuses on transferring the experience across objectsand robotic platforms. The large-scale collection of the data ispossible thanks to scripted policies. The strength of this datasetis evaluated in action-conditioned video prediction and in actionprediction. Our dataset [16] is collected with demonstrations,scripted policies as well as learned policies. This paper is thefirst to show how to efficiently label such datasets with rewardsand how to apply batch RL to such challenging domains.

V. CONCLUSIONS

We have proposed a new data-driven approach to robotics.Its key components include a method for reward learning,retrospective reward labelling and batch RL with distributionalvalue functions. A significant amount of engineering andinnovation was required to implement this at the present scale.To further advance data-driven robotics, reward learning andbatch RL, we release the large datasets [16] from NeverEndingStorage and canonical agents [28].

We found that reward sketching is an effective way to elicitreward functions, since humans are good at judging progresstoward a goal. In addition, the paper also showed that storingrobot experience over a long period of time and across differenttasks allows to efficiently learn policies in a completely off-linemanner. Interestingly, diversity of training data seems to bean essential factor in the success of standard state-of-the-artRL algorithms, which were previously reported to fail whentrained only on expert data or the history of a single agent [22].Our results across a wide set of tasks illustrate the versatilityof our data-driven approach. In particular, the learned agentsshowed a significant degree of generalization and robustness.

This approach has its limitations. For example, it involvesa human-in-the-loop during training which implies additionalcost. The reward sketching procedure is not universal and otherstrategies might be needed for different tasks. Besides, thelearned agents remain sensitive to significant perturbations inthe setup. These open questions are directions for future work.

ACKNOWLEDGMENTS

We would like to thank all the colleagues at DeepMind whoteleoperated the robot for data collection.

REFERENCES

[1] Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi.Striving for simplicity in off-policy deep reinforcement learning.arXiv preprint arXiv:1907.04543, 2019.

[2] Riad Akrour, Marc Schoenauer, and Michèle Sebag. APRIL:Active preference learning-based reinforcement learning. InECMLPKDD, pages 116–131, 2012.

[3] Riad Akrour, Marc Schoenauer, Michele Sebag, and Jean-Christophe Souplet. Programming by feedback. In InternationalConference on Machine Learning, pages 1503–1511, 2014.

[4] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai,Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, BryanCatanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech2: End-to-end speech recognition in English and Mandarin. InInternational Conference on Machine Learning, pages 173–182,2016.

[5] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layernormalization. arXiv preprint arXiv:1607.06450, 2016.

[6] Nir Baram, Oron Anschel, Itai Caspi, and Shie Mannor. End-to-end differentiable adversarial imitation learning. In InternationalConference on Machine Learning, pages 390–399, 2017.

[7] Gabriel Barth-Maron, Matthew W. Hoffman, David Budden, WillDabney, Dan Horgan, Dhruva TB, Alistair Muldal, Nicolas Heess,and Timothy Lillicrap. Distributed distributional deterministicpolicy gradients. In International Conference on LearningRepresentations, 2018.

[8] Marc G Bellemare, Will Dabney, and Rémi Munos. A distribu-tional perspective on reinforcement learning. In InternationalConference on Machine Learning, pages 449–458, 2017.

[9] Christopher Berner, Greg Brockman, Brooke Chan, VickiCheung, Przemysław Debiak, Christy Dennison, David Farhi,Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2with large scale deep reinforcement learning. arXiv preprintarXiv:1912.06680, 2019.

[10] Eric Brochu, Nando de Freitas, and Abhijeet Ghosh. Activepreference learning with discrete choice data. In Advances onNeural Information Processing Systems, pages 409–416, 2007.

[11] Eric Brochu, Tyson Brochu, and Nando de Freitas. A Bayesianinteractive optimization approach to procedural animation design.In SIGGRAPH Symposium on Computer Animation, pages 103–112, 2010.

[12] Daniel Brown, Wonjoon Goo, Prabhat Nagarajan, and ScottNiekum. Extrapolating beyond suboptimal demonstrationsvia inverse reinforcement learning from observations. InInternational Conference on Machine Learning, pages 783–792,2019.

[13] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic,Shane Legg, and Dario Amodei. Deep reinforcement learningfrom human preferences. In Advances on Neural InformationProcessing Systems, pages 4299–4307, 2017.

[14] Wei Chu and Zoubin Ghahramani. Preference learning withGaussian processes. In International Conference on MachineLearning, pages 137–144, 2005.

[15] Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair,Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, SergeyLevine, and Chelsea Finn. RoboNet: Large-scale multi-robotlearning. In Conference on Robot Learning, 2019.

[16] DeepMind. Sketchy data, 2020. URL https://github.com/deepmind/deepmind-research/tree/master/sketchy.

[17] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and KristinaToutanova. BERT: Pre-training of deep bidirectional transformersfor language understanding. In Proceedings of the 2019Conference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies,Volume 1, pages 4171–4186, 2019.

[18] Gabriel Dulac-Arnold, Daniel J. Mankowitz, and Todd Hester.

Challenges of real-world reinforcement learning. arXiv preprintarXiv:1904.12901, 2019.

[19] Stephen E Feinberg and Knley Larntz. Log-linear representationfor paired and multiple comparison models. Biometrika, 63(2):245–254, 1976.

[20] Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided costlearning: Deep inverse optimal control via policy optimization.In International Conference on Machine Learning, pages 49–58,2016.

[21] Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewardswith adversarial inverse reinforcement learning. In InternationalConference on Learning Representations, 2018.

[22] Scott Fujimoto, David Meger, and Doina Precup. Off-policydeep reinforcement learning without exploration. arXiv e-prints,art. arXiv:1812.02900, 2018.

[23] Dhiraj Gandhi, Lerrel Pinto, and Abhinav Gupta. Learning to flyby crashing. In International Conference on Intelligent Robotsand Systems, pages 3948–3955, 2017.

[24] Alex Graves and Navdeep Jaitly. Towards end-to-end speechrecognition with recurrent neural networks. In InternationalConference on Machine Learning, pages 1764–1772, 2014.

[25] Roland Hafner and Martin Riedmiller. Reinforcement learningin feedback control. Machine learning, 84(1-2):137–169, 2011.

[26] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deepresidual learning for image recognition. In IEEE Computer Visionand Pattern Recognition, pages 770–778, 2016.

[27] Jonathan Ho and Stefano Ermon. Generative adversarial imitationlearning. In Advances on Neural Information Processing Systems,pages 4565–4573, 2016.

[28] Matt Hoffman, Bobak Shahriari, John Aslanides, Gabriel Barth-Maron, Feryal Behbahani, Tamara Norman, Abbas Abdolmaleki,Albin Cassirer, Fan Yang, Kate Baumli, Sarah Henderson, AlexNovikov, Sergio GÃsmez Colmenarejo, Serkan Cabi, Caglar Gul-cehre, Tom Le Paine, Andrew Cowie, Ziyu Wang, Bilal Piot, andNando de Freitas. Acme: A research framework for distributedreinforcement learning. arXiv preprint arXiv:2006.00979, 2020.

[29] Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, ShaneLegg, and Dario Amodei. Reward learning from humanpreferences and demonstrations in Atari. In Advances on NeuralInformation Processing Systems, pages 8011–8023, 2018.

[30] Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen,Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu,and Rosalind Picard. Way off-policy batch deep reinforcementlearning of implicit human preferences in dialog. arXiv preprintarXiv:1907.00456, 2019.

[31] Rae Jeong, Yusuf Aytar, David Khosid, Yuxiang Zhou, JackieKay, Thomas Lampe, Konstantinos Bousmalis, and FrancescoNori. Self-supervised sim-to-real adaptation for visual roboticmanipulation. arXiv preprint arXiv:1910.09470, 2019.

[32] Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke,Filip Radlinski, and Geri Gay. Evaluating the accuracy of implicitfeedback from clicks and query reformulations in web search.ACM Transactions on Information Systems, 25(2), 2007.

[33] Justin Johnson, Bharath Hariharan, Laurens van der Maaten,Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. CLEVR: Adiagnostic dataset for compositional language and elementaryvisual reasoning. In IEEE Computer Vision and PatternRecognition, 2017.

[34] Mrinal Kalakrishnan, Ludovic Righetti, Peter Pastor, and StefanSchaal. Learning force control policies for compliant manipu-lation. In International Conference on Intelligent Robots andSystems, pages 4639–4644, 2011.

[35] Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz,Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly,Mrinal Kalakrishnan, Vincent Vanhoucke, and Sergey Levine.Scalable deep reinforcement learning for vision-based roboticmanipulation. In Conference on Robot Learning, pages 651–673,

https://github.com/deepmind/deepmind-research/tree/master/sketchy

https://github.com/deepmind/deepmind-research/tree/master/sketchy

2018.[36] Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos,

and Will Dabney. Recurrent experience replay in distributedreinforcement learning. In International Conference on LearningRepresentations, 2018.

[37] Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcementlearning in robotics: A survey. The International Journal ofRobotics Research, 32(11):1238–1274, 2013.

[38] Yuki Koyama, Issei Sato, Daisuke Sakamoto, and Takeo Igarashi.Sequential line search for efficient visual design optimizationby crowds. ACM Transactions on Graphics, 36(4):1–11, 2017.

[39] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Im-agenet classification with deep convolutional neural networks.In Advances on Neural Information Processing Systems, pages1097–1105, 2012.

[40] Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, andSergey Levine. Stabilizing off-policy Q-learning via boot-strapping error reduction. In Advances on Neural InformationProcessing Systems, pages 11761–11771, 2019.

[41] Sascha Lange, Thomas Gabel, and Martin Riedmiller. Batchreinforcement learning. In Reinforcement learning, pages 45–73.Springer, 2012.

[42] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel.End-to-end training of deep visuomotor policies. The Journalof Machine Learning Research, 17(1):1334–1373, 2016.

[43] Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, andDeirdre Quillen. Learning hand-eye coordination for roboticgrasping with deep learning and large-scale data collection. TheInternational Journal of Robotics Research, 37(4-5):421–436,2018.

[44] Yunzhu Li, Jiaming Song, and Stefano Ermon. Infogail:Interpretable imitation learning from visual demonstrations. InAdvances on Neural Information Processing Systems, pages3812–3822, 2017.

[45] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Dollár, and C LawrenceZitnick. Microsoft COCO: Common objects in context. InEuropean Conference on Computer Vision, pages 740–755, 2014.

[46] Andrew Maas, Ziang Xie, Dan Jurafsky, and Andrew Ng.Lexicon-free conversational speech recognition with neuralnetworks. In Proceedings of the 2015 Conference of theNorth American Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages 345–354,2015.

[47] Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher,Max Spero, Albert Tung, Julian Gao, John Emmons, AnchitGupta, Emre Orbay, Silvio Savarese, and Li Fei-Fei. RoboTurk:A crowdsourcing platform for robotic skill learning throughimitation. In Conference on Robot Learning, pages 879–893,2018.

[48] Josh Merel, Yuval Tassa, Sriram Srinivasan, Jay Lemmon, ZiyuWang, Greg Wayne, and Nicolas Heess. Learning humanbehaviors from motion capture by adversarial imitation. arXivpreprint arXiv:1707.02201, 2017.

[49] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A.Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, MartinRiedmiller, Andreas K. Fidjeland, and Georg Ostrovski etal. Human-level control through deep reinforcement learning.Nature, 518(7540):529–533, 2015.

[50] F Mosteller. Remarks on the method of paired comparisons: I.the least squares solution assuming equal standard deviationsand equal correlations. Psychometrika, 16:3–9, 1951.

[51] Ashvin Nair, Bob McGrew, Marcin Andrychowicz, WojciechZaremba, and Pieter Abbeel. Overcoming exploration in rein-forcement learning with demonstrations. In IEEE InternationalConference on Robotics & Automation, pages 6292–6299, 2018.

[52] Andrew Y. Ng and Stuart J. Russell. Algorithms for inverse

reinforcement learning. In International Conference on MachineLearning, pages 663–670, 2010.

[53] Jan Peters and Stefan Schaal. Reinforcement learning of motorskills with policy gradients. Neural networks, 21(4):682–697,2008.

[54] Tobias Pohlen, Bilal Piot, Todd Hester, Mohammad GheshlaghiAzar, Dan Horgan, David Budden, Gabriel Barth-Maron, HadoVan Hasselt, John Quan, Mel Vecerík, et al. Observe andlook further: Achieving consistent performance on atari. arXivpreprint arXiv:1805.11593, 2018.

[55] Dean A Pomerleau. Alvinn: An autonomous land vehicle in aneural network. In Advances on Neural Information ProcessingSystems, pages 305–313, 1989.

[56] Rouhollah Rahmatizadeh, Pooya Abolghasemi, Ladislau Bölöni,and Sergey Levine. Vision-based multi-task manipulation forinexpensive robots using end-to-end learning from demonstration.In IEEE International Conference on Robotics & Automation,pages 3758–3765, 2018.

[57] Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giu-lia Vezzani, John Schulman, Emanuel Todorov, and SergeyLevine. Learning complex dexterous manipulation with deepreinforcement learning and demonstrations. Robotics, Scienceand Systems, 2018.

[58] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, SanjeevSatheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, AdityaKhosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei.ImageNet large scale visual recognition challenge. InternationalJournal of Computer Vision, 115(3):211–252, 2015.

[59] Dorsa Sadigh, Anca D. Dragan, Shankar Sastry, and Sanjit A.Seshia. Active preference-based learning of reward functions.In Robotics, Science and Systems, 2017.

[60] Pierre Sermanet, Kelvin Xu, and Sergey Levine. Unsupervisedperceptual rewards for imitation learning. Robotics, Science andSystems, 2017.

[61] Pratyusha Sharma, Lekha Mohan, Lerrel Pinto, and AbhinavGupta. Multiple interactions made easy (MIME): Large scaledemonstrations data for imitation. In Conference on RobotLearning, pages 906–915, 2018.

[62] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, DaanWierstra, and Martin Riedmiller. Deterministic policy gradientalgorithms. In International Conference on Machine Learning,pages 387–395, 2014.

[63] David Silver, Aja Huang, Chris J Maddison, Arthur Guez,Laurent Sifre, George Van Den Driessche, Julian Schrittwieser,Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al.Mastering the game of go with deep neural networks and treesearch. Nature, 529(7587):484, 2016.

[64] Hal Stern. A continuum of paired comparison models.Biometrika, 77:265–273, 1990.

[65] Malcolm J. A. Strens and Andrew W. Moore. Policy search usingpaired comparisons. Journal of Machine Learning Research, 3:921–950, 2003.

[66] LL Thurstone. A law of comparative judgement. PsychologicalReview, 34:273–286, 1927.

[67] Matej Vecerík, Todd Hester, Jonathan Scholz, Fumin Wang,Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Rothörl,Thomas Lampe, and Martin Riedmiller. Leveraging demonstra-tions for deep reinforcement learning on robotics problems withsparse rewards. arXiv preprint arXiv:1707.08817, 2017.

[68] Mel Vecerik, Oleg Sushkov, David Barker, Thomas Rothörl,Todd Hester, and Jon Scholz. A practical approach to insertionwith variable socket position using deep reinforcement learning.In IEEE International Conference on Robotics & Automation,pages 754–760, 2019.

[69] Oriol Vinyals, Igor Babuschkin, Junyoung Chung, MichaelMathieu, Max Jaderberg, Wojciech M Czarnecki, AndrewDudzik, Aja Huang, Petko Georgiev, Richard Powell, et al.

Alphastar: Mastering the real-time strategy game StarCraft II.DeepMind Blog, 2019.

[70] Paul J Werbos et al. Backpropagation through time: What it doesand how to do it. Proceedings of the IEEE, 78(10):1550–1560,1990.

[71] Christian Wirth, Riad Akrour, Gerhard Neumann, and JohannesFürnkranz. A survey of preference-based reinforcement learningmethods. Journal of Machine Learning Research, 18(136):1–46,2017.

[72] Markus Wulfmeier, Abbas Abdolmaleki, Roland Hafner, Jost To-bias Springenberg, Michael Neunert, Tim Hertweck, ThomasLampe, Noah Siegel, Nicolas Heess, and Martin Riedmiller.Regularized hierarchical policies for compositional transfer inrobotics. arXiv preprint arXiv:1906.11228, 2019.

[73] Yuke Zhu, Ziyu Wang, Josh Merel, Andrei Rusu, Tom Erez,Serkan Cabi, Saran Tunyasuvunakool, János Kramár, RaiaHadsell, Nando de Freitas, et al. Reinforcement and imitationlearning for diverse visuomotor skills. Robotics, Science andSystems, 2018.

Date post:	06-Jun-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

A Framework for Data-Driven Robotics · the usual scale of RL methods in robotics. Among the RL for...

Documents