Learning Human Search Behavior from Egocentric Visual Inputs

EUROGRAPHICS 2021 / N. Mitra and I. Viola(Guest Editors)

Volume 40 (2021), Number 2

Learning Human Search Behavior from Egocentric Visual Inputs

Maks Sorokin1 Wenhao Yu1,2 Sehoon Ha1,2 C. Karen Liu3

{maks,wenhaoyu,sehoonha}@gatech.edu, [email protected] Georgia Institute of Technology, Atlanta, GA, 30308, USA

2 Robotics at Google, Mountain View, CA, 94043, USA3 Stanford University, Stanford, CA, 94305, USA

Figure 1: A humanoid character learns to navigate and search for a target object (the mustard bottle) in a photorealistic 3D scene using itsown egocentric vision and locomotion capability. Top: third-person view. Bottom: first-person view.

Abstract“Looking for things” is a mundane but critical task we repeatedly carry on in our daily life. We introduce a method to developa human character capable of searching for a randomly located target object in a detailed 3D scene using its locomotioncapability and egocentric vision perception represented as RGBD images. By depriving the privileged 3D information from thehuman character, it is forced to move and look around simultaneously to account for the restricted sensing capability, resultingin natural navigation and search behaviors. Our method consists of two components: 1) a search control policy based on anabstract character model, and 2) an online replanning control module for synthesizing detailed kinematic motion based on thetrajectories planned by the search policy. We demonstrate that the combined techniques enable the character to effectively findoften occluded household items in indoor environments. The same search policy can be applied to different full body characterswithout the need of retraining. We evaluate our method quantitatively by testing it on randomly generated scenarios. Our workis a first step toward creating intelligent virtual agents with humanlike behaviors driven by onboard sensors, paving the roadtoward future robotic applications.

CCS Concepts• Computing methodologies → Procedural animation; Motion processing;

1. Introduction

“We spend about 5000 hours of our lives looking for things aroundthe home” [IKE19]. Indeed, searching for objects in complex in-door environments is a frequent event in our daily life–we look foringredients in the kitchen for a recipe, we locate grocery items onthe shelves in a supermarket, and we seem to always be in searchof phones, keys or glasses many times a day. The goal of this paper

is to model such important and ubiquitous behaviors by developinga virtual human capable of using its egocentric vision perceptionand locomotion capability to search for any randomly placed targetobject in a complex 3D scene.

Search behaviors depend on simultaneous locomotion and sur-vey of the environment, requiring modeling not only the physicalmotor skills, but also human sensory and decision making. Con-

© 2021 The Author(s)Computer Graphics Forum © 2021 The Eurographics Association and JohnWiley & Sons Ltd. Published by John Wiley & Sons Ltd.

DOI: 10.1111/cgf.142641

https://diglib.eg.orghttps://www.eg.org

https://orcid.org/0000-0001-5994-0046

https://orcid.org/0000-0001-8263-8224

https://orcid.org/0000-0002-1972-328X

https://orcid.org/0000-0001-5926-0905

M. Sorokin & W. Yu & S. Ha & C. K. Liu / Learning Human Search Behavior from Egocentric Visual Inputs

ventional character animation assumes full knowledge of the 3Denvironment and utilizes it to achieve optimal movements. Whileoptimality is indeed observed in many human locomotion and ma-nipulation tasks, it is also at odds with the stochastic nature of hu-man sensing and decision making capabilities. Having the “oracleability” to know exactly the 3D position of every vertex in the scenewill likely to result in an “optimal but unnatural” search behaviors.Alternatively, imitating or tracking motion capture trajectories di-rectly can potentially lead to natural human behaviors. However,pre-scripting or pre-planning a reference trajectory can be challeng-ing for a search controller tasked to find objects placed at randomlocations in random scenes.

This paper builds on the hypothesis that equipping the virtualcharacter with human-like sensing capabilities can lead to morenatural behaviors, as demonstrated by previous work [YLNP12,NZC∗18, EHSN20]. In particular, we limit the virtual character toegocentric vision perception from RGBD images when perform-ing the search task. Without the full state information about theenvironment and the global position and orientation of itself, thecharacter is forced to coordinate its motor capabilities to navigateand scan the scene to find the target object, naturally inducing hu-manlike decision-making behaviors under partial observation. Ourmethod consists of two components: 1) a search control policy thatdetermines where the character should move to and look at basedon an abstract model, and 2) an online replanning control mod-ule for synthesizing the detailed kinematic motion based on theplanned trajectories from the search policy. This decoupling of thetask provides several benefits: first, training the search policy withan abstract model that only includes a torso and a head/camera ismore computationally efficient for high-dimensional and large ob-servation space, and second, a trained search policy can be re-usedfor multiple characters that share the same abstract model withoutretraining.

We use deep reinforcement learning (DRL) to train the searchpolicy that takes as input the visual perception from the abstractmodel and predicts where it should move to and look at. Ourtraining framework shares some similarities with existing worksin visual navigation while having a few key distinctions. Unlikemost vision-guided navigation applications, our agent can activelychoose where to look independent of the body moving direc-tion, enabling more plausible head movements and effective searchstrategies for the character. However, decoupling the body move-ment with the gaze direction makes the learning problem morechallenging due to the larger observation space and the higherdimensional action space. We show that by combining Soft Ac-tor Critic (SAC) with a constrastive proposed by Srinivas et al.[SLA20], a vision-guided policy can effective learn the featuresfrom high dimensional pixels for our search problem. In addition,transferring the policy trained for an abstract model to a full-bodycharacter presents many challenges. Drawing analogy from sim-to-real transfer learning, we propose a zero-shot online replanningmethod to transfer a model-agnostic policy to the biped humanmodel and a wheel-based robot model. Combining offline visuo-motor policy learning with online trajectory planning results in avirtual human capable of making motion plans using egocentric vi-sion perception.

We demonstrate our method on a human and a robotic charactersearching household items in realistic indoor scenes. We show thatthe character is able to find a small object, such as a pair of glassesin a large space including an open kitchen and a living room, pop-ulated with furniture and other objects. To get the overall perfor-mance of our policy, we report the success rate of search policytested on randomly created scenarios. We further demonstrate theimportance of enabling the head movement of the character for bothbetter learning performance and better search behavior. Our workis a first step toward creating intelligent virtual agents with human-like behaviors driven by onboard sensors, paving the road towardfuture robotic applications.

2. Related Work

Our research is inspired by prior work in visual navigation, deep re-inforcement learning, and data-driven kinematic animation. Belowwe will review each of them in turn.

2.1. Visual Navigation

Training an autonomous agent to navigate complex environmentsfrom visual inputs has been an important topic in computer graph-ics, robotics, and machine learning [ACC∗18, KWR∗16, MPV∗16,ZMK∗17, PZI∗19].

Some of the work by Kuffner et al. [KJL99] tackles the problemusing path-planning and path-following algorithms that utilize theprivileged information about the environment (e.g., floor layouts)and aims to generate collision free paths. Shim et al. [SYT17] andWang et al. [WSY∗18] avoid the use of the privileged informationand learn to approach the goal object which perform feature-basedgoal identification while tackling the searching via random explo-ration. In this work, the character is deprived of privileged infor-mation and perceives the world only via visual observations, whichleads to a learned searching behaviour.

Enabling these agents to work with real images is essential forapplying them to real-world applications. However, directly train-ing visual navigation agents in real environments is expensive, es-pecially since we usually want to train agents for a large varietyof environments. To this end, researchers have developed simu-lated environments that leverages modern 3D scanning techniquesto reproduce real-world scenarios and allow agents to observephoto-realistic visual inputs in a scalable way [XSL∗20, MAO∗19,KMH∗17]. These tools enables rapid advancements in learning al-gorithms and neural network structures in training visual naviga-tion agents [GDL∗17, PS17, ZTB∗17, FTFFS19, WKM∗20]. Forexample, Fang et al. [FTFFS19] proposed a scene memory trans-former architecture that saves history of observations into the mem-ory and extracts relevant information using the Transformer archi-tecture [VSP∗17]. Wijmans et al. developed a decentralized dis-tributed proximal policy optimization (DD-PPO) that allows largescale training of visual navigation agents and demonstrated thatwith large scale training one can obtain agents that generalizes tonovel scenarios [WKM∗20].

Our method also utilizes the simulation tools developed by otherresearchers to train our virtual human in a realistic environment.

© 2021 The Author(s)Computer Graphics Forum © 2021 The Eurographics Association and John Wiley & Sons Ltd.

390


Specifically, we used iGibson, which provides a suite of realistic in-door environments [XSL∗20]. Unlike prior work in visual naviga-tion that focus on agents moving in 2D or 2.5D spaces, e.g. mobilerobots, our work solves visual navigation tasks with an additionalchallenge of controlling egocentric perspective.

2.2. Deep Reinforcement Learning

Deep reinforcement learning (DRL) provides a general frame-work for automatic design of controllers for complex motor skillsfrom simple reward functions. Within the graphics community, re-searchers have applied DRL algorithms on a variety of physics-based control problems, such as locomotion [YTL18, PBYVDP17,PALvdP18], manipulation [CYT∗18], aerial behaviors [WPKL17],and soft-body motion [MWL∗19]. However, most of these meth-ods use low-dimensional character states or exploit privileged 3Dinformation in the input space to the policy in order to simplify thelearning problem. Directly learning a controller from egocentric vi-sion inputs remains a challenging problem. Recent advancementsin image-based deep reinforcement learning have shown promisingprogress in addressing this challenge [SLA20, HFW∗19, OLV18].For example, Srinivas et al. proposed to learn an embedding ofthe visual input by minimizing a contrastive loss between ran-domly cropped input images from the replay buffer [SLA20]. Theirmethod demonstrated superior performance on a set of image-based robotic control problems. In our work, we apply the methodby Srinivas et al. [SLA20] for training egocentric vision-basedpolicies to accomplish the searching task.

Similar to our work, Merel et al. investigated the problem of cre-ating full-body human motions with egocentric vision-based con-trol policies [MAP∗19, MTA∗20]. In particular, they developed ahierarchical control scheme that exploits egocentric vision to coor-dinate low-level motor skill modules derived from motion capturedemonstrations. Their learning approach is able to train the charac-ter locomotion and whole-body manipulation using visual inputs.Our work also takes 2D images as input, but our 3D scenes containdetailed geometry with photo-realistic appearance, resulting in amuch more complex observation space than those used in the pre-vious work. In addition, the vision inputs are more critical to thesearch task and require careful coordination between the locomo-tion and the gaze direction to enable the character to navigate in acluttered environment while thoroughly surveying the environment.

2.3. Data-driven Kinematic Animation

Data-driven kinematic animation has been an effective approachfor generating realistic human animations from example motiontrajectories. Early work constructs graph-based structures to au-tomatically transition between recorded motion clips [LCR∗02,AF02, KGP02]. Although these methods can successfully generatewhole-body motions, the output motions are limited to motion clipsin the database. To overcome this limitation, interpolation tech-niques based on linear bases [SHP04, CH05] or statistical transi-tion models [CH07, LWB∗10] are adopted for predicting more ex-pressive motions from a smaller number of examples. These meth-ods are further extended by exploiting neural networks, such asconditional Restricted Boltzmann Machine (cRBM) [TH09] or an

Encoder-Recurrent-Decoder (ERD) network [FLFM15]. Recently,many researchers have demonstrated that deep neural network cansuccessfully learn human motion manifolds for bipedal locomo-tion [HSK16, HKS17], quadrupedal locomotion [ZSKS18], objectinteractions [SZKS19], and motion retargetting [AWL∗19]. We uti-lize previous work by Holden et al. [HKS17] and Starke et al.[SZKS19] to generate detailed full-body animations from an ab-stract trajectory planned by the search policy.

3. Overview

Given a realistic 3D indoor scene populated with furniture andhousehold items, we develop a virtual human capable of using itsown egocentric vision and locomotion capability to search for anyrandomly placed target object in the scene, including those oc-cluded by furniture or due to the layout of the room. We take ahierarchical approach which consists of two components, a searchpolicy operates on an abstract agent to determine the movementand gaze direction at every time step, and a motion synthesis mod-ule that synthesizes the kinematic motion on a full-body characterto realize the actions determined by the search policy.

We made a few assumptions in our framework. The vision inputincludes RGBD images and a mask channel in which the target ob-ject has been segmented and labeled. Automatic segmentation andobject recognition for general scenes and objects from raw imagesis a challenging computer vision problem, beyond the scope of thiswork. In addition, we assume that the 3D environment has furnitureand partial divisions that block the line of sight, but it is roughly oneconnected space, as we do not attempt to solve a maze navigationproblem.

4. Search Control Policy

While end-to-end DRL approaches have demonstrated success inlearning motor skills, solving a visuomotor policy with a detailedfull-body character in a large and highly textured environment re-mains challenging due to the large and complex observation spaceof the agent. In this work, we propose to first use the learningapproach for training an agent-agnostic search policy that has avision-based observation space but an abstract action space. Oncetrained, the search policy can be applied to characters with variouskinematics to synthesize search behaviors with full-body move-ments. To this end, we define an abstract model that consists ofa cylinder-shape main body and a camera connected to the mainbody via a universal joint with two degrees of freedom (dofs). Theabstract model can move around in the 3D space while the cam-era can point at different directions independently from the bodymovement. This additional head movement allows a character to si-multaneously navigate and look around, which is essential to modelhuman-like search behaviors.

4.1. Problem Formulation

We formulate the vision-based search task as Partially Observ-able Markov Decision Processes (POMDPs), (S,O,A,T ,r, p0,γ),where S is the state space, O is the observation space, A is the ac-tion space, T is the transition function, r is the reward function, p0


391


Figure 2: Overview of the learning pipeline for training the searchpolicy.

is the initial state distribution and γ is a discount factor. We take theapproach of model-free reinforcement learning to find a policy π,such that it maximizes the accumulated reward:

J(π) = Es0,a0,...,sT

T

∑t=0

γtr(st ,at), (1)

where s0 ∼ p0, at ∼ π(ot), ot ∼ c(st) and st+1 = T (st ,at). Thestate space contains 3D information of the environment and theglobal position and orientation of the agent for reward evalua-tion during training, but it is not available to the policy duringtesting. Instead, the policy can only access limited information ob-servable to the onboard sensors. Our POMDP is defined as follows:

Observation space. The observation space consists of two sensingmodalities: vision and proprioception (Figure 2). The propriocep-tion for the abstract model only contains the joint position betweenthe camera and the cylinder: the pitch angle qp and the yaw angleqy. The agent does not have an access to its global position andorientation.

The vision perception is represented as 2D images observableby standard RGBD cameras, augmented by mask images that pro-vide the segmentation of the target object. For the searching tasks,we exclude the color information (RGB) because we found thatthe depth and mask images alone contain sufficient information forfinding collision-free paths to complete the task when navigating ina cluttered 3D scene. The depth image D (84× 84), obtained withthe field of view (FOV) angle of 90◦, has the maximal depth at5 m, normalized to the range of [0,1]. This setting has been proveneffective for visual navigation tasks, such as Point-Goal naviga-tion [SKM∗19].

The mask image M is a binary image which contains 1s at thetarget object’s pixel locations and 0s otherwise. When the targetobject is not in the field of view, M provides no information. Weprocess the raw mask image M to obtain a feature vector m =[xc,yc,r,α,M], where xc and yc are the average coordinates of thepixels with value 1, r =

√x2

c + y2c and α = arctan(yc,xc) are their

polar coordinates, and M is the downsampled mask image of size5×5. Using our mask feature vector instead of the raw mask imagereduces the dimensionality of the state space and forces the agentto learn a policy agnostic to the object shapes. Note that xc,yc,r andα are all defined in the image frame. The observation at every time

step t is defined as ot = [Dt ,Dt−1, · · · ,Dt−K+1,mt ,qpt ,q

yt ] where

we concatenate the K history of the depth images so the policy hassome “short-term memory” of the environment. We set K = 5 forall the experiments.

Action space. We define a compact action space that only de-termines the agent’s 2D global movement and the camera di-rection. Specifically, the action vector is defined as a = [∆x,∆y,∆θ,∆qp,∆qy],

which are the relative movements in the forward direction, lat-eral direction, yaw orientation, camera pitch angle, and camera yawangle, respectively. We use the action at the current time step tomodify the target global configuration and camera angles for thenext time step. The next state of the abstract model is simulated bytracking the target global configuration and the camera pose usingposition control in a physics simulator: st+1 = T (st ,a).

Reward function. Unlike policy execution, when we evaluate thereward function during training, we utilize all 3D information rel-evant to the reward calculation, such as global position and ori-entation of the agent, the 3D coordinate of the target object andthe 3D meshes of the environment. Inspired by the work of Savvaet al. [SKM∗19], we define the following reward function:

r(st) = w1rs(st)+w2rd(st)+w3rl(st)+w4rc(st). (2)

rs is the success reward of 10 when the agent successfully finds theobject, which is only awarded once the success-checking terminalcondition is invoked. rd measures the negative distance to the goalwhile returns 0 if the goal is not visible, which encourages the agentto move toward and look at the target. rl is the live penalty of thevalue −0.1. Finally, rc checks the collision between the cylinder(the main body) and the environment mesh and penalizes collisionby rc(st) = clip(−0.1ncol ,−3.0,0.0), where ncol is the number ofcollision in the history. For all experiments, we use the same weightvector [1.0,1.0,1.0,0.1].

Initial state distribution. For training a robust searching behavior,we randomize both the agent’s initial location and the target ob-ject’s location at each episode. We collect the candidate locationsfor both initial positions by sampling random places and filter outthe invalid candidates that collide with any other objects. Note thatthe target object is not necessarily always blocked from the agent’sline of sight and might be visible immediately, which successfullyfacilitates the learning process.

Termination conditions. There are two occasions at which the tra-jectory rollout can be terminated. First is the time based condition,which caps the total number of actions in the environment at Tmaxsteps, which is set to 100 for all the experiments. Second conditioncontrol is a success check, which is triggered once the object is inthe nearby proximity of 0.5m and the agent is observing the object,i.e. mask image M contains some 1s.

Domain randomization. Similar to many sim-to-real transferlearning problems, the success of the search policy depends onwhether it can be transferred to the target character with differentstate and action space. We apply the approach of domain random-ization to increase robustness of the search policy when transferredto a different character. We identify that the global vertical positionof the character can be drastically different from that of the abstract


392


model due to its designed body height. We therefore randomize theheight of the abstract model during training for the duration of a tra-jectory rollout in the range of [1.0,1.8]m. In addition, legged char-acters may exhibit natural vertical oscillation during locomotion,while the abstract model moves at a constant height. Therefore, weinject white noises with the range of [−0.1,0.1]m to the verticalposition of the abstract model at each time step. The randomizedvertical movement will affect visual observations by changing thecamera position, essential to success transfer of search policy todifferent characters.

4.2. Policy Training Process

Training control policies for simulated character with visual per-ception input has several challenges. First, these policies usuallyhave complex structures and a large number of parameters, mak-ing it computationally expensive to obtain reliable gradients forupdating the policy. In addition, learning to extract useful featuresfrom images depending solely on the task reward might lead to sub-optimal features that do not generalize well to new scenarios.

We train the policy using Soft Actor Critic (SAC) and Con-trastive Unsupervised Representations for Reinforcement Learn-ing (CURL), as proposed in the work of Srinivas et al. [SLA20].SAC [HZH∗18] is an off-policy model-free reinforcement learningalgorithm, which has been applied to challenging robotic controlproblems with desirable sample efficiency [HHZ∗18]. CURL aug-ments the SAC algorithm for learning effective features from highdimensional pixels by jointly optimizing the DRL loss and a con-trastive loss to learn a compact latent space.

Specifically, from each input image, CURL will randomly croptwo sets of smaller images named queries and keys. These croppedimages are then passed through a query encoder and a key encoderto obtain a low-dimensional latent representation of the image qand k. CURL formulates a constrastive loss:

Lq = logexp(

qTWk+

)exp(qTWk+)+∑

K−1i=0 exp(qTWki)

, (3)

where W is a learn-able weight matrix and k+ are keys that arefrom the same time instance of q. The constrastive loss encour-ages the encoded latent features of queries and keys from the sametime instance to be close to each other while being far away fromthe latent features from different time instances under a bilinearproduct distance. We optimize this contrastive loss jointly with theRL objective (Equation 1). Figure 2 illustrates the data flow of ourlearning algorithm and indicates the paths where the gradient in-formation is propagated through. We refer readers to the originalpaper of SAC and CURL for more details about the algorithms.

We represent our search policy as a convolutional neural net-work. The history of depth image is passed through a Pixel CNNinto an embedding of 128 dimensions [SLA20]. The depth im-age embedding is then concatenated with the mask feature and theagent state, which are then fed into 3 fully connected layers with1024 neurons to obtain the final action output. We train the searchpolicy using SAC with CURL for 0.75 million simulation samples.

5. Full-body Motion Synthesis

The search policy generates global/root configuration andhead/camera movements for the abstract model. However, this tra-jectory cannot be transferred directly to the actual character dueto the discrepancy in the state and action spaces and the transi-tion functions, between the actual character and the abstract model.In particular, the policy will receive different input images due tothe oscillating head height of the characters and due to the charac-ter exhibiting the walking motions and not searching. Furthermore,given the same command, such as moving forward, the characterswill achieve different resulting states depending on the transitionfunction of the locomotion model. The differences in the heightof the characters can be mitigated by domain randomizing and in-jecting noise to the height of the abstract model during training torobustify the controller. While, the difference in the head motionwill be resolved by querying the controller as if it was to follow thetrajectory generated by actual character.

To overcome the discrepancy in the transition function of loco-motion model, we employ an online replanning scheme to generatecharacter motions that best match the planned trajectory from theabstract model. Our method is analogous to the model predictivecontrol (MPC) framework in that every time step we replan a longtrajectory using the abstract model and execute only a small portionof the trajectory under full-body dynamics. Please note that the tra-jectories must be collision-free for both abstract and full-body dy-namics.

At current time step during testing, we first rollout the searchpolicy (with abstract model) for T time steps from the currentstate and observation, s0 and o0, by repeatedly querying the policy,at = π(ot), stepping forward in the environment, st+1 = T (st ,at),and observing the environment, ot+1 = render(st+1), i.e. render-ing the environment to images. Note that the state of the abstractmodel s includes the root configuration and the head pose. Wepass the planned state trajectory s0, · · ·sT to the locomotion genera-tor Phase-Functioned Neural Network (PFNN) [HKS17] or NeuralState Machine (NSM) [SZKS19] which generates legged motionon a human character to match the plan. Our method is agnosticto the full-body motion generator if it can synthesize a reasonablefull-body motion trajectory for the given abstract plan: we will gen-erally refer them to as “MG” thereafter.

However, the root and body/head states along the full-body tra-jectory are likely to deviate from the planned state trajectory due tothe difference in the transition function between MG and our ab-stract model. Applying the strategy of online replanning, we onlyconsider the first M (where M� T ) steps of the full-body trajec-tory q0, · · · ,qM−1 to be valid, and replan the abstract model at theMth step.

One issue with our online replanning scheme is that the loco-motion generator, MG, does not generate vision perception duringmotion synthesis, but replaning at the Mth step requires a short his-tory of depth images as the short-term memory. Furthermore, theplanned head motion becomes suboptimal for the searching tasksince the traversed trajectory deviates from the plan. As such, thevision observations generated by the head poses in q0, · · · ,qM−1can also result in suboptimal next plan. To overcome this issue, weiteratively update the history of abstract model’s state st from t = 0


393


to t = M− 1 using the root configuration from the full-body poseqt and the head pose from the search policy. We re-generate ob-servations ot using the modified abstract state and store the depthimages in the buffer to recover the “memory lapse” from t = 0 tot =M−1. Finally, the abstract model makes a new plan from t =Mto t = M +T with optimal head movements and restored memory(history of depth images). Our algorithm applied at every M timesteps can be summarized in Algorithm 1.

Algorithm 1 Online Motion Replanning and Synthesis

1: Input: s0,o0, q02: B = {s0} . initialize plan buffer3: for t = 0 : T −1 do . generate an abstract trajectory4: a = π(ot) . query policy5: st+1 = T (st ,a) . advance in environment6: ot+1 = render(st+1) . render observation7: Store st+1 in B8: q0, · · · ,qM−1 = MG(B) . generate M human motion poses9: for i = 0 : M−1 do . regenerate head orientations & history

10: si.root = qi.root . update the root position11: si = set_root_height(qi) . update the camera height12: oi = render(si)13: Update depth image buffer with oi14: a = π(oi) . query policy with new observation15: si+1 = T (si,a) . modify camera pose according to π

16: qi+1.head = si+1.head . overwrite head orientation17: oM = render(sM) . render at last camera pose18: Return sM ,oM ,q0, · · · ,qM−1

6. Evaluation

We evaluate our method on a humanoid character and a wheel-based robot character, Fetch Robot [WFK∗16]. For the humanoidcharacter, we applied two motion generators to synthesize the full-body motion. Figure 3 shows the humanoid character and the robotwe use in our experiments. We test our search controller in a mod-ern home scene with an open kitchen and a living area separated bya countertop, as well as a master bedroom connected to a bathroom(Figure 4). To increase complexity of the scene, we add a few opencabinets in which the target objects can be placed. We use iGibson[XSL∗20] environment that provides 3D scans reconstructed fromrealistic indoor environments and a photorealistic renderer to gen-erate vision inputs to the character. For physics simulation, we usePyBullet [CB17] to simulate the motion of abstract model and tocheck collision with the environment.

6.1. Evaluation of Search Policy

We generate 100 random scenarios to evaluate the success rate ofthe search policy for an abstract model. At the beginning of eachtest, the agent is randomly assigned to a collision-free location inthe scene with random orientation. Similarly, the target object isplaced randomly on any surface in the scene, including the interiorof cabinets (Figure 5). If the agent can get close to the target objectwithin 100 steps (∼15 seconds), we consider it a successful trial.

Figure 3: An abstract model (left) for policy search and two ani-mated characters,the Fetch robot (middle) and a humanoid charac-ter (right), for full-body animation.

Figure 4: The home scene used in our experiments. (a) a kitchenand a living area separated by a countertop. (b) a bedroom con-nected to a bathroom.

Figure 5: Examples of object placements in our experiments.

Since the search policy will be used by different agents with spe-cific body heights, we evaluate the performance of the search pol-icy by setting the abstract model to three different heights: 1.65m,1.05m, and 0.45m, where the first two correspond to the height ofthe human character and the Fetch robot, and 0.45m is selected forfurther comparison. We compute the success rate of the policy withthose three different heights and show the results in Table 1. In gen-eral, the advantage of height gives the tall characters a better view


394


of surfaces while shorter characters struggle to see objects placedon the surface above their heights. As such, there is a near 20%drop in success rate for the shortest character.

We also test the policy on two sampling schemes of the targetlocations of objects: 1) sampling everywhere except for the insidearea of the low cabinets on the floor, and 2) sampling everywhere.The results show that the tall characters (1.05m and 1.65m) performworse on those challenging cases where objects are inside the lowcabinets, while the success rate of the shorter character (0.45m)increases when we allow objects to be sampled inside the cabinet.

Head height Excluding cabinets Everywhere0.45m 50% 58%1.05m 92% 75%1.65m 90% 76%

Table 1: Performance of the search policy with different cam-era heights and object locations. In general, abstract models withhigher camera locations show better success rates. However, lowercameras are beneficial if we include the challenging case of theobject is hidden in the cabinets on the floor.

6.2. Comparison to Search Policy without Head Movements

Figure 6: Training curves comparison of abstract agents with andwithout head movements. The result indicates that the agent withhead movements shows a 20% higher success rate than the agentwithout head movements.

Our key hypothesis is that active head movements lead to moreeffective search behavior by allowing the character to look at dif-ferent parts of the scene. To evaluate this, we train a baseline searchpolicy for an abstract character without the degrees of freedom tomove its head relative to the body. To look around the environment,the agent needs to rotate its entire body around. The result showsthat the head movement is crucial to achieve 20% higher successrate in the searching task (Figure 6).

6.3. Evaluation of Full-body Characters

To evaluate the performance of our algorithm we use human char-acter with the height of 1.65m. To animate the character as a kine-

matic motion generation model we use two distinct Motion Gen-erators: Phase-Functioned Neural Network (PFNN) [HKS17] andNeural State Machine (NSM) [SZKS19]. We also utilize both togenerate legged locomotion for the character and incorporate thediscrepancy using the online replanning control scheme describedin Section 5. Additionally, we apply the searching policy on anotherdrastically different model, a Fetch Robot with height of 1.05m.Fetch is a wheel-based robot with telescopic degree of freedom toadjust its height. Due to similarity between our control model andFetch’s differential drive we directly apply the actions produced bythe search policy on Fetch.

We show that the search policy can be successfully realized bythe full body characters even for challenging cases in which the tar-get objects are placed inside of cabinets, on the other side of room,or occluded by furniture. Different search strategies emerge arounddifferent locations in the scene. For example, the character tendsto move slower around cabinets and look carefully at the interiorpart where objects are likely to be placed. The character also uti-lizes backward steps to have a better view of the surface in frontof it. The results are best viewed in the supplemental video wherewe show the full-body motion in the first-person-view as well asmultiple third-person views.

6.4. Performance Analysis

To further analyze the performance of our method and understandthe choices of different hyperparameters, we construct a few base-lines for comparison:

• 1-step: To evaluate the importance of long-term planning forthe searching task, we design a baseline to control the agent byquerying the searching policy for one single action and passingit to MG. This baseline does not utilize the long term trajectoryplanning but instead extrapolates a straight line path in the actiondirection, with respective orientations. We evaluate two settingswith longer and shorter horizons for MG.• Noisy Search: To evaluate the importance of the searching pol-

icy before the target object is seen, we create a simple controlscheme which in the presence of the object will query the trainedsearching policy (to use it as an approach mechanism with obsta-cle avoidance), and when there is no object apply actions sampleuniformly at random to move and look around the scene.• Ours with different Ms: To understand the effects of the hy-

perparameter choices of motion synthesis algorithm, we evalu-ate multiple values controlling the parameter M which specifiesthe number of steps MG returns when attempting to follow thetrajectory.

The above baselines and our method are compared using the fol-lowing metrics:

• Number of attempts: Number of trajectories generated by theabstract searching policy before the character reached the goal.• SPL: Success weighted by the path length, represented as:

1N

N

∑i=1

Siì

max(pi, ì),

where ì is the shortest distance path length, pi is a traversed pathlength, and Si is a binary indicator if the rollout success.


395


Figure 7: Performance of baselines (to the left of dashed line) and our online motion planning with different hyperparameter choices (to theright of dashed line). left - Success weighted by the path length - showing the importance of long-term trajectory planning and that simplenoise based exploration is not sufficient for finding the object in the designed environment. right - shows that a larger number of abstracttrajectories generated by abstract model is required for smaller values of M, which is a trade-off for higher SPL.

Figure 8: Failure cases where the character penetrates furniture.

Figure 7 left shows that 1-step performs poorly for both MGhorizons compared to our method with long-term planning. On av-erage, our method performs 40% hire than 1-step on SPL metric.The performance of noisy search is also worse than our methodby 40%, which shows that structured exploration is required to ef-ficiently find the target object. The performance of our method issensitive to the choice of the hyperparameter M (the number of ex-ecuted frames from the motion plan). In our experiments, smallervalues of M perform better in terms of the SPL because the al-gorithm replans more frequently, but it also requires more com-putational costs to generate a larger number of generated abstracttrajectories (Figure 7 right).

Most common failure modes for the baselines are related to col-lisions with the environment obstacles, which causes the characterto get stuck in the furniture or the walls. Collisions can also happento the full-body character if it significantly deviates from the plan.However, we found that collisions of the full-body character occurless frequently because the agent is trained with penalty to avoidthe obstacles. Some example scenarios are illustrated in Figure 8.

7. Discussion and Conclusion

In this work, we have developed a virtual character exhibitingsearching behaviors inspired by the human. We have demonstratedthat by training a agent-agnostic search policy and using a re-planning algorithm for transferring the planned abstract motion toactual characters, we can obtain successful and plausible search-ing behaviors in complex environments. The decomposition of thesearch task enabled us to reuse the single trained search policy for

characters with different shapes and motor capabilities. Our key in-sight is that by depriving the privilege 3D information from thecharacter, humanlike behaviors emerge because the character isforced to rely on its own egocentric vision perception and loco-motion to complete the task. Furthermore, allowing the head of thecharacter to move independently from the rest of the body leads tomore natural searching behaviors, while facilitating the learning ofa more effective policy.

One limitation of our current system is the dependence on themask channel to recognize and identify the target object. This as-sumption can be lifted by incorporating the state-of-the-art objectrecognition methods, such as [HGDG17, XGD∗17]. In our currentimplementation, we chose to not incorporate memory structure inour policy beyond a very short history of the vision observations.We find that reasonable search behavior can be obtained withoutusing memory for the set of environments we are working with. Onthe other hand, when working with more complex scenes, such asnavigating in an entire building, memory becomes essential for lo-calizing the agent and recognizing places that have been explored inthe past. On the locomotion motion side, we notice that the collisionchecking during search policy learning sometimes is not sufficientwhen the policy is applied to the full character (Figure 8). A finerresolution collision checking might be needed to further improvethe motion quality. Lastly, our scheme performs reasonably wellfor characters with relatively small variations in height during lo-comotion such as the ones presented here, however, the characterswith more complex dynamics in the head motion or tasks that re-quire active control of the character height might be of a challengeand further research on the topic is necessary.

There are a few promising future avenues for our work. First,is enabled interactions between the character and the environment,such as opening the fridge or drawers. This will allow the emer-gence of more intricate searching behaviors. Furthermore, our al-gorithm takes the character from a random location in the room tobe in front of the object of interest. This provides an ideal initialpose for the character to perform downstream manipulation taskssuch as pouring water into a cup, or pick up a phone. How to in-corporate manipulation into our system and achieve more complexhuman behaviors is thus an important future work direction.


396


Appendix A: Hyperparameters

In the Table. 2 we provide a complete list of the hyperparametersused for training the CURL.

Parameter SettingImage Size 100x100Augmentation Random Crop (84x84)Image History Buffer Size 5Head History Buffer Size 1Replay buffer 100000Discount rate γ 0.99Number of training steps 0.75MBatch-size 32Alpha (SAC) : initial temperature 0.1Alpha (SAC) : learning rate 0.0001Alpha (SAC) : optimizer β1 0.5Alpha (SAC) : optimizer β2 0.999Actor : learning rate 0.001Actor : optimizer β1 0.9Actor : optimizer β2 0.999Actor : Number of layers 4Actor : Hidden dim 1024Actor : Activation function ReLUCritic : learning rate 0.001Critic : τ (polyak averaging) 0.01Critic : optimizer β1 0.9Critic : optimizer β2 0.999Critic : Number of layers 4Critic : Hidden dim 1024Critic : Activation function ReLUEncoder (CNN) : learning rate 0.001Encoder (CNN) : τ (polyak averaging) 0.05Encoder (CNN) : Number of layers 4Encoder (CNN) : Number of filters 32Encoder (CNN) : Latent dimension 128Encoder (CNN) : Activation function ReLU

Table 2: A complete set of CURL hyperparameters used to conductall of the training experiments.

References[ACC∗18] ANDERSON P., CHANG A., CHAPLOT D. S., DOSOVITSKIY

A., GUPTA S., KOLTUN V., KOSECKA J., MALIK J., MOTTAGHI R.,SAVVA M., ET AL.: On evaluation of embodied navigation agents. arXivpreprint arXiv:1807.06757 (2018). 2

[AF02] ARIKAN O., FORSYTH D. A.: Interactive motion generationfrom examples. ACM Transactions on Graphics (TOG) 21, 3 (2002),483–490. 3

[AWL∗19] ABERMAN K., WU R., LISCHINSKI D., CHEN B., COHEN-OR D.: Learning character-agnostic motion for motion retargeting in 2d.arXiv preprint arXiv:1905.01680 (2019). 3

[CB17] COUMANS E., BAI Y.: Pybullet, a python module for physicssimulation in robotics, games and machine learning., 2016-2017. URL:http://pybullet.org. 6

[CH05] CHAI J., HODGINS J. K.: Performance animation from low-dimensional control signals. In ACM SIGGRAPH 2005 Papers. ACMNew York, NY, USA, 2005, pp. 686–696. 3

[CH07] CHAI J., HODGINS J. K.: Constraint-based motion optimizationusing a statistical dynamic model. In ACM SIGGRAPH 2007 papers.ACM New York, NY, USA, 2007, pp. 8–es. 3

[CYT∗18] CLEGG A., YU W., TAN J., LIU C. K., TURK G.: Learningto dress: Synthesizing human dressing motion via deep reinforcementlearning. ACM Transactions on Graphics (TOG) 37, 6 (2018), 1–10. 3

[EHSN20] EOM H., HAN D., SHIN J. S., NOH J.: Model predictivecontrol with a visuomotor system for physics-based character animation.ACM Trans. Graph. 39 (July 2020). 2

[FLFM15] FRAGKIADAKI K., LEVINE S., FELSEN P., MALIK J.: Re-current network models for human dynamics. In Proceedings of the IEEEInternational Conference on Computer Vision (2015), pp. 4346–4354. 3

[FTFFS19] FANG K., TOSHEV A., FEI-FEI L., SAVARESE S.: Scenememory transformer for embodied agents in long-horizon tasks. In Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recog-nition (2019), pp. 538–547. 2

[GDL∗17] GUPTA S., DAVIDSON J., LEVINE S., SUKTHANKAR R.,MALIK J.: Cognitive mapping and planning for visual navigation. InProceedings of the IEEE Conference on Computer Vision and PatternRecognition (2017), pp. 2616–2625. 2

[HFW∗19] HE K., FAN H., WU Y., XIE S., GIRSHICK R.: Momentumcontrast for unsupervised visual representation learning. arXiv preprintarXiv:1911.05722 (2019). 3

[HGDG17] HE K., GKIOXARI G., DOLLÁR P., GIRSHICK R.: Mask r-cnn. In Proceedings of the IEEE international conference on computervision (2017), pp. 2961–2969. 8

[HHZ∗18] HAARNOJA T., HA S., ZHOU A., TAN J., TUCKER G.,LEVINE S.: Learning to walk via deep reinforcement learning. arXivpreprint arXiv:1812.11103 (2018). 5

[HKS17] HOLDEN D., KOMURA T., SAITO J.: Phase-functioned neuralnetworks for character control. ACM Transactions on Graphics (TOG)36, 4 (2017), 1–13. 3, 5, 7

[HSK16] HOLDEN D., SAITO J., KOMURA T.: A deep learning frame-work for character motion synthesis and editing. ACM Transactions onGraphics (TOG) 35, 4 (2016), 1–11. 3

[HZH∗18] HAARNOJA T., ZHOU A., HARTIKAINEN K., TUCKER G.,HA S., TAN J., KUMAR V., ZHU H., GUPTA A., ABBEEL P.,ET AL.: Soft actor-critic algorithms and applications. arXiv preprintarXiv:1812.05905 (2018). 5

[IKE19] IKEA: How much time do we spend searching for thingsaround the home?, 2019. URL: https://www.ikea.com/es/en/ideas/how-much-time-do-we-spend-searching-for-things-around-the-home-pubec2a8ae0. 1

[KGP02] KOVAR L., GLEICHER M., PIGHIN F.: Motion graphs. In ACMTransactions on Graphics. ACM New York, NY, USA, 2002, pp. 1–10.3

[KJL99] KUFFNER JR J. J., LATOMBE J.-C.: Perception-based naviga-tion for animated characters in real-time virtual environments. 2

[KMH∗17] KOLVE E., MOTTAGHI R., HAN W., VANDERBILT E.,WEIHS L., HERRASTI A., GORDON D., ZHU Y., GUPTA A., FARHADIA.: AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv(2017). 2

[KWR∗16] KEMPKA M., WYDMUCH M., RUNC G., TOCZEK J.,JASKOWSKI W.: Vizdoom: A doom-based ai research platform for visualreinforcement learning. In 2016 IEEE Conference on Computational In-telligence and Games (CIG) (2016), IEEE, pp. 1–8. 2

[LCR∗02] LEE J., CHAI J., REITSMA P. S., HODGINS J. K., POLLARDN. S.: Interactive control of avatars animated with human motion data.In Proceedings of the 29th annual conference on Computer graphics andinteractive techniques. ACM, 2002, pp. 491–500. 3

[LWB∗10] LEE Y., WAMPLER K., BERNSTEIN G., POPOVIC J.,POPOVIC Z.: Motion fields for interactive character locomotion. InACM SIGGRAPH Asia 2010 papers. ACM, 2010, pp. 1–8. 3


397

http://pybullet.org

https://www.ikea.com/es/en/ideas/how-much-time-do-we-spend-searching-for-things-around-the-home-pubec2a8ae0




[MAO∗19] MANOLIS SAVVA*, ABHISHEK KADIAN*, OLEKSANDRMAKSYMETS*, ZHAO Y., WIJMANS E., JAIN B., STRAUB J., LIU J.,KOLTUN V., MALIK J., PARIKH D., BATRA D.: Habitat: A Platform forEmbodied AI Research. In Proceedings of the IEEE/CVF InternationalConference on Computer Vision (ICCV) (2019). 2

[MAP∗19] MEREL J., AHUJA A., PHAM V., TUNYASUVUNAKOOL S.,LIU S., TIRUMALA D., HEESS N., WAYNE G.: Hierarchical visuo-motor control of humanoids. In 7th International Conference on Learn-ing Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019(2019). 3

[MPV∗16] MIROWSKI P., PASCANU R., VIOLA F., SOYER H., BAL-LARD A. J., BANINO A., DENIL M., GOROSHIN R., SIFRE L.,KAVUKCUOGLU K., ET AL.: Learning to navigate in complex envi-ronments. arXiv preprint arXiv:1611.03673 (2016). 2

[MTA∗20] MEREL J., TUNYASUVUNAKOOL S., AHUJA A., TASSA Y.,HASENCLEVER L., PHAM V., EREZ T., WAYNE G., HEESS N.: Catchand carry: Reusable neural controllers for vision-guided whole-bodytasks. ACM Trans. Graph. 39 (July 2020). 3

[MWL∗19] MIN S., WON J., LEE S., PARK J., LEE J.: Softcon: simula-tion and control of soft-bodied animals with biomimetic actuators. ACMTransactions on Graphics (TOG) 38, 6 (2019), 1–12. 3

[NZC∗18] NAKADA M., ZHOU T., CHEN H., WEISS T., TERZOPOU-LOS D.: Deep learning of biomimetic sensorimotor control for biome-chanical human animation. ACM Transactions on Graphics (TOG) 37, 4(2018), 1–15. 2

[OLV18] OORD A. V. D., LI Y., VINYALS O.: Representation learn-ing with contrastive predictive coding. arXiv preprint arXiv:1807.03748(2018). 3

[PALvdP18] PENG X. B., ABBEEL P., LEVINE S., VAN DE PANNE M.:Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Transactions on Graphics (TOG) 37, 4(2018), 1–14. 3

[PBYVDP17] PENG X. B., BERSETH G., YIN K., VAN DE PANNE M.:Deeploco: Dynamic locomotion skills using hierarchical deep reinforce-ment learning. ACM Transactions on Graphics (TOG) 36, 4 (2017), 1–13. 3

[PS17] PARISOTTO E., SALAKHUTDINOV R.: Neural map: Struc-tured memory for deep reinforcement learning. arXiv preprintarXiv:1702.08360 (2017). 2

[PZI∗19] PAN X., ZHANG T., ICHTER B., FAUST A., TAN J., HA S.:Zero-shot imitation learning from demonstrations for legged robot visualnavigation. arXiv preprint arXiv:1909.12971 (2019). 2

[SHP04] SAFONOVA A., HODGINS J. K., POLLARD N. S.: Synthesizingphysically realistic human motion in low-dimensional, behavior-specificspaces. ACM Transactions on Graphics (ToG) 23, 3 (2004), 514–521. 3

[SKM∗19] SAVVA M., KADIAN A., MAKSYMETS O., ZHAO Y., WIJ-MANS E., JAIN B., STRAUB J., LIU J., KOLTUN V., MALIK J., PARIKHD., BATRA D.: Habitat: A platform for embodied ai research. In TheIEEE International Conference on Computer Vision (ICCV) (October2019). 4

[SLA20] SRINIVAS A., LASKIN M., ABBEEL P.: Curl: Contrastive un-supervised representations for reinforcement learning. arXiv preprintarXiv:2004.04136 (2020). 2, 3, 5

[SYT17] SHIM V. A., YUAN M., TAN B. H.: Automatic object search-ing by a mobile robot with single rgb-d camera. In 2017 Asia-PacificSignal and Information Processing Association Annual Summit and Con-ference (APSIPA ASC) (2017), IEEE, pp. 056–062. 2

[SZKS19] STARKE S., ZHANG H., KOMURA T., SAITO J.: Neural statemachine for character-scene interactions. ACM Transactions on Graph-ics (TOG) 38, 6 (2019), 1–14. 3, 5, 7

[TH09] TAYLOR G. W., HINTON G. E.: Factored conditional restrictedboltzmann machines for modeling motion style. In Proceedings ofthe 26th annual international conference on machine learning (2009),pp. 1025–1032. 3

[VSP∗17] VASWANI A., SHAZEER N., PARMAR N., USZKOREIT J.,JONES L., GOMEZ A. N., KAISER Ł., POLOSUKHIN I.: Attention is allyou need. In Advances in neural information processing systems (2017),pp. 5998–6008. 2

[WFK∗16] WISE M., FERGUSON M., KING D., DIEHR E., DYMESICHD.: Fetch and freight: Standard platforms for service robot applications.In Workshop on autonomous mobile service robots (2016). 6

[WKM∗20] WIJMANS E., KADIAN A., MORCOS A., LEE S., ESSA I.,PARIKH D., SAVVA M., BATRA D.: DD-PPO: Learning near-perfectpointgoal navigators from 2.5 billion frames. International Conferenceon Learning Representations (ICLR) (2020). 2

[WPKL17] WON J., PARK J., KIM K., LEE J.: How to train your dragon:example-guided control of flapping flight. ACM Transactions on Graph-ics (TOG) 36, 6 (2017), 1–13. 3

[WSY∗18] WANG J., SHIM V. A., YAN R., TANG H., SUN F.: Auto-matic object searching and behavior learning for mobile robots in un-structured environment by deep belief networks. IEEE Transactions onCognitive and Developmental Systems 11, 3 (2018), 395–404. 2

[XGD∗17] XIE S., GIRSHICK R., DOLLÁR P., TU Z., HE K.: Aggre-gated residual transformations for deep neural networks. In Proceed-ings of the IEEE conference on computer vision and pattern recognition(2017), pp. 1492–1500. 8

[XSL∗20] XIA F., SHEN W. B., LI C., KASIMBEG P., TCHAPMI M. E.,TOSHEV A., MARTÍN-MARTÍN R., SAVARESE S.: Interactive gibsonbenchmark: A benchmark for interactive navigation in cluttered environ-ments. IEEE Robotics and Automation Letters 5, 2 (2020), 713–720. 2,3, 6

[YLNP12] YEO S. H., LESMANA M., NEOG D. R., PAID. K.: Eyecatch: Simulating visuomotor coordination forobject interception. ACM Trans. Graph. 31, 4 (July 2012).URL: https://doi.org/10.1145/2185520.2185538,doi:10.1145/2185520.2185538. 2

[YTL18] YU W., TURK G., LIU C. K.: Learning symmetric and low-energy locomotion. ACM Transactions on Graphics (TOG) 37, 4 (2018),1–12. 3

[ZMK∗17] ZHU Y., MOTTAGHI R., KOLVE E., LIM J. J., GUPTA A.,FEI-FEI L., FARHADI A.: Target-driven visual navigation in indoorscenes using deep reinforcement learning. In 2017 IEEE internationalconference on robotics and automation (ICRA) (2017), IEEE, pp. 3357–3364. 2

[ZSKS18] ZHANG H., STARKE S., KOMURA T., SAITO J.: Mode-adaptive neural networks for quadruped motion control. ACM Trans-actions on Graphics (TOG) 37, 4 (2018), 1–11. 3

[ZTB∗17] ZHANG J., TAI L., BOEDECKER J., BURGARD W., LIU M.:Neural slam: Learning to explore with external memory. arXiv preprintarXiv:1706.09520 (2017). 2


398

https://doi.org/10.1145/2185520.2185538

https://doi.org/10.1145/2185520.2185538

Date post:	09-Feb-2022
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

Learning Human Search Behavior from Egocentric Visual Inputs

Documents