Walk the Talk: Coordinating Gesture with Locomotion for ... · Walk the Talk: Coordinating Gesture...

Walk the Talk: Coordinating Gesture with Locomotion forConversational Characters

Yingying Wang1∗ Kerstin Ruhland2† Michael Neff1‡

Carol O’Sullivan2,3§

1University of California, Davis2Trinity College Dublin, Ireland3Disney Research Los Angeles

Abstract

Communicative behaviors are a very importantaspect of human behavior, and deserve specialattention when simulating groups and crowds ofvirtual pedestrians. Previous approaches havetended to focus on generating believable ges-tures for individual characters and talker-listenerbehaviors for static groups. In this paper, weconsider the problem of creating rich and variedconversational behaviors for data-driven anima-tion of walking and jogging characters. We cap-tured ground truth data of participants convers-ing in pairs while walking and jogging. Our styl-ized splicing method takes as input a motion cap-tured standing gesture performance and a set oflooped full body locomotion clips. Guided by theground truth metrics, we perform stylized splic-ing and synchronization of gesture with locomo-tion to produce natural conversations of charac-ters in motion.

Keywords: animation, motion capture, gesturesynthesis

∗[email protected]†[email protected]‡[email protected]§[email protected]

1 Introduction

While it may be a challenge for some to walk andchew gum at the same time, people frequentlyand effortlessly talk while they walk. This impor-tant behavior is missing, however, from most vir-tual character systems. Crowd simulations gener-ally lack communicative behavior and miss nat-ural eye, head and gesture movements of peoplewalking. Systems focused on gesture and non-verbal communication are targeted almost exclu-sively at standing or sitting characters.

In general, motion capture databases consist ofmotions specific to a particular domain, e.g., lo-comotion, conversation, fighting. For combina-tions of motions, such as walking and talking, itwould be impractical to try to capture every pos-sible combination of both types of motion. Simu-lating conversational behavior for virtual charac-ters while in locomotion is not as trivial as simplycompositing separate gesture and locomotion be-haviors. Gestures must be adapted to fit the natu-ral arm swings and cadence of walking or joggingbehavior. While in locomotion, attention patternsdiffer compared to standing as people have to payattention to their walking path, to their conversa-tional partner and towards points of interest.

There is a general dearth of previous researchon how people adjust their gesturing behavior

1

while they walk or jog. We therefore began byconducting a study to obtain ground truth dataon how people communicate during locomotion.Multiple subjects were video recorded standing,walking and jogging while engaged in differenttypes of discussions and debates. According toour empirical observations, people tend to exe-cute fewer and smaller gestures and gaze shiftstoward their conversational partner while walk-ing or jogging than they do while standing.

We take a data-driven approach to create nat-ural animations of talking while walking or jog-ging. We take as input two sources of data: (i)a pre-existing locomotion database that containsvarious walking and jogging motions of multi-ple actors (without gesture); and (ii) gesticulationdata from three-way standing conversations.

With our stylized splicing method, locomotionanimation is automatically spliced with motionfrom the gesture database, which is intelligentlyadapted to match the style of the locomotion ac-tor. For example, we temporally realign gestureemphasis to the locomotion tempo, and synthe-size the typical bounce of arms seen in jogging.

Furthermore, we use an addresser-addresseerelationship (AAR) to describe orientation be-havior for characters engaged in a conversationwith each other. In this way, we can convertstanding group conversations into walking or jog-ging ones, by adding attention behavior and re-pairing the conversational partner relationship.The basis for the orientation behavior, and otherconversational parameters, is derived from ourannotated ground truth video. By transferringand adapting gesticulation data, our system is ca-pable of creating conversational behaviors for in-dividual characters and groups in locomotion.

2 Related Work

Generating conversational behaviors such as ges-ture, facial expressions and gaze has been a veryactive area of research. Utterance planning [1],

prosody [2, 3], probabilistic modeling from in-put text [4] or real human performance [5] andrule-based systems [6] have all been used. Headmovement and eye gaze of a virtual conversa-tional partner may be used to communicate infor-mation about their internal states, attitudes, atten-tions and intentions [7] or to actively influencethe conversation [8]. Ennis et al. [9] found thatsynchrony of the body motions of the conversingpartners in a standing group was very important.However, the combination of conversational be-haviors (e.g. gestures and gaze) for groups whilewalking, jogging and talking has not been ex-plored.

Motion graphs and motion blending tech-niques have been proposed to reuse and combineexisting motions into new motion sequences [10,11, 12], where potential transition points in mo-tion sequences are chosen based on a posturesimilarity metric and used to construct a graphstructure. New sequences can be produced bystitching together the motion segments from agraph walk. Fernandez-Baena et al. [13] con-struct a Gesture Motion Graph (GMG) from a la-beled gesture database and select the graph walkthat best matches the accompanying prosodic ac-cent and the gesture timing slot. Stone et al. [5]build a linguistic network based on a character’sutterance and choose optimized edges by penal-izing the match of utterance and gesture, the con-nection of neighboring utterances and adjacentgestures. However, these approaches all treat thecharacter state vector as a monolithic whole, tak-ing all the Degrees of Freedom (DOFs) of datafrom a single clip at a time.

When splicing motions together, naive DOFreplacement can produce unrealistic results as itignores the physical and stylistic correlations be-tween body parts [14, 15]. Mousas et al. [16]overcome the synchronization problem by usingvelocity based temporal alignment. Partial-bodymotion graphs can also be generated to splice andsynchronize arm or hand motions with full-body

2

clips [17, 18, 19]. For example, Majkowska etal. [20] integrate separately captured hand mo-tions into full body animation and find the cor-responding splicing points using a two-pass dy-namic time warping (DTW) algorithm. Ourmethod not only adapts gesture performance spa-tially to the styles of the locomotion arm swing,it also temporally aligns the gesture stroke peakto the locomotion tempo.

3 Ground truth data

While there are many studies of conversationalbehaviors conducted with sitting or standing par-ticipants, we wish to explore how gesture andgaze behaviors differ in the case of conversers inmotion. For this paper, we focus on walking andjogging scenarios.

We recorded real video footage of two sets ofmale and two sets of female participants, agedbetween 21 and 39, talking together while stand-ing, walking and jogging. To encourage a natu-ral and lively discussion we selected participantswho knew each other and chose conversation top-ics that were of interest to them. Video and au-dio were recorded with a Sony HDR-AS100V ac-tion camera with a resolution of 1920x1080 anda frame rate of 29. In all conditions, the partic-ipants were placed next to each other and orien-tated toward the camera. The video camera wasplaced in front of the participants capturing thewhole upper body for later annotation of the headrotation and gesture (see Fig. 2). In the walk-ing and jogging conditions, a continuous path ofapproximately 200 meters was chosen with thecamera moving in front of the subjects.

To create conversations with varying dynam-ics, we recorded two different conditions: dom-inant speaker conversations and debates (asin [9]). In the debates, each participant expressedtheir opinion on the topic being discussed withinterruptions from the conversational partner. Aninformative topic was chosen for the dominant

speaker conversations, where one speaker did themajority of speaking, while the other politely lis-tened with only occasional responses or ques-tions. In total 24 dominant speaker conversations(3 for each of the 8 participants) and 12 debates(3 with each of the 4 groups of participants) wererecorded. Dominant speaker conversations lastedapproximately one minute, while debates lastedbetween one and two minutes.

3.1 Annotation

We annotated every fifth frame of the videofootage to capture all important information,such as head turns and gestures. To approximatehead rotation and to compensate for camera andparticipants’ movement, the x and y pixel coordi-nates of a center point on the body and the tip ofthe nose were marked. The participants’ verbalbehavior was noted as talking or listening. Foreach gesture, we reported the type and magni-tude (0.5 and 4, with 1 being the relaxed handposition of the participant); the elbow bend (0 to3, with 0 representing no bend and 2 a bend of90 degrees); the arm displacement (0 represent-ing the arm close to the body and incrementingsteps thereafter); the facing direction of the palm(up, down, in or out) and the peak of the gesture.

For the type of gesture, we used the taxon-omy proposed by McNeill [21]: ‘Beat’ - a rhyth-mic flick of finger, hand or arm to highlightwhat is being said; ‘Deictic’ - a pointing ges-ture with direction; ‘Iconic’ - a representationof a concrete object, or drawing with the hand;and ‘Metaphoric’ - a representation of an abstractconcept. We also noted ‘Adaptor’ motions, suchas crossing arms and touching the face or hair.

Each gesture is divided into 4 phases: prepa-ration, stroke, hold, and retraction. During thestroke phase, the gesture normally peaks. Thelocomotion contact or flight information was an-notated relative to this peak. During walking, acontact occurs when the front foot is close to ortouching the ground, whereas one leg is passing

3

the other represents a flight (see Fig. 3). For jog-ging, flight occurs when both legs are in the airand for contact at least one foot is on the ground.

According to empirical observations, we hy-pothesized that individuals would use more ges-tures when standing compared to walking or jog-ging. Similarly, we suspected the range and du-ration of gestures to be higher during a standingconversation. We also hypothesized that personalstyle would persist across gaits, and that the peakof a gesture would occur during the contact phaseof locomotion.

3.2 Analysis

Frequency and duration of gaze shifts: Foreach participant, we averaged over the numberof times they gazed at or away from their con-versational partner for all conversation types. AnAnalysis of Variance (ANOVA) was conductedwith factors gait, and gaze direction. We foundan interaction effect between gait and the gaze di-rection (F (2, 12) = 72.834, p ≈ 0). Participantsgazed at their conversational partners more of-ten while standing than while walking or jogging.We also conducted an ANOVA with dependentvariable gaze duration and independent variablesgait, and gaze direction, which showed an inter-action effect (F (2, 12) = 17.611, p ≈ 0). Withincreasing intensity of body motion, the averagegaze duration toward the conversational partnerdecreases. Both of these results suggest that itis harder and at times not feasible to initiate andmaintain eye contact during physical activities.

Frequency and duration of gestures: AnANOVA with the percentages of different typesof gestures as the dependent variable and withinfactors gait, conversation type, and type of ges-ture was conducted. The categorical predictorwas the sex of the participant. A main effectof gait (F (2, 12) = 6.154, p = 0.015) showedthat participants gestured significantly less whenjogging, and a main effect of conversation typeshowed that fewer gestures were used during

debates than in dominant speaker conversations(F (1, 6) = 13.172, p = 0.011). However, an in-teraction between the sex of the participant andthe conversation type (F (1, 6) = 6.194, p =0.047) indicated that the male participants ges-tured significantly less during debates than domi-nant speaker conversations, whereas women ges-tured equally for both. Previous studies have alsofound that females gesture more during conver-sation [22].

A main effect of gesture type (F (3, 18) =14.907, p ≈ 0) indicated that the most commongesture type used was beat, followed by adap-tor. There was also an interaction between ges-ture type and gait (F (6, 36) = 2.979, p = 0.007)showing that adaptors were used much more instanding than walking or jogging. An analysis onthe duration of gestures gave an interaction ef-fect between gait and participant sex (F (2, 12) =4.379, p = 0.037), as male gestures were shorterduring jogging, whereas the gesture duration forwoman was the same for different gaits.

Gesture peaks: An ANOVA on the num-ber of gesture peaks, with independent variablesgait and locomotion phase found a main effectof locomotion phase (F (1, 6) = 268.44, p ≈0), where significantly more gesture peaks hap-pened during the contact phase. No main effectof gait was found suggesting that gesture peakswere similar across gaits. However, an interac-tion occurred between locomotion phase and gait(F (1, 6) = 9.070, p = 0.024), which shows thatthe percentage of gesture peaks in locomotioncontact phase is significantly higher than in flightphase for both gaits.

Elbow bend: As hypothesized, an ANOVAshowed a main effect of gait on the gesture space(F (2, 12) = 10.563, p = 0.002). During jog-ging, most gestures were made with the elbowbent by 90 degrees. The average elbow bendwhile standing and walking was significantlylower and ranged between 0 and 76.5 degrees.

From these results we can conclude that people

4

make fewer and smaller gestures and gaze mo-tions with increasing body motion intensity. Wealso conducted a short post-experiment surveyand most participants rated the question: “Howengaged in the conversation do you think yourpartner was in the following situations?” thelowest for jogging. It seems reasonable to assumethat the physical exertion, reduced eye-contactand gestures led to this assessment. We usedthese results to guide our stylized splicing andsynchronization algorithms in order to generatenatural conversational behaviors for conversingcharacters while walking or jogging.

4 Coordinated Gesture and Lo-comotion

In this section, we present our method for gener-ating natural conversational behavior for pedes-trians in motion, given locomotion clips from avariety of actors and standing conversational mo-tion clips with gestures.

We use two existing motion corpuses: (i) 19standing conversations with three male (9) orthree female (10) actors, each approximately 160sec. long, for a total performance time of 8,403sec. [9]; and (ii) normal walking and jogging mo-tions captured from 16 male and 16 female ac-tors, with varied styles of arm expansion, elbowbend and swing amplitude [23]. All motions areretargeted to a 22-joint 69-DOF skeleton. All thecaptured motions are re-sampled to 30 fps to givea common time base.

Naively splicing two such clips generates un-natural results, thus the gesture performanceneeds to be customized before being spliced withthe locomotion. Figure 4 illustrates the generalwork flow: (1) we ensure stylistic consistencybetween the clips by adjusting the gesture per-formance to match a particular locomotion style;(2) we temporally micro-synchronize the ges-ture phase with the locomotion cycle, and sim-ulate arm disturbances resulting from the body’s

ground interaction; (3) we fully exploit the func-tions of gesture preparation and retraction forsmooth splicing; (4) for conversations with twoor more participants, we coordinate their pair-ings by adding head and torso orientations. Ourmethod builds on previous splicing approaches,such as [14], by handling stylization, synchro-nization and conversational pairings.

To extract temporal information from thegestures, we performed the following an-notation of the gesture databases. Eachgesture phrase is temporally composed ofpreparation, stroke, hold and retractionphases [24, 25, 26], where (gesture →[preparation] [hold] stroke [hold]). The mainmeaning of the gesture is carried in the strokephase [26, 27]. The preparation phase placesthe arm, wrist, hand and fingers in the properconfiguration to begin the stroke [26]. Theretraction phase returns the arm to a resting posi-tion. During the annotation, we label the timingof gesture phrases and phases (tPb, tPe, tSb,tSe, tRb and tRe), corresponding to beginningand end of Preparation, Stroke and Retraction.The annotated gesture types are based on thetaxonomy proposed by McNeil [21], as in theground truth study. We also annotate gesturinghandedness and the addresser/addressee in thegroup.

To extract locomotion tempo, we use a stan-dard breakdown of locomotion into four phases,as previously mentioned in Section 3.1 (seeFig. 3). Typically, during flight the root altitudeincreases, whereas it decreases during contact.All our locomotion data clips are between 1.5 and2 seconds long, starting from phase left contact,consisting of two full locomotion cycles and canbe seamlessly looped to be any given length.

4.1 Stylization

As mentioned before, naively splicing gesturesonto locomotion clips produces unrealistic re-sults as it does not take into account the differ-

5

ences between gestures while standing or in mo-tion. For example, low gestures with straight el-bows are common for standing characters, butare unnatural in a jogging situation as our groundtruth analysis shows that most of the gestures aremade with the elbow bent to around 90 degrees.Furthermore, variation in the styles of locomo-tion should be transferred to the gesture style.For example, pedestrians with a larger arm swingare more likely to perform broader gestures. Thegoal of our stylization process is to adjust ges-tures to be consistent with the locomotion armshape and swing. We consider gestures to beauxiliary actions on base motions like standing orjogging, and use the statistics of the base motionsto offset the gestures.

Given the locomotion clip ML, we computeits mean arm pose to be BL as the base, includ-ing the shoulder, elbow and wrist DOFs. Simi-larly for gesture performance MG, we use its restpose as the base BG. BL and BG thus reflectthe overall correspondence between the arms andthe torso (see Fig. 5). The difference between thebase poses, BD = BL − BG, is then used to ad-just the original gesture motionMG. Gestures areextracted from the standing character as an offsetfrom the base pose and then layered onto the basepose of the desired locomotion clip, which gen-erates M ′G = MG +BD.

To incorporate the dynamic features of thelocomotion arm swing, we compute the stan-dard deviation of arm DOF di in the locomotionclip. Arm rotations in M ′G are constrained within±ci ∗ STD(di), where ci is a user specified con-stant, typically three. Joint rotations exceedingthis active range are linearly rescaled. To avoidaltering the pointing direction, stylization is notapplied to deictic gestures.

4.2 Synchronization

Temporally, a gesture has its preparation, stroke,hold and retraction phases, while locomotionrepeats its flight/contact cycles with a certain

tempo. From the ground truth study, wefound that both are linked, in that significantlymore stroke peaks happen during contact phases.Pedestrians are therefore likely to align theirstroke peaks to the locomotion contact phase.Some stroke emphases are actually caused by thearms being shaken due to ground impact duringlocomotion, which produces an effect of gestur-ing to the tempo of the locomotion cycle. Wesynthesize this effect to make the gesture per-formance more realistic during locomotion, es-pecially for joggers.

Alignment: Unlike previous splicing researchthat uses Dynamic Time Warping (DTW) align-ment [14] or velocity based synchronization [16]to align arm motion with the locomotion, wealign gestures based on the timing of the utter-ance and locomotion phase. The gesture ‘syn-chrony rules’ referred to in [21] indicate that ges-ture strokes have been observed to consistentlyend at or before, but never after, the prosodicstress peak of the accompanying syllable. Userstudies in [28, 29] have indicated that gesturesthat are performed 0.2 to 0.6 second earlier w.r.tthe accompanying utterance are rated highly fortheir naturalness. This provides us with an exacttime window for gesture alignment: for a givengesture, if it has a stroke peak and does not alignwith the locomotion contact phase based on theutterance timing, we search 0.6 seconds behind tofind the first contact phase point. We then alignthe stroke peak with this contact phase. We donot perform re-alignment for gestures with mul-tiple stroke peaks, to avoid conflicts in alignmentand also to preserve the original time gaps be-tween these peaks.

Synthesis: During the contact phase of loco-motion, the body hits the ground and suddenlychanges its velocity. It is unlikely that a personcould hold their arms as steady in this conditionas a standing character could, especially for jog-gers. Some stroke emphases are actually causedby the shaking arms resulting from locomotion,

6

the effect of which could vary, due to differentarm firmness and also the vigor of the jogging.To synthesize arm shake to the beat of the loco-motion, we use the motion of the root to adjustthe elbow. Using R′elbow = Relbow + k ∗∆Hroot,on top of the original elbow rotation Relbow, welayer the influence of root height change ∆Hroot

due to the ground impact. For a walking mo-tion, ∆Hroot is negligible, but for jogging mo-tions, ∆Hroot is large and the elbow bounce isvery obvious. If k is an adjustable parameter thatrescales the height changes to the elbow rotationspace, then by increasing k we can synthesizejogging on a bumpy road with loose arms.

4.3 Splicing

The goal of our splicing method is to add thegesture performance MG to locomotion clip ML

given the gesture timing, and to generate the out-put spliced motion MS that naturally combinesthe two. We further segment the skeleton intotorso, lower-body, left arm and right arm. Afull-body motion sequence Mfb can thus be de-scribed by the union of the motion of its fourparts: left arm M la, right arm M ra, torso M ts

and lower body M lb, where each part shouldmaintain close correlation with each other.

M lbS = M lb

L , t ∈ [0, N ]M ts

S = M tsL , t ∈ [0, N ]

(1)

As the lower-body motion is the dominant factorin locomotion, and the torso swivels to its tempo,we preserve M lb

L (t) and M tsL (t) throughout time

t ∈ [0, N ] (Eq. 1). The stroke and hold phasescarry the semantics of the gesture, thus M la

G (t)and M ra

G (t) are preserved in the spliced motionfrom the beginning of the stroke tSb to the endtSe.Marm

S =Marm

L , t ! ∈ [tPb, tRe]Marm

G , t ∈ [tSb, tSe]

slerp(MarmL ,Marm

G , t−tPbtPe−tPb+1

), t ∈ [tPb, tPe]

slerp(MarmL ,Marm

G , t−tRbtRe−tRb+1

), t ∈ [tRb, tRe]

(2)

Our method takes the gap between the beginningof the gesture preparation tPb and the end of thepreparation tPe, and applies spherical linear in-terpolation (slerp) to the arm joint rotations totransition from the locomotion swing to the ges-ture performance. Similarly, slerp is applied dur-ing the gap between the beginning of the gestureretraction tRb and its end tRe to transition gestureperformance back to locomotion swing (Eq. 2).

5 Interaction

Using the motion splicing and gesture stylizationmethods described above, we are able to syn-thesize multiple gesturing characters in locomo-tion. To make these characters appear plausi-bly engaged in a conversation, head and torsoorientations need to be added to generate ap-propriate gaze behavior. This is done basedon an ‘addresser-addressee relationship’ (AAR)that defines the conversational interaction, wherethe addresser gazes toward the addresee. ThisAAR specification includes high level informa-tion such as labeling the addresser, addressee andthe timing of gaze behavior. The system supportsmultiple ways of generating the AAR, includingrandom specification, manual specification, re-specting AAR from the original motion capturedgroup conversation, or generating it based on thestatistics from our ground truth study.

Gaze Generation: Gaze is implemented byfirst dynamically retrieving the positions of theaddresser and addressee in the scene at the spec-ified time, and then computing the θyaw thatwould fully rotate one character’s head to look atthe other, in the horizontal plane. Since gaze alsoinvolves eye movement, a complete head rotationis not always necessary. We use a distribution todetermine the torso yaw angle as r ∗ θyaw wherer is randomly chosen from [.6, 1]. The rotationis implemented with a combination of spine andneck DOFs. If the addressee is in front of orbehind the addresser, exceeding a threshold dis-

7

tance (one meter in the current implementation),a small adjustment of forward/backward lean upto 15◦ is added to the spine joint of the addresser.After stylized splicing, M torso

Spliced is directly de-rived from locomotion MLoco, which preservesthe natural motion of the torso, timed with thelocomotion tempo. The newly synthesized AARhead and torso orientations are layered on top ofthis torso movement.

AAR Specification: There are several waysto generate the AAR. Firstly, it may be inferredfrom the motion clip by assuming that the charac-ter performing a gesture is the addresser and theothers are addressees. Gesture timing is used forhead and torso rotation whereby the gaze direc-tion is achieved by the start of the stroke and re-turns to neutral with the retraction. This methodallows an originally captured standing group con-versation to be transformed into one with loco-motion, or any gesture specification to generategaze behavior. Alternatively, we allow a userto fully author the AAR such that the user hascomplete control at the cost of some added upfront labor. This flexible specification can helpto pair participants from different conversationsin the database to simulate a new group con-versation. Our system can also randomly se-lect addresser and addressees and pair them intoAAR. This method facilitates simulating plausi-ble crowd conversations during locomotion, withminimal user intervention.

Synthesize AAR from Ground Truth: Fi-nally, our ground truth data may be leveragedto create more complex and realistic gaze behav-ior. Data analysis suggests that both the addresserand addressee will gaze at and away from eachother during the conversation and that the dura-tion of this behavior is not necessarily the sameas the duration of a gesture. We analyzed the du-ration of the gaze at and away behavior for eachof our subjects for both walking and jogging andused this for some of our experiments. Gaze be-havior during the conversation is thus determined

for each addresser and addressee based on thestatistics of the subject model assigned to them(selected from the ground truth data). When aparticular behavior is chosen, say gaze-at, it is as-signed a duration based on the subject model witha small random variation. If subsequent selec-tions of the same gaze behavior exceed a total du-ration greater than the average duration plus onestandard deviation, the gaze behavior is forced toswitch to the opposite type. Experiments wereconducted with different sampling strategies, andthis method was found to generate a pattern ofgazing at and away behavior that appeared natu-ral and non-repetitive.

6 Result and Applications

To evaluate the effectiveness of our method, weapply them in a number of different scenarios.Please see the supplemental videos for full ani-mated results.

Stylized Splicing: To demonstrate the advan-tage of stylistic splicing, we select five distinctjogging types, detailed in Table 1. We exper-iment with different gesture types, varied wristpositions and stroke amplitudes.

Fig. 6 compares the results of splicing a lowgesture from a standing posture on five differentjogging motions. As mentioned before, low ges-tures with straight elbows are common for stand-ing characters; however, splicing them directlyinto a locomotion clip can generate unrealisticartifacts as the base pose of jogging is quite dif-ferent from the standing rest pose. In Fig. 6(a),naıve splicing not only produces an inconsistentstraight arm configuration in the middle of a jog-ging arm swing, it also generates an identical ges-ture performance for the different jogging styles.Fig. 6(b) demonstrates how stylized splicing caneffectively fix these problems by transferring thejogging base poses, resulting in a more believablegesture performance for the jogging locomotion.Jogger J1 has a large elbow bend, small arm ex-

8

pansion and a positive swing, so the spliced ges-ture is also performed high and narrow in frontof the body. J2 has a larger arm expansion duringthe jogging swing, so the gesture is also wider.J3, J4 and J5 have less elbow bend with differentvariances. Their elbows are more straight whenperforming the gesture, but J4 has narrower armexpansion for the gesture, and J5 preserves hisasymmetric style.

Conversation Simulation: Based on the AARinformation from the gesture database annota-tion, our method converts the motion capturedstanding conversations into locomoting ones.Figure 1 shows the same group conversation indifferent locomotion conditions. Head and torsoorientation is calculated according to the new po-sition of the addressee in the scene. For this ex-periment we used gaze timing profiles extractedfrom the ground truth study.

Our method can also simulate conversationalrelationships that vary from the original motioncaptured group structures. Figure 7 shows a sim-ulated group conversation, using the same ges-ture and locomotion input as above, but in thiscase, no AAR specification is necessary fromthe user. By default, we randomly pick one ad-dressee in the audience and pair it with the ad-dresser to establish their conversational relation-ship. Cross-group communication different fromthe original motion capture data is highlighted.We allow further specification from the user fordetailed control.

7 Conclusion

We have presented a novel method for generat-ing conversational gestures for virtual pedestri-ans. Animator can fully reuse existing clips of lo-comotion and standing conversations. “Stylizedsplicing” flexibly adjusts gesture behavior in timeand space to the locomotion style. Using AARspecification, virtual pedestrians can be dynami-cally paired into conversational groups, which al-

lows the simulation of crowd conversations. Ourground truth data can also be used as a solid refer-ence for animator to generate gestures for pedes-trians.

Currently, gesture performance and head/torsoorientation ground truth is extracted from videos.In the future, analyzing motion capture data ofconversations during locomotion could help tomore precisely quantify the data. Our methodis capable of splicing gestures into any synthe-sized locomotion. Integrating the technique witha general motion graph would support a largervariety of scenes and different locomotion paths.Given the extracted ground truth information, amotion retrieval algorithm could be used with ourmethod to efficiently search the gesture databaseand find the most ideal performance for the se-quence. We hope that our work can contribute toa new body of research on gesture synthesis for awide set of naturalistic activities.

References

[1] S. Kopp and I. Wachsmuth. Synthesizingmultimodal utterances for conversationalagents. Computer animation and virtualworlds, 15(1):39–52, 2004.

[2] S. Levine, C. Theobalt, and V. Koltun. Real-time prosody-driven synthesis of body lan-guage. ACM Transactions on Graphics(TOG), 28(5), 2009.

[3] S. Levine, P. Krahenbuhl, S. Thrun, andV. Koltun. Gesture controllers. ACM Trans-actions on Graphics (TOG), 29(4):124,2010.

[4] M. Neff, M. Kipp, I. Albrecht, and H. Sei-del. Gesture modeling and animation basedon a probabilistic re-creation of speakerstyle. ACM Transactions on Graphics(TOG), 27(1), 2008.

9

[5] M. Stone, D. DeCarlo, I. Oh, C. Rodriguez,A. Stere, A. Lees, and C. Bregler. Speak-ing with hands: Creating animated conver-sational characters from recordings of hu-man performance. ACM Transactions onGraphics (TOG), 23(3):506–513, 2004.

[6] J. Cassell, H.H. Vilhjalmsson, and T. Bick-more. Beat: The behavior expression ani-mation toolkit. In Proceedings of the 28thAnnual Conference on Computer Graphicsand Interactive Techniques, SIGGRAPH’01, pages 477–486. ACM, 2001.

[7] S. Marsella, J. Gratch, and J. Rickel.Expressive behaviors for virtual worlds.In Life-like characters, pages 317–360.Springer, 2004.

[8] D. Bohus and E. Horvitz. Facilitating multi-party dialog with gaze, gesture, and speech.In International Conference on MultimodalInterfaces and the Workshop on MachineLearning for Multimodal Interaction, pages5:1–5:8. ACM, 2010.

[9] C. Ennis, R. McDonnell, and C. O’Sullivan.Seeing is believing: Body motion dom-inates in multisensory conversations.ACM Transactions on Graphics (TOG),29(4):91:1–91:9, 2010.

[10] L. Kovar, M. Gleicher, and F. Pighin. Mo-tion graphs. ACM Transactions on Graph-ics (TOG), 21(3):473–482, 2002.

[11] A. Safonova and J.K. Hodgins. Construc-tion and optimal search of interpolated mo-tion graphs. ACM Transactions on Graph-ics (TOG), 26(3), 2007.

[12] M. Gleicher, H. Shin, L. Kovar, andA. Jepsen. Snap-together motion: assem-bling run-time animations. In ACM SIG-GRAPH 2008 Classes, SIGGRAPH ’08,pages 52:1–52:9. ACM, 2008.

[13] A. Fernandez-Baena, M. Antonijoan,R. Montano, A. Fuste, and J. Amores.Bodyspeech: A configurable facial andgesture animation system for speakingavatars. Proceedings of the InternationalConference on Computer Graphics andVirtual Reality (CGVR), page 3, 2013.

[14] R. Heck, L. Kovar, and M. Gleicher. Splic-ing upper-body actions with locomotion.Computer Graphics Forum, 25(3):459–466,2006.

[15] Pengcheng Luo and Michael Neff. A per-ceptual study of the relationship betweenposture and gesture for virtual characters. InMotion in Games, pages 254–265. SpringerBerlin Heidelberg, 2012.

[16] C. Mousas, P. Newbury, and C. Anag-nostopoulos. Splicing of concurrentupper-body motion spaces with locomo-tion. Procedia Computer Science, 25:348–359, 2013.

[17] K. Tamada, S. Kitaoka, and Y. Kitamura.Splicing motion graphs: Interactive genera-tion of character animation. Short papers ofComputer Graphics International, 3, 2010.

[18] W. Ng, C. Choy, D. Lun, and L. Chau.Synchronized partial-body motion graphs.In ACM SIGGRAPH ASIA 2010 Sketches,pages 28:1–28:2. ACM, 2010.

[19] N. Al-Ghreimil and J.K. Hahn. Combinedpartial motion clips. In WSCG’03, 2003.

[20] A. Majkowska, V.B. Zordan, and P. Falout-sos. Automatic splicing for hand andbody animations. In Proceedings of the2006 ACM SIGGRAPH/Eurographics Sym-posium on Computer Animation, pages309–316, 2006.

10

[21] D. McNeill. Hand and mind: What gesturesreveal about thought. University of ChicagoPress, 1992.

[22] N.J. Briton and J.A. Hall. Beliefs about fe-male and male nonverbal communication.Sex Roles, 32(1-2):79–90, 1995.

[23] L. Hoyet, K. Ryall, K. Zibrek, H. Park,J. Lee, J.K. Hodgins, and C. O’Sullivan.Evaluating the distinctiveness and attrac-tiveness of human motions on realistic vir-tual bodies. ACM Transactions on Graphics(TOG), 32(6):204:1–204:11, 2013.

[24] D. Efron. Gesture and environment. Nkig’sCrown Press, 1941.

[25] A. Kendon. Gesticulation and speech: Twoaspects of the process of utterance. The re-lationship of verbal and nonverbal commu-nication, 25:207–227, 1980.

[26] Sotaro Kita, Ingeborg Van Gijn, and HarryVan der Hulst. Movement phases in signsand co-speech gestures, and their transcrip-tion by human coders. In Gesture and signlanguage in human-computer interaction,pages 23–35. Springer, 1997.

[27] D. McNeill. Gesture and thought. Univer-sity of Chicago Press, 2005.

[28] C. Kirchhof. On the audiovisual integra-tion of speech and gesture. Int. Soc. GestureStudies (ISGS), 2012.

[29] Y. Wang and M. Neff. The influence ofprosody on the requirements for gesture-text alignment. In Intelligent Virtual Agents(IVA), pages 180–188, 2013.

Figure 1: Standing conversation (l) converted towalking (m) and jogging (r) groups.

Figure 2: Two participants walking

Figure 3: Locomotion Cycle: jogging

11

Figure 4: Overall work flow for performing styl-ized splicing and synchronization of conversingcharacters when walking or jogging.

Figure 5: Base arm poses for standing and 5 joggingstyles (Table 1)

ID Expansion(y) Swing(z) Bend(z)

J1 (13.8± 1.1) (27.3± 4.5) (113.8±3.1)

J2 (30.6± 3.5) (−6.4±13.7) (109.9±6.3)

J3 (21.5± 0.9) (−3.5±14.6) (82.7±11.8)

J4 (9.9± 3.1) (−3± 8.6) (68.4± 7.5)

J5 (22.5± 1.8) (−2.9±13.9) (74.7±13.9)

Table 1: Quantized Jogging Styles (approx. meanand stdev in deg.)

(a) Without stylization, the spliced arm strokes look stiff andinconsistent.

(b) Gestures consistent with locomotion styles

Figure 6: Comparison demonstrating the effectof gesture stylization.

Figure 7: Simulated large group of joggers con-versing with random AARs: original (blue),cross group (red).

12

Date post:	03-Jul-2019
Category:	Documents
Upload:	vuongkien
View:	216 times
Download:	0 times

Walk the Talk: Coordinating Gesture with Locomotion for ... · Walk the Talk: Coordinating Gesture...

Documents