+ All Categories
Home > Documents > Pictorial Human Spaces: A Computational Study on the Human … · 2018. 7. 4. · learning in pose...

Pictorial Human Spaces: A Computational Study on the Human … · 2018. 7. 4. · learning in pose...

Date post: 30-Jan-2021
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
22
Int J Comput Vis DOI 10.1007/s11263-016-0888-3 Pictorial Human Spaces: A Computational Study on the Human Perception of 3D Articulated Poses Elisabeta Marinoiu 2 · Dragos Papava 2 · Cristian Sminchisescu 1,2 Received: 14 August 2015 / Accepted: 11 February 2016 © Springer Science+Business Media New York 2016 Abstract Human motion analysis in images and video, with its deeply inter-related 2D and 3D inference com- ponents, is a central computer vision problem. Yet, there are no studies that reveal how humans perceive other peo- ple in images and how accurate they are. In this paper we aim to unveil some of the processing—as well as the lev- els of accuracy—involved in the 3D perception of people from images by assessing the human performance. More- over, we reveal the quantitative and qualitative differences between human and computer performance when presented with the same visual stimuli and show that metrics incorporat- ing human perception can produce more meaningful results when integrated into automatic pose prediction algorithms. Our contributions are: (1) the construction of an experimental apparatus that relates perception and measurement, in partic- ular the visual and kinematic performance with respect to 3D ground truth when the human subject is presented an image of a person in a given pose; (2) the creation of a dataset containing images, articulated 2D and 3D pose ground truth, as well as synchronized eye movement recordings of human subjects, shown a variety of human body configurations, both Communicated by Deva Ramanan. B Cristian Sminchisescu [email protected] Elisabeta Marinoiu [email protected] Dragos Papava [email protected] 1 Department of Mathematics, Lund University, Lund, Sweden 2 Institute of Mathematics of the Romanian Academy, Bucharest, Romania easy and difficult, as well as their ‘re-enacted’ 3D poses; (3) quantitative analysis revealing the human performance in 3D pose re-enactment tasks, the degree of stability in the visual fixation patterns of human subjects, and the way it correlates with different poses; (4) extensive analysis on the differences between human re-enactments and poses produced by an automatic system when presented with the same visual stim- uli; (5) an approach to learning perceptual metrics that, when integrated into visual sensing systems, produces more stable and meaningful results. Keywords Human pose estimation · Human perception · Metric learning · Perceptual distance · Human eye movements 1 Introduction When shown a photograph of a person, humans have a vivid, immediate sense of 3D pose awareness, and a rapid under- standing of the subtle body language, personal attributes, or intentionality of that person. How can this happen and what do humans perceive? How is such paradoxical monocular stereoscopy possible? Are the resulting percepts accurate in an objective, veridical sense, or are they an inaccurate, pos- sibly stable by-product of extensive prior interaction with the world, modulated by sensations acquired through the selective visual observation of the photograph? The distinc- tion between the regular 3D space we move in and the 3D space perceived when looking into a photograph—the picto- rial space—has been introduced and beautifully studied by Koenderink (1998) for rigid objects through the notion of pictorial relief. In this paper we aim to explore the concept for the case of articulated human structures, motivating our 123
Transcript
  • Int J Comput VisDOI 10.1007/s11263-016-0888-3

    Pictorial Human Spaces: A Computational Study on the HumanPerception of 3D Articulated Poses

    Elisabeta Marinoiu2 · Dragos Papava2 · Cristian Sminchisescu1,2

    Received: 14 August 2015 / Accepted: 11 February 2016© Springer Science+Business Media New York 2016

    Abstract Human motion analysis in images and video,with its deeply inter-related 2D and 3D inference com-ponents, is a central computer vision problem. Yet, thereare no studies that reveal how humans perceive other peo-ple in images and how accurate they are. In this paper weaim to unveil some of the processing—as well as the lev-els of accuracy—involved in the 3D perception of peoplefrom images by assessing the human performance. More-over, we reveal the quantitative and qualitative differencesbetween human and computer performance when presentedwith the samevisual stimuli and show thatmetrics incorporat-ing human perception can produce more meaningful resultswhen integrated into automatic pose prediction algorithms.Our contributions are: (1) the construction of an experimentalapparatus that relates perception and measurement, in partic-ular the visual and kinematic performance with respect to 3Dground truth when the human subject is presented an imageof a person in a given pose; (2) the creation of a datasetcontaining images, articulated 2D and 3D pose ground truth,as well as synchronized eye movement recordings of humansubjects, shown a variety of human body configurations, both

    Communicated by Deva Ramanan.

    B Cristian [email protected]

    Elisabeta [email protected]

    Dragos [email protected]

    1 Department of Mathematics, Lund University, Lund, Sweden

    2 Institute of Mathematics of the Romanian Academy,Bucharest, Romania

    easy and difficult, as well as their ‘re-enacted’ 3D poses; (3)quantitative analysis revealing the human performance in 3Dpose re-enactment tasks, the degree of stability in the visualfixation patterns of human subjects, and the way it correlateswith different poses; (4) extensive analysis on the differencesbetween human re-enactments and poses produced by anautomatic system when presented with the same visual stim-uli; (5) an approach to learning perceptual metrics that, whenintegrated into visual sensing systems, produces more stableand meaningful results.

    Keywords Human pose estimation · Human perception ·Metric learning · Perceptual distance · Human eyemovements

    1 Introduction

    When shown a photograph of a person, humans have a vivid,immediate sense of 3D pose awareness, and a rapid under-standing of the subtle body language, personal attributes, orintentionality of that person. How can this happen and whatdo humans perceive? How is such paradoxical monocularstereoscopy possible? Are the resulting percepts accurate inan objective, veridical sense, or are they an inaccurate, pos-sibly stable by-product of extensive prior interaction withthe world, modulated by sensations acquired through theselective visual observation of the photograph? The distinc-tion between the regular 3D space we move in and the 3Dspace perceived when looking into a photograph—the picto-rial space—has been introduced and beautifully studied byKoenderink (1998) for rigid objects through the notion ofpictorial relief. In this paper we aim to explore the conceptfor the case of articulated human structures, motivating our

    123

    http://crossmark.crossref.org/dialog/?doi=10.1007/s11263-016-0888-3&domain=pdf

  • Int J Comput Vis

    pictorial human space terminology.1 We develop methodol-ogy aimed to measure the human uncertainty in observingand reproducing a 3D human pose, given images, and forlearning human perceptual metrics that adequately reflect it,in automatic pose estimation systems.

    Three-dimensional humanpose estimation in natural envi-ronments has been a subject of constant interest, morerecently with the advent of sophisticated RGB and RGB-Dsensors. Significant progress has been achieved by carefullyintegrating complex feature representations and large-scalelearning methods (Bo and Sminchisescu 2009, 2010; Sigaland Black 2006; Sigal et al. 2010b; Shotton et al. 2011)as well as measurement-based optimization in multi-camerasetups (Sidenbladh et al. 2000; Gall et al. 2010; Huang et al.2013). Most if not all of these methods aim to accurately pre-dict human poses as truthfully reflected in small Euclideandistances between the 3D joint positions (or joint angles) ofa predicted human skeleton and the ground truth. While suchmetrics make sense from a computational point of view, andlow errors would indeed correspond to good predictions, sev-eral issues have become apparent over time: (1) for general,complex poses and motions, it remains difficult to achieveEuclidean errors under 100 mm per joint on average. Whilethese may not seem large, often many poses from that errordistribution look meaningless perceptually, and (2) not alltasks require the same levels of accuracy, so it would be use-ful to be able to develop techniques to adapt the predictivemetric for a desired task complexity.

    Our approach to establish the observation-perception linkis to make humans re-enact the 3D pose of another person(for which ground truth is available), shown in a photograph,following an exposure time. Simultaneously our setup allowsthe measurement of human pose and eye movements duringthe ‘pose matching’ performance. This comprises an obser-vation, a memory and a re-enactment error. As the poses aretaken from everyday activities, their reproduction should notput subjects in difficulty as far as the ability to articulate apose is concerned, however.

    The contribution of our work can be summarized as fol-lows: (1) the construction of an apparatus relating the humanvisual perception (re-enactment as well as eye movementrecordings) with 3D ground truth; (2) the creation of a datasetcollected from 14 subjects (7 female and 7 male), containing120 images of humans with different levels of control skill,in different poses, both easy and difficult, available online athttp://vision.imar.ro/percept3d; (3) quantitative analysis ofhuman eye movements, 3D pose re-enactment performance,error levels, stability, correlation as well as cross-stimuluscontrol, in order to reveal how different 3D configurationsrelate to the subject focus on certain features in images, in

    1 No connection with pictorial structures (Sapp et al. 2010; Fischlerand Elschlager 1973)—2D tree-structured models for object detection.

    the context of the given task, (4) extensive analysis on humanre-enactmens versus automatic pose estimations on the samestimuli and (5) an approach to learn perceptual metrics thatwhen integrated into automatic state of the art structured-prediction systems, produces considerably more stable andmeaningful results compared to the plain Euclidean counter-parts used in the literature. An earlier, considerably reducedversion of our work has appeared in Marinoiu et al. (2013).

    2 Related Work

    In this section we review two related research directions,pose estimation frommonocular images and learning human-inspired perceptual metrics.

    2.1 Human Pose Estimation from Monocular Images

    The problem of human pose estimation from static imageshas received significant attention in computer vision, bothin the 2D (Ferrari et al. 2009; Sapp et al. 2010; Yang andRamanan 2011) and the 3D case (Deutscher et al. 2000;Sigal et al. 2007; Sminchisescu and Triggs 2003; Urtasunet al. 2005; Agarwal and Triggs 2006; Gall et al. 2010; Boand Sminchisescu 2010; Andriluka et al. 2010). The 2Dcase is potentially easier, but occlusion and foreshorteningchallenge the generalization ability of 2D models. For 3Dinference, different models as well as image features havebeen explored, including joint positions (Lee and Chen 1985;Bourdev et al. 2010), edges and silhouettes (Sminchisescuand Triggs 2003; Gall et al. 2010) or intermediate labeling of2D body-parts (Ionescu et al. 2014a). Recent studies focused,as well, on inferring human attributes extracted from 3Dpose(Sigal et al. 2010a) and on analyzing perceptual invariancesbased on 3D body shape representations (Sekunova et al.2013). Akhter and Black (2015) have obtained encourag-ing results in 3D pose estimation given 2D joints locationsby learning pose-dependent joint angle limits. These wereobtained with the help of trained athletes and gymnasts whowere able to perform an extensive variety of stretching poses.Other approaches leverage the power of deep convolutionalneural networks for pose estimation both in 2D and 3D.Although deep learning methods have proven successful inimage classification and object detection, their applicationto human pose estimation is not straightforward due to theflexible structure of the body, the required precision and theambiguities in predicting the pose (Jain et al. 2014). InToshevand Szegedy (2014) the authors propose a holistic approachthat uses a cascade of deep neural networks. They start withan initial pose estimation based on the whole image and thenrefine the joint predictions using high resolution sub-images.A combination of local appearance and holistic views ofeach local part is proposed in (Fan et al. 2015). They use

    123

    http://vision.imar.ro/percept3d

  • Int J Comput Vis

    both part patches and body patches within a convolutionalneural network as well as a method to combine the jointlocalization results for different parts.Othermethods proposehybrid approaches that integrate data-driven deep convolu-tional neural networks with the expressiveness of graphicalmodels (Chen and Yuille 2014; Tompson et al. 2014). Theseapproaches focus on 2D pose estimation from monocularimages, although the more difficult 3D case has been tackledin Li and Chan (2014) where the authors propose amulti-tasknetwork for both regressing body joint locations and for jointdetection.

    It is well understood that the problem of geometricallyinferring a skeleton from monocular joint positions, theproblem of fitting a volumetric model to image features bynon-linear optimization (Dickinson and Metaxas 1994), andthe problemof predicting poses froma large training data cor-pus based on image descriptors are under-constrained underour present savoir faire. These produce either discrete setsof forward-backward ambiguities for known limb lengths(Lee and Chen 1985; Sminchisescu and Triggs 2003; Smin-chisescu and Jepson 2004; Rehg et al. 2003) (continuousnon-rigid affine folding for unknown lengths), or multiplesolutions due to incorrect alignment or out-of-sample effects(Deutscher et al. 2000; Sminchisescu and Triggs 2003). 3Dhuman pose ambiguities from monocular images may not beunavoidable. Bettermodels and features, a subtle understand-ing of shadows, body proportions, clothing or differentialforeshortening effects may all reduce uncertainty. The ques-tion still iswhether such constraints can be reliably integratedtowards metrically accurate monocular results and how dothey relate to human performance.

    Studies like Lee and Chen (1985), Sminchisescu andTriggs (2003), Sminchisescu and Jepson (2004), and Rehget al. (2003) can be viewed as revealing sets of ambiguities formathematical models (articulated representations, perspec-tive or affine projection) under relatively simple observables(joint positions, edge or silhouette features). The currentstudy is complementary, aiming to reveal perceptual (human)pose estimation uncertainty under complex observables (realimages). We take an experimental perspective, aiming atbetter understanding what humans are able to do, how accu-rately, and where they are looking when recognizing a 3Dpose. Such insights can have implications for learning andevaluationmethods in definingmetrics that better capture thesemantics of a human pose, and would support quantitativeanalysis (and training data) for definingmoremeaningful tar-gets and levels of uncertainty for the operation of computervision methods in similar tasks.

    While thiswork focuses on experimental human sensing inmonocular images, the moving light display setup of Johans-son (1973) isworthmentioning as amilestone in emphasizingthe sufficiencyof dynamicminimalismwith respect to humanmotion perception. Yet in that case, as for static images, the

    open question remains on how such vivid dynamic perceptsrelate to the veridical motion and how stable across observersthey are. Our paper focuses mostly on analysis from a com-puter vision perspective but links with the broader domainof sensorimotor learning for redundant systems, under non-linearity, delays, uncertainty and noise (Wolpert et al. 2011).We are not aware, however, of a study similar to ours, norof an apparatus connecting real images of people, eye move-ment recordings and 3D perceptual pose data, with multiplesubject ground truth, as we propose.

    2.2 Learning Human Perceptual Metrics

    In order to learn a metric that incorporates human percep-tion of poses, previous work relies on asking people to rankthe degree of similarity of two poses (Harada et al. 2004).They consider several measures based on both angular andjoint position representations. They learn a weighted Euclid-ean distance over joints in order to maximize the correlationcoefficient between human’s perception of pose similarityand the proposed metric. Chen et al. (2009) make a relativejudgment by assessing which of two poses is more similar toa third one. They define an extensive pool of relational geo-metric features, and attach weights to them using Adaboost,based on data labeled by humans. Müller et al. (2005) pro-pose to use geometric boolean features such as whether theright foot lies in front of or behind the plane spanned bythe left foot, for efficient content-based retrieval of motioncapture data.

    The success of these approaches relies extensively on thedifferences between the poses presented to people, as show-ing two very different or very similar poses will make thedecision trivial and will not provide more information thanthe standard Euclidean distance between the poses shownand the ground truth. Moreover testing the proposed met-ric is done mainly in the context of pose retrieval whereaswe use it to improve pose estimation algorithms. A relateddomain is that of learning a similarity measure for featurelearning in pose estimation (Kanaujia et al. 2007), humanmotion (Tang et al. 2008) or temporal clustering of humanbehavior (López-Méndez et al. 2012).

    3 Apparatus for Human Pose Perception

    In this section we describe our design of an experimentalapparatus that allows linking 3D human pose perception,and 2D feature identification strategies, based on images andthe 3D human pose ground truth. The key difficulty in ourexperimental design is to link a partially subjective phenom-enon like the 3D human visual perceptionwithmeasurement.Our approach is to dress people in a motion capture suit,equip them with an eye tracker and show them images of

    123

  • Int J Comput Vis

    Fig. 1 Illustration of our human pose perception apparatus. a Screenon which the image is projected as captured by the external cameraof the eye tracker. b Result of mapping the fixation distribution on theoriginal high-resolution image, following border detection, tracking and

    alignment. c Heat map distribution of all fixations of one of our subjectsfor this particular pose. d Detail of our head-mounted eye tracker ande 3D motion capture setup

    other people in different poses, which were obtained usingmotion capture as well (Fig. 1). By asking the subjects tore-enact the poses shown, we can link perception and mea-surement. The focus of the paper is to investigate how wellpeople can understand and reproduce body configurationsshown in images. We were only interested in the body con-figuration per-se, thus we only showed humans that did notinteract with the surrounding environment (no manipulationof objects or occlusions by objects in the scene). 2 We use astate-of-the-art Vicon motion capture system together with ahead mounted, high-resolution mobile eye tracking system.Themocap system tracks a large number of markers attachedto the full body mocap suit worn by a person. Each markertrack is then labeled taking into account its placement on thebody regarding a model template. Given the labels, humanmodels are used to accurately compute the location and orien-tation of each 3D body joint. The mobile eye tracker systemmaps a person’s gaze trajectory on the video captured from itsfrontal camera. The synchronization between the two cam-eras is done automatically by the system.

    3.1 Experimental Design and Dataset Collection

    Subjects and general setup. We first analyze the re-enactment performance of ten subjects, five male and fivefemale, who did not have a medical history of eye problemsor mobility impediments. Moreover, their profession did notrequire above average neuro-motor skills (as required in thecase of dancing, acting or practicing a particular sport). Wewill refer to this group as the regular subjects. In addition tothis, we also analyze the performance of another four sub-jects, two males and two females, who were all final year

    2 While asking people to re-enact a certain interaction between a personand the surrounding environment can provide further insight on howpeople learn and reproduce certain activities, or manipulate objects,this is beyond the scope of the current study.

    choreography students, focusing on modern and classicalballet. This group will be referred to as the skilled subjects.All the participants were recruited through an agency andhad no link with computer vision. Each subject was pre-sented images of people in different poses taken from theHuman3.6M (H3.6M) dataset (Ionescu et al. 2014b), andthen asked to reproduce the poses seen as well as they could.They were explicitly instructed not to mirror the pose, butto reproduce the left and right sides accordingly. The imageswere projected on a 1.2 m tall screen located 2.5–3 m away.The eye tracker calibration was done by asking the subject tolook at specific points, while the system was recording pupilpositions at each point. The calibration points were projectedon the same screen used to project images. For mocap we usethe standard calibration procedure.

    First Experiment—Limited Exposure of Visual Stim-uli. Each subject was required to stand still and look at oneimage at a time until it disappeared, then re-enact the pose bytaking asmuch time as necessary.We chose to display imagesfor 5 s such that the subjects would have enough time to seethe necessary pose detail, while still being a short enoughexposure not to run into free viewing. The duration was cho-sen by first recording two test subjects and showing themimages for 3, 5 and 8 s respectively. Their feedback was that3 s were too short to view enough detail, while 8s were morethan enough. From an eye tracking video we were interestedin the 5 s that captured the subjects’ gaze recordings over theimage shown.

    To be able to analyze the fixation pattern with respect tothe joint coordinates of the person shown in the image (asin Fig. 2) and to increase the accuracy of recorded gazes,we mapped fixations that fell onto the image on the screen(captured by eye-tracker/viewer’s camera) back to the orig-inal high-resolution image (c.f. Fig. 1). We created greenand blue borders for the original image, for easier detectionand tracking later on. First, we evaluated the viewer’s cam-era intrinsic parameters, and corrected for radial distortion

    123

  • Int J Comput Vis

    Fig. 2 Percentage of fixations falling on joints, for each of the 120poses (easy and hard) shown to subjects. The mean and standard devi-ation is computed for each pose among the ten subjects. A fixation was

    considered to fall on a particular joint if this was within 40px distancefrom the fixation. On average 54 % of the fixations on easy poses and30 % of those for hard poses fell on joints

    of each image in the captured video. Then, we set a thresh-old in the HSV color space to retrieve the green and blueborders. Instead of directly detecting corners (which, mightfall-off the image due to subject’s subtle head movement),we detected the green and blue borders using a Radon trans-form, imposing a 90◦ angle and a known aspect ratio of theimage on the screen. Fixations inside the rectangular borderof the image on the screen are easily translated into an imagecoordinate.

    Moreover, we synchronized themocap and the eye trackersystems by detecting the start and end frames of the dis-played images in the eye tracker video as well as in the videorecorded with the motion capture system (two digital cam-eras of the mocap system were pointed at the screen as well).During experiments, subjects were dressed in the mocap suitand had the mobile eye tracker head-mounted. For each poseprojected on the screen, we captured both the scanpaths andthe 3Dmovement of the subject in the process of re-enactingthe pose, once it had disappeared from the screen.

    Once the 5 s exposure time has passed, the subject nolonger had the possibility to see the image of the pose to re-enact, but had to adjust his position based on the memory ofthat pose. This time constraint ensures that the subject willmostly look at what is important in understanding and repro-ducing the pose.Moreover, itmakes the process of translatingfixations from the video coordinates of the eye tracker to theones of the image on the screen robust, as there are no rapidhead movements or frames where the pose is only partiallyseen (Fig. 1). We display a total of 120 images, each repre-senting a bounding box of a person, and rescale them to 800pixels height in order to have the same projected size. Theimages are mainly frontal. 100 contain easily reproduciblestanding poses, whereas 20 of them are harder to re-enact asthey require sitting on the floor, which often results in self-occlusion. The poses shownwere selected fromHuman3.6M(Ionescu et al. 2014b), from various types of daily activitiesand were performed by ten subjects.

    Second Experiment—Continuous Exposure of VisualStimuli. Another experimental approach would be to allow

    the subjects to observe the image stimulus while adjustingtheir pose, thus removing a confounding factor due to short-term memory decay (forgetting). In this way, the subjectscould alternate between adjusting their body and checkingback in the image, without having to memorize all the posedetails. While this choice may appear more natural, it hasthe drawback that, while the subject is checking back inthe image to adjust his pose, there is no longer a simple,robust way to map the fixations from the scene camera tothe original image as it is difficult to separate those that fallon the pose from those on the subject’s own body or sur-roundings. Furthermore, as the subject’s head movementsare not constrained anymore, the resulting image can bedegraded by motion blur, and the stimuli may appear onlypartially in the images captured by the camera of the eye-tracker.

    To combine the advantages of the two approaches andto understand to what extent the subjects make errors dueto poor short memory, we designed a second experiment.In the first part of the second experiment our subjects areconstrained to observe the image only for 5 s (similar tothe first experiment) but then, once they assumed the pose,we display the image again and allow them to correct theirposture accordingly, taking as much time as needed. Thestimulus image is available at all times during the pose correc-tion process. This experimental setting allows us to analyzethe subjects’ performance when they do not have the visualaid as compared to the case when the stimulus is availableas long as necessary, for each of the poses shown. To par-ticipate in this experiment, another group of five subjects(two males and three females) were recruited under similarconstraints as the regular subject group (no visual or mobil-ity impediments and no particular body movement skills).They were shown another set of 150 images also taken fromHuman3.6M (Ionescu et al. 2014b) which were split usingthe same criteria as in the first experiment, into 100 easyposes and 50 hard. The display conditions and all the pre-processing steps (distance to projector, scaling size, etc.)remain the same throughout both experiments.

    123

  • Int J Comput Vis

    3.2 Evaluation and Error Measures

    We use the same skeleton joints as in Human3.6M (Ionescuet al. 2014b) such that our analysis can immediately relatewith existing computer vision methods and metrics.

    H3.6M position error (MPJPE)between a recorded poseand the ground truth is computed by translating the root(pelvis) joint of the given pose to the one of the ground truth.We rotate the pose such that it faces the same direction asthe ground truth. The error is then computed for each joint asthe norm of the difference between the recorded pose and theground truth. In this way, we compensate for the global orien-tation of the subject. We normalize both the subject skeletonand the ground truth to a default skeleton of average weightand height, ensuring that all computed errors are comparablebetween poses and subjects.

    H3.6M angles error (MPJAE) is computed as theabsolute value of the difference between joint angles of testand ground truth, for each 1 d.o.f. joint (e.g. for the elbow, theangle between upper arm and lower arm). For a 3 d.o.f. joint,the representation is in ZXY Euler angles and the angle dif-ference is computed separately for each d.o.f. as previously;the final error is the mean over the 3 d.o.f. differences.

    4 Data Analysis

    What is the joint angle distribution of the poses in ourdataset? In Fig. 3 we show the angle distribution, for eachjoint, measured over the 120 poses in the dataset. Easy posescontain mainly standing positions (very few angles over 30◦in the lower body part), whereas in the case of the hard ones,often very different from standing, large angles are spreadacross all joints.

    Since subjects were asked to match the right and left com-ponents of a pose accordingly, we want to know whetherthere is a balanced distribution between the deviations of ourselected poses from a resting pose, over the right and leftsides. Figure 4 shows the mean deviation from a rest pose,over each joint. The poses shown to subjects have a similar

    Fig. 4 Deviation statistics from rest pose under MPJPE

    degree of displacement from a resting pose on both right andleft sides of the body.

    4.1 Human Eye Movement Recordings

    The analysis on human eye movement recordings is carriedmainly on the ten regular subjects that participated in thefirst experiment. Cases where results are presented both onregular and skilled subjects (jointly or separated) are signaledappropriately.

    Static and dynamic consistency. In this section we ana-lyze how consistent the subjects are in terms of their fixatedimage locations.We are first concernedwith evaluating staticconsistency, which considers only the fixation locations andthen with dynamic consistency, which takes into account theorder of fixations. To evaluate howwell the subjects agree onfixated image locations, we predict each subject’s fixationsin turn using the information from the other nine subjects(Ehinger et al. 2009; Mathe and Sminchisescu 2013). Thiswas done considering the same pose as well as differentposes. For each pose, we generate a probability distributionby assigning a value of 1 to each pixel fixated by at least oneof the 9 subjects and 0 to others, then locally blurring witha Gaussian kernel. The width was chosen such that, for eachpose, it would span a visual angle of 1.5◦. The probabilityat pixels where the 10th subject’s fixations fall is taken asthe prediction of the model obtained from nine subjects. Forcross-stimulus control, we repeat the process for 100 pairs

    Fig. 3 Distribution of jointangles in our dataset (underMPJAE) split over easy (left)and hard poses (right)

    123

  • Int J Comput Vis

    Fig. 5 Static inter-subject eye movement agreement. Fixations fromone subject are predicted using data from the other nine subjects bothon the same image (blue) and on a different image of a person, randomlyselected from our 120 poses (green) (Color figure online)

    of randomly selected different poses. Figure5 indicates goodconsistency.

    We also checked for a difference in the visual patterns ofregular and skilled subjects by analyzing the static consis-tency between the two groups. This was done in a similarmanner as described above, but using all 10 regular sub-jects for training and the four skilled subjects for testing. Weobtained an inter-group mean AUCinter = 85.17 ± 3.7% ,similarwith the intra-groupone (AUCintra = 85.87±1.5%)(see Fig. 5) which suggests that there is no significant dif-ference between the visual strategies of regular and skilledsubjects when presented the same visual stimuli for re-enactment.

    To evaluate howconsistent the subjects are in their order offixating areas of interest (AOIs), we used the hidden Markovmodeling recently developed by Mathe and Sminchisescu(2013). The states correspond to AOIs that were fixated bysubjects and the transitions correspond to saccades. For eachpose, we learn a dynamic model from the scanpaths of ninesubjects and compute the likelihood of the 10th subject’sscanpath under the trained model. The leave-one-out processis repeated in turn for each subject and the likelihoods areaveraged. The average likelihood (normalized by the scan-path length) obtained is−9.38. Results are compared againstthe likelihood of randomly generated scanpaths and the like-lihoods of scanpaths from another randomly chosen pose.Specifically, for each pose, we generate a random scanpathwith the exception of the first fixation, whichwas taken as thecenter of the image to account for central bias. Each randomscanpath is evaluated against the model trained with subjectfixations on that pose. The average likelihood ismuch smallerfor randomly generated trajectories (−42.03) than for those

    Fig. 6 HMMs trained for two poses using the method of Mathe andSminchisescu (2013). Ellipses correspond to states (fixation clusters),whereas dotted arrows to transition probabilities assigned by the HMM.The AOIs determined by the model correspond to regions that wellcharacterize the pose

    Fig. 7 Number of times a body joint was in the top-3 regions fixated,accumulated over the 120 poses shown

    of human subjects. Also, the likelihood of scanpaths obtainedfrom other images is considerably smaller −17.12 than thelikelihood of scanpaths obtained from the same image indi-cating that subjects are consistent in the order they fixateAOIs. Examples of trained HMMs are shown in Fig. 6.

    What percentage of fixations fall on joints? In order tounderstand where our subjects look and what are the mostimportant body cues in re-enacting each 3D pose, we projectthe skeletal joint positions onto the stimulus image. We ana-lyze the fixations relative to the 17 joints in Fig. 7. Foreach pose, we take into account the 3D occlusions (basedon mocap data and a 3D volumetric model) and consider, aspossibly fixated, only those joints that are visible. We con-sider a fixation to be on a joint if it falls within a distanceof 40 pixels. This threshold was chosen to account for anangle of approximately 1.5◦ of visual acuity. Our first analy-sis aims to reveal to what extent subjects are fixating jointsand how their particular choice of regions varies with theposes shown. Figure 2 shows what percent of fixations fell

    123

  • Int J Comput Vis

    Fig. 8 Fixation counts on each joint. The mean and standard deviationis computed among the 120 poses by aggregating over all ten subjects,for each pose

    on joints for each pose, for each subject. On average 54 % offixations fell on various body joints for easy poses, but only30 % for hard poses. This is not surprising as more joints areusually occluded in the case of the complex poses shown.There are, on average 0.59 ± 0.87 occluded joints in easyposes and 2.95±2.5 occluded joints on hard poses. The per-centage of fixations that fall on joints is negatively correlatedwith the number of occluded joints in a pose (−0.65).

    Where do subjects look first? Since approximately halfof human fixations fell on joints, we want to know whethercertain joints are always sought first, and to what extent thejoints consideredfirst are pose-dependent. Theorder inwhichjoints are fixated can offer insight into the cognitive processinvolved in pose recognition. Figure 7 shows howmany timesa joint was among the first 3 AOIs fixated, for each subject.On average a subject fixates 8.99±2.5 regions on each image.The first 3 fixations almost never fall on the lower body partwhich (typically) has less mobility, but mostly on the regionsof the head and arms.

    Which are the most fixated joints? We study whethercertain joints are fixated more than others and we want toknow whether this would happen regardless of the poseshown, or whether it varies with the pose. For this purpose,we consider the number of fixations that fall on a particularjoint as well as the time spent on fixating each joint. Figures8 and 9 show the distribution of fixations on body joints aver-aged over poses and over subjects, respectively. Notice thatalthough certain joints have been fixated more than others,this depends on the specific pose.

    The variation can also be observed in a detailed analysisof how frequently certain joints were fixated in two arbi-trary poses, presented in Figs. 10 and 11. While in Fig. 10 thelegs are almost never fixated, a different frequency patternis apparent in Fig. 11. One simple, immediate observationis that the configuration of the legs in the second imagehas a greater deviation from a standard, rest pose than inthe first image. However we can also notice that there is ahigher degree of familiarity in the leg configuration shown in

    Fig. 9 Fixation counts on each joint. The mean and standard deviationis computed for each of the ten subjects by aggregating their fixationsover all 120 poses shown

    Fig. 10 Number of fixations of each subject on the 17 body joints,when presented the image (pose) stimulus shown on the right

    Fig. 11 Number of fixations made by each subject, on the 17 bodyjoints, when presented the image (pose) stimulus shown on the right

    Fig. 10 as compared to the less typical (not so easily recog-nisable) configuration of legs in Fig. 11. By computing thecorrelation coefficient between the number of fixations perjoint and the deviation of each joint from the correspondingjoint of a rest pose (under the Euclidean metric) we obtainedonly a small correlation for the shown poses: 0.18 for Fig. 10and 0.17 for Fig. 11 and by looking at all 120 configura-tions we notice a large variation between −0.68 and 0.74.This shows that the magnitude of the Euclidean distancebetween the joints of shown poses and those of a rest pose

    123

  • Int J Comput Vis

    Fig. 12 Time spent on fixating each joint. Themean and standard devi-ation is computed among the 120 poses by aggregating the duration offixations for all ten subjects, for each pose

    Fig. 13 Time spent on fixating each joint. Themean and standard devi-ation is computed among the ten subjects by aggregating, for each, thefixation duration over all 120 poses

    is not always a good indicator of how much people will fix-ate those joints. The process is more complex and probablydepends of the types of joints (as it is with head/neck areaor extremities) and the familiarity of subjects with certainsub-configurations.

    In Fig. 9 we aggregate joint fixations for each subject, onall poses, and show the most frequently fixated joints, onaverage. The inter-subject variation is smaller than the onebetween poses, confirming a degree of subject consistencywith respect to the joints more frequently fixated. The wristsand the head area are the most looked at, within a generaltrend of fixating upper body parts more than lower ones.

    How long are people looking at different joints? As thelength of a fixation varies, it is also important to considerthe time spent in fixating a particular joint. In Fig. 12 weshow the mean time and standard deviation spent on a pose,for each body joint, by aggregating over subjects. Similarly,in Fig. 13 we show the mean time and standard deviation

    Table 2 Average time taken by the skilled subjects to re-enact a 3Dpose

    Subjects SK1 SK2 SK3 SK4

    Time(s) easy 4.9 6.5 4.8 3.8

    Time(s) hard 6.2 10.7 7.6 5.2

    The mean time is 5.2± 1.1 s for an easy pose and 7.4± 2.3 s for a hardone

    between subjects, by aggregating over poses. Notice thatinter-pose standard deviation is higher than inter-subjectstandard deviation. It can be further observed that joints at theextremity of the upper body (head, neck, wrists) are fixatedthe most.

    4.2 3D Pose Re-Enactment

    In this section we complement eye-movement studies withan analysis of how well humans are able to reproduce the 3Dposes of people shown in images.

    How long does it take to re-enact a pose? While thereis variance between subjects in the time taken to re-enact apose, Tables 1 and 2 show that all of them (both regularand skilled) are consistent in taking more time for hard posescompared to easy ones.When comparing the group of regularsubjectswith the group of skilled ones, there is nomeaningfuldifference (0.9 s for easy poses and 1.2 s for hard poses) inthe average time taken for pose re-enactment between thetwo.

    Are easy poses really easy and hard poses really hard?The selection criterion is based on our perception of howhard it would be to re-enact the pose. Here we check howthis relates to the measured errors for different poses.

    The first two leftmost plots in Fig. 14 show that MPJAEsmoothly decreases over time both in the case of the regularsubjects and in the case of skilled ones, when re-enactingan easy pose. It can also be noticed that subjects requiredifferent times to completion for an easy pose. The secondeasy pose seems to be perceived slightly harder, with highererrors and longer completion times. The right side of Fig. 14shows two hard poses as well as the errors and time taken tore-enact them both for the regular subjects and for the skilledones. The errors are considerably higher than for the easyposes indicating that our selection of difficult poses indeedresulted in higher re-enactment errors and longer completiontimes. The plots for easy poses show a similar re-enactment

    Table 1 Average time taken bythe regular subjects to re-enact a3D pose

    Subjects SR1 SR2 SR3 SR4 SR5 SR6 SR7 SR8 SR9 SR10

    Easy 6.6 6.0 4.4 4.4 5.2 6.7 9.2 7.8 4.6 4.8

    Hard 9.6 8.2 6.4 8.6 7.5 10.0 11.1 11.5 6.5 7.0

    The mean time is 6 ± 1.6 s for an easy pose and 8.6 ± 1.8 s for a hard one

    123

  • Int J Comput Vis

    Fig. 14 Error variation, over time, for two easy poses (left) and two hard ones (right). For each pose we show from left to right the error variationfor the ten regular subjects and then for the four skilled ones

    Table 3 Results for regular subjects detailed for easy poses, hard poses as well as over all poses under MPJPE and MPJAE metrics

    Subjects MPJPE min error (mm) MPJPE completion error (mm) MPJAE min error (deg) MPJAE completion error (deg)

    Easy Hard Both Easy Hard Both Easy Hard Both Easy Hard Both

    SR1 105.2 ± 34.8 155.0 ± 59.8 113.5 ± 43.9 119.7 ± 38.9 171.1 ± 70.7 128.3 ± 49.3 18.3 ± 6.5 26.4 ± 6.7 19.6 ± 7.2 20.0 ± 6.5 31.0 ± 8.6 21.8 ± 8.0SR2 75.2 ± 24.4 138.2 ± 49.8 85.7 ± 38.0 87.8 ± 27.6 156.1 ± 65.5 99.1 ± 44.4 16.2 ± 5.8 22.9 ± 6.6 17.3 ± 6.4 17.2 ± 6.3 25.3 ± 8.1 18.6 ± 7.2SR3 79.9 ± 34.5 130.0 ± 51.5 88.3 ± 42.0 88.5 ± 37.8 138.3 ± 56.1 96.8 ± 45.1 15.9 ± 5.8 23.5 ± 7.7 17.2 ± 6.8 16.6 ± 5.9 26.7 ± 9.6 18.3 ± 7.6SR4 78.4 ± 32.2 140.0 ± 41.3 88.7 ± 40.8 90.6 ± 36.4 162.7 ± 47.5 102.6 ± 46.8 16.4 ± 5.9 24.0 ± 6.5 17.7 ± 6.6 17.8 ± 6.5 27.2 ± 6.9 19.4 ± 7.4SR5 73.9 ± 28.5 130.2 ± 40.8 83.3 ± 37.2 85.3 ± 29.0 162.2 ± 74.2 98.1 ± 49.1 16.1 ± 5.4 23.4 ± 6.1 17.3 ± 6.2 17.5 ± 5.7 25.8 ± 7.5 18.9 ± 6.7SR6 81.0 ± 38.5 143.7 ± 46.4 91.4 ± 46.1 92.1 ± 43.6 155.3 ± 44.3 102.6 ± 49.6 16.4 ± 6.2 24.4 ± 7.1 17.8 ± 7.0 17.2 ± 6.5 26.7 ± 9.4 18.8 ± 7.9SR7 84.4 ± 33.5 125.3 ± 39.7 91.2 ± 37.7 99.5 ± 39.8 142.2 ± 56.3 106.6 ± 45.6 17.1 ± 6.0 25.4 ± 7.8 18.5 ± 7.0 18.8 ± 6.5 28.0 ± 8.4 20.3 ± 7.7SR8 77.3 ± 25.5 139.2 ± 41.5 87.6 ± 36.8 85.8 ± 29.2 152.4 ± 45.7 96.9 ± 40.8 15.4 ± 6.0 24.6 ± 7.4 17.0 ± 7.1 16.2 ± 6.1 25.9 ± 7.6 17.8 ± 7.3SR9 73.7 ± 30.0 152.0 ± 56.9 86.7 ± 46.1 87.1 ± 32.8 175.2 ± 69.1 101.7 ± 52.4 15.7 ± 5.7 22.4 ± 5.8 16.8 ± 6.2 17.2 ± 5.9 25.2 ± 8.1 18.6 ± 7.0SR10 72.0 ± 23.5 133.9 ± 39.3 82.3 ± 35.2 80.6 ± 25.2 151.9 ± 43.1 92.5 ± 39.2 15.4 ± 5.6 24.3 ± 7.3 16.9 ± 6.7 16.7 ± 5.8 26.5 ± 8.0 18.3 ± 7.2All 80.1 ± 32.1 138.8 ± 47.0 89.9 ± 41.3 91.7 ± 35.9 156.7 ± 58.1 102.5 ± 47.2 16.3 ± 5.9 24.1 ± 6.9 17.6 ± 6.8 17.5 ± 6.2 26.8 ± 8.2 19.1 ± 7.4

    We display the mean of the minimum errors attained by the regular subjects during re-enactment, as well as the completion errors for the subjects

    Table 4 Results for skilled subjects detailed for easy poses, hard poses as well as over all poses under MPJPE and MPJAE metrics

    Subjects MPJPE min error (mm) MPJPE completion error (mm) MPJAE min error (deg) MPJAE completion error (deg)

    Easy Hard Both Easy Hard Both Easy Hard Both Easy Hard Both

    SK1 80.3 ± 26.4 147.3 ± 59.4 91.5 ± 42.1 86.4 ± 29.2 170.0 ± 95.5 100.4 ± 56.1 16.2 ± 5.8 23.6 ± 6.6 19.6 ± 7.2 17.0 ± 5.8 25.4 ± 6.9 18.4 ± 6.7SK2 76.3 ± 31.0 131.1 ± 41.1 85.4 ± 38.6 85.2 ± 34.8 150.9 ± 56.3 96.2 ± 46.0 17.2 ± 7.2 23.1 ± 5.1 17.3 ± 6.4 18.7 ± 7.2 26.1 ± 6.9 19.9 ± 7.6SK3 73.6 ± 26.5 141.7 ± 50.3 84.9 ± 40.5 83.1 ± 29.5 163.2 ± 66.1 96.4 ± 48.2 15.9 ± 5.1 23.5 ± 7.7 17.2 ± 6.8 17.3 ± 5.7 24.3 ± 5.7 18.5 ± 6.2SK4 72.5 ± 28.0 115.8 ± 37.8 79.7 ± 33.8 81.1 ± 33.0 128.6 ± 44.1 89.0 ± 39.1 16.4 ± 5.9 25.0 ± 6.5 17.7 ± 6.6 17.5 ± 6.0 25.5 ± 8.8 18.8 ± 6.9All 75.8 ± 27.9 133.9 ± 47.1 85.9 ± 38.6 83.9 ± 31.6 153.1 ± 65.5 95.4 ± 47.3 16.4 ± 5.9 23.2 ± 6.4 17.5 ± 6.5 17.6 ± 6.1 25.3 ± 7.0 18.8 ± 6.9

    We display the mean of the minimum errors attained by skilled subjects during re-enactment, as well as completion errors for the skilled subjects

    pattern for both skilled and regular subjects: starting from astandard initial position, then smoothly decreasing the error.In the case of the two hard poses shown, the error curveis less regular, due to subjects taking longer to adjust theirbody, sometimes deciding to mirror their initial choice ofbody configuration before finally declaring the re-enactmentcompleted.

    How accurately do humans re-enact 3D poses?The sub-jects decide when they consider completion, i.e. their body

    configuration being closest to the one shown. Using our errormeasures we analyze whether their perceivedminimum errorwas indeed the closest theywere able to achieve. InTable 3weshow that, on average, regular subject completion errors areworse (by 14±3% underMPJPE or 9±10% underMPJAE)than theirminimum error achieved during the process of posere-enactment. A similar pattern is noticeable among skilledsubjects (Table 4), with completion errors worse than theminimumachieved error by 11±1%underMPJPEor 7±1%

    123

  • Int J Comput Vis

    Fig. 15 Examples of subject re-enactment for two easy (left) and two hard (right) poses. For each pose, the first re-enactment shown is from aregular subject, whereas the second one is from a skilled one

    under MPJAE. The 20 poses we perceived (and selected) ashard to re-enact indeed have larger errors than easy posesby 73 ± 20% (MPJPE) or 53 ± 6% (MPJAE) for regularsubjects and by 82 ± 18% (MPJPE) or 43 ± 4% (MPJAE)for skilled subjects. Average subjects errors (both skilled andregular) for the different poses are shown in Fig. 17.

    We have also compared the re-enactment performance ofregular and skilled subjects on the same stimuli. We wantto understand to what extent formal training in professionsrequiring sharp neuro-motor skills and good body position-ing self-awareness influences perception and the capacity tore-enact poses, as measured under the widely used metrics,MPJPE andMPJAE. Tables 3 and 4 show re-enactment errorsunder theMPJPEandMPJAE for regular and skilled subjects,respectively. It can be noticed that the overall completionerror of the skilled subjects is only 7.1 mm (under MPJPE)or 0.3◦ (under MPJAE) smaller that the overall completionerror of the regular subjects. Similarly, the differencebetweenthe minimum errors achieved by skilled subjects and regularsubjects during re-enactment is only 4mm under the MPJPEmetric and 0.1◦ under the MPJAE metric. The small differ-ence in errors between the two groups of subjects suggeststhat: (1) themetrics used are not sensitive enough to the crite-ria optimized by people perceiving and re-enacting 3D poses,(2) having superior mobility and body coordination does notmake a significant difference in the context of the task ana-lyzed here. Examples of re-enactment from both skilled andregular subjects on 2 easy and 2 hard poses are presented inFig. 15.

    In our experiments we asked subjects to re-enact the stim-ulus pose by correctly assigning the left and right sides asopposed to mirroring, which humans might find unnatural.To understand whether this experimental choice has a signif-

    icant effect on their re-enactment performance, we count foreach subject how many times they made a left-right assign-ment mistake. We consider that such a re-enactment mistakeoccurs if, by mirroring either the upper body part (abovehips), the lower body part or the whole body, we obtain apose with a smaller MPJPE error than the original one. Wefind that, on average less than 10% of the re-enactments pro-duced by either regular or skilled subjects exhibit left-rightassignment mistakes (9.9 ± 4.1% for regular subjects and5.6 ± 4.6% for skilled subjects).

    Do continuously available stimuli improve the qualityof re-enactment? We complemented the initial experimentwith another one, (see Sect. 3.1) in which we allow subjectsto adjust their pose while the image is available, thus rulingout short-termmemory decay as confounding factor. We firstpresented the image only for 5 s, removed it, and asked thesubjects to re-enact the pose. Upon completion, we projectedthe image again, allowing the subjects to adjust their poseonce more. For five subjects, shown 100 easy and 50 hardposes, the errors are presented in Table 5 both for the firstexperimental setup where the visual stimuli was presentedonly for 5 s and for the second setup, when the visual stim-uli was allowed for pose correction. The mean errors were103.92 mm (MPJPE) or 20.3◦ (MPJAE) (without feedback)and 99.36mm (MPJPE) or 20.0◦ (MPJAE) (with visual feed-back). The small difference between the two cases indicatethat continuously available visual stimuli did not significantlychange the re-enactment error on completion.

    Are there correlations between errors of different bodyjoints? We expect that when subjects misinterpret the posi-tion of a joint, thus exhibiting a large error in that particulararticulation, there could be other joints that are incorrectlypositioned, perhaps to compensate Figure 16 indeed shows

    123

  • Int J Comput Vis

    Tabl

    e5

    Resultsdetaile

    dforeasy

    poses,hard

    poses,as

    wellasover

    allp

    oses

    underMPJPE

    andMPJAEmetrics

    Subjects

    MPJPE

    min

    error—

    nofeedback

    (mm)

    MPJPE

    min

    errorwith

    feedback

    (mm)

    MPJAEmin

    error—

    nofeedback

    (deg)

    MPJAEmin

    errorwith

    feedback

    (deg)

    Easy

    Hard

    Both

    Easy

    Hard

    Both

    Easy

    Hard

    Both

    Easy

    Hard

    Both

    S1110.8

    ±49

    .4112.8

    ±39

    .9111.5

    ±46

    .3114.0

    ±52

    .1105.8

    ±36

    .0111.2

    ±47

    .317

    .7±

    5.4

    24.0

    ±6.9

    19.8

    ±6.6

    17.5

    ±5.0

    25.0

    ±7.7

    20.0

    ±6.9

    S292

    .2±

    31.7

    107.8

    ±43

    .097

    .4±

    36.5

    82.8

    ±29

    .289

    .9±

    39.1

    85.1

    ±32

    .917

    .2±

    5.1

    23.7

    ±6.3

    19.4

    ±6.3

    16.3

    ±4.9

    21.4

    ±5.9

    18.0

    ±5.8

    S3109.2

    ±41

    .2115.4

    ±37

    .7111.3

    ±40

    .0110.3

    ±44

    .2119.8

    ±43

    .3113.5

    ±44

    .018

    .6±

    5.2

    25.8

    ±6.5

    21.0

    ±6.6

    18.5

    ±5.5

    26.3

    ±6.6

    21.1

    ±6.9

    S489

    .5±

    35.7

    95.4

    ±37

    .191

    .5±

    36.2

    87.1

    ±33

    .794

    .0±

    31.1

    89.4

    ±32

    .918

    .1±

    5.6

    25.8

    ±6.9

    20.7

    ±7.1

    18.0

    ±5.7

    25.8

    ±6.0

    20.6

    ±6.8

    S5101.6

    ±39

    .6121.3

    ±50

    .4108.2

    ±44

    .397

    .5±

    36.4

    96.7

    ±39

    .497

    .2±

    37.3

    18.9

    ±6.1

    25.0

    ±6.1

    20.9

    ±6.7

    18.8

    ±5.7

    23.7

    ±6.2

    20.4

    ±6.3

    All

    100.63

    ±39

    .5110.52

    ±41

    .6103.92

    ±40

    .698

    .42

    ±39

    .8101.23

    ±37

    .799

    .36

    ±38

    .87

    18.1

    ±5.5

    24.8

    ±6.6

    20.3

    ±6.7

    17.8

    ±5.4

    24.4

    ±6.5

    20.0

    ±6.6

    Wedisplaythemeanof

    theminim

    umerrors

    attained

    bysubjectsdu

    ring

    re-enactmentin

    thesetupof

    fixed

    timeexpo

    sure

    ofvisual

    stim

    uli,as

    wellas

    inthecase

    whenthevisual

    feedback

    was

    continuo

    usly

    available

    strong error correlations for the upper body (under MPJPE)as well as for both arms (under MPJAE), computed for all 14subjects participating in the first experiment.

    How is the error localized at joint level?Apart from ana-lyzing the average error for different poses (Fig. 17), we arealso interested in a more detailed study aiming to understandwhether there are joints that are systematically re-enactedwith a larger error than others. In Fig. 18 we show the aver-age angles errors detailed for each joint. The angle errors ofthe upper body are larger than the ones of the joints belongingto the lower body part, as the arms have the largest mobility.The large neck errors could be attributed to the fact that, aspeople focus to reproduce the configuration of the large bodyparts that have the main influence in the visual appearanceof the pose (legs, arms, torso), they tend to overlook that theneck has its own orientation that has to be reproduced.

    Figure 19 shows the mean position error on each joint,detailed over poses (left) and over subjects (center). The high-est errors can be noticed on extremities (wrists and ankles).Inner joints, that have the least mobility, are re-enacted withconsiderably lower errors. The inter-pose error variation islarger than the inter-subject variation, suggesting that whilein general extremities are harder to re-enact, the levels ofre-enactment error for each joint depend on the particularconfiguration of the observed pose.

    What is the relation between fixations and the errorlevels for each joint? In order to gain a deeper understand-ing of the strategies that humans pursue when re-enacting apose, we investigated the relation between the error levels offixated and not fixated joints. In Figs. 10 and 11 we show thatfamiliar pose sub-configurations are fixated less than unfa-miliar ones. Overall 54% of the fixations of subjects fall onjoints for easy poses and 30% for hard poses. We analyzedthe error levels of the fixated joints as opposed to the jointsthat were not fixated. Table 6 shows the mean MPJPE errorfor fixated and not fixated joints detailed on easy and hardposes. It can be noticed that there is only a small differencebetween the two cases. This could suggest that the objectiveof human subjects is to distribute attentional resources suchthat the error per joint is relatively uniform. This may involvespending more time on fixating image regions associated toharder to reproduce joints (body parts) as opposed to easierones.

    4.3 Insights from Data Analysis

    Our study reveals that people are not significantly better atre-enacting 3D poses given visual stimuli, on average, thanexisting computer vision algorithms (Ionescu et al. 2014b),at least within the laboratory setup of our study (naturally theerrors of computer vision algorithms could be radically dif-ferent and are subject of a detailed analysis in the followingsection). Errors in the order of 10–20◦ or 100mmper joint are

    123

  • Int J Comput Vis

    Fig. 16 Joint error correlations under MPJPE (left) and MPJAE (right) computed for both regular and skilled subjects together

    Fig. 17 MPJPE and MPJAE versus pose index for the 14 subjects (both regular and skilled) participating in the first experiment. Notice significantsubject variance and larger errors for hard poses

    not uncommon.Hard poses selected in the construction of thedataset indeed lead to higher errors compared to easy poses.This indicates that people are not necessarily good at accu-rate 3D pose recovery, under conventional metrics, a findingconsistent with earlier computational studies of 3D monocu-lar human pose ambiguities (Sminchisescu and Triggs 2003,2005; Sminchisescu and Jepson 2004). Instead, qualitativerepresentations may be used for most tasks, although theimplications in skill games (e.g. using Kinect (Sun et al.2012)) where player’s accuracy is valued, but may not berealizable, could be relevant. In the process of reproducingthe pose, subjects attend certain joints more than others andthis depends on the pose, but the scanpaths are stable across

    subjects both spatially and sequentially. Extremities includ-ing the head or thewrists are fixatedmore than internal joints,perhaps because once ‘end-effector’ positions are known,more constrains are applicable to ‘fill-in’ intermediate jointson the kinematic chain. Familiar pose sub-configurationsare often fixated less (or not at all) compared to unfamiliarones indicating that a degree of familiar sub-component poserecognition occurs from low-resolution stimuli, not rulingout poselet approaches (Bourdev et al. 2010). An interestingavenue not pursued in almost any artificial recognition sys-tem, but not inconsistent with our findings, would be thecombination of low-resolution (currently pervasive computervision) inference with pose and image dependent search

    123

  • Int J Comput Vis

    Fig. 18 Angle error for each joint. The joints in the upper body havelarger angle errors than the joints in the lower body

    strategies that focus on high resolution features–combiningbottom-up and selective top-down processing (Sminchisescuet al. 2006; Sigal et al. 2007; Andriluka et al. 2010).

    5 Perceptual Metrics for Automatic 3D HumanPose Estimation

    In this section we focus on two aspects: (1) understandingthe types of errors humans make in pose re-enactment andcompare them with those of automatically generated poses,(2) learning perceptual metrics that more truthfully reflectthe semantics of a human pose.

    5.1 Human Versus Computer Vision Performance

    We aim to reveal the quantitative and qualitative differencesbetween poses re-enacted by humans and poses estimated bya computer vision model and algorithm when presented withthe same visual stimuli.

    Table 6 Average re-enactment error (MPJPE, in mm) on fixated andnot fixated joins

    Easy poses Hard poses Both

    Fixated joints 92.3 ± 29.9 150.4 ± 56.0 102.6 ± 41.4Not fixated joints 90.7 ± 24.3 155.6 ± 39.3 101.5 ± 36.4

    Model description. We consider a structured predictionmodel, Kernel Dependency Estimation (KDE) (Cortes et al.2005), to learn a mapping from features extracted from theimage of a person to the 3D joint representation of his/herpose. In KDE the problem of learning a mapping from inputdescriptors X to a target pose representation Y is treated asa regression problem. Both inputs and targets are lifted tohigh-dimensional spaces via kernels KX and KY associatedwith mapping functions φX and φY . To estimate a pose attesting time we need to solve the pre-image problem (Corteset al. 2005) :

    argminy∈Y ‖WφX (x) − φY (y)‖2 (1)

    As input features, X , we computed histograms of orientedgradients on the mask of the person. We apply a χ2 kernelKX to the input features and a Gaussian kernel KY on the tar-get variables Y . For computational reasons we used Fourierlinear kernel approximation methods (Li et al. 2012; Rahimiand Recht 2007). As training set we use the same dataset,Human3.6M (Ionescu et al. 2014b), from which we selectedthe 120 poses shown to subjects for re-enactment. A poseis represented in xyz coordinates, centered at the hip (lowerback) joint. We initially selected a subset of 55,000 posesof the entire dataset to be used for training. To ensure a fairevaluation, we eliminate poses similar to any of the 120 testposes (within a 75 mm distance), which resulted in 53,400training data points.

    What is the error difference between humans and avision model (KDE)? In Table 7 we show the average re-enactment error for both skilled and regular subjects and theaverage KDE prediction error under MPJPE metric. On easyposes, the difference between skilled subjects and KDE pre-

    Fig. 19 Position error for each joint considering human re-enactment (left and center) and KDE predictions (right). The standard deviation iscomputed among poses (left) and among subjects (center) for human re-enactments and among poses for KDE predictions (right)

    123

  • Int J Comput Vis

    Table 7 Subject re-enactment results as well as KDE prediction foreasy poses, hard poses and over all poses under the MPJPE metric

    Method MPJPE error (mm)

    Easy Hard Both

    Regular subject 91.7 ± 35.9 156.7 ± 58.1 102.5 ± 58.1Skilled subject 83.9 ± 31.6 153.1 ± 65.6 95.4 ± 47.3KDE 100.33 ± 39.54 267.42 ± 133.22 128.18 ± 89.45

    dictions is 16mm while between regular subjects and KDEpredictions is 9 mm. On the hard poses, however, the dif-ference between subjects and KDE is significantly higher:110 mm in the case of regular subjects and 113 mm in thecase of skilled subjects. This indicates that although bothhuman and algorithmic performances are diminished whenpresented with hard poses, the algorithmic approach strug-gles considerably more than humans when the poses aremainly seated and have severe self-occlusions. We are alsointerested in a more detailed analysis aiming to understandthe differences between poses re-enacted by humans and theKDE predictions at joint level. Figure 19 shows the meanEuclidean error for each joint,madebyhumans—both skilledand regular—(left and center) versus KDE (right). Notice astrong similarity between the two: higher errors on extrem-ities (wrists and ankles) and lower errors on inner joints.The main difference can be observed in the standard devia-tions for each joint: while humans are more consistent in thelevel of errors they make on different poses, the algorithmicapproach shows higher variations. This is also reflected inTable 7, mainly on hard poses.

    Qualitative error analysis. Ionescu et al. (2014b) men-tion that MPJPE has the disadvantage of not being robust– i.e. one badly predicted joint can have arbitrarily highimpact on the overall distance between the compared poses.Furthermore, errors that humans hardly perceive can beoveremphasized under this metric. To overcome such issuesthe authors propose a new error measure, mean per joint

    localization error (MPJLE), that uses a perceptual toleranceparameter, t . While this measure emphasizes more clearly towhat extent there are joints predicted with very large errors,a further improvement of the metric would be to set a dif-ferent tolerance parameter for each joint. This is motivatedby the anatomical constraints of the skeleton as, for example,extremities have a larger degree ofmobility and thus aremorelikely to be predicted (by an algorithm) or re-enacted (byhumans) with larger errors. This is also reflected in Fig. 19where we show that subjects re-enact joints with differentlevels of accuracy and different standard deviations. Conse-quently, for each joint i we set a different threshold t (i) thatcorresponds to the standard deviation for that particular joint(obtained from subject re-enactments), multiplied by a levelof tolerance that was varied between 1 and 20. Equation 2shows the modified MPJLE given a predicted pose configu-ration mpred and the ground truth one mgt with N joints.

    EMP J LE@t (mpred ,mgt ) = 1N

    N∑

    i=11‖mpred (i)−mgt (i)‖2≥t (i)

    (2)

    Figure 20 shows the mean per joint localization error forsubject re-enactment (considering regular and skilled sub-jects together) and KDE predictions on the 120 visual stimulishown to subjects.Notice considerably higher errors on jointsinmodel predictions compared to human re-enactment for thesame set of poses. This implies that one of themeasurable cri-teria that humans follow in order to acquire a desired level ofperceptual similarity with the pose shown, could be to main-tain an evenly distributed error across joints. Even if only asmall subset of the body joints is affected by a large error,the perceptual appearance of the pose can change dramati-cally. An example of this situation can be noticed in Fig. 21,where although under the MPJPE metric human and KDEerrors are quite similar, the poses re-enacted by humans areperceptually consistent with those presented as stimuli. The

    Fig. 20 Mean per joint localization error for subjects and KDE. Results are computed for all 120 pose—left, for easy poses— center and for hardposes—right. Human re-enactments tend to have fewer joints with high errors than the algorithmic estimates, especially on hard poses

    123

  • Int J Comput Vis

    Fig. 21 Example of subject re-enactment and KDE estimation of apose. a Ground truth, b KDE prediction (MPJPE—64.57 mm), c Regu-lar subject 9 (MPJPE—70.61mm),dRegular subject 5 (MPJPE—56.46mm), e Regular subject 3 (MPJPE—68.83 mm) Although, the sub-

    jects’ re-enactments are not perfect they are consistent in respectingthe ground truth’s most distinctive sub-configurations: right hand closeto head and right elbow bend—as opposed to the KDE prediction thatmisestimated the right hand

    human reenactment is immediately recognized as reproduc-ing the same pose (right wrist close to the head, left elbowbend), whereas the KDE prediction can be interpreted as adifferent pose than the ground truth. In the case of the KDEprediction there is one joint (right wrist) that is very badlypositioned (190.62 mm error) and this causes the pose to beperceived as considerably differentwith respect to the groundtruth.

    Another approach to investigate the qualitative differ-ences between subjects and algorithmic approaches is touse boolean geometric relationships between body parts. In(Pons-Moll et al. 2014), the authors propose posebits asa mid-level qualitative pose representation. They proposethree types of posebits (joint distance, articulation angle andrelative position) that could generate hundreds of possibleboolean features from which they randomly choose 30 torepresent each pose. We choose a similar pose representa-tion and initially generated a large pool of possible booleanfeatures (e.g. ‘left wrist is above head’, ‘write wrist touchesleft shoulder’,‘right elbow is bend’, etc.). However, insteadof randomly picking a subset, we ranked them and choose themost relevant ones. The relevance of a feature (reflected inits ranking score) is defined such that it promotes those fea-tures which are consistently accurate in human re-enactment,not very rarely occurring, and non-trivial—an example oftrivial, frequently occurring feature is ‘head above hips’.For a particular feature fi , its ranking score is defined asscore( fi ) = f reqsubj ( fi ) · (1 − f reqgt ( fi )) · f reqgt ( fi ),where f reqsubj is the frequency of finding that feature validamong subjects (in relation to the ground truth pose shown)and f reqgt accounts for how common that feature wasamong all ground truth poses presented to subjects. We rank

    Table 8 Comparison between human re-enactment and computervision model performance using the qualitative pose representations.The results represent the Jaccard Index between a pose (re-enacted orestimated) and the ground truth computed on the 120 poses selectedfrom Human3.6M

    Method Jaccard Index

    Easy Hard Both

    Regular subject 0.74 ± 0.13 0.65 ± 0.16 0.72 ± 0.14Skilled subject 0.76 ± 0.13 0.68 ± 0.15 0.74 ± 0.13KDE 0.66 ± 0.13 0.49 ± 0.15 0.64 ± 0.15

    features according to this score and use the first 50 to obtainqualitative pose representations.

    Under this boolean representation we compare the humanre-enactment with the KDE model predictions. We analyzethe Jaccard index (intersection over union) between the setof active features in a pose (re-enacted and respectivelypredicted) and the active features in the ground truth. Werandomly split both the 100 easy poses and the 20 hardposes (as well as the associated subject re-enactments) inhalf, perform the selection of the features on one subset,and show the results on the other subset. We perform fivedifferent splits and average the results. Table 8 shows theaverage Jaccard index for skilled and regular subjects, andfor KDE predictions on both easy and hard poses. Whilethere is only a small difference between the skilled and reg-ular subjects (2%), when comparing the computer visionmodel estimates with both types of subjects we notice thatmodel estimates have a considerably lower Jaccard indexthan both groups. The difference on easy poses is 10% forskilled subjects and 8% for regular ones, while for hard poses

    123

  • Int J Comput Vis

    it is 19% for skilled subjects and 16% for regular ones (seeTable 8).

    Our analysis reveals that re-enactments from subjects tendto have a higher qualitative consistency with the poses shownthan those obtained by a computer vision model. Humansseem to make mistakes rather in the degree of expressing afeature (e.g. how much an elbow is bent, how high a hand israised, etc.) than in completely missing a sub-configuration.Moreover, in the case of hard poses the results produced auto-matically degrade at a considerably higher rate than thoseproduced by humans, making the algorithmic approach dif-ficult to use in certain applications.

    5.2 Perceptual Metric Learning

    In this section we present our proposal of learning a newmetric that captures the perceptual difference between poses.In this way, we aim to reduce the gap between the humanperception of pose similarity and what the commonly usedmetric (MPJPE) evaluates.

    For this purpose we use the re-enactments of both skilledand regular subjects to learn a perceptual metric over poses.The re-enacted poses are the result of what subjects consid-ered to be their closest body configuration to the poses shownand thus give us a powerful insight into human perceptionof 3D articulated poses. Although their errors as measuredbyMPJPE are not, on average, significantly better than thoseproduced by a computer vision algorithm, we showed in theprevious section that poses produced by humans have con-siderably fewer joints predicted with large error and a morepronounced visual consistency with the ground truth. Weargue that such a visual consistency is an important elementthat a metric between 3D poses has to take into account.

    To learn a perceptually relevant metric from subject re-enactment, we use Relevant Component Analysis (RCA)(Bar-hillel et al. 2003) which changes the feature space bya global linear transformation. It assigns high weights to‘relevant dimensions’ and low weights to ‘irrelevant dimen-sions’. The decision regarding the relevant dimensions ismade based on chunklets, obtained through a transitive clo-sure over known equivalence relations in the dataset.

    In our case, a chunklet consists of the re-enacted posesbelonging to different subjects when presented with the samevisual stimuli. Since perception is subjective, by defining anequivalence relation on re-enactment, we do not assume thathumans perceive a pose in the same manner. Instead, weassume that there may be common aspects of a pose thatpeople will tend to get right, whereas others aspects of thepose will be often be ignored. These aspects we would like tolearn. Perception is not straightforward to quantify as none ofthe subjects (or any external human) can provide an absolutejudgment onwhat is perceptually valid andwhat is not. How-ever, to be able to probe into perception and train algorithms

    that produce meaningful results with human-acceptable mis-takes, we assume that people perceive poses in similar ways.Thus, instead of working with individual perceptions, wetry to unveil their common elements. Considering this, wedefine our equivalence relation as follows: two poses are inrelation if they represent subjects’ re-enactment of the samepose (visual stimulus). The chunklets will be obtained afterapplying a transitive closure on this relation. There will beas many chunklets as the presented visual stimuli.

    RCA first computes the covariance matrix of all thecentred data-points in chunklets. Cosidering p points in kchunklets, where chunklet j consists of n j points {x ji }n ji=1and has mean m̂ j , RCA computes the following matrix (Bar-hillel et al. 2003):

    Ĉ = 1p

    k∑

    j=1

    n j∑

    i=1(x ji − m̂ j )(x ji − m̂ j )� (3)

    In our experiments x ji are the position coordinates of ahuman pose. The inverse of Ĉ can be used as a Mahalanobisdistance. Alternatively, one can use the whitening transform

    W = Ĉ− 12 to warp the original data points and then usean Euclidean metric with the same effect as employing theMahalanobis distance on the unmodified data.

    In order to understand to what extent the newly learnedperceptual metric incorporates information from human sub-jects regarding pose similarity, we designed a pose retrievalevaluation scenario. To ensure that poses used for training thedistance differ from those used for testing it, we split our 120poses together with the associated subject re-enactment inhalf. The distancewill be learned on one half and the retrievalexperiment will be carried on the other half. We performedfive different random splits and averaged the results.

    Given a pose from those shown to subjects (a query) wewill compute both the Euclidean distance and the perceptualdistance from all subjects’ re-enactments to that particularpose. Among re-enactments there will be the actual ones,corresponding to the current query (positive class) as wellas re-enactments of other poses (negative class). Ideally, ametric consistent with subjects’ perception of a pose wouldrank the actual re-enactments first. For every possible querypose, the number of positive and negative examples is fixedas there are 14 subjects re-enactments of very pose. To addvariability for each query pose, we randomly sub sampledthe number of positive and negative examples. The resultsfor both the Euclidean and the perceptual metric are shownin the precision-recall graph in Fig. 22. Notice that by usingthe learned perceptual metric we can rank the poses better,based on the relevance given by subjects’ perception of posesimilarity. In Fig. 23 we show (for one of the image stimu-lus presented to subjects) the closest pose (chosen from theirre-enactments) under Euclidean and perceptual distances. In

    123

  • Int J Comput Vis

    Fig. 22 Precision—recall graph when ranking poses using Euclideanand perceptual distances. The rankings obtained with the learned per-ceptual metric are more relevant than those obtained using the standardEuclidean metric

    fact, the pose obtained using the Euclidean distance repre-sents a subject re-enacting another pose than the one weselected.

    5.3 Perceptual Metric in Pose Estimation

    In this sectionwe integrate the newly learned perceptual met-ric in the KDE framework towards more meaningful andperceptually valid results for 3D human pose from monoc-ular images. We used the 55,000 training data previouslypresented (see Sect. 5.1) and selected 25,000 different posesfromHuman3.6Mas testing data. The testing data come from

    Table 9 Pose estimation errors undermetric basedonboth anEuclideanand a perceptual kernel

    Method Mean MPJPE Mean perceptual error

    KDE—Gaussian 113.07 ± 72 18.44 ± 10.25KDE—Perceptual 109.16 ± 66 14.92 ± 9.43

    all 15 action scenarios available in the dataset. We train themodel using both a Gaussian kernel KY (x, y) = e− ‖x−y‖

    22

    2σ 2,

    and a perceptual kernel KY (x, y) = e− d2P (x,y)2σ 2

    on the targetvariables, where dP (x, y) is the perceptual metric betweenposes x and y learned as described in Sect. 5.2.

    In Table 9 we show the mean MPJPE and mean percep-tual error for poses estimated with a Gaussian kernel and aperceptual kernel, respectively. It can be noticed that thoseproduced by the perceptual kernel are smaller under bothmetrics, especially under the perceptual one. Moreover, thestandard deviations are also lower when using a perceptualkernel.

    As shown in Sect. 5.1, poses with similar MPJPE errorcan look perceptually very different if a subset of the jointshave very large errors. First, we investigate how much thepredictions of KDE with a Gaussian Kernel and the predic-tions of KDE with a perceptual kernel agree at joint level.By agreement on a joint we understand that the differencesin the two predictions of that particular joint are smaller thana threshold. Figure 24 shows the percentage of agreementsbetween the poses predicted using the Gaussian kernel andthe poses predicted using the perceptual kernel. Notice thatvery few predictions agree on more than 14 joints. For exam-ple, less than 20% of the poses agree on more than 14 jointsat a threshold of 50mm. This indicates differences betweenthe two types of predictions,whichwe analyze inmore detail.

    Fig. 23 a Original image shown to subjects (ground truth). b Closestsubject re-enactment under Euclidean distance. c Closest subject re-enactment under perceptual distance. Notice that unlike the pose in b,

    the one in c shows consistent sub-configurations with the ground truth,such as: left foot at the back of right foot, right hand bent, etc

    123

  • Int J Comput Vis

    Fig. 24 Percentage of test poses when the predictions made using theGaussian kernel and the perceptual kernel agree at least on i jointswherei = 2, . . . , 17. Agreement on a joint means that distance between thetwo predictions of that joint is smaller than a specified threshold

    In Sect. 5.1 we showed that one of the main differ-ences between poses automatically estimated and humanre-enactments, given the same stimuli, is that humans tendto have fewer joints with large errors (see Fig. 21). To

    understand whether our automatic approach that uses theperceptual kernel has successfully integrated elements ofhuman perception when estimating poses, we check whetherits mistakes are more human-like than those produced withthe standard Gaussian kernel. Figure 25 depicts the mod-ified MPJLE for both types of predictions. Notice that asthe tolerance level increases, the difference between KDEwith a Gaussian kernel and a perceptual kernel, respec-tively, becomes higher. There are fewer joints with very largeerror when using the perceptual kernel. Although more pro-nounced, the same difference is shown in Fig. 20 (left), whencomparing the performance of humans and that of KDEwiththe Gaussian kernel.

    In Fig. 26 we show an example when the computervision model prediction obtained using the perceptual ker-nel, although not perfect, appears qualitatively better thanthe one obtained using the Gaussian kernel. We also showthe distribution of error per joints for the two predictions.Notice that even if most of the joints in the two predictionshave similar errors, there are four joints (left elbow, left wrist,right elbow, right wrist) with extremely large errors, makingthe pose perceptually very different from the one shown inthe stimulus image.

    Fig. 25 Mean per joint localization error for poses estimated using the Gaussian and perceptual kernel (left all poses; center: easy poses; righthard poses)

    Fig. 26 a Test image, b prediction of KDE with Gaussian kernel, c prediction of KDE with perceptual kernel, d per joint error distribution for theprediction obtained with the Gaussian and perceptual kernel, respectively

    123

  • Int J Comput Vis

    The largest difference between humans and KDE on the120 poses is in the case of hard poses. We analyze whetherthis is also valid when comparing the results obtained usingthe Gaussian kernel and the results obtained with the percep-tual kernel. Since the test dataset is a few orders ofmagnitudelarger than the initial 120 poses shown to subjects (25,000images), in order to automatically split it into easy and hardposes we considered a KNN procedure (K = 1) and assignan easy or hard label to each pose in the test set, based onour easy or hard assignment of the 120 poses. Figure 25(center and right) shows the modified MPJLE error splitfor the easy and hard poses in the test set when using aGaussian and a perceptual kernel, respectively. Notice amorepronounced improvement of the perceptual kernel over theGaussian in the case of hard poses. It is interesting that thisis exactly the case where humans outperform KDE the most(see Fig. 20).

    5.4 Human Evaluation

    We conducted a psychophysical experiment to better under-stand whether the poses obtained using KDE with a per-ceptual kernel are perceptually closer to ground truth thanthose obtained using the Gaussian kernel. In our experi-ment we showed 50 images of humans in different poses,together with the prediction obtained with both the Gaussianand the perceptual kernels, asking seven independent users(four male and three female) to decide which of the twopredictions is more similar to the ground truth. We random-ized the order in which the two predictions appear. The 50poses were chosen randomly from the test poses. We added asingle condition, that the difference between the two pre-dictions is larger than a 50 mm threshold, to emphasizecases where the two metrics produced sufficiently differentresults.

    The results are shown in Table 10. On average, peoplechose the prediction based on the perceptual kernel as beingmore truthful to the stimuli in 56% of the cases. The oneobtainedwith the standardGaussian kernel is judged superiorin only 30% of the cases. 13% of the times subjects reportedthat they could not make a decision between the two. Wehave also checked how well subjects agree on responses, byconsidering how many times at least s of them have giventhe same answer, where s is varied between 4 and 7. Resultsare shown in Table 11. We see that in 72% of the cases atleast five subjects preferred the same pose, and in 30% of thecases all subjects chose the same answer. Our results indicatethat people consistently considered the poses produced by acomputer vision model using the perceptual kernel as moretruthful to the stimuli than those obtained using the Gaussiankernel.

    Table 10 Subjects’ preference for poses obtained with a perceptualkernel and with a Gaussian kernel

    Subjects Perceptual kernel Gaussian kernel Undecided

    S1 56% 36% 8%

    S2 54% 40% 6%

    S3 62% 28% 10%

    S4 58% 24% 18%

    S5 58% 28% 14%

    S6 44% 22% 34%

    S7 60% 34% 6%

    Average 56.0% ± 5.8 30.2% ± 6.5 13.7% ± 9.9On average, in 56% of cases, poses obtained with the perceptual kernelare considered more truthful to the stimuli, while poses obtained withtheGaussian kernelwere judged superior only in 30%of cases. Subjectscould not decide between the two in 13% of the cases

    Table 11 Subjects’ preferences agreement

    Number of subjects (s) 4 5 6 7

    Agreement 94% 72% 58% 30%

    We checked how often (out of 50 poses shown) at least s subjects (toprow) gave the same answer, where s is varied between 4 and 7

    6 Conclusions

    The visual analysis of humans is an active computer visionresearch area—yet it faces open problems on what elementsof meaning should we detect, what elements of the poseshould we represent and how, and what are the acceptablelevels of accuracy for different human sensing tasks. In thispaper we have taken an experimental approach to such ques-tions by investigating the human pictorial space, consistingof the set of perceptually valid articulated pose configurationsthat humans re-enact when presented with image stimuli. Wehave developed a novel apparatus for this task, constructed apublicly available dataset, and performed quantitative analy-sis to reveal the level of human performance, the accuracy inpose re-enactment tasks, as well as the structure of eyemove-ment patterns and the correlation with the pose difficulty.

    We have also discussed the implications of such find-ings for the construction of computer-vision based humansensing systems, proposed to learn perceptual metrics for3D human pose prediction, and extensively studied theireffectiveness for large scale automatic 3D human pose esti-mation using Human3.6M (Ionescu et al. 2014b). While 3Dhuman pose estimation has nowadays become a mainstreamresearch area, training and performance evaluation are stilllargely basedon ad-hocEuclideanmetrics that donot degradegracefully, leading to results that can sometimes be mean-ingless. Moreover, the current systems lack mechanismsthat would allow them to calibrate to the level of perfor-mance required for different human sensing tasks, which

    123

  • Int J Comput Vis

    may have very different constraints. Our study indicates thatlearning perceptual metrics can be performed effectively.Moreover, when these are used for prediction in connectionwith automatic computer vision models, they produce moremeaningful pose estimates than their plain Euclidean coun-terparts.

    Acknowledgments This work was supported in part by CNCS-UEFISCDI under PCE-2011-3-0438, and JRP-RO-FR-2014-16.

    References

    Agarwal, A., & Triggs, B. (2006). Recovering 3D human pose frommonocular images. IEEE Transactions on Pattern Analysis andMachine Intelligence, 28, 44–58.

    Akhter, I., &Black,M. J. (2015). Pose-conditioned joint angle limits for3d human pose reconstruction. In IEEE international conferenceon computer vision and pattern recognition.

    Andriluka, M., Roth, S., & Schiele, B. (2010). Monocular 3D poseestimation and tracking by detection. In IEEE international con-ference on computer vision and pattern recognition.

    Bar-hillel, A., Hertz, T., Shental, N., & Weinshall, D. (2003). Learningdistance functions using equivalence relations. In Internationalconference on machine learning.

    Bo, L., & Sminchisescu, C. (2009). Structured output-associativeregression. In IEEE international conference on computer visionand pattern recognition.

    Bo, L., & Sminchisescu, C. (2010). Twin gaussian processes for struc-tured prediction. International Journal of Computer Vision, 87,28–52.

    Bourdev, L., Maji, S., Brox, T., & Malik, J. (2010). Detecting peo-ple using mutually consistent poselet activations. In: Europeanconference on computer vision. http://www.eecs.berkeley.edu/~lbourdev/poselets.

    Chen, C., Zhuang, Y., Xiao, J., & Liang, Z. (2009). Perceptual 3Dpose distance estimation by boosting relational geometric features.Computer Animation and Virtual Worlds, 20, 267–277.

    Chen,X.,&Yuille,A. (2014).Articulated pose estimationby agraphicalmodel with image dependent pairwise relations. In: Advances inneural information processing systems (NIPS).

    Cortes, C., Mohri, M., & Weston, J. (2005). A general regression tech-nique for learning transductions. In International conference onmachine learning (pp. 153–160).

    Deutscher, J., Blake, A., & Reid, I. (2000). Articulated body motioncapture by annealed particle filtering. In IEEE international con-ference on computer vision and pattern recognition.

    Dickinson, S., &Metaxas, D. (1994). Integrating qualitative and quanti-tative shape recovery. In International journal of computer vision.

    Ehinger, K. A., Hidalgo-Sotelo, B., Torralba, A., & Oliva, A. (2009).Modelling search for people in 900 scenes: A combined sourcemodel of eye guidance. Visual Cognition, 17, 945–978.

    Fan, X., Zheng, K., Lin, Y., & Wang, S. (2015). Combining localappearance and holistic view: Dual-source deep neural networksfor human pose estimation. In IEEE conference on computer visionand pattern recognition (CVPR 2015), Boston, MA, June 7–12.

    Ferrari, V., Marin, M., & Zisserman, A. (2009). Pose search: retriev-ing people using their pose. In IEEE international conference oncomputer vision and pattern recognition.

    Fischler, M. A., & Elschlager, R. A. (1973). The representation andmatching of pictorial structures. IEEETransactions onComputers,22(1), 67–92. doi:10.1109/T-C.1973.223602.

    Gall, J., Rosenhahn, B., Brox, T., & Seidel, H. (2010). Optimizationand filtering for human motion capture: A multi-layer framework.International Journal of Computer Vision, 87, 75–92.

    Harada, T., Taoka, S., Mori, T., & Sato, T. (2004). Quantitative eval-uation method for pose and motion similarity based on humanperception. International journal of humanoid robotics.

    Huang,C.H., Boyer, E.,& Ilic, S. (2013). Robust humanbody shape andpose tracking. In 3DV—International Conference on 3D Vision—2013 (pp. 287–294). Seattle, United States. doi:10.1109/3DV.2013.45, https://hal.inria.fr/hal-00922934, best paper runner upaward.

    Ionescu, C., Carreira, J., & Sminchisescu, C. (2014a). Iterated second-order label sensitive pooling for 3D human pose estimation. InIEEE conference on computer vision and pattern recognition.

    Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2014b).Human3.6M: Large scale datasets and predictive methods for 3Dhuman sensing in natural environments. IEEE transactions on pat-tern analysis and machine intelligence.

    Jain, A., Tompson, J., Andriluka, M., Taylor, G. W., & Bregler, C.(2014). Learning human pose estimation features with convolu-tional networks.

    Johannson, G. (1973). Visual perception of biological motion and amodel for its analysis. In Perception and psychophysics.

    Kanaujia,A., Sminchisescu,C.,&Metaxas,D. (2007). Semi-supervisedhierarchical models for 3d human pose reconstruction. In IEEEinternational conference on computer vision and pattern recogni-tion.

    Koenderink, J. (1998). Pictorial relief.Royal Society of LondonA:Math-ematical, Physical and Engineering Sciences, 356, 1071–1086.

    Lee, H. J., & Chen, Z. (1985). Determination of 3D human body pos-tures from a single view. Computer Vision, Graphics and ImageProcessing, 30, 148–168.

    Li, F., Lebanon, G., & Sminchisescu, C. (2012). Chebyshev approx-imations to the hyistogram χ2 kernel. In IEEE internationalconference on computer vision and pattern recognition.

    Li, S., & Chan, A. B. (2014). 3d human pose estimation from monocu-lar images with deep convolutional neural network. In ComputerVision—ACCV 2014–12th Asian Conference on Computer Vision,Singapore, Singapore, November 1–5, Revised Selected Papers,Part II.

    López-Méndez, A., Gall, J., Casas, J., & van Gool, L. (2012). Metriclearning fromposes for temporal clustering of humanmotion. InR.Bowden, J. Collomosse, K. Mikolajczyk (Eds)., British machinevision conference (BMVC) (pp. 49.1–49.12). BMVA Press.

    Marino


Recommended