People Watching (Social, Perceptual, and Neurophysiological Studies of Body Perception) ||

Page 1 of 44 Computational Mechanisms of the Visual Processing of Action Stimuli

PRINTED FROM OXFORD SCHOLARSHIP ONLINE (www.oxfordscholarship.com). (c) Copyright Oxford University Press, 2013.All Rights Reserved. Under the terms of the licence agreement, an individual user may print out a PDF of a single chapter of amonograph in OSO for personal use (for details see http://www.oxfordscholarship.com/page/privacy-policy). Subscriber: BostonUniversity; date: 12 May 2013

People Watching: Social, Perceptual, and NeurophysiologicalStudies of Body PerceptionKerri Johnson and Maggie Shiffrar

Print publication date: 2012Print ISBN-13: 9780195393705Published to Oxford Scholarship Online: Jan-13DOI: 10.1093/acprof:oso/9780195393705.001.0001

Computational Mechanisms of the Visual Processing of Action Stimuli

Falk Fleischer, Martin A. Giese

DOI: 10.1093/acprof:oso/9780195393705.003.0022

Abstract and Keywords

Computational models are fundamentally important for testing the feasibilityof theories of the visual processing of body movements and for derivingwell-defined theoretical predictions that can be tested experimentally.A computational model is proposed for the recognition of transitive andnontransitive hand actions from real videos that reproduces several keyneurophysiological properties of the action perception system. Limitationsof the proposed model, along with novel predictions and areas for futureresearch, are discussed.

Keywords: computational models, computer vision, transitive actions, nontransitive actions,goal-directed actions, motion processing, form processing, templates, mirror neuron system,dynamic controller architectures, superior temporal sulcus (sts), temporal order viewinvariance, position invariance

The recognition of biological motion and actions is a core function of thevisual system, with crucial importance for survival and social communication.Motion recognition addresses the processing of body movements of humansand other species. One class of such movements, called nontransitiveactions, is not primarily directed toward goal objects. Examples arelocomotion, such as walking and running, or many communicative gestures,like waving. Another important class of movements is goal-directed actions,also called transitive movements. These actions are directed toward specificgoal objects. Examples are grasping, holding, pushing, or pointing towardobjects. In neuroscience, these two subfunctions of motion recognition have

http://www.oxfordscholarship.com/page/privacy-policy



been investigated largely independently by different research communities.One group of researchers, coming from vision research, has focused on thevisual processing of biological motion stimuli and other body movements,often focusing on nontransitive actions. Another community, stressingspecifically the potential links between representations for action perceptionand execution, have often focused on goal-directed actions and stressedthe dependency of action goals. Theoretical approaches accounting for theprocessing of these two types of body motion stimuli have remained largelyunrelated, and it is not really clear which processes might be shared betweenthe processing of nontransitive and transitive actions. Many details aboutthe neural mechanisms of motion and action recognition are reviewed inthis volume, so that we focus here on those aspects that are central formodeling.

The systematic study of the visual recognition of nontransitive actions(without goal objects) has been influenced strongly by the classicalwork of Johansson (1973). His famous experiments have shown thatbody movements can be recognized from highly impoverished stimuli,such as point-light walkers. Subsequent studies have demonstrated thatperception from point-light stimuli is amazingly robust, for example, againstdisplacements of the dots along the skeleton of the moving figure (e.g.,Beintema & Lappe, 2002; Dittrich, 1993), or against masking with substantialnumbers of moving noise dots (Bertenthal & Pinto, 1994; Cutting, Moore& Morrison, 1988; Thornton, Pinto & Shiffrar, 1998). In addition, point-lightstimuli can convey subtle details about motion style, conveying informationabout gender, identity, or the emotion of walkers (e.g., Beardsworth &Buckner, 1981; Chouchourelou, Matsuka, Harber, & Shiffrar, 2006; Cutting &Kozlowski, 1977; Dittrich, Troscianko, Lea, & Morgan, 1996; Pollick, Lestou,Ryu, & Cho, 2002). See de Gelder (Chapter 20, this volume) for a detaileddiscussion on the recognition of bodily expressions.

A variety of models have been developed for the recognition of biologicalmotion and actions without goal objects. Early approaches (p. 389 ) havetried to exploit geometrical invariants of body motion, such as the fact that,for the side view, the distance between dots on the same limb remainsapproximately constant over the course of the motion (e.g., Hoffman &Flinchbaugh, 1982; Webb & Aggarwal, 1982). Another set of approacheshas tried to account for body motion recognition by the fitting of three-dimensional models of body shapes (e.g., Marr & Vaina, 1982). This approachhas been extensively extended in computer vision, combining it withprobabilistic predictive models (e.g., Blake & Isard, 1998 and many others),


http://www.oxfordscholarship.com/view/10.1093/acprof:oso/9780195393705.001.0001/acprof-9780195393705-chapter-22#acprof-9780195393705-bibItem-2145











http://www.oxfordscholarship.com/view/10.1093/acprof:oso/9780195393705.001.0001/acprof-9780195393705-chapter-20#







but typically without claiming that the developed mechanisms are relevantfor the brain.

A third class of computational approaches is coarsely inspired by corticalmechanisms and tries to account for motion recognition by the extractionof form and motion features, or spatiotemporal image features from videosequences. Early methods have compared such features with templates thatwere either constructed by hand or that had been learned using sequencerecognition methods such as hidden Markov models (HMMs) (e.g., Bobick,1997; Niyogi & Adelson, 1994). In general, the most robust state-of-the-art body motion recognition approaches in computer vision are based onthe extraction of spatiotemporal image features combined with appropriatemethods for the learning of optimized feature dictionaries and powerfulclassification methods (e.g., Dollar, Rabaud, Cottrell, & Belongie, 2005; Efros,Berg, Mori, & Malik, 2003; Gorelick, Blank, Shechtman, Irani, & Basri, 2007;Laptev & Lindberg, 2003). (Much more detailed reviews about computervision methods for action recognition are given, for example, in Gavrila[1999] and Moeslund and Granum [2001]; for a more detailed reviewstressing the relationship between such methods and biological models seeGiese [2004]).

Based on the idea of an extraction of motion and form features, a numberof biologically inspired models for the recognition of nontransitive actionshave been developed that reproduce basic features from experiments inelectrophysiology, psychophysics, and functional imaging (Casile & Giese,2005; Giese & Poggio, 2003; Lange & Lappe, 2006). These models are basedon hierarchies of neural detectors for form and/or motion features that mimicproperties of cortical neurons that are involved in the recognition of motionpatterns. In addition, these models assume mechanisms for the learningof motion and shape templates, which potentially determine the tuning ofneurons in higher visual areas that are selective for moving bodies, such asthe superior temporal sulcus (STS).

While these hierarchical neural models had originally been developed tomodel biological data without any technical relevance, new interest insuch architectures has recently emerged in computer vision, where it hasbeen shown that such architectures can achieve performance levels inmotion classification that are competitive with nonbiological state-of-the-art approaches in computer vision (Escobar, Masson, Vieville, & Kornprobst,2009; Jhuang, Serre, Wolf, & Poggio, 2007; Schindler & van Gool, 2008;Serre, Wolf, Bileschi, Riesenhuber, & Poggio, 2007).

















Research on the perception of transitive actions—actions that are directlydirected toward a goal object—has recently become a very popular topic inneuroscience. A vast number of studies have investigated the perceptionof goal-directed actions, such as grasping, predominantly using behavioraland imaging methods (see, for example, Rizzolatti, Fogassi, & Gallese, 2001;Rizzolatti & Craighero, 2004; Ferrari, Bonini, & Fogassi, 2009). This researchinterest has been stimulated by the discovery of so-called mirror neuronsin monkey premotor and parietal cortex. These neurons show selectivetuning during the visual observation of actions, as well as during actionexecution (di Pellegrino, Fadiga, Fogassi, Gallese, & Rizzolatti, 1992; Fogassiet al., 2005; Gallese, Fadiga, Fogassi, & Rizzolatti, 1996; Rizzolatti et al.,2001), specifically if such actions are directed toward a goal object. Imagingstudies in humans suggest the existence of a mirror neuron system in thehuman cortex (e.g., Iacoboni et al., 2005; Kilner, Neal, Weiskopf, Friston,& Frith, 2009). However, the significance of such observations in humanfunctional magnetic resonance imaging (fMRI) experiments is still underdispute (e.g., Dinstein, Hasson, Rubin, & Heeger, 2007). (p. 390 ) At thesingle-cell level, neurons with visual selectivity for goal-directed actions andfor action-related properties of the goal-object have been found in the STS(Nelissen, Vanduffel, & Orban, 2006; Perrett et al., 1989) and in the parietalcortex (e.g., in the anterior intraparietal sulcus or AIP) (Baumann, Fluet, &Scherberger, 2009; Gardner et al., 2007; Murata, Gallese, Luppino, Kaseda,& Sakata, 2000; Sakata, Taira, Kusunoki, Murata, & Tanaka, 1997). A moreextensive review of the mirror neuron system in monkeys and humans isdescribed by Calvo-Merino (Chapter 16, this volume).

The observation of joint motor and visual tuning for actions in the sameneurons, or the same areas in imaging studies, has been interpreted asevidence for recognition by resonance. This signifies the hypothesis thatactions are recognized by an internal simulation of the visually observedactions in motor representations, which are also involved in the control ofthe execution of the same action (Rizzolatti et al., 2001). In fact, behavioralresearch has provided an overwhelming amount of evidence (see, e.g.,Blakemore & Frith, 2005; Schütz-Bosbach & Prinz, 2007; Wilson & Knoblich,2005) that neural representations of action execution and action vision arefunctionally tightly coupled, consistent with the theory of a common coding(Prinz, 1997) of actions in perception and execution (an extensive reviewof the principles of common coding is provided by van der Wel, Sebanz,& Knoblich, Chapter 7, this volume). However, it remains an unresolvedquestion how exactly the mirror neuron system contributes to the couplingof action execution and perception. A simple mechanism for recognition by



























resonance would, for example, predict that the tuning of mirror neurons forexecuted and perceived actions should be highly similar. However, suchsimilarity is far from clearly present in many mirror neurons in area F5(Caggiano et al., personal communication; Gallese et al., 1996).

Data from behavioral and fMRI studies in humans on action recognition havealso been used as support for much further-reaching speculations. It hasbeen suggested that the mirror neuron system might play a fundamentalrole in the imitative learning of actions and also for the development oflanguage (see Arbib, 2008, for a review). In addition, the involvement ofthe mirror neuron system in a variety of other higher cognitive functionshas been claimed, such as action understanding, theory of mind, empathy,and even the perception of aesthetic qualities (e.g., Frith & Singer, 2008;Freedberg & Gallese, 2007; Gallese & Goldman, 1998; Rizzolatti & Fabbri-Destro, 2008). We think that such extensions of the theory of “recognition byresonance” provide food for many interesting discussions, potentially evenincluding important philosophical implications. However, we do not treatsuch aspects in the remainder of this chapter since many of the underlyingconcepts cannot be easily formulated with sufficient accuracy and strictquantitative links to empirical data, so that a treatment in the context ofmathematical modeling is very difficult.

Corresponding to the strong interest in the relationship between actionperception and action execution in the context of transitive actions, mostbiologically motivated computational models in this area have focused onthe potential role of motor representations for action recognition. One of thefirst neural network models with relevance for the biological system (Fagg &Arbib, 1998; Oztop & Arbib, 2002) demonstrated that the recognition of goal-directed actions from video stimuli is possible by comparison of internallysimulated sequences of the hand and arm configurations with the visualstimulus. This model was later extended, linking it to the MOSAIC modelfor the selection of motor controllers (Haruno, Wolpert, & Kawato, 2001;Oztop, Kawato, & Arbib, 2006) and including mechanisms that account foraudio-visual properties of mirror neurons (Bonaiuto, Rosta, & Arbib, 2007).In general, cortical mechanisms of motor control are assumed to rely onforward models that compute predictions of the sensory signals dependenton the controller output. This prediction can then be compared with theactual sensory feedback. By comparing the predictions of different controllermodels, which realize different possible actions, with the actual sensoryinput, it is possible to select the most appropriate (p. 391 ) controller, whichresults in the best prediction of the actual sensory input. Once this controller















has been found, the underlying action has been recognized. (see, e.g.,Wolpert, Doya, & Kawato, 2003; Wolpert & Ghahramani, 2000, for furtherdetails.)

The underlying principles are illustrated in Figure 22-1. During actionexecution (panel A), the motor controller generates a motor command that ismapped through a predictive

Figure 22-1. Recognition of actions based on predictions generated byforward models. (A) During action execution, a controller generates a motorcommand that depends on the desired motor state and an error signal,which results from a comparison of the true sensory input and the sensoryinput predicted from the motor command by an internal forward model. Inthe absence of any perturbations, the predicted perceptual state matchesexactly the sensory feedback, so that the error signal disappears. (B) Duringaction observation, no real motor output is generated. However, in thecontext of an internal simulation, motor commands might be generatedthat are mapped onto predicted sensory outcomes by forward models. Bycomputation of the difference between the predicted sensory outcomes andthe true sensory information, it is possible to determine the dynamic statethat would correspond to the actually observed movement and to determinethe type of the observed action by comparing the prediction errors betweenmultiple available controller models. Modified from Wolpert D. M., Doya K., &Kawato M. (2003). A unifying computational framework for motor control andsocial interaction. Philosophical Transactions of the Royal Society, 358, 593–602.






forward model into predicted sensory signals. These signals are comparedto the true sensory feedback, and the prediction error is used to updatethe controller. In the context of action observation (panel B), the controllerruns without this error input and generates predictions based on the actualsensory inputs. After optimization of the controller, the sensory inputactivates the motor command that would be compatible with the actualobserved action. (p. 392 )

Biologically inspired dynamical controller architectures for action recognitionthat are based on these ideas have been successfully applied in the contextof robot systems (e.g., Demiris & Simmons, 2006; Sauser & Billard, 2006;Schaal, Ijspeert, & Billard, 2003). Several models in this context have largelybypassed the aspect of visual processing by making strongly simplifyingassumptions about the processing up to parietal areas and STS. Thesemodels assume, for example, that the three-dimensional structure of effectorand goal object and their metric relationship is given by the visual system interms of low-dimensional variables, such as joint angles, which then serveas input for the dynamic controller architecture (e.g., Erlhagen, Mukovskiy,& Bicho, 2006; Tani, Ito, & Sugita, 2004; Wolpert et al., 2003). However,these approaches do not answer the question of how a sufficiently accurateestimation of such low-dimensional parameters is possible. Such estimationis a difficult computational problem, especially from monocular imagesequences, which do not provide depth information through disparity cues.Yet, humans and animals are very good in action recognition from stimuliwithout such depth cues.

To summarize, only few of the existing computational approaches for therecognition of goal-directed actions work on real video stimuli at all (Billard& Mataric, 2001; Oztop & Arbib, 2002; Demiris & Simmons, 2006; Metta,Sandini, Natale, Craighero, & Fadiga, 2006; Kjellström, Romero, Martínez, &Kragić, 2008). None of these models exploits mechanisms that approximatethe functions of neurons in visual cortex, except for a single model forthe estimation of grip aperture from video sequences (Prevete, Tessitore,Santoro, & Catanzariti, 2008).

The previous overview of the existing work raises several questions. Whichphysiologically plausible mechanisms are computationally powerful enoughto accomplish the visual recognition of transitive, goal-directed bodymovements from real video stimuli? Only architectures based on suchmechanisms seems ultimately suitable as a basis for the developmentof more detailed neural models of visual action recognition. How are the
















principles for the recognition of transitive and nontransitive actions related?Can a part of the architecture for the recognition of nontransitive actions beexploited as well for the recognition of transitive actions? Which additionalcomputational and neural steps are required for the recognition of transitiveactions with goal objects?

We will try to provide answers for these questions in this chapter: The nextsection presents a short overview of an established, basic architecturefor the recognition of nontransitive actions. This architecture accounts fora variety of experimental data, and it has recently also been applied inbiologically inspired computer vision, where it has reached competitiveperformance levels. In the section “Neural Architecture for the Recognitionof Transitive Actions” we introduce an extension of this model architecturethat makes it suitable for the recognition of transitive actions. The requiredadditional computational mechanisms are explained in detail. In the section“Simulation Results and Predictions,” we show a few simulation results,illustrating that the proposed models are not only powerful enough torecognize actions from real videos, but also reproduce key properties ofaction-selective cortical neurons. Furthermore, we show some predictionsthat can be validated easily at the level of single neurons. The last sectiondiscusses several limitations of the proposed model and presents concludingremarks.

Basic Model for the Visual Recognition of NontransitiveActions

Here, we present a basic model architecture for the recognition of bodymovements that are not directed toward specific goal objects, such aswalking or waving. The presented model has been compared in detail witha variety of experimental results (Casile & Giese, 2005; Giese & Poggio,2003), and it has motivated several new experiments that have partiallyconfirmed predictions made by the model (e.g., Jastorff & Orban, 2009;Peuskens, Vanrie, Verfaillie, & Orban, 2005; Singer & Sheinberg, 2010; (p. 393 ) Thurman & Grossman, 2008; Vangeneugden, Pollick, & Vogels,2008). Furthermore, the basic architecture of the model has given rise tocomputationally more efficient implementations that have demonstrated thefeasibility of the proposed approach even for technical motion recognitionsystems, reaching state-of-the-art performance levels (e.g., Jhuang et al.,2007).











The proposed motion recognition model is based on a number of principlesthat are well-established for the visual cortex, and some of them are sharedwith the processing of static shapes (e.g., Riesenhuber & Poggio, 1999). Anoverview of the model architecture is shown in Figure 22-2.

Two Hierarchical Neural Pathways

Consistent with the basic anatomical architecture of the visual cortex, themodel is partitioned into two hierarchical neural pathways that model theventral and dorsal processing streams (Felleman & van Essen, 1991; Goodale& Milner, 1992; Ungerleider & Mishkin, 1982). The first pathway is specializedfor the

Figure 22-2. Overview of the basic architecture for the recognition of non–goal-directed body movements with two pathways for the processing of formand optic flow information. Abbreviations indicate potentially correspondingareas in monkey and human cortex (V, visual cortex; M(S)T, medial (superior)temporal cortex; KO, kinetic-occipital area; IT, inferior temporal cortex; EBA,extrastriate body area; FBA, fusiform body area; IPL, inferior parietal lobule;F5, premotor cortex). (See color insert.)

processing of form information, while the second pathway processes localmotion information. Consistent with electrophysiological data (Saleem,Suzuki, Tanaka, & Hashikawa, 2000), these two pathways converge at alevel that corresponds to the STS. While in real visual cortex these twoprocessing streams likely are connected at multiple levels, the model makesthe oversimplifying assumption that they remain separate until the level ofthe STS, where they are integrated.

Both pathways consist of hierarchies of neural detectors. Consistent withreal neurons in the visual pathway, the complexity of the extracted features









and the sizes of the receptive fields of the neural detectors increase alongthe hierarchy. The sizes of the receptive fields at different hierarchy levelscoarsely match the receptive field sizes of corresponding neurons in thevisual pathway. Invariance against translation and scaling of features isaccomplished by nonlinear pooling along the hierarchy using a maximumoperation (Fukushima, 1980; Riesenhuber & Poggio, 1999). (p. 394 )

The neural detectors in the form pathway process form features withdifferent levels of complexity. The first hierarchy level is formed by localorientation detectors, mimicking the properties of simple cells in areaV1 (Hubel & Wiesel, 1962). The response properties of these cells weremodeled by Gabor filters. These are local filters that respond maximally tograting stimuli of particular orientation and spatial frequency. The detectorsthat form the next hierarchy level mimic the behavior of complex cells(e.g., in areas V2 or V4). Their responses were determined by computingthe maximum of the responses of groups of Gabor filters with the sameorientation preference, but slightly different receptive field centers anddifferent spatial scales. It has been shown that such “maximum pooling”results in (partial) position and scale invariance and increases the robustnessof the responses of the orientation detectors against background clutter(Riesenhuber & Poggio, 1999). Related models for the recognition of staticshapes have added additional layers at this level, forming successivelymore complex form features by combining features from previous layers(Riesenhuber & Poggio, 1999; Serre et al., 2007). The resulting detectorsextract form features of intermediate complexity and result in tuningproperties that match quite accurately those of neurons in area V4 (Cadieuet al., 2007). However, for the recognition of body motion from simplestimuli, the inclusion of more complex intermediate-level form features wasnot necessary (Giese & Poggio, 2003).

The next higher level of the form recognition hierarchy consists of detectorsthat are selective for complex shapes, corresponding to body configurationsthat are characteristic for “snapshots” from movement sequences. It wasassumed that these shape-selective neurons are similar to view-tunedneurons, as described in area IT of monkeys (Logothetis & Sheinberg, 1996).These detectors were modeled by Gaussian radial basis functions (RBFs).These are model neurons with a Gaussian tuning function, typically withina multidimensional feature space. Their maximum response arises for atemplate feature vector (called the “center” of the RBF), and the responsedecays monotonically with the distance between the input vector and thistemplate vector. Physiologically plausible circuits for the implementation of













RBFs have been proposed by Kouh and Poggio (2008). The templates (RBFcenters) were determined by learning from training image sequences andsetting them to the output vectors from the previous hierarchy level thatresulted from keyframes of the motion pattern, which were obtained bysampling with equidistant time steps. Neurons with properties similar to suchsnapshot neurons have been observed in the STS in monkey cortex (e.g.,Oram & Perrett, 1996; Perrett et al., 1985, 2009), and in fact, recently strongelectrophysiological evidence has been provided for the existence of suchneurons (Singer & Sheinberg, 2010; Vangeneugden et al., 2009).

The motion pathway has, in principle, the same architecture as the formpathway. In this case, the extracted features depend on the local motion inthe stimulus. The first hierarchy layer of this pathway contains local motionenergy detectors that are selective for different local motion directionsand speeds. These detectors model motion-selective neurons in primaryvisual cortex and the mediotemporal area (MT) (Smith & Snowden, 1994).Extensions of the original model have realized this level using differenttypes of motion detectors that are suitable for the processing of real videosequences (Escobar et al., 2009; Jhuang et al., 2007). The second hierarchylevel of the motion pathway consists of detectors for more complex localoptic flow patterns; that is, motion patterns that integrate different directionsand speeds with specific local spatial configurations. In the original models,these patterns were predefined (translation, opponent motion in differentdirections). In more recent implementations, “dictionaries” of optimizedintermediate-level optic flow features have been learned from examplevideos (Jhuang et al., 2007; Sigala, Serre, Poggio, & Giese, 2005). Theresulting detectors correspond to motion-selective neurons, potentially inareas MT and MSTl, that are selective for complex intermediate optic flowfeatures (e.g., Allman, Miezin, & McGuinness, 1985; Born, 2000; Eifuku &Wurtz, 1998; Xiao, Raiguel, Marcar, Koenderink, & Orban, 1995) (p. 395 ) .

The next higher level of the motion pathway consists of neural detectorsfor complex optic flow patterns that arise temporarily during action stimuli.These detectors are equivalent to the snapshot neurons in the form pathway.They are modeled by radial basis functions whose centers were determinedby the feature vectors from the previous level, derived from training videos,just like the RBFs of the snapshot neurons in the form pathway. Neurons withselectivity for such highly complex motion patterns have been found in theSTS (e.g., Perrett et al., 1985; Vangeneugden et al., 2009).





















It is important to notice that the described optic flow pattern detectorsare selective for patterns of local motion information with complex spatialorganization. This spatial organization is holistic and covers the whole actionstimulus. (Such holistic recognition mechanisms are sometimes termed“configural” in the psychophysical literature.) It is important to note herethat holistic recognition can be accomplished based on form features (as forthe snapshot neurons) and also on local motion features, since local motionalso carries spatial information. The argument that holistic or configuralrecognition mechanisms automatically imply form-based processing is thus aconceptual confound that sometimes can be found in the related literature.

Recent work on similar models for object and motion recognition in computervision has shown that, to accomplish good performance on real images andvideo sequences, it is crucial to optimize the tuning properties of the neuraldetectors at the intermediate levels of the hierarchy by learning from imagedata (Jhuang et al., 2007; Serre et al., 2007). This approach is also advancedin the section “Shape Recognition Hierarchy” below.

Selectivity for Temporal Order

The recognition of action patterns is critically dependent on temporalorder; that is, the temporal sequence in which body shapes arise during thestimulus. To account for this effect, the model assumes that the snapshotand optic flow pattern neurons are embedded in dynamic recurrent neuralnetworks (i.e., nonlinear neural networks with lateral connections) thatmake their responses dependent on the sequential temporal order. Thedetails of this network are described later, under the section headed “NeuralArchitecture for the Recognition of Transitive Actions.” As consequence ofthe lateral connections between them, the individual snapshot neurons firestrongly only if the corresponding body shapes occur in the right temporalcontext. Showing the same stimulus sequence in reverse or scrambledtemporal order results in a substantial decay of their neural responses (Giese& Poggio, 2003). Likewise, due to this network dynamics, the optic flowpattern neurons in the motion pathway respond strongly only if the relevantoptic flow patterns arise in the right temporal sequence. The form of theselateral connections can be easily established by Hebbian learning (Jastorff &Giese, 2004), a physiologically plausible learning rule that makes synapticchanges dependent on the correlations between pre- and postsynapticsignals and their relative timing.








The highest hierarchy levels of the form and motion pathways are formedby motion pattern detectors that temporally smooth and summate theactivity of all snapshot neurons and the optic flow pattern neurons thatencode the same motion pattern. These neural detectors respond during thepresence of a particular action (like “walking,” “marching,” “boxing,” etc.).Their response is strongly sequence-selective, so that they do not becomeactivated by actions shown in reverse temporal order (Giese & Poggio,2003). Neurons with similar properties have been found in the STS (Jellema& Perrett, 2006; Oram & Perrett, 1996; Perrett et al., 1985; Vangeneugden etal., 2009).

Integration of Form and Motion

In the cortex, the two visual pathways converge at the level of the STS(Saleem et al., 2000). This convergence of form and motion has beena central feature of the model by Giese and Poggio (2003) and can besimply modeled by summing the responses of the motion pattern neuronsin the form and motion pathways. Alternatively, one also could assumecommon motion pattern neurons for both pathways. The detailed (p. 396 )mechanisms of the fusion of form and motion in body motion recognitionin the visual cortex remain unknown and have to be decided based onappropriate data from single cells. However, experimental evidence suggeststhat both pathways interact at earlier levels than the STS and that thereare top-down influences from motion pattern recognition into earlier levelsof the motion and form pathways (e.g., Peuskens et al., 2005). Such top-down influences are not captured by the existing neural models for motionpattern recognition. A detailed discussion of top-down processes is providedby Thornton (Chapter 3, this volume). In addition, Saygin (Chapter 21, thisvolume) gives a detailed discussion about the integration of form and motioninformation in biological motion processing based on human imaging data.

Deviating from the idea of a fusion of form and motion cues (e.g., at thelevel of the STS) as proposed in our model, in the field of biological motionprocessing, a vivid discussion has emerged that tries to address the questionof whether the recognition of point-light walkers is based exclusively on formor exclusively on motion information. The starting point of this discussionwas a novel point-light stimulus (Beintema & Lappe, 2002) that presentedthe dots at randomized positions on the skeleton in every frame yet stillelicited the percept of biological motion in human observers, even thoughit reduced the amount of available local motion information. This resulthas motivated the hypothesis that biological motion recognition might















be based exclusively on form templates, basically independent of localmotion except for “segmentation” of the figure from the background. Asfurther support for this hypothesis, Lange and Lappe (2006) have proposeda neural form template fitting model, which is very similar to the formpathway of the model discussed earlier, but lacks mechanisms for scale andposition invariance. In detailed simulations, they showed that, assumingan appropriate prepositioning of the template over the stimulus, the modelresulted in good curve fits for several psychophysical results. However, moredetailed simulations with our model, including both a motion and a formpathway, suggested that spontaneous generalization between normal andpoint-light stimuli might be much easier to obtain in the motion pathway. Inaddition, these simulations showed that the stimulus by Beintema and Lappe(2002) contained substantial amounts of local motion information and is thusalso suitable for recognition by a purely motion-based model (Casile & Giese,2005).

The question of whether form or motion features contribute to the perceptionof body motion and actions seems interesting and has stimulated a largenumber of novel studies (e.g., Beintema, Georg, & Lappe, 2006; Lange,Georg, & Lappe, 2006; Casile & Giese, 2005; Thirkettle, Benton, & Scott-Samuel, 2009; Thurman & Grossman, 2008). From our point of view,however, extreme positions that posit “only form” or “only motion” as beingrelevant for the processing of body motion are not particularly helpful fordeveloping a deeper understanding of brain function. From the point ofcomputational efficiency, it seems extremely unlikely that the brain discardsrobust features like local motion from the recognition process. Also, someof the arguments in favor of an exclusively form-based recognition processseem problematic. For example, the stimulus presented by Beintema andLappe (2002) contains substantial local motion (Casile & Giese, 2005)and thus does not provide an example of a local motion-free biologicalmotion recognition. Then, the idea of the fitting of form templates (Lange& Lappe, 2006) bypasses completely how such templates are positioned onthe stimulus. This problem is specifically critical in the presence of motionclutter. For dynamic backgrounds, clearly a segmentation of the figure fromthe background using motion as basis for a subsequent simple fitting of aform template is not possible. In this case, segmentation and recognitionof the stimulus cannot be decoupled, and a simple testing of all possibletemplate positions (and scales) is computationally not tractable. However,subjects are easily able to recognize point-light stimuli with arbitrary sizeand position in motion clutter. As well, predictions from the hypothesis ofexclusively form-based processing of biological motion are contradicted by















experimental data (p. 397 ) (e.g., Thurman & Grossman, 2008). Furthermore,there is accumulating evidence from behavioral and imaging studies thatsupports an essential involvement of both form and motion information inthe recognition of biological motion. Likewise, it would be easy to come upwith a similar list of arguments against the purely motion-based processingof biological motion stimuli. In our view, a cue fusion account seems thus notonly computationally more efficient, but also much more plausible in termsof cortical processing.

Limitations of the Basic Architecture

The described basic model architecture reproduced a range of empirical factsfrom electrophysiology, psychophysics, functional imaging, and even lesionstudies in patients (e.g., Giese & Poggio, 2003). It specifically reproducedthe recognition of point-light stimuli despite substantial amounts of motionclutter without a priori knowledge of the position of the moving figureand even from stimuli with degraded motion information, such as the oneproposed by Beintema and Lappe (Casile & Giese, 2005). Also, the modelreproduced the view-dependence of neurons that are selective for biologicalmotion stimuli (Oram & Perrett, 1996; Perrett et al., 1985).

However, the original model was tested only with simplified artificial stimulithat had been used in neuroscience experiments. Recent extensions of sucharchitectures have included neural models for the estimation of optic flowfrom gray-level videos. In addition, some of these studies have includedlearning of optimized mid-level features from example image sequences. Avalidation on benchmark databases from computer vision showed that sucharchitectures can reach performance levels that are competitive with state-of-the-art action recognition systems in computer vision mechanisms notinspired by biology (Escobar et al., 2009; Jhuang et al., 2007; Schindler & vanGool, 2008).

In a biological context, however, future work must address a number ofserious limitations of the proposed basic model architecture. For example,the described models have predominantly a feed-forward architecture andneglect the influence of attentional modulation and context informationthat is present in motion recognition in biological systems. (See Thompson,Chapter 18, this volume, for a more detailed discussion of these issues.) Inaddition, this biologically motivated work on action recognition has solelyfocused on nontransitive actions, which are not directed toward goals orobjects. The major purpose of this chapter is to propose an extension of













the original architecture that can account for the recognition of transitiveactions by the inclusion of a small set of additional neural mechanisms thatare consistent with facts known from neurophysiology.

Neural Architecture for the Recognition of Transitive Actions

The architecture described so far reproduces a number of properties ofthe neural mechanisms in the recognition of nontransitive actions, suchas walking or waving, which do not involve a direct interaction with a goalobject. In this section, we describe several extensions that make the generalarchitecture described earlier suitable for the recognition of transitiveactions that are directed toward goal objects and specify interactions withthem. More specifically, the proposed novel model accounts for the visualrecognition of grasping acts from natural video sequences. In comparisonwith the model described above, this novel architecture is distinguished bythe following features:

1. In order to limit the computational complexity of the firstimplementation, the present version of the model containsonly a form pathway. It turned out that form-based recognitionwas sufficient to accomplish relatively robust recognition fromgrasping videos. However, it seems possible that, in the brain, therecognition of grasping acts combines form and motion features. Alater extension of the model that includes a motion pathway seemseasily possible.2. To accomplish robust performance on real video sequences, themodel was extended by an unsupervised learning algorithm thatoptimizes the form features of the detectors at the middle hierarchylevels. (p. 398 )3. The model includes additional neural circuits that modelcomputational mechanisms likely located in STS and in the parietaland premotor cortex. These circuits combine the information aboutthe effector (hand) movement and the shape and location of thegoal object, that is the grasped object.

Figure 22-3 presents an overview of the extended architecture. Panel Ashows the neural hierarchy for the recognition of effector and object shapes.Its architecture closely resembles the form pathway of the original modelpresented in Figure 22-3. Panel B shows the additional proposed neuralmechanisms that integrate the information about the goal object and theeffector, and which are required




Figure 22-3. Overview of the extended architecture for the recognition ofgoal-directed actions. (A) Neural hierarchy for the recognition of effector andobject shapes. (B) Mechanisms for the integration of the information aboutthe goal object and the effector (hand). Reprinted from Fleischer, F., Casile,A., & Giese, M. A. (2009).View-independent recognition of grasping actionswith a cortex-inspired model. Proceedings IEEE International Conference onHumanoid Robots. doi, 10.1109/ICHR.2009.5379524. With permission of thepublisher, IEEE. (See color insert.)

to realize a view-independent recognition of hand actions. These additionalmechanisms are sketched below in more detail. A more extensive descriptionof the model can be found in Fleischer et al. (2008, 2009a,b).

Shape Recognition Hierarchy

The recognition of effector (hand) and object shape is based on thehierarchical neural framework introduced earlier, which was derived fromwell-established object recognition models (Mel & Fiser, 2000; Perrett &Oram, 1993; Riesenhuber & Poggio, 1999; Rolls & Milward, 2000; Serre etal., 2007). An overview of the shape recognition hierarchy is shown in Figure22-3A. (p. 399 )

The first hierarchy level consists of orientation detectors that were modeledby Gabor filters, with 12 different preferred orientations and two differentspatial scales. “Complex cell” responses were computed from the outputsignals of these detectors by pooling the responses of detectors with thesame orientation preference and spatial scales, but slightly different spatialpreferences, using a maximum operation (see the section “Two Hierarchical











Neural Pathways”). The spatial receptive fields of these complex cellscorrespond to a size of about 1.0 degrees, matching approximately theparameters derived from electrophysiological experiments. The outputsignals of all levels of the neural hierarchy were thresholded using linearthreshold functions.

The next higher level extracts form features of intermediate complexity.Unlike the original model described earlier, the selectivity of these mid-level form detectors was optimized by learning. The detectors were modeledby Gaussian radial basis functions (RBFs, see above) whose centers weredefined by input signals from the previous hierarchy layer. Based on a set oftraining videos showing grasping actions, a limited set (dictionary) of suchintermediate form features was learned using a simple greedy clusteringprocedure that preserves frequently occurring features, while it tends toeliminate novel features that are redundant based on a similarity measure.The learned feature representation thus reflects the feature statistics presentin the training data (Fidler et al., 2008; Mutch & Lowe, 2006; Serre et al.,2007). The thresholded responses of the learned feature detectors werepooled over a local spatial neighborhood diameter of 1.7 degrees, againusing a maximum operation, thus generating detector responses withhigher position invariance. The same procedure can be cascaded in order togenerate several intermediate layers that extract more and more complexform features. For the given implementation, we tested between two andfour intermediate layers, dependent on the required recognition accuracy.

As for the model architecture discussed earlier, the neural detectors at thehighest level of the shape recognition hierarchy were given by GaussianRBFs whose centers corresponded to keyframes from training imagesequences. These feature detectors are selective for individual viewsof objects and hands, and they correspond to the body shape snapshotdetectors described earlier. Unlike equivalent detectors in most standardobject recognition models (e.g., Riesenhuber & Poggio, 1999), these highest-level shape detectors are not completely position-invariant. Instead, thereexists a small ensemble of detectors with different spatial preferences(tuning width corresponding to about 3.7 degrees) for each learned shape.This makes it possible to read out the two-dimensional retinal position of therecognized shapes from this detector population using a simple populationvector approach. (An example is an estimate of the stimulus position thatis given simply by a weighted average of the preferred positions of theindividual detectors, where the weights are given by their normalizedresponses.) It will be shown below that the extraction of position information








is important for the recognition of effective goal-directed actions, since it iscrucial for the computation of the spatial relationship between the effectorand goal object.

Several results from electrophysiology support the assumption that shape-selective neurons in ventral areas (such as IT) are characterized by a limiteddegree of position invariance (Baker, Keysers, Jellema, Wicker, & Perrett,2000; DiCarlo & Maunsell, 2003; Lehky, Peng, McAdams, & Sereno, 2008; fora review see Kravitz, Vinson, & Baker, 2008), supporting this assumption inour model.

Selectivity for Temporal Order

For the recognition of effector movements (like the closing of the graspinghand), the responses of the snapshot neurons were associated over timeusing a dynamic neural network. For this purpose, the responses of handshape detectors with different spatial preferences were pooled using amaximum operation, in order to realize position-independent detectors forthe recognition of individual hand shapes. In addition, we trained a linearremapping of the response of these detectors onto a one-dimensionalactivity (p. 400 ) map that provides input for the dynamic neural network.This mapping compensates for the fact that natural grips are characterizedby temporally strongly nonuniform shape variations of the hand (seeFleischer et al., 2009b, for further details). The resulting input distributioncan be characterized by a time-dependent vector with the elements r k(t),where the index k indicates the position of the activity variable in the map. Ifthe trained grasping action is presented, an activity peak arises in this mapthat propagates with approximately constant speed.

A dynamic neural network that results in sequence-selective responses canbe derived from dynamic neural fields with asymmetric lateral connections(Giese, 1999; Xie & Giese, 2002; Zhang, 1996). (A neural field is an idealizedmodel for a network of neurons that encodes continuous parameters. In thiscase, the network can be approximated by a continuous neural mediumwith a continuous instead of a discrete index for the individual neurons. Thisapproximation can make the mathematical treatment substantially easier.See e.g., Amari, 1977; Giese, 1999). To define the dynamic network, weassume the existence of dynamic snapshot neurons whose activity u k(t)obeys the differential equation:(1)














The function [.]+ defines a linear threshold function, which is defined as[x]+ = max(x, 0). The time constant of the dynamics τ was about 160 ms.The positive threshold parameter h determines the resting level of theneural network without input. The interaction kernel w(k) was chosen as asmooth asymmetric function that was adjusted to maximize the amplitudeof the activity peak that emerged for the training stimulus sequencepresented in the correct temporal order and to minimize the response forthe sequence presented in reverse temporal order. The snapshot neuronsin this representation become activated strongly only if the presented handshape matches the corresponding training shape, and if it occurs in theright temporal context (see Giese & Poggio, 2003, for further details). Theappropriate connection strength for the lateral connections w(k) could belearned as well, using physiologically plausible mechanisms (see earlierdiscussion).

The outputs of the snapshot neurons belonging to the same hand actionwere then temporally integrated by motion pattern neurons that obeyed thedynamical equation:(2)

The time constant of this integrator dynamics was given by τ s = 200 ms,and h s is a positive threshold parameter. The motion pattern neuronsrespond selectively to a particular view of a particular hand action, butcontinuously during the whole action sequence. If the corresponding handshapes are presented in the wrong temporal order, the response of theseneurons is strongly reduced (Giese & Poggio, 2003).

Integration of the Information about Effector and Object

The recognition of goal-directed actions requires not only the recognitionof the effector movement but also the analysis of the relationship betweenthe effector and the goal object. A grip that does not reach the goal objectwould be an unsuccessful action, and the same would be true if the handshape is inappropriate for the realization of an efficient grip on the object.Neurophysiological data show that the activities of many action-selectiveneurons are critically dependent on whether the relationship between






effector and object is appropriate. For example, action-selective neurons inthe STS or in area F5 in premotor cortex show strongly reduced responsesif the grasping hand does not touch the goal object (“mimicked action”)(Perrett et al., 1989; Umiltà et al., 2001). A model for the visual recognitionof goal-directed action thus has to account for this dependence of neuralactivity on the relationship between effector and object.

The proposed model explains this dependence by an additional circuit thatreceives its input from the hand shape and object shape detectors (Figure22-3B). The proposed mechanism is critically dependent on the fact that,due to the incomplete position invariance of the shape detectors, the retinalpositions (p. 401 ) of goal object and effector can be estimated from theresponses of the shape detectors. The core of the proposed neural circuitis a neural map that represents the relative position of effector and object.In this two-dimensional relative position map, the relative position of objectand effector is encoded by a neural activity peak (Figure 22-3B). The mapcan be constructed by simple neural operations from the outputs of theshape detectors. Let aE(u,v,t) signify the retinotopic spatial activity mapthat represents the current effector position for a particular grip type, andlet aO(u,v,t) correspond to the activity map that is defined by the objectshape detectors of an object that can be efficiently grasped with this griptype. Then, the activity in the corresponding relative position map can becomputed according to the relationship: (3)

This convolution can be computed by a simple neural architecture thatis similar to a gain field (Pouget & Seijnowski, 1997; Salinas & Abbott,1995) by summing product terms from the two input maps. Gain fieldshave been a fundamental architecture for the realization of coordinatetransformation (e.g., in parietal cortex). Due to the multiplication, a non-zero output signal in the map can arise only if both effector and object arepresent in the stimulus and if the object shape is suitable for a grip that isencoded by the corresponding hand shape detectors. In this way, the modelmatches the hand shape and the grip affordance of the object. In our presentimplementation, the center of the relative position map corresponds to theposition of the effector, and the object position is represented relative tothe effector. Similar implementations of coordinate transformations by gainfields have been found in multiple regions in the parietal cortex. Examplesinclude the change from an eye-centered to a head-centered frame ofreference (Batista, Buneo, Snyder, & Andersen, 1999; Buneo, Jarvis, Batista,









& Andersen, 2002), or the representation of the relative positions of objectparts (Chafee, Averbeck, & Crowe, 2007).

A grip will only be successful if the effector is located within a certain rangeof spatial positions relative to a goal object. This range corresponds to awell-defined spatial region in the relative position map (in Figure 22-3B,the region marked by the cyan curve). A further postulate of our theory isthe existence of a class of neurons, called affordance neurons, which sumthe activity in the relative position map over these spatial regions. As aconsequence, these neurons are activated only if both effector and objectare present and if they have a spatial relationship that is appropriate fora successful grip. We assume that the receptive fields of the affordanceneurons are established by learning.

As a final step in the integration of information about effector and object,we assume a multiplicative combination of the output signals from theaffordance neurons and the motion pattern neurons that are detecting thesame grip. This multiplication is computed by the view-dependent action-selective detectors in our model (Figure 22-3B). These detectors respondonly in the presence of the appropriate effector motion and with the rightspatial relationship between effector and object. Neurons with similarproperties have been described in anterior STS (STSa) (Perrett et al., 1989)and area F5 (Rizzolatti et al., 2001). Motor neurons with specific tuning forobject shapes and the relationship between object and effector have alsobeen found in the parietal area AIP (Baumann et al., 2009; Gardner et al.,2007; Murata et al., 2000).

Integration of Different Views

The initial stages of our model realize recognition of learned views ofthe object and the effector. Correspondingly, the action-selective neuraldetectors described earlier are view-dependent; that is, they depend onthe visual perspective in which the action is observed. Their responsethus decays if an action is presented with views that deviate increasinglyfrom the training view. Such view dependence is in accordance withelectrophysiological data. View-dependent action-selective neurons havebeen observed in the STS (Oram & Perrett, 1996; Perrett et al., 1985), andthe majority of F5 mirror neurons are also view-dependent (Caggiano et al.,2011). This clearly argues in favor of a view-based (p. 402 ) representationof actions rather than of a full reconstruction of the three-dimensionalgeometry of effector and object. Such a full three-dimensional reconstruction














is implicitly assumed by many other models for the recognition of goal-directed actions (see the opening paragraphs of this chapter).

In our model, view-independent action recognition is accomplished bypooling the output signals of a limited number of view-specific modulesusing a maximum operation (Figure 22-3B, right). This realization of a view-invariant recognition of actions closely matches a widely accepted principlefor view-invariant object recognition in the ventral stream (e.g., Logothetis,Pauls, & Poggio, 1995; Poggio & Edelman, 1990). Our simulations show that avery limited number of view-dependent modules is sufficient to accomplish afully view-independent recognition of actions.

Simulation Results and Predictions

In the following section, a number of simulation results are presentedto illustrate that the model is computationally powerful enough for therobust recognition of hand actions from real video sequences. In addition,the simulations demonstrate that the model reproduces a number of keyproperties of action-selective neurons in the cortex that have been observedin electrophysiological experiments. Further simulation results can be foundin Fleischer et al. (2009a,b).

Robust Recognition from Real Video Sequences

The recognition of hand actions from real videos is a challenging computervision problem (e.g., Athitsos & Scarloff, 2003; Kjellström et al., 2008;Stenger, Thayananthan, Torr, & Cipolla, 2006). The distinction of differenttypes of grips requires a highly accurate recognition of shape details. Thedifference between a precision grip (grasping with the index finger and thethumb) and a power grip (grasping with the full hand) might, dependingon the view angle, be defined only by a relatively small number of pixelsin a video stimulus. At the same time, recognition must be accomplishedindependent of the view of the action and of the position of the actionstimulus within the visual field. Electrophysiological recordings from area F5in macaque cortex show that action-selective neurons (mirror neurons) arehighly selective for differences between different grip types. But also, theyoften show strong invariance against the position of the stimulus within thevisual field (Gallese et al., 1996).

To investigate the behavior of our model with real-world stimuli, we recordedvideo sequences with four different hand actions filmed from 19 view angles











using a Canon XL1-S camera with a frame rate of 25 Hz. The videos showa human hand grasping an object with different grip types. Videos wereconverted to gray-level and had a frame size of 350 × 315 pixels. Thehand shape detectors were trained using example shapes derived from theoriginal videos by color segmentation of hand and object. However, testingwas based on unsegmented gray-level videos.

The performance of the snapshot neurons in the model is shown for thedistinction between precision and power grip in Figure 22-4A, assumingthe same view for both grips. In addition, this simulation compared twovariants of the model, one with sequence selectivity (red line) and onewithout sequence selectivity (blue line). Classifications were based on theresponses of the snapshot neurons, assigning as response the grip type thatcorresponded to the more strongly activated snapshot neuron for the sametime step. Performance (percent of correct classifications) is plotted as afunction of the normalized time as a fraction of the duration of the wholegrasping action. The model was tested with novel grasping sequences thatwere not used to train the model.

Figure 22-4A shows that almost perfect recognition performance is achievedalready after about half of the overall duration of the action. In addition,the comparison between the two model variants shows that sequenceselectivity results in a slight improvement for the classification in the earlystages of the grips, where (p. 403 ) the hand shapes of the two grips arestill very similar. In this case, sequential order information can disambiguateinformation arising from intermediate hand shapes. Even though the handshape detectors were trained with example images that did not contain thegoal object, the model efficiently generalizes to the stimuli, including alsogoal objects.

A validation of the model’s performance with a variety of views is shown inFigure 22-4B. In this case, the model was trained with seven different views(in steps of 30 degrees) of two actions (power grip from above and from theside of the object). Again, the performance is shown for testing with videosthat show the same action from novel views that were not presented duringtraining. The average performance over all test views of the classification bythe snapshot neurons is shown as a function of the normalized duration ofthe action. Almost perfect performance is accomplished after less than 60%of the overall time of the action, thus before the hand has reached its finalconfiguration. This implies that the model accomplishes robust view-invariantrecognition from real video stimuli, requiring only a relatively




Figure 22-4. Classification performance of the model for real video stimuli.(A) Classification performance of the snapshot neurons for precision versuspower grip presented with the same view. A model including the neuralmechanism for sequence selectivity (red curve) was tested against a versionof the model without sequence selectivity (blue curve). (B) Testing withmultiple views. Classification performance for a power grip from the topversus a power grip from the side presented with 12 different views thatwere disjointed from the training views. Dashed curves signify standard errorover repeated simulations. (See color insert.)

limited number of view-specific modules for each action.

Position Invariance

Many action-selective neurons (e.g., in premotor cortex) show only a weakvariation in their responses to displacements of the stimulus in the visualfield (Gallese et al., 1996). The model reproduces this strong positioninvariance.

To test position invariance, two grips (precision vs. power grip) were shownat nine different positions within the visual field of the model. As shown inFigure 22-5, the responses of action-selective neurons that are selectivefor precision or power grip almost do not change with the position of thestimulus within the visual field. This strong position invariance is achievedwhile the model shows high selectivity for the relative position of object andeffector, as shown later.

In summary, the last two sections show that the proposed neuralarchitecture, even though it is based on very elementary neural operations—all of which, in principle, could be realized with cortical neurons—is (p.404 ) sufficiently powerful to solve the difficult computational problem ofgoal-directed action recognition for real-world stimuli. The challenge of thisproblem is that high accuracy for the recognition of finger positions and the





relationship between effector and object have to be realized together withsubstantial invariance against stimulus position and view.

View Dependence

View dependence is a natural consequence of the fact that the model isbased on view-dependent representations of hand and object shapes. Suchview dependence matches electrophysiological data, since view-dependentaction-selective neurons have been observed in the STS (Oram & Perrett,1996; Perrett et al., 1985), as well as in the premotor cortex of monkeys(Caggiano et al., 2011). View-tuning is also a common observation forneurons in ventral shape-selective areas,

Figure 22-5. Position invariance for precision versus power grip. Theresponses of neurons at the highest hierarchy level selective for powergrips are shown during presentation of power grip stimuli (dark gray) andof precision grip stimuli (light gray) for nine different positions of the stimuliwithin the visual field (indicated by filled dots in the insets).

such as area IT (Logothetis, Pauls, & Poggio, 1995; Tarr & Bülthoff, 1998).

The view tuning of action detectors at the highest level of the view-dependent modules (Figure 22-3B) is illustrated in Figure 22-6. In this case,seven view-specific modules have been trained that are selective for viewsthat differ by 30 degrees. The training actions were power grips of a rod-likeobject from the side and from the top. The thin curves indicate the activitiesfor the view-dependent action detectors, with different colors indicatingdifferent view-dependent modules. All test views were different from thetraining views. Panels A and C show the response for a grip from the top,









and panels B and D show the responses for a grip from the side. The top-gripneurons (panels A and B) show strong responses only for the top grip, forviews that are sufficiently close to their training view. In addition, they showa gradual decay of (p. 405 ) the tuning curve, which corresponds to a tuningwidth of about 60 degrees. This view dependence and the tuning widthmatch quantitatively with electrophysiological results obtained by studyingthe view dependence of mirror neurons in area F5 using similar stimuli(Caggiano et al.,2011). For the grip from the side, the top-grip neurons showonly relatively weak responses and no clear view tuning. The behavior of theside-grip neurons from the view-dependent modules trained with side-gripstimuli is complementary: Strong responses arise only for side-grip stimuli, iftheir view is similar to the training view. Again, one finds smooth view-tuningcurves with widths of around 60 degrees.

The thick lines in Figure 22-6 indicate the responses at the highest levelof the model; that is, those formed by the view-invariant action detectors.These detectors show strong responses for all views of the trained action andmuch smaller responses for the alternative action. Based on the

Figure 22-6. Activity of the action-selective neurons (see Figure 22-3). Thinlines indicate the activity of the view-dependent detectors and thick linesthat of the corresponding view-independent detectors. Panels A and C showresponses to a stimulus showing a power grip from the top, and panels Band D the responses for a power grip from the side. Panels A and B show theresponse for the neural modules that have been trained with a top grip (top-grip neurons), and panels C and D show the responses of the neurons fromthe model trained with the side grip (side grip-neurons). Thick lines indicatethe responses of the corresponding view-independent action detectors. Testviews were different from the views that were used to train the model. (Seecolor insert.)




responses of these detectors, it is trivial to classify the actions (simplyselecting the action as recognized that corresponds to the action-selectivedetector with the higher activity). Most importantly, the direction of theactivity difference between the two actions has the same sign for all views,even for the untrained views. This makes it possible to classify all views withonly a small number of trained view-dependent modules. View-independentaction-selective neurons have been found in area F5 of the macaque(Caggiano et al., 2011), and in the STS (Jellema & Perrett, 2006; Perrett etal., 1989).

Selectivity for the Relationship Between Effector and Object

Many action-selective neurons show high selectivity of their responses forthe relationship between effector and goal object. Mirror neurons in areaF5 have been reported to fail to respond when either the effector or thegoal (p. 406 ) object are missing in the stimulus (Umiltà et al., 2001). Inaddition, many mirror neurons fail to respond for “mimicked actions,” whereboth effector and object are present, but where the effector misses the goalobject, with the experimenter grasping next to it. Similar observations havebeen made for action-selective neurons in the STS (Perrett et al., 1989).

The model reproduces this selectivity for the relationship between effectormotion and goal object even in quantitative detail. Figure 22-7 shows theactivity of an action-selective neuron that has been trained with a power gripfor stimuli that show effector and object with the correct spatial relationshipand with incorrect spatial relationship (mimicked action). In addition, stimuliwere tested that contained only the object or only the effector. The insetreplots data from an electrophysiological study that has investigated theresponses of neurons in the anterior STS using the same type of stimuli(Perrett et al.,









Figure 22-7. Selectivity for the correct relationship between effector andobject. Responses are shown for an action-selective detector (selectivefor power grip) for a normal grasping stimulus, a mimicked action (thehand not reaching the object), and stimuli where either the hand orthe object was missing. Error bars indicate standard deviation over 10independent simulations. The inset shows corresponding neurophysiologicaldata from an electrophysiological experiment by Perrett et al. (1989).Adapted from Perrett, D. I., et al. (1989). Frameworks of analysis forthe neural representation of animate objects and actions. Journal ofExperimental Biology, 146, 87–113. With permission of the publisher, Journalof Experimental Biology, jeb.biologists.org.

1989). Clearly, the response of the action-selective detectors decayssubstantially if either the object or the effector is missing in the stimulus, orif a mimicked action is presented. The decay is even quantitatively similar tothe response profile observed in the electrophysiological study.

We conclude that the proposed neural mechanism accounts, at leastqualitatively, for this neurophysiological data. From the computational pointof view, it seems not trivial to accomplish this selectivity for the spatialrelationship between effector and object, at the same time guaranteeingstrong position invariance for the action recognition.

Predictions

The model is formulated largely in terms of mechanisms that could, inprinciple, be implemented by real cortical neurons. This makes it possible toderive a number of predictions that (p. 407 ) can be tested immediately inexperiments. Here are only a few examples:

• Action-selective neurons should show sequence selectivity.This implies that the presentation of the same stimulus framesin forward and reverse order should result in different neuralresponses. In fact, initial data seem to confirm this predictionfor action-selective neurons (mirror neurons) in area F5 of themacaque, obtaining a good agreement between the behaviorof individual neurons in this area and the model (V. Caggiano,personal communication).• Changing the relative positions of effector and object shouldresult in well-defined gradual tuning curves for the dependenceon relative position. This dependence should be invariant againstposition changes of the whole stimulus in the visual field. This





prediction could be confirmed by recording of neurons, for example,in relevant areas in the parietal cortex or in area F5.• The existence of the postulated neuron classes (affordanceneurons, motion pattern neurons, view-dependent andview-invariant action-selective neurons) can be verified inelectrophysiological studies. A coarse anatomical localization ofthe different postulated computational steps (relative positionmap, affordance neurons, etc.) might also be possible in carefullycontrolled fMRI experiments.

Conclusion

We have reviewed in this chapter biologically inspired models for the visualrecognition of body movements and actions. We provided an overview ofa class of architectures for the recognition of nontransitive actions thatare not goal-directed, but which are relatively well established as modelsfor brain functions and as basic architectures for computer vision systemsfor action recognition. In later sections, we presented an extension ofthis basic class of models, making them suitable to account also for therecognition of transitive actions. We have shown that a model based onthe proposed extensions is computationally powerful enough to realizerecognition of transitive actions from real videos. In addition, we haveshown that the model reproduces several neurophysiological results aboutaction-selective cortical neurons. While it is still somewhat preliminary,this makes the proposed architecture interesting as a starting point for thedevelopment of more elaborated models that can be fitted in much moredetail to experimental data.

Obviously, the proposed model has a number of limitations, which at thesame time define topics for future research. We list here only a few of themajor points:

• The proposed model completely ignores the influence of disparityfeatures, which would provide depth information obtained by acomparison of the retinal images from both eyes. It is known thatmany neurons, specifically in parietal regions, show disparitytuning (Orban, Janssen, & Vogels, 2006; Tsutsui, Taira, & Sakata,2005). Thus, it remains an important topic for future research toexplore the role of disparity features in action recognition. To ourknowledge, no neural model so far has addressed this topic.• There has been an extensive discussion of how the perceptionand execution of actions are linked in the context of research on






the mirror neuron system (e.g., Prinz, 1997; Rizzolatti et al., 2001).Doubtless, strong empirical evidence exists for a tight interactionbetween neural representations for action execution and actionperception, as reviewed by Calvo-Merino, Chapter 16, this volume,and van der Wel, Sebanz, and Knoblich, Chapter 7, this volume.With respect to this discussion, the proposed model provides theinsight that many visual tuning properties of action-selective visualneurons can be explained with relatively standard mechanisms thatare also common to visual processes outside of action recognition.A direct coupling to motor representations or even the time-synchronous internal resimulation of the observed motor behaviorwithin the motor system (motor resonance) was not necessary toaccount for (p. 408 ) these results. However, the tight coupling ofmotor execution and visual recognition of body motion raises thequestion of how exactly motor and visual representations are linkedto each other and how the proposed model has to be modified totake this link into account. An interesting idea in this context isthe existence of predictive dynamic mechanisms at multiple levelswithin a hierarchical system that can propagate predictions in abottom-up as well as a top-down fashion (Kiebel, Daunizeau, &Friston, 2008).• The computational limits of the proposed architecture need to beinvestigated much more thoroughly, using more extended datasetsthat include many objects and more types of grips. Only in thisway will it be possible to judge how the proposed solution scalesup for bigger problems and different tasks, such as the recognitionof emotional expressions (Schindler & van Gool, 2008; see alsode Gelder, Chapter 20, this volume). Another step that will becritical to make the system interesting for applications in computervision and robotics is to improve the computational mechanismsat the different levels of the hierarchy in order to make the systemapplicable for action stimuli with background clutter.• Another important problem that has rarely been addressed inthe context of action vision is the influence of attention on theprocessing of complex motion stimuli (see, e.g., Rodriguez-Sanchez,Simine, & Tsotsos, 2007). It seems likely that an integration ofattentional selection will play a key role in making the proposedarchitecture applicable for more complex problems, as forvisual scenes in which multiple objects or effectors are present.Conversely, it also remains a question for future research todetermine how action perception influences the control of attention












(e.g., by directing attention toward action goals; see also thediscussion of top-down and bottom-up processes in Thompson,Chapter 18, this volume).

In summary, we think that the proposed skeleton architecture might providea first step toward more quantitative models for the visual recognition ofgoal-directed actions that make well-defined predictions that can be verifiedor falsified at the level of the behavior of single cells in relevant highercortical areas. It is likely that only this approach will finally help to unravelthe real neural circuits that underlie the visual processing of action stimuliand body motion.

Acknowledgments

We are grateful to M. Shiffrar for the invitation to write this chapter, to V.Caggiano for sharing his electrophysiological data, and to L. Omlor for helpwith the video stimuli. We thank P. Thier and A. Casile, L. Fogassi, and G.Rizzolatti for interesting discussion in the context of a project on mirrorneurons. This work was supported by the Deutsche Forschungsgemeinschaft(DFG) SFB 550 and Gl 305/4–1, EC FP6 project COBOL, and FP7 projectsSEARISE, TANGO and AMARSI. Further support from the Hermann LillySchilling Foundation is gratefully acknowledged.

References

Bibliography references:

Allman, J., Miezin, F., & McGuinness, E. (1985). Direction- and velocity-specific responses from beyond the classical receptive field in the middletemporal visual area (MT). Perception, 14, 105–126.

Amari, S. (1977). Dynamics of pattern formation in lateral-inhibition typeneural fields. Biological Cybernetics, 27, 77–87.

Arbib, M. A. (2008). From grasp to language: embodied concepts and thechallenge of abstraction. Journal of Physiology, 102, 4–20.

Athitsos, V., & Sclaroff, S. (2003). Estimating 3d hand pose from a clutteredimage. Proceedings. IEEE International Conference on Computer Vision andPattern Recognition, 2, 432–439.





Baker, C. I., Keysers, C., Jellema, T., Wicker, B., & Perrett, D. I. (2000). Codingof spatial position in the superior temporal sulcus of the macaque. CurrentPsychology Letters—Behavior, Brain, and Cognition, 1, 71–87.

Batista, A. P., Buneo, C. A., Snyder, L. H., & Andersen, R. A. (1999). Reachplans in eye- centered coordinates. Science, 285, 257–260.

Baumann, M. A., Fluet, M. -C., & Scherberger, H. (2009). Context-specificgrasp movement representation in the macaque anterior (p. 409 )intraparietal area. Journal of Neuroscience, 29, 6436–6448.

Beardsworth, T., & Buckner, T. (1981). The ability to recognize oneself froma video recording of one’s movements without seeing one’s body. Bulletin ofthe Psychonomic Society, 18, 19–22.

Beintema, J. P., & Lappe, M. (2002). Perception of biological motion withoutlocal image motion. Proceedings of the National Academy of Science USA,99, 5661–5663.

Beintema, J. A., Georg, K., & Lappe, M. (2006). Perception of biologicalmotion from limited-lifetime stimuli. Perception and Psychophysics, 68, 613–624.

Bertenthal, B. I., & Pinto, J. (1994). Global processing of biological motions.Psychological Science, 5, 221–225.

Billard, A., & Mataric, M. (2001). Learning human arm movements byimitation: Evaluation of a biologically-inspired connectionist architecture.Robotics and Autonomous Systems, 941, 1–16.

Blake, A., & Isard, M. (1998). Active contours. Berlin, Germany: Springer.

Blakemore, S. J., & Frith, C. (2005). The role of motor contagion in theprediction of action. Neuropsychologia, 43, 260–267.

Bobick, A. (1997). Movement, activity, and action: The role of knowledge inthe perception of motion. Philosophical Transactions of the Royal Society B:Biological Sciences, 352, 1257–1265.

Bonaiuto, J., Rosta, E., & Arbib, M. (2007). Extending the mirror neuronsystem model, I. Audible actions and invisible grasps. Biological Cybernetics,96, 9–38.




Born, R. T. (2000). Center-surround interactions in the middle temporal visualarea of the owl monkey. Journal of Neurophysiology, 84, 2658–2669.

Buneo, C. A. Jarvis, M. R., Batista, A. P., & Andersen, R. A. (2002). Directvisuomotor transformations for reaching. Nature, 416, 632–636.

Cadieu, C., Kouh, M., Pasupathy, A., Connor, C. E., Riesenhuber, M., &Poggio, T. (2007). A model of V4 shape selectivity and invariance. Journal ofNeurophysiology, 98, 1733–1750.

Caggiano, V., Fogassi, L., Rizzolatti, G., Pomper, J. K., Thier, P., Giese, M. A., &Casile, A. (2011). View-based encoding of actions in mirror neurons of area f5in macaque premotor cortex. Current Biology, 21, 144–148.

Calvo-Merino, B. (2013). Neural mechanisms for action observation. In K.L. Johnson & M. Shiffrar (Eds.), People watching: Social, perceptual, andneurophysiological studies of body perception (Chapter 16). New York:Oxford University Press.

Casile, A., & Giese, M. A. (2005). Critical features for the recognition ofbiological motion. Journal of Vision, 5, 348–360.

Chafee, M. V., Averbeck, B. B., & Crowe, D. A. (2007). Representing spatialrelationships in posterior parietal cortex: Single neurons code object-referenced position. Cerebral Cortex, 17, 2914–2932.

Chouchourelou, A., Matsuka, T., Harber, K., & Shiffrar, M. (2006). The visualanalysis of emotional actions. Social Neuroscience, 1, 63–74.

Cutting, J. E., & Kozlowski, L. T. (1977). Recognizing friends by their walk:Gait perception without familiarity cues. Bulletin of the Psychonomic Society,9, 353–356.

Cutting, J. E., Moore, C., & Morrison, R. (1988). Masking the motions ofhuman gait. Perception and Psychophysics, 44, 339–347.

de Gelder, B. (2013). From body perception to action preparation: Adistributed neural system for viewing bodily expressions of emotion. In K.L. Johnson & M. Shiffrar (Eds.), People watching: Social, perceptual, andneurophysiological studies of body perception (Chapter 20). New York:Oxford University Press.




Demiris, Y., & Simmons, G. (2006). Perceiving the unusual: Temporalproperties of hierarchical motor representations for action perception. NeuralNetworks, 19, 272–284.

DiCarlo, J. J. & Maunsell, J. H. R. (2003). Anterior inferotemporal neurons ofmonkeys engaged in object recognition can be highly sensitive to objectposition. Journal of Neurophysiology, 89, 3246–3278.

Dinstein, I., Hasson, U., Rubin, N., & Heeger, D. J. (2007). Brain areasselective for both observed and executed movements. Journal ofNeurophysiology, 98, 1415–1427.

Di Pellegrino, G., Fadiga, L., Fogassi, L., Gallese, V., & Rizzolatti, G. (1992).Understanding motor events: A neurophysiological study. Experimental BrainResearch, 91, 176–180.

Dittrich, W. H. (1993). Action categories and the perception of biologicalmotion. Perception, 22, 15–22.

Dittrich, W. H., Troscianko, T., Lea, S. E., & Morgan, D. (1996). Perception ofemotion from dynamic point-light displays represented (p. 410 ) in dance.Perception, 25, 727–738.

Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognitionvia sparse spatio-temporal features. Proceedings of the InternationalConference on Computer Communications and Networks, 1, 65–72.

Efros, A., Berg, A., Mori, G., & Malik, J. (2003). Recognizing action at adistance. Proceedings. IEEE International Conference on Computer Vision, 2,726–734.

Eifuku, S., & Wurtz, R. H. (1998). Response to motion in extrastriate areaMSTl: Center-surround interactions. Journal of Neurophysiology, 80, 282–296.

Erlhagen, W., Mukovskiy, A., & Bicho, E. (2006). A dynamic model for actionunderstanding and goal-directed imitation. Brain Research, 1083, 174–188.

Escobar, M., Masson, G., Vieville, T., & Kornprobst, P. (2009). Actionrecognition using a bio-inspired feedforward spiking network. InternationalJournal of Computer Vision, 82, 284–301.

Fagg, A. H., & Arbib, M. A. (1998). Modeling parietal-premotor interactions inprimate control of grasping. Neural Networks, 11, 1277–1303.




Felleman, D. J., & van Essen, D. C. (1991). Distributed hierarchical processingin the primate visual cortex. Cerebral Cortex, 1, 1–49.

Ferrari, P. F., Bonini, L., & Fogassi, L. (2009). From monkey mirror neurons toprimate behaviours: Possible “direct” and “indirect” pathways. PhilosophicalTransactions of the Royal Society B: Biological Sciences, 364, 2311–2323.

Fidler, S., Boben, M., & Leonardis, A. (2008). Similarity-based cross-layeredhierarchical representation for object categorization. Proceedings. IEEEConference on Computer Vision and Pattern Recognition, DOI 10.1109/CVPR.2008.4587409.

Fleischer, F., Casile, A., & Giese, M. A. (2008). Neural model for the visualrecognition of goal-directed movements. In V. Kurkova, R. Neruda, & J.Koutnik (Eds.), International Conference on Artificial Neural Networks, Part II,LNCS 5164, 939–948.

Fleischer, F., Casile, A., & Giese, M. A. (2009a). Bio-inspired approachfor the recognition of goal-directed hand actions. In X. Jiang & Petkovrole="editor" (Eds.), International Conference on Computer Analysis ofImages and Patterns, LNCS 5702, 714–722.

Fleischer, F., Casile, A., & Giese, M. A. (2009b). View-independent recognitionof grasping actions with a cortex-inspired model. Proceedings. IEEE-RASInternational Conference on Humanoid Robots, 1, 514–519.

Fogassi, L., Ferrari, P. F., Gesierich, B., Rozzi, S., Chersi, F., & Rizzolatti, G.(2005). Parietal lobe: From action organization to intention understanding.Science, 29, 662–667.

Freedberg, D., & Gallese, V. (2007). Motion, emotion and empathy in estheticexperience. Trends in Cognitive Science, 11, 197–203.

Frith, C. D., & Singer, T. (2008). The role of social cognition in decisionmaking. Philosophical Transactions of the Royal Society B: BiologicalSciences, 363, 3875–3886.

Fukushima, K. (1980). Neocognitron. A self-organizing neural network modelfor a mechanism of pattern recognition unaffected by shift in position.Biological Cybernetics, 36, 193–202.

Gallese, V., Fadiga, L., Fogassi, L., & Rizzolatti, G. (1996). Action recognitionin the premotor cortex. Brain, 119, 593–609.




Gallese, V., & Goldman, A. (1998). Mirror neurons and the simulation theoryof mindreading. Trends in Cognitive Science, 2, 493–501.

Gardner, E. P., Babu, K. S., Reitzen, S. D., Ghosh, S., Brown, A. S., Chen, J.,et al. (2007). Neurophysiology of prehension. III. Representation of objectfeatures in posterior parietal cortex of the macaque monkey. Journal ofNeurophysiology, 98, 3708–3730.

Gavrila, D. M. (1999). The visual analysis of human movement: A survey.Computer Vision and Image Understanding, 73, 82–98.

Giese, M. A. (1999). Dynamic neural field theory for motion perception.Dordrecht, Netherlands: Kluwer Academic Publishers.

Giese, M. A. (2004). Neural model for biological movement recognition. InL. M. Vaina, S. A. Beardsley, & S. Rushton (Eds.), Optic flow and beyond.Dordrecht, Netherlands: Kluwer.

Giese, M. A., & Poggio, T. (2003). Neural mechanisms for the recognition ofbiological movements. Nature Reviews. Neuroscience, 4, 179–192.

Goodale, M. A., & Milner, A. D. (1992). Separate visual pathways forperception and action. Trends in Neuroscience, 15, 97–112.

Gorelick, L., Blank, M., Shechtman, E., Irani, M., & Basri, R. (2007). Actionsas space–time shapes. IEEE Transactions on Pattern Analysis and MachineIntelligence, 29, 2247–2253.

(p. 411 ) Haruno, M., Wolpert, D. M., & Kawato, M. (2001). Mosaic model forsensorimotor learning and control. Neural Computation, 13, 2201–2220.

Hoffman, D. D., & Flinchbaugh, B. E. (1982). The interpretation of biologicalmotion. Biological Cybernetics, 42, 195–204.

Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interactionand functional architecture in the cat’s visual cortex. Journal of Physiology(London), 160, 106–154.

Iacoboni, M., Molnar-Szakacs, I., Gallese, V., Buccino, G., Mazziotta, J. C., &Rizzolatti, G. (2005). Grasping the intentions of others with one’s own mirrorneuron system. PLoS Biology, 3, e79.




Jastorff, J., & Giese, M. A. (2004). Time-dependent hebbian rules for thelearning of templates for motion recognition. Dynamic Perception, 5, 151–156.

Jastorff, J., & Orban, G. A. (2009). Human functional magnetic resonanceimaging reveals separation and integration of shape and motion cues inbiological motion processing. Journal of Neuroscience, 29, 7315–7329.

Jellema, T., & Perrett, D. I. (2006). Neural representations of perceived bodilyactions using a categorical frame of reference. Neuropsychologia, 44, 1535–1546.

Jhuang, H., Serre, T., Wolf, L., & Poggio, T. (2007). A biologically inspiredsystem for action recognition. Proceedings. IEEE International Conference onComputer Vision, DOI 10.1109/ICCV.2007.4408988.

Johansson, G. (1973). Visual perception of biological motion and a model forits analysis. Perception and Psychophysics, 14, 201–211.

Kiebel, S. J., Daunizeau, J., & Friston, K. J. (2008). A hierarchy of time-scalesand the brain. PLoS Computational Biology, 4, e1000209.

Kilner, J. M., Neal, A., Weiskopf, N., Friston, K. J., & Frith, C. D. (2009).Evidence of mirror neurons in human inferior frontal gyrus. Journal ofNeuroscience, 29, 10153–10159.

Kjellström, H., Romero, J., Martínez, D., & Kragić, D. (2008). Simultaneousvisual recognition of manipulation actions and manipulated objects. In D.Forsyth, P. Torr, and A. Zisserman (Eds.), ECCV 2008, Part II, LNCS 5303 (pp.336–349).

Kouh, M., & Poggio, T. (2008). A canonical neuronal circuit for corticalnonlinear operations. Neural Computation, 20, 1427–1451.

Kravitz, D. J., Vinson, L. D., & Baker, C. I. (2008). How position dependent isvisual object recognition? Trends in Cognitive Science, 12, 114–122.

Lange, J., & Lappe, M. (2006). A model of biological motion perception fromconfigural form cues. Journal of Neuroscience, 26, 2894–2906.

Lange, J., Georg, K., & Lappe, M. (2006). Visual perception of biologicalmotion by form: A template-matching analysis. Journal of Vision, 6, 836–849.




Laptev, I., & Lindeberg, T. (2003). Space-time interest points. Proceedings.IEEE International Conference on Computer Vision, 1, 432–439.

Lehky, S. R., Peng, X., McAdams, C. J., & Sereno, A. B. (2008). Spatialmodulation of primate inferotemporal responses by eye position. PLoS ONE,3, e3492.

Logothetis, N. K., Pauls, J., & Poggio, T. (1995). Shape representation in theinferior temporal cortex of monkeys. Current Biology, 5, 552–563.

Logothetis, N. K., & Sheinberg, D. L. (1996). Visual object vision. AnnualReview of Neuroscience, 19, 577–621.

Marr, D., & Vaina, L. M. V. (1982). Representation and recognition of themovements of shapes. Proceedings of the Royal Society of London B:Biology, 214, 501–524.

Mel, B., & Fiser, J. (2000). Minimizing binding errors using learned conjunctivefeatures. Neural Computation, 9, 779–796.

Metta, G., Sandini, G., Natale, L., Craighero, L., & Fadiga, L. (2006).Understanding mirror neurons: A bio-robotic approach. Interaction Studies(Special Issue on Epigenetic Robotica), 7, 197–232.

Moeslund, T. B., & Granum, G. (2001). A survey of computer vision-basedhuman motion capture. Computer Vision and Image Understanding, 81, 231–268.

Murata, A., Gallese, V., Luppino, G., Kaseda, M., & Sakata, H. (2000).Selectivity for the shape, size, and orientation of objects for grasping inneurons of monkey parietal area AIP. Journal of Neurophysiology, 83, 2580–2601.

Mutch, J., & Lowe, D. G. (2006). Multi-class object recognition with sparse,localized features. Proceedings. IEEE Conference on Computer Vision andPattern Recognition, 1, 11–18.

Nelissen, K., Vanduffel, W., & Orban, G. A. (2006). Charting the lowersuperior temporal region, a new motion-sensitive region in monkey superiortemporal sulcus. (p. 412 ) Journal of Neuroscience, 26, 5929–5947.




Niyogi, S. A., & Adelson, E. H. (1994). Analyzing and recognizing walkingfigures in XYT. Proceedings. IEEE Conference on Computer Vision and PatternRecognition, 1994, 1, 469–474.

Oram, M. W., & Perrett, D. I. (1996). Integration of form and motion in theanterior temporal polysensory area (STPa) of the macaque monkey. Journalof Neurophysiology, 76, 109–129.

Orban, G. A., Janssen, P., & Vogels, R. (2006). Extracting 3D structure fromdisparity. Trends in Neuroscience, 29, 466–473.

Oztop, E., & Arbib, M. A. (2002). Schema design and implementation of thegrasp-related mirror neuron system. Biological Cybernetics, 87, 116–140.

Oztop, E., Kawato, M., & Arbib, M. (2006). Mirror neurons and imitation:Computationally guided review. Neural Networks, 19, 254–71.

Perrett, D. I., Harries, M. H., Bevan, R., Thomas, S., Benson, P. J., Mistlin,A. J., et al. (1989). Frameworks of analysis for the neural representation ofanimate objects and actions. Journal of Experimental Biology, 146, 87–113.

Perrett, D. I., & Oram, M. W. (1993). Neurophysiology of shape processing.Image and Vision Computing, 11, 317–333.

Perrett, D. I., Smith, P. A., Mistlin, A. J., Chitty, A. J., Head, A. S., Potter, D.D., et al. (1985). Visual analysis of body movements by neurones in thetemporal cortex in the macaque monkey: A preliminary report. BehavioralBrain Research, 16, 153–170.

Perrett, D. I., Xiao, D., Barraclough, N. E., Keysers, C., & Oram, M. (2009).Seeing the future: Natural image sequences produce “anticipatory” neuronalactivity and bias perceptual report. Quarterly Journal of ExperimentalPsychology, 62, 2081–2104.

Peuskens, H., Vanrie, J., Verfaillie, K., & Orban, G. A. (2005). Specificity ofregions processing biological motion. European Journal of Neuroscience, 21,2864–2875.

Poggio, T., & Edelmann, S. (1990). A network that learns to recognize three-dimensional objects. Nature, 343, 263–266.




Pollick, F. E., Lestou, V., Ryu, J., & Cho, S. B. (2002). Estimating the efficiencyof recognizing gender and affect from biological motion. Vision Research, 42,2345–2355.

Pouget, A., & Sejnowski, T. J. (1997). Spatial transformations in the parietalcortex using basis functions. Journal of Cognitive Neuroscience, 9, 223–237.

Prevete, R., Tessitore, G., Santoro, M., & Catanzariti, E. (2008). Aconnectionist architecture for view-independent grip-aperture computation.Brain Research, 1225, 133–145.

Prinz, W. (1997). Perception and action planning. European Journal ofCognitive Psychology, 9, 129–154.

Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of objectrecognition. Nature Reviews. Neuroscience, 2, 1019–1025.

Rizzolatti, G., Fogassi L., & Gallese, V. (2001). Neurophysiologicalmechanisms underlying the understanding and imitation of action. NatureReviews. Neuroscience, 2, 661–670.

Rizzolatti, G., & Craighero, L. (2004). The mirror-neuron system. AnnualReview of Neuroscience, 27, 169–192.

Rizzolatti, G., & Fabbri-Destro, M. (2008). The mirror system and its role insocial cognition. Current Opinion in Neurobiology, 18, 179–184.

Rodriguez-Sanchez, A. J., Simine, E., & Tsotsos, J. K. (2007). Attention andvisual search. International Journal of Neural Systems, 17, 275–288.

Rolls, E. T., & Milward, T. (2000). A model of invariant object recognition inthe visual system: Learning rules, activation functions, lateral inhibition, andinformation-based performance measures. Neural Computation, 12, 2547–2572.

Sakata, H., Taira, M., Kusunoki, M., Murata, A., & Tanaka, Y. (1997). The TINSlecture: The parietal association cortex in depth perception and visual controlof hand action. Trends in Neuroscience, 20, 350–357.

Saleem, K. S., Suzuki, W., Tanaka, K., & Hashikawa, T. (2000). Connectionsbetween anterior inferotemporal cortex and superior temporal sulcus regionsin the macaque monkey. Journal of Neuroscience, 20, 5083–5101.




Salinas, E., & Abbott, L. F. (1995). Transfer of coded information fromsensory to motor networks. Journal of Neuroscience, 75, 6461–6474.

Sauser, E. L., & Billard, A. G. (2006). Parallel and distributed neural modelsof the ideomotor principle: An investigation of imitative cortical pathways.Neural Networks, 19, 285–298. (p. 413 )

Saygin, A. P. (2013). Sensory and motor brain areas subserving biologicalmotion perception: Neuropsychological and neuroimaging studies. In K.L. Johnson & M. Shiffrar (Eds.), People watching: Social, perceptual, andneurophysiological studies of body perception (Chapter 21). New York:Oxford University Press.

Schaal, S., Ijspeert, A., & Billard, A. (2003). Computational approaches tomotor learning by imitation. Philosophical Transactions of the Royal SocietyB: Biological Sciences, 358, 537–547.

Schindler, K., & van Gool, L. (2008). Combining densely sampled form andmotion for human action recognition. Proceedings. DAGM Symposium, LNCS,5096, 122–131.

Schütz-Bosbach, S., & Prinz, W. (2007). Perceptual resonance: Action-inducedmodulation of perception. Trends in Cognitive Science, 11, 349–355.

Serre, T., Wolf, L., Bileschi, S., Riesenhuber, M., & Poggio, T. (2007). Robustobject recognition with cortex-like mechanisms. IEEE Transactions on PatternAnalysis and Machine Intelligence, 29, 411–426.

Sigala, R., Serre, T., Poggio, T., & Giese, M. A. (2005). Learning features ofintermediate complexity for the recognition of biological motion. Proceedingsof the International Conference on Artificial Neural Networks LNCS, 3696,241–246.

Singer, J. M., & Sheinberg, D. L. (2010). Temporal cortex neurons encodearticulated actions as slow sequences of integrated poses. Journal ofNeuroscience, 30, 3133–3145.

Smith, A. T., & Snowden, R. J. (1994). Visual detection of motion. London:Academic Press.

Stenger, B., Thayananthan, A., Torr, P., & Cipolla, R. (2006). Model-basedhand tracking using a hierarchical bayesian filter. IEEE Transactions onPattern Analysis and Machine Intelligence, 28, 1372–1384.




Tani, J., Ito, M., & Sugita, Y. (2004). Self- organization of distributedlyrepresented multiple behavior schemata in a mirror system: Reviews of robotexperiments using RNNBP. Neural Networks, 17, 1273–1289.

Tarr M. J., & Bülthoff, H. H. (1998). Image-based object recognition in man,monkey and machine. Cognition, 67, 1–20.

Thirkettle, M., Benton, C. E., & Scott-Samuel, N. E. (2009). Contributionsof form, motion and task to biological motion perception. Journal of Vision,9(28), 1–11.

Thompson, J. C. (2013). The how, when, and why of configural processingin the perception of human movement. In K. L. Johnson & M. Shiffrar (Eds.),People watching: Social, perceptual, and neurophysiological studies of bodyperception (Chapter 18). New York: Oxford University Press.

Thornton, I. M. (2013). Top-down versus bottom-up processing of biologicalmotion. In K. L. Johnson & M. Shiffrar (Eds.), People watching: Social,perceptual, and neurophysiological studies of body perception (Chapter 3).New York: Oxford University Press.

Thornton, I. M., Pinto J., & Shiffrar, M. (1998). The visual perception of humanlocomotion. Cognitive Neuropsychology, 15, 535–552.

Thurman, S. M., & Grossman, E. D. (2008). Temporal “Bubbles” reveal keyfeatures for point-light biological motion perception. Journal of Vision, 8(28),1–11.

Tsutsui, K.-I., Taira, M., & Sakata, H. (2005). Neural mechanisms of three-dimensional vision. Neuroscience Research, 51, 221–229.

Umiltà, M. A., Kohler, E., Gallese, V., Fogassi, L., Fadiga, L., Keysers, C., &Rizzolatti, G. (2001). I know what you are doing: A neurophysiological study.Neuron, 31, 155–165.

Ungerleider, L. G., & Mishkin, M. (1982). Two cortical visual systems. In: D. J.Ingle, M. A. Goodale, & R. J. W. Mansfield (Eds.), Analysis of visual behavior(pp. 549–586). Cambridge: MIT Press.

van der Wel, R. P. R. D., Sebanz, N., & Knoblich, G. (2013). Action perceptionfrom a common coding perspective. In K. L. Johnson & M. Shiffrar (Eds.),People watching: Social, perceptual, and neurophysiological studies of bodyperception (Chapter 7). New York: Oxford University Press.




Vangeneugden, J., Pollick, F., & Vogels, R. (2009). Functional differentiationof macaque visual temporal cortical neurons using a parametric actionspace. Cerebral Cortex, 19, 593–611.

Webb, J. A., & Aggarwal, J. K. (1982). Structure from motion of rigid andjointed objects. Artificial Intelligence, 19, 107–130.

Wilson, M., & Knoblich, G. (2005). The case of motor involvement inperceiving conspecifics. Psychological Bulletin, 131, 460–473. (p. 414 )

Wolpert, D. M., Doya, K., & Kawato, M. (2003). A unifying computationalframework for motor control and social interaction. PhilosophicalTransactions of the Royal Society, 358, 593–602.

Wolpert, D. M., & Ghahramani, Z. (2000)Computational principles ofmovement neuroscience. Nature Reviews. Neuroscience, 3, 1212–1217.

Xiao, D. K., Raiguel, S., Marcar, V., Koenderink, J., & Orban, G. A. (1995).Spatial heterogeneity of inhibitory surrounds in the middle temporal visualarea. Proceedings of the National Academy of Science USA, 92, 11303–11306.

Xie, X., & Giese, M. A. (2002). Nonlinear dynamics of direction-selectivenonlinear neural media. Physical Review E: Statistical, Nonlinear, and SoftMatter Physics, 65, 051904/1–051904/11.

Zhang, K. (1996). Representation of spatial orientation by the intrinsicdynamics of the head-direction cell ensemble: A theory. Journal ofNeuroscience, 16, 2112–2126.


Date post:	10-Dec-2016
Category:	Documents
Upload:	maggie
View:	212 times
Download:	0 times

People Watching (Social, Perceptual, and Neurophysiological Studies of Body Perception) ||

Documents