+ All Categories
Home > Documents > Motion analysis for human interaction detection using optical flow on lattice superpixels

Motion analysis for human interaction detection using optical flow on lattice superpixels

Date post: 10-Dec-2016
Category:
Upload: song-wang
View: 213 times
Download: 1 times
Share this document with a friend
8
2013, Vol.18 No.2, 109-116 Article ID 1007-1202(2013)02-0109-08 DOI 10.1007/s11859-013-0902-3 Motion Analysis for Human Interaction Detection Using Optical Flow on Lattice Superpixels ZHENG Peng 1,2 , CAO Yu 3 , WANG Song 3 1. School of Computer, Wuhan University, Wuhan 430072, Hubei, China; 2. The Key Laboratory of Aerospace Information Security and Trust Computing, Ministry of Education, Wuhan 430072, Hubei, China; 3. Department of Computer Science and Engineering, University of South Carolina, SC, 29208, USA © Wuhan University and Springer-Verlag Berlin Heidelberg 2013 Abstract: We develop a new video-based motion analysis algorithm to determine whether two persons have any interaction in their meet- ing. The interaction between two persons can be very general, such as shaking hands, exchanging objects, and so on. To make the motion analysis robust to image noise, we segment each video frame into a set of superpixels and then derive a motion feature and a motion pattern for each superpixel by averaging the optical flow within the superpixel. Specifically, we use the lattice cut to construct the superpixels, which are spatially and temporally consistent across frames. Based on the motion feature and the motion pattern of the superpixels, we develop an algorithm to divide an input video sequence into three consecutive periods: 1) two persons walking toward each other, 2) two persons meeting each other, and 3) two persons walking away from each other. The experiment show that the proposed algorithm can accurately dis- tinguish the videos with and without human interactions. Key words: superpixel; optical flow; interaction; video CLC number: TP 37 Received date: 2012-02-21 Foundation item: Supported by the National Natural Science Foundation of China (61272453) Biography: ZHENG Peng, male, Associate professor, Ph. D, research direc- tion: computer vision, information hiding. E-mail:[email protected] 0 Introduction Automatic human activity detection and recognition from a video sequence have been attracting more and more researchers in computer vision community. It has wide applications in monitoring and surveillance. For example, detecting the event where “a person leaves a bag unattended” or “two persons exchange their bags” may be very useful for public security at airport and bus/train station. Detecting the event where “a person falls on the floor” may be very useful for nursing rooms. Human activity detection and recognition are the very challenging problems because of the variety/complexity of different activities, strong image noise, possible cam- era shaking, and the large data size of a video. According to Ref.[1], human activities can be cate- gorized into four different levels: gestures, actions, inter- actions, and group activities. Gestures are elementary movements of a person’s body part, such as hands and legs. Actions are single person activities, such as walking and jumping. Interactions are human activities that involve two or more persons and/or objects. Group activities are the activities performed by conceptual groups. In this pa- per, we investigate the problem of detecting human inter- actions from a short video that involves two persons. We focus on the following specific circumstances: given a video in which two persons walk toward each other, meet each other, and then walk away. We develop an algorithm to determine whether they interact with each other when they are meeting. Different from previous work on activity recognition [1] , this paper is not limited to recognize a cer-
Transcript

2013, Vol.18 No.2, 109-116

Article ID 1007-1202(2013)02-0109-08

DOI 10.1007/s11859-013-0902-3

Motion Analysis for Human Interaction Detection Using Optical Flow on Lattice Superpixels

□ ZHENG Peng1,2, CAO Yu3, WANG Song3

1. School of Computer, Wuhan University, Wuhan 430072,

Hubei, China;

2. The Key Laboratory of Aerospace Information Security

and Trust Computing, Ministry of Education, Wuhan 430072,

Hubei, China;

3. Department of Computer Science and Engineering, University

of South Carolina, SC, 29208, USA

© Wuhan University and Springer-Verlag Berlin Heidelberg 2013

Abstract: We develop a new video-based motion analysis algorithm to determine whether two persons have any interaction in their meet-ing. The interaction between two persons can be very general, such as shaking hands, exchanging objects, and so on. To make the motion analysis robust to image noise, we segment each video frame into a set of superpixels and then derive a motion feature and a motion pattern for each superpixel by averaging the optical flow within the superpixel. Specifically, we use the lattice cut to construct the superpixels, which are spatially and temporally consistent across frames. Based on the motion feature and the motion pattern of the superpixels, we develop an algorithm to divide an input video sequence into three consecutive periods: 1) two persons walking toward each other, 2) two persons meeting each other, and 3) two persons walking away from each other. The experiment show that the proposed algorithm can accurately dis-tinguish the videos with and without human interactions. Key words: superpixel; optical flow; interaction; video CLC number: TP 37

Received date: 2012-02-21 Foundation item: Supported by the National Natural Science Foundation of China (61272453) Biography: ZHENG Peng, male, Associate professor, Ph. D, research direc-tion: computer vision, information hiding. E-mail:[email protected]

0 Introduction

Automatic human activity detection and recognition from a video sequence have been attracting more and more researchers in computer vision community. It has wide applications in monitoring and surveillance. For example, detecting the event where “a person leaves a bag unattended” or “two persons exchange their bags” may be very useful for public security at airport and bus/train station. Detecting the event where “a person falls on the floor” may be very useful for nursing rooms. Human activity detection and recognition are the very challenging problems because of the variety/complexity of different activities, strong image noise, possible cam-era shaking, and the large data size of a video.

According to Ref.[1], human activities can be cate-gorized into four different levels: gestures, actions, inter-actions, and group activities. Gestures are elementary movements of a person’s body part, such as hands and legs. Actions are single person activities, such as walking and jumping. Interactions are human activities that involve two or more persons and/or objects. Group activities are the activities performed by conceptual groups. In this pa-per, we investigate the problem of detecting human inter-actions from a short video that involves two persons. We focus on the following specific circumstances: given a video in which two persons walk toward each other, meet each other, and then walk away. We develop an algorithm to determine whether they interact with each other when they are meeting. Different from previous work on activity recognition [1], this paper is not limited to recognize a cer-

Wuhan University Journal of Natural Sciences 2013, Vol.18 No.2 110

tain kind of interaction. Instead, the involved interaction can be very general and may take different forms. We fo-cus on determining whether there is an interaction in this video or not. We expect that this research can provide in-sight to identify common features underlying different interactions and make the proposed method to be general-ized to handle videos with unseen interactions.

Human activity detection and recognition are usu-ally achieved by exploring and modeling the motion pat-terns shown in the given video. Typical motion analysis algorithms, such as optical flow, are very sensitive to image noise. In this paper, we segment each frame into a set of superpixels and then use Lucas-Kanade optical flow algorithm [2] to derive a motion feature and a mo-tion pattern for each superpixel. To make the constructed superpixels consistent across frames, we use the lattice cut approach for generating these superpixels [3-5]. Based on the derived motion features and motion patterns on these lattice superpixels, we develop an algorithm to di-vide each input video into three periods: ① two persons walking toward each other, ② two persons meeting each other, and ③ two persons walking away from each other. We then compare the motion features in these three periods, summarize it into a low-dimensional feature vector for describing this video, and use this final feature vector to determine whether the two persons have any interactions in this video. In the experiment, we collect 31 short videos, each of which contains either interaction or no interaction. The experiment results show that the proposed algorithm can accurately classify the videos with and without human interactions.

1 Related Works

Human activity recognition from video sequences has been studied by many researchers. Recently, Aggar-wal and Ryoo have conducted a comprehensive review on human activity recognition [1], where various spatial and temporal features are extracted and used. In Refs. [6-8], the input video sequence is treated as a 3D volume, and some local volumetric features are extracted for ac-tion recognition. In Refs.[9-12], local interest point de-scriptors and/or discriminative trajectory features are detected and used for human activity recognition. In Refs.[13-15], the involved activity agents, such as per-sons, are first detected, and their relations are then mod-eled for recognizing the underlying human activities. In Ref.[16], they make use of motion context descriptor to recognize human activities.

Many different models and methods have been de-veloped to describe the identified features and agents for human activity recognition. For example, prior work has used Hidden Markov Model (HMM), which has been suc-cessfully used in speech recognition, to describe and dis-tinguish the dynamics underlying different human activi-ties

[17-19]. In Ref.[20], Bayesian networks, together with Markov chain Monte Carlo algorithm, are used to recog-nize bicycle related activities. A hierarchical probabilistic latent model is developed to represent the behavior pat-tern[21]. Probabilistic analysis, such as stochastic-context grammars, is designed for modeling human activities in a hierarchical way[22,23]. Several test video databases are also developed for performance evaluation. For example, the KTH dataset

[24], the Weizmann dataset

[25], the IXMAS dataset

[26], and the UCF sports action datasets [27] contain

videos for different human actions, such as walking, run-ning, jumping, bending, and so on. The Hollywood hu-man action dataset[28] contains videos of both actions and interactions, such as answering phone, getting out of a car, handshaking, hugging, kissing, sitting down, sitting up, and standing up.

Most of the previous works and test video data fo-cus on human actions where there is only one person involved. In addition, most of previous works only han-dle very specific activities, such as jumping, running, hugging, walking, and so on. In this paper, we focus on videos with two persons, and our goal is to determine whether these two persons have any interactions or not, without requiring the recognition of what kind of inter-action. Given the wide variety of human interactions, it would be difficult to collect video data that cover all dif-ferent kinds of interactions for supervised training. Therefore, we propose a simple but robust approach to identify and distinguish the motion patterns between the videos with and without human interactions without us-ing statistical training on a large set of video data.

Closely related to this paper is the pair-activity rec-ognition studied in Ref.[11], where the dynamic relations between two active objects are identified and analyzed for recognizing five activities: “chasing”, “following”, “independent”, “meeting”, and “together”. We only focus on the activity of “meeting” between two persons, but our task is a more refined recognition: determining whether they have any interactions during the meeting. As a result, we need more refined feature detection and motion-pattern analysis to achieve this goal. In Ref.[12], a spatial-temporal video matching algorithm is devel-oped to detect and localize complex nonperiodic activi-

ZHENG Peng et al : Motion Analysis for Human Interaction Detection Using …

111

ties of “shake-hands”, “point”, “hug”, “push”, “kick”, and “punch”. Some of these activities involve the inter-actions between two persons. However, recognizing what kinds of interaction in the video is not the focus of this paper, our goal is to determine whether the two persons interact with each other or not during the meeting.

2 Proposed Method

In this paper, we address the following specific problem: given a video that describes a process where two persons walk toward each other, meet each other and then walk away from each other, find out whether these two persons have any interactions during their meeting.

The basic assumption is that such an interaction would introduce a motion pattern that is different from the case where two persons walk independently without any interaction. To simply the problem, we focus on the case where the camera orientation is largely orthogonal to the motion: one person is walking from left to right, and the other is walking from right to left throughout the video. Furthermore, we assume that, in this short process, the camera is largely static, as in most monitoring applications.

We will not conduct person detections, which itself is a challenging research problem. Instead, we know that the input video contains two persons, and we can focus

on studying their motion patterns in the video. For this purpose, we divide our method into four steps. First, we construct lattice superpixels on each video frame. Second, we calculate the motion features of each superpixel using optical flow and define the motion pattern for each su-perpixel. Third, we divide the video sequence into three periods based on the motion pattern of superpixels. Fi-nally, we compare the motion features in these three pe-riods to determine whether the two persons have any interactions with each other during the meeting. 2.1 Lattice Superpixels

Given a video sequence, we first segment each frame into a set of superpixels[3-5, 29]. With superpixels, we reduce the complexity of the image representation and improve the robustness of the later motion-pattern analysis. Here, we desire temporally consistent superpix-els across video frames. For further motion analysis, we particularly desire that the constructed superpixels in different frames show consistent structures just like pix-els, e.g., the left, right, top, and bottom neighbors to each superpixel are well and uniquely defined. For this pur-pose, we use the lattice cut technique introduced in Refs.[3-5] to construct the superpixels in each frame, and we call them lattice superpixels.

As shown in Fig. 1, the lattice superpixels[5] are constructed by three steps.

Fig. 1 An example of constructing the lattice superpixels (a) uniformly sampled 25 × 25 grid lattice on a frame; (b) boundary map constructed by Berkeley edge detector; (c) the obtained 25 × 25 lattice superpixels; (d) the super-pixels constructed by the Turbopixel algorithm[29]

First, the input image is uniformly divided into M N× rectangular grids, as shown in Fig. 1(a). We choose 25.M N= = Second, a probabilistic boundary map is constructed from the original image, as shown in Fig. 1(b). We use the state-of-the-art Berkeley edge de-tector developed in Ref.[30] to generate the boundary map. Third, the lattice superpixels are obtained by de-forming the initial grids to align with the constructed boundary map, as shown in Fig. 1(c). This way, the number of the resulting superpixels is fixed to be M N× , and each superpixel corresponds to an initial grid spatial relation. 2.2 Motion Features

In this paper, we use optical flow to construct the

motion features for each lattice superpixel on each video frame. Particularly, we adopt the widely used Lucas- Kanade algorithm[2], which assumes constant flow in a local neighborhood of a pixel. By concatenating the ba-sic optical-flow equations for all the pixels in that neighborhood, we can solve these equations in terms of the least square criterion to estimate the velocity at each pixel. This algorithm uses spatial intensity gradient in-formation to find the matched locations in two frames and can be generalized to handle rotations. However, by considering only information in a small neighborhood around a pixel, the Lucas-Kanade algorithm, just like many other optical flow algorithms, cannot find the cor-rect flow field within a region that shows a uniform in-

Wuhan University Journal of Natural Sciences 2013, Vol.18 No.2 112

tensity. In this paper, we combine the optical flow and the above-mentioned lattice superpixels to extract more robust motion features. In extracting the motion features on frame It, we first calculate the optical flow at this frame using the Lucas-Kanade algorithm, by taking its difference to the frame It+1. Denote the flow vector at pixel i to be pi. Since our goal is to determine whether there is a human interaction during the meeting of two persons, we are mainly interested in their horizontal mo-tions. For each pixel i in a superpixel S, we project pi onto the horizontal axis and then take their average value as a motion feature p(S). Positive p(S) indicates a right-ward motion, and negative p(S) indicates a leftward mo-tion, for superpixel S. We also quantify each pixel’s mo-tion pattern into one of the three types: leftward, right-ward, and static. Specifically, if |pi| is less than a given threshold T, pixel i is considered to be static. Otherwise, we check the projection of pi onto the horizontal axis to

decide whether it is leftward or rightward. We then decide the motion pattern of a lattice super-

pixel by combining the motion patterns of the pixels in-side this superpixel. Similarly, we classify each super-pixel’s motion pattern to be either leftward, rightward, or static. We use a simple voting technique for this classifica-tion: for superpixel, we count the percentages of the pixels that are moving leftward, rightward, or stay static, respec-tively, and the pattern associated to the largest percentage of the pixels are chosen as the motion pattern of this su-perpixel. This voting technique makes the estimated mo-tion pattern of a superpixel more robust against image noise. An example is shown in the second row in Fig. 2, where the leftward, rightward, and static superpixels are labeled in white, gray, and dark, respectively. In the fol-lowing, we combine the extracted motion feature p(S) and motion pattern for each superpixel to determine whether a human interaction is presented in the video.

Fig. 2 An illustration of the motion pattern derived for lattice superpixels

top row: selected video frames. Bottom row: motion pattern of the superpixels, where the leftward, rightward, and static superpixels are labeled in white, gray, and dark, respectively

2.3 Three-Period Identification The algorithms described above segment each

frame into M N× lattice superpixels, where each su-perpixel is described by a motion feature p(S) and a mo-tion pattern. In this section, we use them to temporally divide the input video into three consecutive periods: ① two persons walking toward each other, ② two per-sons meeting each other, and ③ two persons walking away from each other.

First, we check each row of the N superpixels be-cause we are mainly interested in the horizontal motions. If the motion pattern of all N superpixels in a row are static, we ignore this row of superpixels from further consideration because it is more likely that the image background and does not provide any motion informa-tion of the foreground if a lattice row contains superpix-els with either leftward or rightward patterns but not both. Besides the static pattern, it also does not provide useful information for determining the human interaction.

In this paper, we focus only on the lattice rows that contain superpixels with both the leftward and rightward patterns. In such a lattice row, we use the following algo-

rithm to decide whether the two persons start meeting or end meeting each other in this frame.

Algorithm 1 Three-Period Identification Period 1: Scan all the superpixels in this lattice row

from left to right and locate the first K consecutive su-perpixels all with the leftward pattern. Take the leftmost superpixel SL in this K consecutive ones as the frontend of the person who is walking to the left side. To balance robustness against noise and capability to handle persons with small size, we choose K = 2 in this paper.

Period 2: Scan all the superpixels in this row from right to left and locate the first K consecutive superpixels all with rightward pattern. Take the rightmost superpixel SR in this K consecutive ones as the frontend of the per-son who is walking to the right side.

Period 3: If SL is very close to SR from the right side, we claim that two persons start meeting each other in this row of this frame. If SL is very close to SR from the right side, we claim that two persons end meeting each other in this row of this frame. In this paper, two superpix-els in the same lattice row are considered to be very close if they are adjacent or there is only one superpixel with

ZHENG Peng et al : Motion Analysis for Human Interaction Detection Using …

113

static pattern in between, as illustrated in Fig. 3.

Fig. 3 An illustration of the cases where two persons start meeting and end meeting, respectively ←, —, → represent the leftward, static, and rightward motion pattern associated to a superpixel. (a) and (b): the cases where two persons start meeting in this lattice row; (c) and (d): the cases where two persons end meeting in this lattice row

For each lattice row, we scan all the frames and de-termine the start-meeting frame in terms of this row us-ing Algorithm 1. We repeat it to locate the start meeting frame in each of M lattice rows. The earliest start-meet-ing frame in these M rows is then identified to be the frame where the two persons start meeting each other, and we denote this frame as Is.

Starting from the frame Is, we further scan for the end-meeting frame using a similar technique. Specifi-cally, for each row of lattice superpixels, we find the first frame where two persons end meeting using the Algo-rithm 1. We then repeat this algorithm to locate the end-meeting frame for each of the M rows. The earliest end-meeting frame for these M rows is then identified to be the frame where the two persons end-meeting each other, and we denote this frame as It. We divide the video into three periods in a way that Period 1 is from the first frame to frame Is, Period 2 is from the frame Is+1 to frame It, and Period 3 is from the frame It+1 to the end of the video sequence. Because of the image noise, we may not detect a start-meeting frame or an end-meeting frame in a lattice row, even if this row contains superpix-els with both leftward and rightward patterns. In this case, we simply ignore this row for further consideration. 2.4 Detecting Human Interactions

Our basic idea to classify the videos with and with-out human interaction is to check whether the motion features in the three periods, obtained in the previous sec-tion, are consistent or not. If the motion features in all three periods are similar, it is unlikely that these two per-sons have any interactions during the meeting. On the contrary, if the motion features in Period 2 show notice-able difference from the ones in Periods 1 and 3, it is likely that the two persons have some interactions.

Specifically, for Period , {1,2,3}j j ∈ and lattice

row , {1,2, , },i i M∈ we calculate two features Lijq

and . Here, Lijq is the average of the motion feature p(S)

over all the superpixels that satisfy (a) showing a left-ward motion pattern, (b) located in row i, and (c) located in a frame in period j. Similarly, R

ijq is the average of the motion feature p(S) over all the superpixels that satisfy (a) showing a rightward motion pattern, (b) located in row i, and (c) located in a frame in period j. We then define

2 1 3 2

1 2 32 max( , , )

L L L Li i i iL

i L L Li i i

q q q qq

q q q

− + −=

2 1 3 2

1 2 32 max( , , )

R R R Ri i i iR

i R R Ri i i

q q q qq

q q q

− + −=

and further define feature vectors:

( )1 2, , ,L L L LMq q q q= , ( )1 2, , ,R R R R

Mq q q q= .

Note that, each element in feature vectors qL and qR takes a value in the range of [0, 1] and reflects the difference between the features in period 2 and the features in other two periods.

Out of the two feature vectors qL and qR, we always select only one for the final classification. The major rea-son is that one person will occlude the other during the meeting, and we cannot locate the motion features of a person when he/she is occluded. Since we do not know which person is occluded in the meeting, we select the feature that is more informative for the classification. In the ideal case, if all the elements in these two feature vec-tors are 1, it strongly indicates an interaction in this video; if all the elements are 0, it strongly indicates no interaction; and if all the elements are 0.5, it is uninformative, and we cannot decide whether there is an interaction. Therefore, we calculate

20.5L L

Md q I= − and 2

0.5R RMd q I= −

where IM is an M dimension identity vector. If L Rd d∨ , we choose feature vector qL because it is more informative. Otherwise, we choose feature vector qR. We denote the finally selected feature vector to be q.

For classification, we use a simple nearest neighbor algorithm. We set the feature q extracted from one video with an interaction as the prototype of the video class with interactions. To classify a new video, we extract its feature and compare this feature’s distance to the prototypes.

3 Experiment

We collect a set of 31 videos, each of which consists of 120 to 500 frames. These videos are taken by a ‘Ko-dak EasyShare Z1012 IS’ digital camera with resolution 640 × 480. As shown in Fig. 4, these videos are taken for

Wuhan University Journal of Natural Sciences 2013, Vol.18 No.2 114

Fig. 4 Elected frames of three tested videos, with the constructed superpixels The top row shows an outdoor video where two persons exchange objects in hands. The mid row shows an outdoor video where two persons have no interaction in the meeting. The bottom row shows an indoor video whether two persons shake hands. The frame number is shown below each frame. The starting-meeting frame number is in underline, and the end-meeting frame number is in dot

different persons in different backgrounds, both indoors and outdoors. In each video, there are two persons who walk toward each other, meet each other, and then walk away from each other. 10 of them have no interaction between the persons, and in other 21 videos, each one contains one of the four interactions: exchanging objects, shaking hands, hugging each other, and stopping/talking to each other.

In constructing the lattice superpixels, we choose M = N = 25. We use one video with interaction to construct the prototype for the nearest neighbor classifier. The performance is evaluated on the remaining 30 videos, out of which 20 contain human interactions and 10 contain no human interaction. As shown in Table 1, we can ac-quire the rather correctly classifying results by using the proposed method.

We do not find another available method that fo-cuses on classifying videos with and without interactions. Many of them try to recognize certain actions and/or

Table 1 Precision and recall when using three different methods

20 videos with interactions

10 videos without interaction Methods

Precision Recall Precision Recall

Method in

Ref.[16] 0.74 1.00 1.00 0.30

Method by grid 0.70 0.80 0.43 0.30

Proposed method 0.95 1.00 1.00 0.90

events. However, most of prior video event recognition methods extract certain features for classification. Fig-ures 5 and 6 demonstrate the features extracted using the proposed method and the features extracted using the method described in Ref.[16]. Figure 5 shows the super-pixels based on motion features p(S) extracted from six videos, where the axis labeled “optical flow” is the value of p(S) at each superpixel. In Fig. 5, we can see that the videos with human interactions (Fig. 5(a-c)) show quite different feature profiles from the videos without human interactions (Fig. 5(d-f)): there is a clear valley at the

Fig. 5 Features p(S) extracted using the proposed method

(a-c): Features extracted from three videos with human interactions; (d-f): Features extracted from three videos without human interactions

ZHENG Peng et al : Motion Analysis for Human Interaction Detection Using …

115

frames where two persons have an interaction with each other. Figure 6 shows the features extracted in the same six videos using the method described in Ref.[16]. We can see that feature profiles from the videos with and without interactions do not show clear difference.

We also apply the same nearest neighbor algorithm to classify the same 30 videos using the features ex-tracted in Ref.[16]. The resulting precision and recall are shown in Table 1. We can see that its performance is much lower than the proposed method.

Fig. 6 Features extracted using the method developed in Ref.[16], on the same six videos as in Fig. 5

4 Conclusion

In this paper, we developed a new method to recog-nize whether two persons have any interaction when they walk to each other in a video. We applied lattice cut to segment each frame into a set of superpixels with fixed spatial relations and then used the optical flow to extract motion features on each lattice superpixel.

By combining features on all the superpixels in all the frames, we developed an algorithm to divide the video into three periods: two persons walking toward each other, two persons meeting each other, and two persons walking away from each other. We finally compared the motion features in these three periods to determine whether this video contains human interactions. This method does not depend on a supervised training on a large set of annotated videos. Instead, we analyzed the underlying motion pat-terns to perform the classification. Therefore, we believe that the proposed method has good generalization ability to process new videos with unseen interactions. In the ex-periments, we collected 31 real videos to test the proposed method, and the results had very promising performance.

[1] Aggarwal J K, Ryoo M S. Human activity analysis: A review [J].

ACM Computing Surveys, 2011, 43(3): 1601-1643.

[2] Lucas B D, Kanade T. An iterative image registration tech-

nique with an application to stereo vision [C]//Proceedings

of Imaging Understanding Workshop. Washington D C:

IEEE Press, 1981: 121-130.

[3] Moore A, Prince S J, Warrell J, et al. Superpixel lattices [C]//

IEEE Conference on Computer Vision and Pattern Recogni-

tion. Washington D C: IEEE Press, 2008, 1(12): 998-1005.

[4] Moore A, Prince S J, Warrell J, et al. Scene shape priors for

superpixel segmentation [C]//IEEE International Conference

on Computer Vision. Washington D C: IEEE Press, 2009:

771-778.

[5] Moore A, Prince S J, Warrell J. “Lattice Cut”-constructing

superpixels using layer constraints [C]//IEEE Conference on

Computer Vision and Pattern Recognition. Washington D C:

IEEE Press, 2010: 2117-2124.

[6] Ke Y, Sukthankar R, Hebert M. Efficient visual event detec-

tion using volumetric features [C]//IEEE International Con-

ference on Computer Vision. Washington D C: IEEE Press,

2005, 1: 166-173.

[7] Ke Y, Sukthankar R, Hebert M. Event detection in crowded

videos [C]//IEEE International Conference on Computer Vi-

sion. Washington D C: IEEE Press, 2007, 1: 1-8.

[8] Ke Y, Sukthankar R, Hebert M. Spatio-temporal shape and

flow correlation for action recognition [C]//IEEE Workshop

on Visual Surveillance. Washington D C: IEEE Press, 2007,

1: 1-8.

[9] Yilmaz A, Shah M. Recognizing human actions in videos

acquired by uncalibrated moving cameras [C]// IEEE Inter-

References

Wuhan University Journal of Natural Sciences 2013, Vol.18 No.2 116

national Conference on Computer Vision. Washington D C:

IEEE Press, 2005, 1: 150-157.

[10] Zheng H, Li Z, Fu Y. Efficient human action recognition by

luminance field trajectory and geometry information [C]//

IEEE International Conference on Multimedia and Expo.

Washington D C: IEEE Press, 2009: 842-845.

[11] Zhou Y, Yan S, Huang T. Pair-activity classification by

bi-trajectories analysis [C]//IEEE Conference on Computer

Vision and Pattern Recognition. Washington D C: IEEE

Press, 2008: 1-8.

[12] Ryoo M S, Aggarwal J K. Spatio-temporal relationship

match: Video structure comparison for recognition of com-

plex human activities [C]//IEEE International Conference on

Computer Vision. Washington D C: IEEE Press, 2009: 1593-

1600.

[13] Ni B, Yan S, Kassim A. Recognizing human group activities

with localized causalities [C]//IEEE Conference on Com-

puter Vision and Pattern Recognition. Washington D C:

IEEE Press, 2009: 1470-1477.

[14] Natarajan P, Nevatia R. Coupled hidden semi markov mod-

els for activity recognition [C]//IEEE Workshop on Motion

and Video Computing. Washington D C: IEEE Press, 2007:

1-10.

[15] Niebles J C, Han B, Li F F. Efficient extraction of human

motion volumes by tracking [C]//IEEE Conference on Com-

puter Vision and Pattern Recognition. Washington D C:

IEEE Press, 2010: 655-662.

[16] Tran D, Sorokin A. Human activity recognition with metric

learning [C]//European Conference on Computer Vision.

Berlin: Springer-Verlag, 2008: 1-14.

[17] Oliver N, Horvitz E, Garg A. Layered representations for

human activity recognition [C]//IEEE International Confer-

ence on Multimodal Interfaces. Washington D C: IEEE Press,

2002: 3-8.

[18] Nguyen N T, Phung D Q, Venkatesh S, et al. Learning and

detecting activities from movement trajectories using the hi-

erarchical hidden markov models [C]//IEEE Conference on

Computer Vision and Pattern Recognition. Washington D C:

IEEE Press, 2005, 2: 955-960.

[19] Zhang D, Gatica-Perez D, Bengio S, et al. Modeling indi-

vidual and group actions in meetings with layered HMMs [J].

IEEE Transactions on Multimedia, 2006, 8(3): 509-520.

[20] Damen D, Hogg D. Recognizing linked events: Searching

the space of feasible explanations [C]// IEEE Conference on

Computer Vision and Pattern Recognition. Washington D C:

IEEE Press, 2009: 927-934.

[21] Yin J, Meng Y. Human activity recognition in video using a

hierarchical probabilistic latent model [C]//CVPR Workshop.

Washington D C: IEEE Press, 2010: 15-20.

[22] Joo S W, Chellappa R. Attribute grammar-based event rec-

ognition and anomaly detection [C]//CVPR Workshop. Berlin:

Springer-Verlag, 2006: 107-114.

[23] Gupta A, Srinivasan P, Shi J, et al. Understanding videos,

constructing plots learning a visually grounded storyline

model from annotated videos [C]//IEEE Conference on

Computer Vision and Pattern Recognition. Washington D C:

IEEE Press, 2009: 2012-2019.

[24] Schldt C, Laptev I, Caputo B. Recognizing human actions: A

local SVM approach [J]. Proceedings of the International

Conference on Pattern Recognition, 2004, 3: 32-36.

[25] Blank M, Gorelick L, Shechtman E, et al. Actions as space-

time shapes [J]. IEEE International Conference on Computer

Vision, 2005, 2: 1395-1402.

[26] Weinland D, Ronfard R, Boyer E. Free viewpoint action

recognition using motion history volumes [J]. Computer Vi-

sion and Image Understanding, 2006, 104(2): 249-257.

[27] Rodriguez M D, Ahmed J, Shah M. Action MACH: A spa-

tio-temporal maximum average correlation height filter for

action recognition [C]//IEEE Conference on Computer Vi-

sion and Pattern Recognition. Washington D C: IEEE Press,

2008: 1-8.

[28] Laptev I, Marszalek M, Schmid C, et al. Learning realistic

human actions from movies [C]//IEEE Conference on Com-

puter Vision and Pattern Recognition. Washington D C:

IEEE Press, 2010: 1-8.

[29] Levinshtein A, Stere A, Kutulakos K, et al. Turbopixels: Fast

superpixels using geometric flows [J]. IEEE Trans on Pat-

tern Anlaysis and Machine Intelligence, 2009, 31(12): 2290-

2297.

[30] Martin D R, Fowlkes C C, Malik J. Learning to detect natu-

ral image boundaries using local brightness, color, and tex-

ture cues [J]. IEEE Trans on Pattern Anlaysis and Machine

Intelligence, 2004, 26(5): 530-549.


Recommended