Human Classification, Activity Recognition, Object...

Human Classification, Activity Recognition, Object Detection and Human Object Interaction

Mihir Patankar University of California San Diego

San Diego, USA [email protected]

Abstract—Human detection, tracking and activity recognition is an important area of research with applications ranging from entertainment, robotics, surveillance and elderly care. In this project, I implemented and developed human activity and object classification algorithms using Microsoft Kinect v1 and MATLAB R2016b/R2017a in Windows OS. The dataset used for activity classification was CAD-60 with 17 activities performed by 4 people and manually obtained dataset to detect microwave. Cascade object classifier using multiple stages of one-level decision trees was used while 10-layer neural network was used for activity classification in real time. The neural network has an accuracy of 99.9% and was tested with real data as well. The results of the detection are included in the report. The two classifiers were combined to detect human object interaction based on proximity.

Keywords—Activity classification, Human classification, Object classification, Human-Object Interaction

I. INTRODUCTION Human detection and tracking algorithms have faced various challenges because of different human poses, illuminations, complex backgrounds and occlusion. Also, given the importance of human detection in applications like self-driving cars, elderly care, human robot interaction, there is a need to perform this task accurately and in real time. Failure to do so can lead to undesirable consequences. Hence, a lot of research has been focused on developing algorithms which can not only detect and track humans accurately but are able to execute in real time. Similarly, object detection with different sizes, orientations, illuminations and in cluttered environment is a challenging task. Also, the same object can be made in different types. Human activities are closely linked with various objects and detecting the objects can help significantly in detecting activities and applying context to the situation. For example, after detecting objects in the kitchen like refrigerator, microwave it can be easier to detect activities rather than looking through all possible activities. Human activities generally follow a particular set of motion of the joints which can be tracked easily. However, this is challenging given the similarity in various activities. Also, the way in which different humans can do the same activity and multi-tasking can pose issues while solving this problem.

While using wearable sensors can make the detection independent of the pose and orientation of joints, it comes with a set of drawbacks. The sensors require continuous maintenance and are intrusive. Also, using RGB cameras, we may invade the privacy of the monitored people if the sensing if the computation is done remotely. However, this problem is solved by using just depth data for activity classification which uses IR depth sensor. Also, RGB data is used for object classification but it is processed locally thus solving the privacy issue. To make the sensing independent of orientation and pose, data from various imaging sensors can be fused together. In this work, activity classification and object classification is performed in real time. The two systems are combined for detecting human object interaction based on proximity. Activity classification is performed using Machine learning algorithms like k-NN, SVM and Decision Trees and also using a 10 layer neural network. A comparison of all the algorithms is provided. Also, various techniques are used for object detection such as feature extraction, flow tracking, cascade object classifier, RCNN and a comparative study is provided.

II. RELATED WORK

[1] Human activity recognition is a difficult problem because of the large number of human activities, no known vocabulary to define them, they are composed across time and space leading to complexity and the lack of knowledge about body measurements to characterize these activities. An activity recognition algorithm should be robust to noise, discriminative and should be capable of rejecting unknown activities. The implementation of the algorithm described in the paper is as follows. Histogram of human silhouette is obtained through background subtraction. Optical flow measurements are obtained in x and y directions using KLT algorithm and smoothed using median filter. Bounding box are split into nxn windows and descriptors are obtained for each such window by binning. The combined vector is then used as feature. Previous and next 5 frames are used for motion context and combined with the feature vector after dimensionality

reduction using PCA. Total dimensions used are 286. Various action classification models like Naive Bayes, 1NN, 1NN with rejection used. Main contribution of the paper is 1NN with metric learning algorithm called LMMN (Large margin nearest neighbours). It tries to maximize distance between samples of different labels and minimise distance of same label samples by optimising a distance D. Video is subsampled before implementing algorithm. The main advantages of the algorithm are that it is specifically useful when only a few samples are available and for rejection of unknown activities. The major drawbacks are that it needs history and future of the tasks and thus not suitable for real time. It is difficult to implement KLT in real time since it is computationally intensive. Background subtraction is noisy and not illumination invariant. Also, it uses RGB data, thus invading privacy. The paper does not provide any data about time of implementation of the algorithm. [2] While a lot of research has already been done on activity recognition using wearable sensors, activity recognition using RGBD data is an ongoing research topic. These sensors have an inherent advantage of being non-intrusive and protecting privacy because of the use of depth data. The work uses Microsoft Kinect to obtain skeleton data for human activities, k-means clustering for extracting key poses from the activity and finally a multi class SVM to classify the activities. Most previous works have used SVM, HMM and random forests for activity classification. The skeleton data extracted using kinect is normalised to make it invariant to humans and scale. The N feature vectors are compressed to k-clusters represented by the centroids of the clusters.This set of features is used as features for SVM for the classification problem. This algorithm was tested on 5 public datasets giving state-of-the-art or better accuracy. The main advantages of this algorithm is that it is easy to implement and still gives a high accuracy. Another advantage is that it can be implemented in real time. Also, the k-means and multi-class SVM can be easily replaced with other clustering algorithms and classifiers respectively. The drawbacks of the system are that it cannot classify activities that are quite similar to each other for example talking on phone and drinking water. Also, the algorithm works poorly on complex activities. [3] Along with recognizing objects, it is necessary to understand their interaction with humans to understand what is happening in the images. The image can be labelled as a triplet with <subject, verb and object>. The wide variety and granularity of human actions and the different ways in which they can interact with objects poses challenges to this task. Also, humans perform multiple actions simultaneously. The object position can be narrowed down by using the human pose and actions.

This method improves accuracy by 26% over others and can run in 135ms/image thus having a potential real-time use. They add a human-centric branch to this which does action classification and density estimation on regions of interest (RoI) associated with a person. Density estimator predicts a 4-d Gaussian distribution that models the likely relative position of the object to the person, for every action. Every pair of person, object bounding box returned is given a triplet score and an action Thus, the human-centric recognition branch, standard object detection branch and a simple pairwise interaction branch form the learning system. This method gives high precision and accuracy. Since, it uses standard faster RCNN code it would be easy to extend functionalities. The drawbacks are that it is computation intensive and cannot currently run in real time. [4] This paper introduced a novel algorithm for speeding up object detection. The bottleneck was feature extraction at various positions and scales in the image. This paper introduced the concept of integral image which allowed feature extraction to be done in constant time. It introduced cascade classifier which is using a number of stages of weak learners to obtain the final output where each weak learner depends on a single feature. Another major contribution of this paper was in the working of the cascade classifier, which rejected false data as soon as possible and spend more time on the data that is actually important. Thus, overall this method gives a very high detection accuracy at a much lower implementation cost. The algorithm could run with 15 fps on a Intel Pentium III 700MHz machine and can run in real time with better processors available now. [5] This paper presents a Bayesian model for integrating information from various tasks like scene analysis, human motion/ pose recognition, object detection and object reaction for observing human object interactions. This is particularly useful for differentiating objects that have similar shape like spray bottle and a drinking bottle or features and tasks that might be similar in nature like running and kicking a ball. There are 2 major contributions of the paper one of which is using actions and object reactions for localising and recognising objects and the second is to use object context and object reactions to aid action recognition. Section 3 describes the collection of training data. Section 4 describes human and activity classification using the data obtained. Section 5 describes various methods employed for object detection and tracking. Section 6 combines the activity classification and object detection for human-object interaction. Section 7 concludes the work.

III. SCALED TRAINING DATA

The Kinect library returns a set of 20 skeleton joint coordinates. These are world coordinates considering Kinect as the origin. Each joint Ji is represented as a set of 3 values. An issue with using raw coordinates for activity classification is that the coordinates would have different values for the same human activity based on scale and position. [2] To make the training data independent of scale and position of the human, we make all the joint positions relative to the torso and normalise it using the neck to torso distance. Let Jtorso be the coordinates of the torso joint and let Jneck be the coordinates of the neck joint then the coordinates of any other joint Ji can be calculated as

Ji = ( Ji - Jtorso ) / ( Jtorso - Jneck ) A set of such 20 vectors is stored in a comma separated value (csv) file which is used to train/ test the Machine Learning or Neural Network algorithms.

IV. CLASSIFICATION FROM SKELETON DATA

Human classification and Activity classification is performed using data obtained from publicly available dataset or custom dataset.

1. HUMAN ACTIVITY CLASSIFICATION [6] Cornell Activity Dataset (CAD-60) dataset is used as training data. It comprises of 60 RGB-D videos taken for 4 people for 17 different activities. They provide a text file consisting of the RGB, depth values and the skeleton joint locations for 15 joints which can be converted into CSV file and used for training. It consists of 80,312 samples of training combined from all activities for all people. The trained machine learning algorithms or neural networks can be tested on real time data taken from Microsoft Kinect v1. The data taken from kinect is scaled as described in Section III and passed to the algorithms as inputs which classifies it as an activity. The results obtained using machine learning algorithms are as given below in Table 1 and Figure 1

Table 2

Figure 1

It can be seen in Figure 1, that in general for most algorithms when the training is done on person 1,4 and tested on person 2, better results are obtained than training on people 1,2,3 and testing on 4. This is because person 3 is left handed and person 1,2,4 are right handed and training and testing only on right handed people gives better results. Using neural network with 10 hidden layers from MATLAB R2016b, the data was divided into 70% training data, 15% validation and 15% test data. Following results were obtained after the training as shown in figure 2. The percentage errors obtained on the validation and testing data are 0.0183% and 0.027% respectively. However, it has a few detection issues when tested on real time data.

Figure 2

The trained neural network was used for real time classification using which 3 activities are shown in figure below. Figure 3 show activities showing relaxing on couch, writing on whiteboard and working on computer.

Figure 3

When the neural network is trained only on persons 1,2 and 4 who are right handed it gives much better results on real-time test data for right handed people rather than using all 4 people one of which is left-handed. Also the percent of training images used were increased from 70% to 90% and the number of hidden layers from 10 to 15. The activity labelled “random” in the dataset was NOT used for training which was initially causing a few detection errors. The results obtained after training the neural network are as seen below in figure 4.

Figure 4

Testing the network on real time video, following actions could be recognised in real time.

Figure 5

Thus, it can be seen from figure 5 that 8 out of the 13 activities could be correctly classified in real time by the neural network.

2. HUMAN CLASSIFICATION Human classification was performed by using the absolute skeleton joint distances of all 20 skeleton joints. Initially the system was tested on 2 humans and then adding test data of another human to test the system. The training data obtained earlier had issues since it was not normalised. The absolute joint lengths were different based on scale, orientation and position. To tackle this problem, the joint distances were normalised using the neck to torso distance as described before. The results obtained using various machine learning algorithms are summarised in the Table 2 and Figure 6 below.

Figure 6

Table 2

V. OBJECT DETECTION AND TRACKING

Object detection is the process of finding instances of interest in the image frame which can be used for surveillance,

monitoring, for autonomous vehicles and also to add context to activity recognition in this work. Different techniques were used for object detection and tracking and the results obtained which are described below.

1. FEATURE EXTRACTION [7] This uses Speeded Up Robust Features (SURF) for feature extraction which is a local feature detector and descriptor. It is inspired from Scale Invariant Feature Transform (SIFT) which extracts points from a reference image having the object of interest. 100 strongest points are taken from the set of points and stored. For detection, again SURF features are extracted from the scene and matched pairs are found. Also, it uses affine transformation to find the object in any orientation in the image. From the matched points, the top, bottom, left and right points are found which are used to draw a bounding box around the object found. Following are the results obtained using the algorithm described above using reference image and scene as shown below. Figure 7 shows 100 strongest points taken from reference image. Figure 8 shows the scene on which the object is to be detected and Figure 9 shows detected object in the cluttered scene marked with yellow polygon.

Figure 7

Figure 8

Figure 9

The same algorithm can be used on video by breaking it into images and extracting features from each image. The features are then matched with the reference image to identify the object. However, this method suffers from the drawback that it depends on illumination of the surroundings and is slow since extracting features from each image is computationally intensive.

2. OPTICAL FLOW [8] To save the computation cost for extracting the features from each image from the video stream. The algorithm is initialised with a bounding box either manually marked or detected from the algorithm described in the previous section. Once the bounding box is detected, it is tracked using the Kanade Lucas Tracking algorithm which tracks a set of points over the frames. It assumes that the flow is constant in a local neighbourhood of the pixels under consideration, and solves the optical flow equations for all the pixels in that neighbourhood, by using the least squares criterion Figure 10 shows a set of points detected in the first frame which are marked manually. The green points are tracked over the frames in the video.

Figure 10

The green points of interest are tracked over time and a bounding box can be drawn around is as described before. Figure 11 shows one of the frames of the video where the points are tracked and a red bounding box is drawn over them.

Figure 11

The drawback of this method is again illumination variance since all the points to be tracked might not be visible in each frame and thus, might lead to error in tracking. Also, all objects might not have good SURF or other features like corners and edges that are unique to be tracked reliably.

3. CASCADE OBJECT CLASSIFICATION [9] Cascade classifier consists of multiple stages, where each stage is a collection of weak learners. The weak learners are simple classifiers called decision stumps. Decision stumps are nothing but one-level decision trees. Each stage is trained using a technique called boosting which provides the ability to train a highly accurate classifier by taking a weighted average of the decisions made by weak learners. Each classifier stage labels the current region by sliding window as either positive which indicates that an object is found or negative. If the label is negative, the window moves on, otherwise the next stage is used for detection. Only if the final stage labels the region as positive, the output of the entire classifier is positive. Thus, the negative regions are rejected as fast as possible while the positive samples are passed through all the stages of the classifier since they are rare and worth the effort.

Thus the system should have a low false negative rate, since once the sample is classified as negative, it is rejected forever and then never gets a second chance. On the other hand, the system can have high false positive rate, since the detector gets many more chances later to reject the current region.

The cascade classifier used has the following parameters. The number of positive samples used is 841 while the number of number of negative samples is 807. The number of stages of the cascade is 5, while the false alarm rate is 0.01. The trained classifier was tested on a video which gives the following output. The figure 12 shows one sample frame with the microwave labelled. The major advantages of this method are that it runs in real time with very high accuracy. Also, the classifier can be trained for occlusions and different illuminations. To make the classifier more robust the number of positive and negative

images need to be increased. Also, the positive images need to be of different scales, rotations and illumination.

Figure 12

4. R-CNN [10] RCNN has two parts one of which proposes regions which might have high probability of having the object of interest. An image, in general might have approximately 2000 region proposals. The second part runs CNN on each region proposal which is computationally intensive which makes it extremely slow. It takes almost 50s per image for execution. The CNN used is ZFNet which is an improvement over AlexNet. It was implemented in MATLAB R2016b using NVIDIA GeForce 840M 2GB GPU machine. As compared to cascade classifier described earlier, it has a much lower frame rate, while the cascade classifier can perform in real time. One possible solution is to increase the number of samples used for training as deep learning methods generally require a much larger dataset as compared to other methods. Also, by rejecting values with low confidence, the false positive rate can be reduced. The figures 13,14,15 below show a correct detection, false detection with high confidence and a false detection with low confidence which can be rejected.

Figure 13

Figure 14

Figure 15

5. FASTER R-CNN [11] The major difference between R-CNN and Faster R-CNN is how they select regions to process and the way those regions are classified. R-CNN and Fast R-CNN use a region proposal algorithm as a pre-processing step before running the CNN.. In the case of Fast R-CNN, the use of these techniques becomes the processing bottleneck. However, Faster R-CNN addresses this issue by implementing the region proposal mechanism using the CNN and thereby making region proposal a part of the CNN training and prediction steps. It still takes around 113 minutes to train the network using 103 positive and 807 negative instances. The speed improvement over R-CNN makes the detection rate around 0.2 seconds per image which is almost 5 fps which is much slower than real time. Note that Faster R-CNN has been introduced only after MATLAB version R2017a and does not work in any of their previous versions R-CNN and Faster R-CNN give high accuracy at the cost of speed. Thus, considering the trade-offs between accuracy and speed, cascade object classifier is the best of all for applications in home conditions, where speed is more important than accuracy.

VI. HUMAN-OBJECT INTERACTION

As described in the previous section, the location of the microwave can be obtained using the cascade object classifier. The classifier also returns the labelled area in which the object of interest, microwave in this case is present.

The human skeleton joint positions can be obtained either in the image frame or in the real world with kinect as the origin of the coordinate system. Thus, the x,y location in image frame for both the human and microwave are passed to the depth matrix to obtain the depth in meters from the kinect device. The figure 16 shows the microwave labelled in yellow and its position displayed in green text, while the position of the human is displayed in red text.

Figure 16

The interaction of humans with microwave is detected using the proximity of the human in terms of x,y in image frame and z in world frame of reference. The output of this algorithm when tested is as shown in the figure 17. The positions of human, microwave are labelled and compared for proximity. The blue text shows that the human is near the microwave.

Figure 17

VII. CONCLUSION AND FUTURE WORK

In this work, real time implementation of activity classification, object classification and human-object interaction based on proximity was done using Microsoft Kinect v1 and MATLAB R2016b/2017a. A comparative study is presented in terms of execution time and accuracy for activity and object classification.

The parameters for the neural network such as amount of training data, training rate, number of layers can be fine tuned to obtain better results. Other publicly available datasets can be used for training and a more robust detection. More household objects can be detected using the Cascade classifier to increase the scope of the project. ImageNet which is a publicly available image database with bounding boxes can be used for increasing the number of objects that can be recognized. The activity classification currently works for 1 human which can be extended for up to 6 at the same time. Also, a technique called as Video Summarisation by Prof. Jeff Bilmes at University of Washington can be used to efficiently process data and remove any redundancies.

REFERENCES [1] “Human Activity Recognition with Metric Learning”, Du Tran et al.

2008 [2] “A Human Activity Recognition System Using Skeleton Data from

RGBD Sensors” , Enea Cippitelli et al., 2016

[3] “Detecting and Recognizing Human-Object Interactions”, Georgia Gkioxari et al., 2017

[4] “Rapid Object Detection using a Boosted Cascade of simple features”, Paul Viola, Michael Jones., 2001

[5] “Observing human object interactions : using Spatial and functional compatibility for recognition”, Abhinuv Gupta et al, 2009

[6] Cornell Activity Dataset : Download and Documentation. [7] “Speeded up Robust features”, Herbert Bay et al, 2006 & Feature

Extraction : Mathworks Documentation [8] Object tracking and motion estimation, Mathworks Documentation [9] Cascade object classification, Mathworks Documentation [10] Detect objects using R-CNN : Matlab Documentation [11] “Faster RCNN : towards real time object detection using RPN”, S Ren et

al, 2016 & Mathworks Documentation

http://pr.cs.cornell.edu/humanactivities/data.php

https://www.mathworks.com/help/vision/ref/rcnnobjectdetector.detect.html

https://www.mathworks.com/discovery/feature-extraction.html

https://www.mathworks.com/help/vision/object-tracking-and-motion-estimation.html

https://www.mathworks.com/help/vision/ug/train-a-cascade-object-detector.html

https://www.mathworks.com/help/vision/examples/object-detection-using-faster-r-cnn-deep-learning.html

Date post:	27-Jun-2020
Category:	Documents
Upload:	others
View:	9 times
Download:	0 times

Human Classification, Activity Recognition, Object...

Documents