Video Retrieval System Based on 3D Point Cloud Processing ... · Video Retrieval System Based on 3D...

Video Retrieval System Based on 3D Point Cloud Processing for Assisting Karate-kata Practice

Kazumoto Tanaka Faculty of Engineering, Kinki University

[email protected]

Abstract

This paper presents a video retrieval system to assist trainees with karate-kata practice. The system uses a depth sensor and a color video camera to record the movements of karate-kata practitioners, who would then be able to retrieve selected images. The system consists of two phases: a phase in which the images were accumulated using the depth sensor and video camera, followed by a second phase in which the 3D point cloud of a stationary posture was used to retrieve the image associated with the particular posture. The proposed system was tested by asking two practitioners of the sport to execute the segments of a karate-kata, which were then recorded. One of the practitioners was then asked to retrieve each segment to assess the successful retrieval rate. The system was found to perform well. 1. Introduction

It is well known that the acquisition of sport

skills by trainees is facilitated by self-observation of their actions through video images [1][2], a learning approach known as "video feedback." However, apart from special gyms where trained staff members are available to operate video equipment, video feedback is not used routinely at ordinary sites where sport is practiced. The main obstacle preventing the wider utilization of video feedback is that “it is troublesome to detect an objective scene from video image files.” Thus, a function that would enable automatic image retrieval from video scenes would be the important key to ensure the wider utilization of video feedback. This prompted the author to develop a video retrieval system for video image files of body actions as the purpose of this study. As a first step, previous work focused on video feedback for karate-kata practice along with the automatic retrieval of images displaying practicing participants [3]. The present paper describes a novel video retrieval system for karate-kata. The unique feature of the system is that it employs both a color video camera and depth sensor, with the information obtained from the depth sensor utilized

for image retrieval and the color video images utilized for video feedback.

Existing image retrieval methods which have already been put into practical use include the use of metadata attached to images according to the MPEG-7 standard [4]. Two methods are available to describe MPEG-7 metadata: manual description and the automatic extraction of features from video images for use as metadata. Manual description requires either the trainee or the coach to input the data every time after a practice session, which is burdensome and impractical. Although automatic image annotation has been the subject of various studies [5][6], the description ability of this approach remains insufficient to describe body actions. The automatic feature extraction method of MPEG-7 utilizes image texture, a color histogram, contour shape, and image moments [7]. The disadvantage of this method is that image retrieval might fail because of the effect of the camera changing position and the photographic subject changing their position and posture, even during the same scene. To overcome these problems, many studies of feature matching, which is robust against translation, rotation, and changes in scale, have been conducted to date [8][9][10][11][12]. Nevertheless, the limitation of using two-dimensional (2D) image features for matching the feature with the image is that three-dimensional (3D) rotation of the camera and the photographic subject could reveal details which were previously obscured in the original image and vice versa.

Depth sensors have become increasingly

available and affordable in recent times (e.g., Kinect sensor by Microsoft Corp.), which has facilitated the simple measurement of 3D coordinate data of the surface of an object (3D point cloud). This has actively promoted studies of the recognition of actions and image retrieval using these sensors [13]. The use of 3D features is greatly beneficial in that even if 3D rotation were to arise, feature matching would be possible. In addition, environments supporting research and development related to the use of a 3D point cloud have been rapidly made available; for example, the

Kazumoto Tanaka, Int.J.Computer Technology & Applications,Vol 6 (5),862-868

IJCTA | Sept-Oct 2015 Available [email protected]

862

ISSN:2229-6093

enhancement of the Point Cloud Library (PCL) for handling 3D point clouds.

The use of libraries made available by depth sensor manufacturers (Kinect SDK by Microsoft Corp. and others) and by PCL have enabled the detection of the human figure in the image recorded by the depth sensor (depth image). In addition, the 3D point cloud of scenes photographed by the depth sensor has enabled the estimation of the 3D coordinates of human joints. However, in the case of the karate movements examined in this study, the reliability of joint detection is not high because trainees are required wear bulky karate uniform in which the body shape is only slightly distinguishable. Especially, the estimation of joints fails when a target assumes a crosswise posture in which half of the body is hidden, as shown in Figure 1. Therefore, this study employed the 3D point cloud of the trainee, instead of using joint information, and then performed 3D point cloud matching to enable image retrieval. The paper also describes the point cloud matching method and the experimental assessment of video retrieval for video feedback.

Image retrieval, with the aim of assisting with the acquisition of sport skills, has not been sufficiently studied to date. For example, in a study of image retrieval with the purpose of using video feedback to provide assistance with tennis coaching [14], a scene was recognized by tracking the movements of the player and ball, but this did not involve the retrieval of body actions. According to another study, a reference body movement model is searched in an exemplary movement database by mainly using information relating to rotations of the trainee’s joints obtained from the depth sensor [15]. However, because the target movements in the study, such as arm rotation, were simple compared with karate techniques and the subjects were not required to wear bulky clothes such as a karate uniform, information about the movements of joints could be considered readily available for retrieval.

Figure 1. Trainee with joints estimated by Kinect SDK. The estimation failed for several joints due to the trainee’s crosswise posture and karate uniform.

On the other hand, karate players assume various postures when performing karate-kata and thus several parts of their body are often occluded from the depth sensor and this causes joint estimation errors. Therefore, the video retrieval system proposed here utilizes information from the 3D point cloud of the postures for retrieval purposes. The approach the author followed to use the 3D point cloud is described in detail in the next section. 2. Methods 2.1. Video Retrieval System for Karate-kata

The system consists of two phases (see Figure 2): an image data accumulation phase using both a color video camera and depth sensor, followed by a video retrieval phase using the 3D point cloud obtained from the depth sensor. Karate-kata practice with video feedback proceeds by repeating the two phases alternately. Each of these two phases is described as follows.



863

ISSN:2229-6093

Figure 2. Flow of karate-kata practice using the video retrieval system.

Image data accumulation phase: Practice sessions are photographed by simultaneously using both a color video camera and depth sensor. The color images are stored chronologically in a database together with the depth images. The 3D point cloud of a trainee is subsequently extracted for use in the video retrieval phase.

Video retrieval phase: The “search query” for the retrieval is the 3D point cloud of the trainee who performed the movements consisting of karate techniques in the scene to be viewed. This assignment is made by extracting the 3D point cloud and by demonstrating the same movement in front of the depth sensor. Then, the 3D point cloud is compared with each of the time-series of the 3D point cloud stored during the image data accumulation phase. Finally, the video images beginning from the color image with the same time stamp as the retrieved 3D point cloud are presented to the trainee.

2.2. 3D Point Cloud matching method

Karate-kata is a competition that incorporates a combination of basic karate techniques in about 2-3 min. The action of each technique is performed extremely fast and cannot be detected by ordinary sensors. However, a karate-kata starts from a stationary state (referred to as a pause). A pause designated as “kime” is performed immediately after completion of each technique. In other words, a karate-kata consists of a repetition of segments from the “stationary state to the movement state” (see Figure 3). Therefore, the 3D point clouds of the postures before and after the objective segment can be used as “search query” (refer to the pictures in the lower part of Figure 3). This enables the retrieval of karate-kata images including fast movements; hence, the system employs the 3D point clouds of the stationary posture for retrieval purposes.

Object recognition is an important field of computer vision research and can be divided into two types of problems: rigid object recognition and non-rigid (deformable) object recognition. The karate-kata postures in the stationary states are fixed postures; thus, karate-kata posture recognition can be considered to belong to the field of rigid object recognition. However, the karate uniform is deformable and this restricts recognition solutions.

The feature-matching approach is a popular approach for rigid object recognition. Although many local feature descriptors that are utilized to encode local 3D shapes have been proposed and have performed well [16][17][18][19], application of the local feature descriptor approach to karate-kata posture recognition may cause the matching process to fail because of inconsistent folding of the karate uniform. Figure 4 shows a 3D point cloud of a player wearing the uniform and local features that were extracted by using the normal aligned radial feature (NARF) [18]. The features on the uniform (especially on the upper clothing) change the locations and numbers of every photograph that was taken after a karate movement was performed in spite of the trainee assuming the same posture.

Global feature descriptors that represent the full point cloud at once have also been studied for object recognition [20][21]. They employ information relating to the whole appearance and are thus more robust to noise than local feature descriptors. On the other hand, because of its global nature, the recognition tends to fail more often than local feature descriptors when partial occlusions occur [22]. This is a crucial problem for karate-kata posture recognition because changes in the position and direction between the depth sensor and the posture arise naturally and the body is partially hidden due to the changes.



864

ISSN:2229-6093

Another approach utilizes both 3D point cloud registration and point matching. The registration transforms two sets of 3D point cloud (the reference and target clouds) into the same coordinate system so as to align these clouds by the minimum distance between them. Recognition is then based on the principle of assessing the correspondence between them (i.e., point matching) after the registration. The iterative closest point (ICP) algorithm and its variants have been proposed for the registration and are very popular in the field of image registration [23][24][25]. These algorithms transform one point cloud to best fit another cloud by revising the transformation iteratively. They are quite useful and widely utilized but require a significant amount of time which cannot be ignored due to the iteration when applying the algorithms to perform the retrieval.

Figure 3. Selected stationary states (“kime” postures) of karate-kata designated as “Pin-an Syodan”.

Principal component analysis (PCA) is a mathematical procedure which detects the principal axes of a dataset as eigenvectors corresponding to the largest eigenvalues of the data covariance matrix. PCA has been applied to the registration in the following manner [24][26]. The covariance matrix of each of the two clouds is calculated, and then the three main orthogonal axes are computed. The rotation transform matrix is determined by using the normalized axis vectors, and the translation is determined by the distance between the centers of the two clouds. The system proposed in this paper employs PCA for the registration because it is very fast. Furthermore, to ensure that the registration is robust, 2D PCA was employed for the system based on the following fact about karate-kata postures: a karate-kata is performed on a flat plane and thus the postures are fixed except for rotation around the vertical axis (see Figure 5). This allows the rotation for the registration to be

performed only around the vertical axis. The 2D PCA is performed as follows.

Let P be a point cloud of a human as a reference and Q be a point cloud of a human as a target (i.e., query). P and Q are projected on to the plane corresponding to the plane on which the karate-kata is performed (the xz-plane in Figure 5), and 2D images fP(x,z) and fQ(x,z) are obtained, respectively. The covariance matrix of each image is calculated by:

( )( )TJiJ

1N

0iJ

iJ

J

ppppN1 J

−−= ∑−

=JCov (1)

where NJ is the number of points, piJ is the i-th

point (2D position vector) of the image fJ, and p_

J is the center of the points. The main direction uJ (the eigenvector corresponding to the largest eigenvalue λJ) where the greatest variance of the points lies is computed by singular value decomposition for each of fP(x,z) and fQ(x,z):

TJJJJ VDUCov = (2)

The angle θ between eigenvectors uP and uQ is calculated, and then the rotation matrix (around the y-axis) is determined by θ.

The 2D PCA is generally very fast and useful with the exception of a single situation. The problem and its solution are as follows. If an occlusion corrupts the information relating to the greatest variance, the results may be less accurate. In other words, the PCA method is most useful for whole object registration. Figure 6 shows an example about the occlusion problem. To solve the problem, the system selects 2D ICP for the registration of fP(x, z) and fQ(x, z) in case the problem occurs. The computation of 2D ICP is simpler compared with 3D ICP; thus, aligning the 2D images can be expected to require less time. The query Q is photographed by using a similar view to the scene to be viewed and also photographed carefully such that significant occlusion is avoided, therefore the condition for selecting 2D ICP is:

th_EλλANDth_Aθ QP <> (3) where th_A is a threshold given by the radian, and th_E is a threshold less than 1.0.

Next, the Hausdorff distance is considered for the assessment of the correspondence between P and Q that were aligned. The distance is a metric between two point clouds and has been studied for point matching [27][28]. A modified Hausdorff distance (MHD) was proposed to decrease the influence of outliers [29]. The MHD and the directed MHD are defined as:

( ) ( ) ( )( )ABBABA ,dmhd,,dmhdmax,mhd =: (4)

( ) ∑∈ ∈

−=A BA

BAa b

bamin,dmhd 1: (5)

where A and B denote two point clouds, || ∙ || indicates the Euclidean distance.



865

ISSN:2229-6093

With regard to the system, the reference cloud P is extracted from the time-series 3D point cloud stored during karate-kata practice, hereby the cloud P may have several occlusions. The system employed dmhd(P,Q) for the assessment, because dmhd(Q,P) may include wrong Euclidean distances due to the occlusion even though P is obtained from the same posture as Q.

Finally, procedures are shown for extracting a point cloud of a human from the time series 3D point cloud on the system.

Step 1: Noises are removed from the time series 3D point cloud.

Step 2: Human regions are extracted from the time series. Segmentation algorithms by PCL or Kinect SDK can be utilized for the extraction.

Step 3: Frames that result in the local minimum value of the sum of differences are extracted using the interframe difference of the 3D point clouds (Extraction of stationary state).

Step 4: Noises are removed from the 3D point cloud of a human.

Figure 4. Local features extracted by NARF for a support size of 15cm. The features are represented by black squares.

Figure 5. A Karate-kata posture. The posture can be rotated around the y-axis (vertical axis). Rotation around the x/z-axis is either impossible or involves a change in the posture.

Figure 6. 3D point cloud of the same posture (upper images); projection of the cloud on to the xz-plane (lower images); Images on the right were photographed after rotation by approximately 45 degrees around the y-axis (vertical axis) compared to the images on the left. The left leg is occluded in the image on the right. The use of PCA for registration between the left and the right fails due to this occlusion. Thus, the system will select ICP for the case.

3. Experiment and result

Video retrieval was assessed experimentally

using the proposed method. The karate-kata “Pin-an Syodan” was used for the experiment. This karate-kata comprises 21 segments. Kinect sensor (Microsoft Corp.) in which the depth sensor and color camera are incorporated into one, is used for the experiment. The Kinect SDK is used for human detection and for calculation of the 3D point cloud from the depth image.

In the experiments, two participants (identified as A and B who are karate trainees with a black belt) performed “Pin-an Syodan” ten times. The performances were photographed and stored in a database. Following this, stationary postures of a segment were performed once by participant A and photographed as a query, and then the database was searched with the query. This was executed for each segment of the karate-kata. The sum of dmhd(P,Q) of the postures before and after a segment was utilized for the assessment of the correspondence. Table 1 shows the successful retrieval rates.



866

ISSN:2229-6093

Table 1. Experimental results indicating the successful retrieval rate for each query (percentage).

Query number*

“Pin-an Syodan”

by A

“Pin-an Syodan”

by B 1 100 100 2 100 90 3 100 100 4 100 90 5 100 80 6 100 100 7 100 100 8 100 100 9 100 100 10 100 100 11 100 100 12 100 100 13 100 90 14 100 100 15 100 80 16 100 80 17 80 60 18 100 90 19 100 70 20 100 100 21 100 100

*The number of the query corresponds to the segment number 4. Discussion

A retrieval rate of almost 100% was attained when participant A retrieved himself. The retrieval of query no.17 failed twice for participant A. This could be attributed to the worsening of the precision of the shape of the 3D point clouds during this segment, which influenced the matching because the segment was performed farthest away from the Kinect sensor. This problem could be solved by using multiple Kinect sensors. However, the objective of the study is to simplify the setting for the video feedback, and requiring the trainee to set multiple sensors would be expecting too much. A practical solution would be to enable the system to provide multiple retrieval results when the dmhd(P,Q) is not sufficiently small, and to allow the user to select the right result. This could be done via a user interface on the system, which would not necessarily become problematic because of infrequent operation.

Despite the problem described above, the retrieval rates can be considered to be high and the system functioned well. This result can be

attributed to the combinatorial effect of two posture clouds, representing the posture before and after each segment, which were used as the search query.

On the other hand, the result of the retrieval exercise by participant B indicates that many false positives were detected. The number of false positives could be reduced by using a face recognition function as an effective solution. It is possible to implement the function on the system because the system also uses color images. Furthermore, the time-series data of each trainee can be stored separately in the database for exclusive retrieval by the particular trainee. However, from another point of view, while learning it may be useful to observe the same movements performed by another practitioner. Should the system follow this approach, the false positive rate would have to be improved. Normalization of the cloud size in 3D as a pre-process to the image matching procedure would be a solution for improving the performance of the system (participants A and B were 170cm and 183cm tall, respectively). 5. Conclusion

A novel video retrieval system employing both a

depth sensor and color video camera for assisting karate-kata practice is proposed in this paper. The system automatically retrieves the objective scene based on 2D PCA, 2D ICP, and the Directed MHD metric. The system was evaluated experimentally and showed good performance, although some problems were identified. Future work will involve further verification of the system using a larger number of subjects and the expansion of the system to other forms of sport. 6. Acknowledgement

This work was supported by JSPS KAKENHI

Grant Number 15K01102. 7. References [1] Guadagnoli, M., Holcomb, W. & Davis, M., “The efficacy of video feedback for learning the golf swing”, Journal of Sports Sciences, 20(8), pp.615-622, 2002. [2] Zhenkun, Z., “The experimental study on computer video feedback in tennis serve teaching effect”, Proceedings of International Workshop on Computer Science in Sports: Wuhan, China, August 1-2, pp.262-264, 2013. [3] Tanaka, K., “Karate-kata exercise assist method using video image retrieval based on 3d point cloud processing”, Proceedings of the 26th International Conference of Society for Information Technology and Teacher Education: Las Vegas, USA, March 2-6, pp.1399-1402, 2015.



867

ISSN:2229-6093

[4] Chang, S.-F., Sikora, T. & Purl, A., “Overview of the MPEG-7 standard”, IEEE Transactions on Circuits and Systems for Video Technology, 11(6), pp.688-695, 2001. [5] Barai, S. & Cardenas, A. F., “Image annotation system using visual and textual features”, Proceedings of the 16th International Conference on Distributed Multimedia Systems: Oak Brook, USA, October 14-16, pp.289-296, 2010. [6] Mitran, M., Cabanac, G. & Boughanem, M., “GeoTime-based tag ranking model for automatic image annotation”, Proceedings of the 29th Annual ACM Symposium on Applied Computing: Gyeongju, Korea, March 24-28, pp.896-901, 2014. [7] Sikora, T., “The MPEG-7 visual standard for content description - an overview”, IEEE Transactions on Circuits and Systems for Video Technology, 11(6), pp.696-702, 2001. [8] Lowe, D. G., “Object recognition from local scale-invariant features”, Proceedings of the 7th IEEE International Conference on Computer Vision: Kerkyra, Greece, September 20-27, pp.1150-1157, 1999. [9] Ke, Y., & Sukthankar, R., “PCA-SIFT: a more distinctive representation for local image descriptors”, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition: Washington, USA, June 27-July 2, pp.506-513, 2004. [10] Abdel-Hakim, A. E & Farag, A. A., “CSIFT: a SIFT descriptor with color invariant characteristics”, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition: New York, USA, June 17-22, pp.1978-1983, 2006. [11] Bay, H., Ess, A., Tuytelaars, T. & Gool, L.V., “SURF: speeded up robust features, Computer Vision and Image Understanding, 110(3), pp.346-359, 2008. [12] Rublee, E., Rabaud, V., Konolige, K. & Bradski, G., “ORB: an efficient alternative to SIFT or SURF”, Proceedings of IEEE International Conference on Computer Vision: Barcelona, Spain, November 6-13, pp.2564-2571, 2011. [13] Aggarwal, J. K. & Xia, L., “Human activity recognition from 3d data: a review”, Pattern Recognition Letters, 48(15), pp.70-80, 2014. [14] Connaghan, D., Moran, K. & O’Connor, N. E., “An automatic visual analysis system for tennis”, Journal of Sports Engineering and Technology, 227(4), pp.273-288, 2013. [15] Hu, M.-C., Chen, C.-W., Cheng, W.-H., Chang, C.-H., Lai, J.-H. & Wu, J.-L., “Real-time human movement retrieval and assessment with kinect sensor”, IEEE Transactions on Cybernetics, 45(4), pp.742-753, 2014. [16] Scovanner, P., Ali, S. & Shah, M., “A 3-dimensional SIFT descriptor and its application to action recognition”, Proceedings of the 15th International Conference on Multimedia: Augsburg, Germany, September 23-28, 2007, pp.357-360, 2007. [17] Lavoué, G., “Bag of words and local spectral descriptor for 3d partial shape retrieval”, Proceedings of the 4th Eurographics Conference on 3D Object Retrieval: Llandudno, UK, April 10, pp.41-48, 2011. [18] Steder, B., Rusu, R. B., Konolige, K. & Burgard, W., “Point feature extraction on 3d range scans taking into account object boundaries”, Proceedings of IEEE International Conference on Robotics and Automation: Shanghai, China, May 9-13, pp.2601-2608, 2011. [19] Tabia, H. Laga, H. Picard, D. & Gosselin, P.-H., “Covariance descriptors for 3d shape matching and retrieval”, Proceedings of IEEE Conference on Computer

Vision and Pattern Recognition: Columbus, USA, June 24-27, pp.4185-4192, 2014. [20] Muja, M., Rusuy, R. B., Bradskiy, G. & Lowe, D. G., “REIN - a fast, robust, scalable REcognition INfrastructure”, Proceedings of IEEE International Conference on Robotics and Automation: Shanghai, China, May 9-13, pp.2939-2946, 2011. [21] Aldoma, A., Tombari, F., Rusu, R. B. & Vincze, M., “OUR-CVFH - oriented, unique and repeatable clustered viewpoint feature histogram for object recognition and 6dof pose estimation”, Lecture Notes in Computer Science, 7476, pp.113-122, 2012. [22] Martínez, L., Loncomilla, P. & Ruiz-del-Solar, J., “Object recognition for manipulation tasks in real domestic settings: a comparative study”, Lecture Notes in Computer Science, 8992, pp.207-219, 2015. [23] Besl, P. J. & McKay, N. D., “A method for registration of 3-d shapes”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(2), pp.239-256, 1992. [24] Salvi, J., Matabosch, C., Fofi, D. & Forest, J., “A review of recent range image registration methods with accuracy evaluation”, Journal of Image and Vision Computing, 25(5), pp.578-596, 2007. [25] Torabi, M., Mousavi, S. M. & Younesian, G. D., “A new methodology in fast and accurate matching of the 2D and 3D point clouds extracted by laser scanner systems”, Optics & Laser Technology, 66, pp.28-34, 2015. [26] Chung, D. H., Yun, I. D. & Lee, S. U., “Registration of multiple-range views using the reverse-calibration technique”, Pattern Recognition, 31(4), pp.457-464, 1998. [27] Huttenlocher, D. P., Klanderman, G. A. & Rucklidge, W. J., “Comparing images using the Hausdorff distance”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(9), pp.850-863, 1993. [28] Aigera, D. & Kedem, K., “Approximate input sensitive algorithms for point pattern matching”, Pattern Recognition, 43, pp.153-159, 2010. [29] Dubuisson, M.-P. & Jain, A. K., “A modified Hausdorff distance for object matching”, Proceedings of the 12th IAPR International Conference on Pattern Recognition, vol.1: Jerusalem, Israel, October 9-13, pp.566-568, 1994.



868

ISSN:2229-6093

Date post:	27-Jul-2018
Category:	Documents
Upload:	trinhque
View:	218 times
Download:	0 times

Video Retrieval System Based on 3D Point Cloud Processing ... · Video Retrieval System Based on 3D...

Documents