Click here to load reader
Click here to load reader
XVII SIMPOSIO DE TRATAMIENTO DE SENALES, IMAGENES Y VISION ARTIFICIAL - STSIVA 2012 1
KERNEL BASED HAND GESTURERECOGNITION USING KINECT SENSOR
Daniela Ramırez-Giraldo∗, Santiago Molina-Giraldo∗, Andres M. Alvarez-Meza∗
Genaro Daza-Santacoloma∗ †, and German Castellanos-Domınguez∗
∗Signal Processing and Recognition Group, Universidad Nacional de Colombia sede Manizales, Manizales, Colombia†Instituto de Epilepsia y Parkinson del Eje Cafetero - Neurocentro, Pereira, Colombia
e-mail: daramirezgi, smolinag, amalvarezme, gdazas, cgcastellanosd {@unal.edu.co}
Abstract—Category 4. A machine learning based methodologyis proposed to recognize a predefined set of hand gestures usingdepth images. For such purpose, a RGBD sensor (Microsoftkinect) is employed to track the hand position. Thus, a prepro-cessing stage is presented to subtract the region of interest fromdepth images. Moreover, a learning algorithm based on kernelmethods is used to discover the relationships among samples,properly describing the studied gestures. Proposed methodologyaims to obtain a representation space which allow us to identifythe dynamic of hand movements. Attained results show howour approach presents a suitable performance for detectingdifferent hand gestures. As future work, we are interested inrecognize more complex human activities, in order to supportthe development of human-computer interface systems.
Keywords— Depth sensor, human motion, kernel methods.
I. INTRODUCTION
Interacting with machines and environments is a task of
interest in computer vision systems. In fact, being able to
detect human activities using computer vision techniques allow
us to suitable built human-computer interfaces, which can
be useful fields like medicine, sport training, entertainment,
controlling process, robotics design, among others [1], [2], [3].
Nonetheless, even when some of the current computer vision
systems have provided the ability to realize an interactive
human body tracking, the challenge is to develop a low-
cost system, reliable in unstructured home settings, and also
straightforward to use.
The most common and ancient method of human commu-
nication have been gestures. In recent years, the gestures have
also employed for interacting with machines or computer-
assisted systems, instead of the traditional use of devices such
as keyboard, mouse, joystick, etc. The human gesture inter-
action has several benefits such as free movements, no wired
device limitations, free hands to use other important tools. In
order to track human full-body pose in real-time, camera-based
motion capture systems can be used that typically require a
person to wear cumbersome markers or suits [4]. There exist
several limitations in the past approaches. Garg [5] uses 3D
images in his method to recognize the hand gesture, but this
process was complicated and inefficient. The focus should
be on efficiency with the accuracy as processing time is a
very critical factor in real time applications. Yang [6] analysis
the hand contour to select fingertip candidates, then finds
peaks in their spatial distribution and checks local variance
to locate fingertips. This method was not invariant to the
orientation of the hand. Then, the human gestures recognition
(particularly hand gestures) is still a challenging task due to
the complexity (degrees of freedom) and unpredictability of
human movements.
Recent advances have developed depth cameras that allow
acquiring dense, three-dimensional scans of a scene in real-
time, without the need for multi-camera systems. Such depth
images are almost independent of lighting conditions and
variations in visual appearance, e.g. due to clothing. In every
image pixel, these cameras provide a measurement of the
distance from the camera sensor to the closest object surface
[4].
In this paper, we propose a machine learning based method-
ology to recognize a predefined set of hand gestures. For
such purposes, we use a RGBD sensor (Microsoft Kinect)
as the input sensor, and we present a learning algorithm
based on kernel methods to discover the relationships among
samples to infer the studied gestures. The goal of the proposed
methodology is to obtain a representation space which allow
us to identify properly the dynamic of the hand movements,
which are captured by the Kinect. Attained results show how
our approach presents an acceptable performance for detecting
different hand gestures.
The remainder of this paper is organized as follows. In
section II, proposed methodology for estimating hand position
from depth images, and the kernel based framework used for
recognizing hand gestures are described. In section III, the
experimental conditions and the obtained results are shown.
Finally, in sections IV and V, the discussion and conclusion
are presented.
II. RECOGNIZING HAND GESTURES
A. Data Acquisition and Preprocessing
Kinect sensor has been widely used in computer vision tasks,
due to the several advantages offered by the depth camera
included in it [7]. The main advantages of depth sensors over
traditional intensity ones are: enhance data representability by
introducing a new characteristic (depth information), straight-
forward 3D reconstruction, capability of work in low light
level scenes, and simplify the task of background subtrac-
tion. In order to take advantage of the kinect properties, a978-1-4673-2761-9/12/ $31.00 c©2012 IEEE
preprocessing procedure is proposed to highlight the region of
interest (hand) from depth images. In this regard, four different
regions are extracted.
The former (gray region) is a dead zone configured by the
user, in which the depth points are not taken into account.
In the next region (yellow), the kinect sensor searches for
the nearest depth point, in our case the hand. Note that, the
yellow region does not contain depth data points. Then, in
the green region, which is called the region of interest, the
kinect establishes a working range, where it is expected to find
the hand of the subject. Hence, as result an image containing
only the data points that are presented in the green region
are obtained. Finally, the last region (red) is also considered
as a dead zone, where any object is captured by the sensor.
Note that the length of the gray and the green regions can
be fixed by the user. The above mentioned depth regions are
summarized as in Fig. 1.
Fig. 1. Kinect sensor depth regions.
Given a depth image matrix D ∈ ℜh×w×3, all the pixels
that belong to the green region are fixed to the gr ∈ ℜ1×3
depth value. Therefore, the binary matrix B ∈ ℜh×w can be
computed as in (1)
Bij =
{
1 ‖dij − gr‖ = 00 otherwise
, (1)
where dij ∈ ℜ1×3 is the depth intensity vector of the (i, j)pixel, with i = 1, . . . , h and j = 1, . . . , w. Then, in order
to identify the temporal dynamics of the hand movement, the
centroid (ic, jc) of the detected object is estimated as
ic =1
Ngr
h∑
i=1
w∑
j=1
iBij , jc =1
Ngr
h∑
i=1
w∑
j=1
jBij ; (2)
being Ngr the number of elements in B that are equal to one.
Regarding, let n > 0 the number of analyzed frames in a hand
gesture, thus, the trajectory matrix V ∈ ℜn×2 is calculated
with row vectors vt = [itc, jtc], being (itc, j
tc) the centroid of
the detected object in frame t, and with t = 1, . . . , n.
Furthermore, a conventional lineal interpolation method
is used to properly compare hand gesture trajectories with
different sizes. Then, the matrix S ∈ ℜT×2 is obtained from
interpolating the columns of V, being T > 0 the fixed time
trajectory size. Finally, a dynamic range normalization is used
over each column of S to achieve consistency for comparing
different trajectories. Therefore, the matrix X ∈ ℜT×2 is
estimated as
Xl1 =2 (Sl1 − s1)
max (s1)−min (s1), Sl2 =
2 (Sl1 − s2)
max (s2)−min (s2)(3)
with l = 1, . . . , T , and being s1 and s2 the first and second
column of S, respectively. Fig. 2 shows the proposed acquisi-
tion and preprocessing framework for predicting hand gestures
trajectories from depth images.
D1
...Dn
Handdetection
B1
Bn
... Centroidcalculation
Vnx2
STx2
Interpolation
NormalizationXTx2
Fig. 2. Data acquisition and preprocessing scheme.
B. Gesture recognition based on Kernel Representation
The use of kernel functions to infer relationships among sam-
ples have been widely used for machine learning procedures
[8]. Here, we propose to use a kernel representation to un-
fold the hand gesture trajectories similarities. Recently, some
machine learning approaches have shown that using multiple
kernels to infer the data similarities instead of just one, can
be useful to improve the data interpretability [9]. Given a
pair of hand trajectory matrices Xp and Xp, and assuming Z
kernel functions, the multiple kernel representations - MKR
based methods aim to infer the combined kernel function
κξ (Xp,Xq) =
∑Z
z=1ξzκz (X
p,Xq), subject to ξz ≥ 0, and∑Z
i=1ξz = 1 (∀ξz ∈ ℜ). Thereby, the input data is analyzed
from different information sources by means of a convex
combination of basis kernels.
Using the above described MKR framework, we propose
to combine two different kernels, κa and κo, to estimate
the abscissa and ordinate similarities among hand trajectories.
Hence, a combined kernel function is computed as
κ (Xp,Xq) = ξaκa(xpa,x
qa) + ξoκo(x
po,x
qo), (4)
where the vectors xpa and xq
a correspond to first column of
Xp and Xq , respectively, and xpo and xq
o contain the second
ones. Moreover, p, q = 1, . . . , N , being N the number of
given trajectories. Thus, the kernel matrix K ∈ ℜN×N can
be estimated by (4).
III. EXPERIMENTAL SET-UP AND RESULTS
To test the performance of the proposed methodology to char-
acterize time-series data, a hand gesture recognition database
was recorded using the kinect sensor. We employ the kinect
camera which gives a 640×480 image at 30 frames per second,
using a depth resolution of 3[mm]. The database contains 3different hand gesture symbols performed by 2 subjects. The
chosen symbols are the letters O,S and L, and each subject
performs each symbol 10 times.
Data is extracted using the libfreenect software provided by
OpenKinect 1, and the OpenCV C++ library is used for the
image processing operations 2. The data acquisition is made
by using the region scheme explained in section II-A. The gray
zone is set to approximately 1[m] (suggested distance by Mi-
crosoft). The length of the green region is small enough fixed
for obtaining more accurate results, approximately 1[cm]. The
centroid of this region is determined by obtaining the mean of
the row and column coordinates of the segmented data points
by using equation 2. Moreover, to remove outlier data, we used
a median filter over the abscissa and ordinate signals (each
column of V), with a fixed window of 12 samples. Each signal
is scaled and interpolated with T = 80 (see section II-A). For
each symbol recording, n frames are taken according to each
user symbol length. Fig. 3 shows a segmented image using the
proposed acquisition and preprocessing framework, and Fig.
4 presents some preprocessed hand gesture trajectories.
Fig. 3. Hand trajectory prediction example.
a) b)
Fig. 4. Some preprocessed hand trajectories. a) Subject one. b) Subject two.
The MKR scheme explained in section II-B is used to
represent, as well as possible, the obtained information. A
1http://openkinect.org2http://opencv.willowgarage.com/wiki/
gaussian kernel is used as basis to estimate the relationships
among hand trajectories in (4). For concrete testing, the kernel
band-width σ is empirically fixed to 3. Besides, ξa and ξa are
set to 0.5 in (4). The resulting kernel matrix K of the studied
dataset can be seen in Fig. 5
O
S
L
O S L
Fig. 5. Gaussian Kernel Matrix.
After that, a Kernel-Principal Component Analysis - KPCA
is applied over K [8], obtaining a low-dimensional feature
space E ∈ ℜ60×3. Finally, a k-nearest neighbors classifier -
knnc is trained over the low-dimensional space. It is important
to note, that the system performance is tested using a 10-folds
cross validation scheme. In Fig.6 a 3D representation of the
studied data is presented.
Fig. 6. Low-dimensional KPCA projection - knnc test accuracy = 100[%].
IV. DISCUSSION
According to the preprocessing results show in Fig. 4, it is
possible to notice the capability of our approach for character-
ize the hand trajectory. Due to the region based methodology
for inferring hand position, the estimate trajectory is smooth
enough for further characterization stages.
On the other hand, the resulting similarity measure obtain
by MKR using a gaussian kernel (Fig. 5) confirms that the
similarity among signals from the same class is very high, with
a mean similarity of 0.69 (orange color). Again, the class that
exposes the highest intra-similarity corresponds to the symbol
L with a mean similarity of 0.78. The classes more similar
between them are the S and the L, exposing a mean similarity
of 0.32.The above given measures properties are corroborated by
the estimated KPCA low-dimensional projection presented in
Fig. 6. It can be notice how the MKR framework facilitates
in a major way the classification process. The resulting fea-
ture space exhibits an appropriate separation among different
classes. It is also noted that the L symbol (third class) shows
the highest intra-similarity among all the signals.
V. CONCLUSIONS
A machine learning based methodology for recognizing hand
gestures using depth images captured by a kinect sensor was
proposed. In this sense, a region based acquisition scheme
using depth images was employed in order to obtain an
accurate segmentation of the region of interest. Moreover,
a MKR framework was proposed to combine into a single
similarity matrix, the abscissa and ordinate features inferred
from the centroid trajectories of hand gestures. Attained results
showed that the proposed acquisition methodology obtains
very accurate data points, properly identifying the dynamic
of the gesture. Furthermore, the proposed MKR framework
enhances the separability of the classes, facilitating further
classification process. As future work, it should be interesting
to include more hand gesture symbols, and also it will be
useful to apply a similar MKR approach for skeleton tracking
using depth images.
ACKNOWLEDGMENTS
This research was carried out under grants provided by a
Msc. and a PhD. scholarships, and the project ”ANALISIS
DE MOVIMIENTO EN SISTEMAS DE VISION POR COM-
PUTADOR UTILIZANDO APRENDIZAJE DE MAQUINA”,
funded by Universidad Nacional de Colombia.
REFERENCES
[1] R. Urtasun and T. Darrell, “Sparse probabilistic regression for activity-independent human pose inference,” in IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2008.[2] T. Jaegglia, E. Koller-Meier, and L. Gool, “Learning generative models
for multi-activity body pose estimation,” Int. J. Comput. Vis, vol. 82, pp.121–134, 2009.
[3] R. Kehl and L. Gool, “Markerless tracking of complex human motionsfrom multiple views,” Comput. Vis. Image Underst., vol. 104, pp. 190–209, 2006.
[4] L. A. Schwarz, A. Mkhitaryan, D. Mateus, and N. Navab, “Humanskeleton tracking from depth data using geodesic distances and opticalflow,” Image and Vision Computing, vol. 20, no. 1, pp. 217–226, 2012.
[5] P. Garg, N. Aggarwal, and S. Sofat, “Vision based hand gesture recogni-tion,” World Academy of Science, Engineering and Technology, vol. 49,pp. 972–977, 2009.
[6] D. Yang, L. Jin, J. Yin et al., “An effective robust fingertip detectionmethod for finger writing character recognition system,” in Proceedings
of the Fourth International Conference On Machine Learning And Cy-
bernetics, 2005, pp. 4191–4196.[7] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore,
A. Kipman, and A. Blake, “Real-time human pose recognition in partsfrom single depth images,” in CVPR, vol. 2, 2011, p. 7.
[8] B. Scholkopg and A. J. Smola, Learning with Kernels. Cambridge, MA,USA: The MIT Press, 2002.
[9] A. Rakotomamonjy, F. R. Bach, S. Canu, and Y. Grandvalet, “Sim-pleMKL,” Journal of Machine Learning Research, vol. 9, pp. 2491–2521,2008.
Daniela Ramırez-Giraldo student of electronic en-gineering from the Universidad Nacional de Colom-bia sede Manizales. Her research interests are imageand video processing using kinect sensor.
Santiago Molina-Giraldo received his undergradu-ate degree in electronic engineering from the Uni-versidad Nacional de Colombia sede Manizales in2012. Currently, he is pursuing a M.Sc at thesame university. His research interests are nonlineardimensionality reduction and kernel methods formotion analysis and signal processing.
Andres Marino Alvarez-Mesa received his un-dergraduate degree in electronic engineering withhonors, and his M.Sc. engineering-industrial au-tomation with honors from the Universidad Nacionalde Colombia sede Manizales, in 2009 and 2012. Cur-rently, he is pursuing a Ph.D at the same university.His research interests are nonlinear dimensionalityreduction and kernel methods for signal processing.
Genaro Daza-Santacoloma received his undergrad-uate degree in electronic engineering in 2005, theM.Sc. degree in engineering-industrial automationwith honors in 2007, and the Ph.D. degree inengineering-automatics with honors in 2010, fromthe Universidad Nacional de Colombia sede Maniza-les. Currently, he is an engineering researcher at theInstituto de Epilepsia y Parkinson del Eje Cafetero- Neurocentro, Pereira - Colombia. His researchinterests are feature extraction/selection and motionanalysis for training pattern recognition systems.
German Castellanos-Dominguez received his un-dergraduate degree in radiotechnical systems and hisPh.D. in processing devices and systems from theMoscow Technical University of Communicationsand Informatics, in 1985 and 1990 respectively. Cur-rently, he is a professor in the Department of Elec-trical, Electronic and Computer Engineering at theUniversidad Nacional de Colombia at Manizales. Heis Chairman of the GCPDS at the same university.His research interests include information and signaltheory, digital signal processing and bioengineering.