8/8/2019 Dual Hand Extraction Using Skin Color and Stereo Information
http://slidepdf.com/reader/full/dual-hand-extraction-using-skin-color-and-stereo-information 1/6
Dual Hand Extraction Using
Skin Color and Stereo Information∗
Thien Cong Pham, Xuan Dai Pham, Dung Duc Nguyen, Seung Hun Jin and Jae Wook Jeon, Member, IEEE
School of Information and Communication EngineeringSunkyunkwan University
Suwon, Korea
{ pham, pdxuan}@ece.skku.ac.kr, [email protected], [email protected], [email protected]
Abstract—Extracting the positions of hands is an importantstep in Human Computer Interaction and Robot Vision ap-plications. Posture and gesture can be extracted from handpositions and the appropriate task can be performed. In thispaper, we propose an approach to extract hand images usingskin color and stereo information. Our method does not requireclear hand-size or high-quality disparity. With a sound training
database and an adequate working environment, we obtainnearly 100 percent accuracy. The run time, ignoring calculationof disparity generation time, is also acceptable.
Index Terms—Hand extraction, skin detection, stereo vision.
I. INTRODUCTION
Computer Vision plays an important role in Human Com-
puter Interaction (HCI). Much research has been done and
many papers have been published in this field. Among these,
hand extraction has been of recent interest. Extraction of hand
information is a task required for robots to understand our
commands. The robot reads commands using a camera, tak-
ing images of the environment, extracting hand informationand finding the appropriate operation to be performed. Hand
extraction is usually not the final step in HCI application.
Later steps may include posture recognition or gesture recog-
nition.
This paper concentrates on hand extraction to provide
information of hand position and hand shape. Our inputs are
the left color image, right color image and disparity image of
the environment. Our target result is an image that contains
only two hands. Fig. 1 is an example of hand extraction. The
more challenging final result is to obtain a clean and clear
hand image. That is, there should not be too much noise and
many holes in the resulting image.
In [1], the authors use only one hand and consider theplane that contains the back of the hand to estimate hand
position and orientation. In our program, we have two hands
in the camera view and we find their location. Two hand
planes in our application do not locate in an approximate
plane. That is, distances from the left and rights hand to the
camera can differ. In [2], the authors use a special regular
∗This work is partially supported by Samsung Electronics.
(a)
(b)
Fig. 1. An example of hand extraction. (a) Original left image (b) Handresult image.
diffuser to detect hands. Requiring more calculation time, in
[3], the authors use image segmentation. Our work, using a
different approach, uses only a camera and requires fewer
calculations.
Our program performs correctly under the operational
environment that satisfies these conditions:
• The operational environment is indoors, either during
day-time or night-time.
• The left hand, right hand and face do not overlap on the
image.
• The difference of the vertical position of the left and
right hands should not be great.
The rest of this paper is organized as follows: in Section
Proceedings of the 2008 IEEEInternational Conference on Robotics and BiomimeticsBangkok, Thailand, February 21 - 26, 2009
978-1-4244-2679-9/08/$25.00 ©2008 IEEE 330
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO CARLOS. Downloaded on August 03,2010 at 11:18:39 UTC from IEEE Xplore. Restrictions apply.
8/8/2019 Dual Hand Extraction Using Skin Color and Stereo Information
http://slidepdf.com/reader/full/dual-hand-extraction-using-skin-color-and-stereo-information 2/6
II, we describe related work and our method for skin-based
detection. This includes training the model and extracting
skin pixels. Section III explains some methods to generate
disparity images for stereo input images and the method that
we use. Section IV describes a technique to use disparity
information and skin-based image to remove background and
small noises. A technique to detect connected components isused in this section. In Section V, we combine the results
from previous sections to correctly generate images of two
hands. Our experimental result, together with the operational
environment, are provided in Section VI. Conclusions and
future work are described in section VII.
I I . SKI N DETECTION
A. Related works
A survey on techniques related to skin-based detection is
provided in [4]. There are four groups of methods.1) Explicitly defined skin region: This is the most simple
and static method. An (R ,G ,B) pixel is classified as skin if
R > 95 and G > 40 and B > 20 andmax{R,G,B} − min{R,G,B} > 15 and
|R − G| > 15 and R > G and R > B.
(1)
We experimented to check the performance of this method.
The results were of poor quality with much noise and many
holes. In addition, the quality was dependent on illumination.
We conclude that this method is inadequate for our applica-
tion.
2) Nonparametric skin distribution modelling: Three main
types are reported:
• Normalized lookup table (LUT).
• Bayes classifier.
• Self Organizing Map.
In LUT, the colorspace is divided into a number of bins,
each representing a particular range of color component value
pairs. These bins form a 2D or 3D (depending on the number
of directions of the colorspace) histogram of a reference
lookup table (LUT). Each bin stores the number of times this
particular color occurred in the training skin sample images.
The value of the lookup table bins constitutes the likelihood
that the requesting color will correspond to skin. Similarly, in
Bayes classifier, not only P (c|skin) is calculated, but also the
P (c|¬¬¬skin) is counted. The Self Organizing Map, using the
famous Compaq skin database provided by [5], is reported to
be marginally better than the Mixture of Gaussian model. In
the Self Organizing Map method, two skin-only and skin+ non-skin labelled images are used to train the model.
Many colorspaces were tested with the Self Organizing Map
detector.
The advantages of non-parametric methods are that they
are rapid to train and, use; they are independent of the shape
of skin distribution. Their drawbacks include storage space
and inability to interpolate or generalize from the training
data.
3) Parametric skin distribution modelling: The four main
reported types are:
• Single Gaussian.
• Mixture of Gaussians.
• Multiple Gaussian clusters.
• Elliptic boundary model.
In Single Gaussian, skin color distribution is modelled by aGaussian joint probability density function, as follows:
p(c|skin) =1
2π|Σs|.e−
1
2(c−µs)
T Σ−1s
(c−µs). (2)
In (2), c is color vector, µs and Σs are mean vector and co-
variance matrix of the distribution. The p(c|skin) probability
can alternatively be used as the skin-like measurement. The
Mixture of Gaussian method considers skin color distribution
as a mixture of Gaussian probability density function:
p(c|skin) =
k
i=1
πi.pi(c|skin), (3)
where k is the number of mixture components and πi
are mixture parameters. By approximating the skin color
cluster with three 3D Gaussian in YCbCr colorspace, the
Multiple Gaussian Clusters method was proposed. The pixel
is classified as skin, if the Mahalanobis distance from the
c color vector to the nearest model cluster center is less
than a threshold. In another approach, the Elliptic boundary
model claimed that the Gaussian model shape is insufficient
to approximate color skin distribution. Instead, they proposed
the shape of an elliptical boundary model.
The performance of these methods is clearly dependent on
the distribution shape of the appropriate application. Some
research proves the correctness of distribution choice for
specific cases. Our work, described in the next part of this
section, uses the Mixture of Gaussian model.
4) Dynamic skin distribution models: Rather than fix the
skin distribution model, methods of this type tune the model
dynamically during different operational conditions. These
methods require rapid training and classification. Besides,
they should have the ability to adapt themselves to changing
conditions. The complexity of this type of method is clearly
greater than that of the earlier methods.
Asserting the best method is outside the scope of this paper.
A comparative evaluation of skin-based detection methods
can be found in [4].
B. Our work
After considering properties of these skin detection meth-
ods, we concluded that the Mixture of Gaussians is the
best fit for the program after reviewing related work on
skin detection. In [6], the HSV colorspace is used to obtain
better tolerance to illumination. Operational conditions in [6]
change very rapidly, because they detect skin in real video.
331
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO CARLOS. Downloaded on August 03,2010 at 11:18:39 UTC from IEEE Xplore. Restrictions apply.
8/8/2019 Dual Hand Extraction Using Skin Color and Stereo Information
http://slidepdf.com/reader/full/dual-hand-extraction-using-skin-color-and-stereo-information 3/6
(a) (b)
Fig. 2. Some images of our training database. (a) Images provide by Massey(b) Images generated by ourselves.
Fig. 3. Plotting sample database to guess number of Gaussians to train.
In our application, we use LUV colorspace and ignore the
Luminance part of the pixel’s information.
Since the database in [5] is no longer freely available, our
model is trained using two sources:
• Skin color database provided by Massey in [7].
• Skin color database created by ourselves: we capture
sample images in the same operational environment of
our target hand extraction program. We then manually
erase the non-skin part of those images and retain the
skin pixels.
Fig. 2 shows some of our samples used to train the Mixture
of Gausians model.We plot the sample and view the distribution shape to find
parameter k for the mixture of Gaussians in (3). Fig. 3 shows
the plot of part of our database. This part of the database is
best approximated by a mixture of two Gaussians, because
its shape contains two peaks. In our real experiment, the
database contains day-time and night-time images, they both
set k = 2, we use k = 4 for the entire database.
Fig. 4 is an intermediate result after the skin detection
(a)
(b)
Fig. 4. Intermediate result after skin detection. (a) Original left image and(b) Skin result image.
step. Background and noise remain on the resulting image.
We will filter those pixels in future steps.
III. GENERATING DISPARITY IMAGES
A. Related work
Much current research focusses on generating disparityimages. The official website [8] shows a number of recent
publications. A taxonomy and evaluation of these methods
is provided in [9]. This paper does not focus on disparity
images. Our purpose is to only consider disparity as an
input component. The Graph-Cut, a global matching method,
which is described in [10], [11], is reported to be the best
method.
B. Our work
To improve extractor performance, disparity images should
satisfy these requirements:
• High accuracy.
• Short running time.
Input images including raw skin-based image and disparity
image are not required to be highly accurate. We will
combine them to extract high-quality results. Run time is
considered to increase the system frame rate.
The method in [10] is chosen, because it provides adequate
results in an acceptable run time. Fig. 5 shows a sample of
disparity images using [10].
332
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO CARLOS. Downloaded on August 03,2010 at 11:18:39 UTC from IEEE Xplore. Restrictions apply.
8/8/2019 Dual Hand Extraction Using Skin Color and Stereo Information
http://slidepdf.com/reader/full/dual-hand-extraction-using-skin-color-and-stereo-information 4/6
Fig. 5. Sample disparity images of Fig. 4(a), generated by Graph Cut [10].
(a)
(b)
Fig. 6. Intermediate result after filtering background and small noises. (a)Original left image. (b) Intermediate result that contains only face, handsand big noises.
IV. BACKGROUND AND SMALL NOISE REMOVAL
Disparity information is useful to remove background on
the raw skin-based image. The value range of disparity is
0 → 255. In this paper, using the range from 55 → 200,
we can simply remove the background. The appropriate
range-threshold can be re-adjusted in different applicationconditions.
We follow these two steps to remove small noises:
• Every connected component is extracted.
• Small connected components containing less than a pre-
defined number of pixels are removed.
Fig. 6 shows the intermediate result. At this step, the
background is filtered using disparity range. In addition, all
connected components, which have a small number of pixels,
are considered noises and will be removed from the image.
The resulting image, Fig. 6(b), still has some big components,
located at the bottom left corner. These will be filtered in the
next step.
V. HAN D EXTRACTION
After removing background and small noises, the inter-
mediate result contains only the face, two hands and big
noises. Some research uses many hand features or limitations
to rapidly and easily extract hands. In [12], the hand is always
the largest component inside the camera view. Or in [13],
a lot of hand shape must be trained before extraction. In
another approach, we use skin and depth information from
both hands. This makes the extractor powerful, but is not too
time consuming.
The relative information of any two components is consid-
ered to extract hands. Before designing an evaluation function
to extract hands, we need to define some features of theconnected components detected.
• Size of a component A: size(A), number of pixels in
A.
• Size index of a component A: size index(A), the
index of A, using size information, compared to other
remaining components. The largest component would
have size index 0. The second largest component would
have size index 1. The smallest component would have
size index equal to the number of components minus 1.
• Average height of a component A: avg height(A),
average horizontal positions of all pixels in A.
• Average disparity of a component A: avg disp(A),
average disparity value all pixels in A.• Disparity index of component A: disp index(A), the
index of A, using average disparity information, com-
pared to other remaining components. The component
that has the greatest average disparity value would have
disparity index 0. The component that has the second
largest average disparity value would have disparity
index 1. The component with the smallest average
disparity value would have disparity index equal to the
number of components minus 1.
We define the evaluation function as follows:
f (A, B) = f 1
(A, B) + f 2
(A, B)
+f 3(A, B) + f 4(A, B), (4)
where A and B are any two components. f 1, f 2, f 3 and f 4are respectively sub functions that represent size index, size
difference, disparity index, and height difference of A and
B.
Each sub-function in (4) is defined as follows. By observ-
ing that hands, together with the face are usually the biggest
333
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO CARLOS. Downloaded on August 03,2010 at 11:18:39 UTC from IEEE Xplore. Restrictions apply.
8/8/2019 Dual Hand Extraction Using Skin Color and Stereo Information
http://slidepdf.com/reader/full/dual-hand-extraction-using-skin-color-and-stereo-information 5/6
components, we define the size index component as:
f 1(A, B) =
5, if size index(A) ≤ 3
and size index(B) ≤ 3
3, if size index(A) ≤ 5
and size index(B) ≤ 5
0, otherwise.
(5)
Two hands, in our application, usually have the same size.
We define the size difference component of the evaluation
function as:
f 2(A, B) =
5, if |size(A) − size(B)| <110
. max(size(A),size(B))
3, if |size(A) − size(B)| <15 . max(size(A),size(B))
1, if |size(A) − size(B)| <13 . max(size(A),size(B))
0, otherwise.
(6)
We observe that the two hands are usually the closest
components to the camera, causing their average disparity
values to be larger than any others. We define the disparity
index sub-function as:
f 3(A, B) =
5, if disp index(A) ≤ 3
and disp index(B) ≤ 3
3, if disp index(A) ≤ 5
and disp index(B) ≤ 5
0, otherwise.
(7)
The final part of the evaluation function is considered by
observing that two hands usually have a close average heightvalue. We define the height difference sub-function as:
f 4(A, B) =
10, if |avg height(A)
− avg height(B)| < 70
5, if |avg height(A)
− avg height(B)| < 100
3, if |avg height(A)
− avg height(B)| < 150
0, otherwise.
(8)
Each two components are to be evaluated using this
evaluation function. The components providing the largest
return values are hands. Return values of all sub-functions of
f can also be re-adjusted to fit the appropriate application.
V I . EXPERIMENTAL RESULT
A. Working environment
We used a standard personal computer for the experiments.
• CPU: AMD Athlon 64 X2 Dual 3800+.
• Memory: 1GB RAM.
• Operating System: WindowsXP 32bit SP3.
• Camera: BumbleBee 2 ICX204, product of Point Grey
Research inc. [14].
• Frame resolution: 640×480.
• Operational environment: indoor (either day time or
night time).
B. Result
We chose an environment close to the training environment
for the experiment. About 80 sample images were used in
the training process. More than 200 images were evaluated
by our extractor. Table I and Table II summarize statistical
information for our experiment.
In Table I, we captured 120 stereo images in the morning
or afternoon, extracted hand results, and entered on the
upper row. These were considered the results under day-
time conditions. For night conditions, we captured another
120 stereo images in the evening, extracted hand results, and
entered on the lower row. The statistical information shows
that we obtained 100 percent accuracy.
We calculated the average running time of skin based
detection, noises removal and hand extraction and entered
in Table II. The statistical information shows that the speed
of the entire process is good. The remaining time-consuming
step is generating disparity by Graph Cut. In our test, this
usually takes around 7 minutes for one frame. This problem
can be overcome by using BVZ [11] or Census [15] with
the help of FPGA. Our current Census software result is 5
seconds for one frame. We will apply to our application in
future work.
Fig. 7 are some of our hand extraction results. Some holes
on the results can be simply removed using the morphology
opening operation (chapter 9, [16]). In later steps, for someapplications, hole removal is sometime unnecessary. There-
fore, we do not chose to implement it in our work, as it would
increase the run time.
TABLE IACCURACY STATISTICS
Total frames Correct frames Correct percentage
Day-time 120 120 100%
Night-time 120 120 100%
TABLE IIRUNNING TIME STATISTICS
Skin extraction Noises removal & hand extraction Total
30 ms/frame 90 ms/frame 120 ms/frame
VII. CONCLUSION AND FUTURE WOR K
Our work, under specific conditions, uses skin color and
disparity information to extract the images of two hands. The
334
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO CARLOS. Downloaded on August 03,2010 at 11:18:39 UTC from IEEE Xplore. Restrictions apply.
8/8/2019 Dual Hand Extraction Using Skin Color and Stereo Information
http://slidepdf.com/reader/full/dual-hand-extraction-using-skin-color-and-stereo-information 6/6
(a) (b) (c)
(d) (e) (f)
Fig. 7. Experimental result. (a), (b) and (c) are original left color images. (d), (e) and (f) are the corresponding respective hand extraction results.
program has been implemented and tested. Accuracy is high
and run time is acceptable.
Future research will focus on posture and gesture recogni-
tion. Extracted hands will provide us with features related to
hand position, and hand posture. The gesture recognizer will
be taught using a training model. In [17], a Hiden Markov
Model is used with a high reported accuracy. We will applythis model for our gesture recognizer.
ACKNOWLEDGMENT
This research was performed as part of Samsung Project
on Gesture Recognition, funded by the Samsung Electronics,
Republic of Korea.
REFERENCES
[1] A. Sepehri, Y. Yacoob and L. S. Davis, ”Estimating 3D Hand Positionand Orientation Using Stereo,” Proc. of Conference on Computer Vision,Graphics and Image Processing, pp. 58-63, 2004.
[2] L. W. Chan, Y. F. Chuang, Y. W. Chia, Y. P. Hung and J. Y. Hsu,”A New Method for Multi-finger Detection Using a Regular Diffuser,”Proc. of International Conference on Human-Computer Interaction, pp.573-582, 2007
[3] X. Yin, D. Guo and M. Xie, ”Hand Image Segmentation using Colorand RCE Neural Network,” International journal of Robotics and
Autonomous System, pp. 235-250, 2001.[4] V. Vezhnevets, V. Sazonov and A. Andreeva, ”A Survey on Pixel-based
Skin Color Detection Techniques,” Proc. of Graphicon, pp. 85-92, 2003.[5] M. J. Jones and J. M. Rehg, ”Statistical Color Models with Application
to Skin Detection,” Proc. of Computer Vision and Pattern Recognition,vol. 1, pp. 274-280, 1999.
[6] L. Sigal, S. Sclaroff and V. Athitsos, ”Skin Color-Based Video Segmen-tation under Time-Varying Illumination,” IEEE Trans. Pattern Analysisand Machine Intelligence, pp. 862-877, 2004.
[7] F. Dadgostar, A. L. C. Barczak and A. Sarrafzadeh, ”A Color HandGesture Database for Evaluating and Improving Algorithms on HandGesture and Posture Recognition,” Research Letters in the Information
and Mathematical Sciences, vol. 7, pp. 127-134, 2005.[8] The Middlebury website. [Online]. Available:
http://vision.middlebury.edu/stereo/, 2008.[9] D. Scharstein and R. Szeliski, ”A Taxonomy and Evaluation of Dense
Two-Frame Stereo Correspondence Algorithms,” International Journalof Computer Vision, vol. 47, pp. 7-42, 2002.
[10] V. Kolmogorov and R. Zabih, ”Multi-camera Scene Reconstruction viaGraph Cuts,” Proc. of European Conference on Computer Vision, pp.82-96, 2002.
[11] Y. Boykov, O. Veksler, and R. Zabih, ”Efficient Approximate EnergyMinimization via Graph Cuts,” IEEE Trans. Pattern Analysis and
Machine Intelligence, pp. 1222-1239, 2001.[12] S. J. Schmugge, M. A. Zaffar, L. V. Tsap and M. C. Shin, ”Task-based
Evaluation of Skin Detection for Communication and Perceptual Inter-faces,” Journal of Visual Communication and Image Representation, pp.487-495, 2007.
[13] E. J. Ong and R. Bowden, ”A Boosted Classifier Tree for Hand ShapeDetection,” Proc. of Automatic Face and Gesture Recognition, pp. 889-894, 2004.
[14] The Point Grey Research Inc. [Onli ne]. Avail able:http://www.ptgrey.com/, 2008.
[15] J. Woodfill and B. V. Herzen, ”Real-Time Stereo Vision on the PARTSReconfigurable Computer,” IEEE Symposium on FPGAs for CustomComputing Machines, pp. 201-210, 1997.
[16] R. C. Gonzalez and R. E. Woods, Digital Image Processing, 3rdEdition. Prentice Hall, 2008.
[17] H. K. Lee and J. H. Kim, ”An HMM-Based Threshold Model Ap-proach for Gesture Recognition,” IEEE Trans. Pattern Analysis and
Machine Intelligence, pp. 961-973, 1999.
335
A h i d li d li i d UNIVERSIDADE FEDERAL DE SAO CARLOS D l d d A 03 2010 11 18 39 UTC f IEEE X l R i i l