Download - Dual Hand Extraction Using Skin Color and Stereo Information

8/8/2019 Dual Hand Extraction Using Skin Color and Stereo Information

http://slidepdf.com/reader/full/dual-hand-extraction-using-skin-color-and-stereo-information 1/6

Dual Hand Extraction Using

Skin Color and Stereo Information∗

Thien Cong Pham, Xuan Dai Pham, Dung Duc Nguyen, Seung Hun Jin and Jae Wook Jeon, Member, IEEE

School of Information and Communication EngineeringSunkyunkwan University

Suwon, Korea

{ pham, pdxuan}@ece.skku.ac.kr, [email protected], [email protected], [email protected]

Abstract—Extracting the positions of hands is an importantstep in Human Computer Interaction and Robot Vision ap-plications. Posture and gesture can be extracted from handpositions and the appropriate task can be performed. In thispaper, we propose an approach to extract hand images usingskin color and stereo information. Our method does not requireclear hand-size or high-quality disparity. With a sound training

database and an adequate working environment, we obtainnearly 100 percent accuracy. The run time, ignoring calculationof disparity generation time, is also acceptable.

Index Terms—Hand extraction, skin detection, stereo vision.

I. INTRODUCTION

Computer Vision plays an important role in Human Com-

puter Interaction (HCI). Much research has been done and

many papers have been published in this field. Among these,

hand extraction has been of recent interest. Extraction of hand

information is a task required for robots to understand our

commands. The robot reads commands using a camera, tak-

ing images of the environment, extracting hand informationand finding the appropriate operation to be performed. Hand

extraction is usually not the final step in HCI application.

Later steps may include posture recognition or gesture recog-

nition.

This paper concentrates on hand extraction to provide

information of hand position and hand shape. Our inputs are

the left color image, right color image and disparity image of

the environment. Our target result is an image that contains

only two hands. Fig. 1 is an example of hand extraction. The

more challenging final result is to obtain a clean and clear

hand image. That is, there should not be too much noise and

many holes in the resulting image.

In [1], the authors use only one hand and consider theplane that contains the back of the hand to estimate hand

position and orientation. In our program, we have two hands

in the camera view and we find their location. Two hand

planes in our application do not locate in an approximate

plane. That is, distances from the left and rights hand to the

camera can differ. In [2], the authors use a special regular

∗This work is partially supported by Samsung Electronics.

(a)

(b)

Fig. 1. An example of hand extraction. (a) Original left image (b) Handresult image.

diffuser to detect hands. Requiring more calculation time, in

[3], the authors use image segmentation. Our work, using a

different approach, uses only a camera and requires fewer

calculations.

Our program performs correctly under the operational

environment that satisfies these conditions:

• The operational environment is indoors, either during

day-time or night-time.

• The left hand, right hand and face do not overlap on the

image.

• The difference of the vertical position of the left and

right hands should not be great.

The rest of this paper is organized as follows: in Section

Proceedings of the 2008 IEEEInternational Conference on Robotics and BiomimeticsBangkok, Thailand, February 21 - 26, 2009

978-1-4244-2679-9/08/$25.00 ©2008 IEEE 330

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO CARLOS. Downloaded on August 03,2010 at 11:18:39 UTC from IEEE Xplore. Restrictions apply.



II, we describe related work and our method for skin-based

detection. This includes training the model and extracting

skin pixels. Section III explains some methods to generate

disparity images for stereo input images and the method that

we use. Section IV describes a technique to use disparity

information and skin-based image to remove background and

small noises. A technique to detect connected components isused in this section. In Section V, we combine the results

from previous sections to correctly generate images of two

hands. Our experimental result, together with the operational

environment, are provided in Section VI. Conclusions and

future work are described in section VII.

I I . SKI N DETECTION

A. Related works

A survey on techniques related to skin-based detection is

provided in [4]. There are four groups of methods.1) Explicitly defined skin region: This is the most simple

and static method. An (R ,G ,B) pixel is classified as skin if

R > 95 and G > 40 and B > 20 andmax{R,G,B} − min{R,G,B} > 15 and

|R − G| > 15 and R > G and R > B.

(1)

We experimented to check the performance of this method.

The results were of poor quality with much noise and many

holes. In addition, the quality was dependent on illumination.

We conclude that this method is inadequate for our applica-

tion.

2) Nonparametric skin distribution modelling: Three main

types are reported:

• Normalized lookup table (LUT).

• Bayes classifier.

• Self Organizing Map.

In LUT, the colorspace is divided into a number of bins,

each representing a particular range of color component value

pairs. These bins form a 2D or 3D (depending on the number

of directions of the colorspace) histogram of a reference

lookup table (LUT). Each bin stores the number of times this

particular color occurred in the training skin sample images.

The value of the lookup table bins constitutes the likelihood

that the requesting color will correspond to skin. Similarly, in

Bayes classifier, not only P (c|skin) is calculated, but also the

P (c|¬¬¬skin) is counted. The Self Organizing Map, using the

famous Compaq skin database provided by [5], is reported to

be marginally better than the Mixture of Gaussian model. In

the Self Organizing Map method, two skin-only and skin+ non-skin labelled images are used to train the model.

Many colorspaces were tested with the Self Organizing Map

detector.

The advantages of non-parametric methods are that they

are rapid to train and, use; they are independent of the shape

of skin distribution. Their drawbacks include storage space

and inability to interpolate or generalize from the training

data.

3) Parametric skin distribution modelling: The four main

reported types are:

• Single Gaussian.

• Mixture of Gaussians.

• Multiple Gaussian clusters.

• Elliptic boundary model.

In Single Gaussian, skin color distribution is modelled by aGaussian joint probability density function, as follows:

p(c|skin) =1

2π|Σs|.e−

1

2(c−µs)

T Σ−1s

(c−µs). (2)

In (2), c is color vector, µs and Σs are mean vector and co-

variance matrix of the distribution. The p(c|skin) probability

can alternatively be used as the skin-like measurement. The

Mixture of Gaussian method considers skin color distribution

as a mixture of Gaussian probability density function:

p(c|skin) =

k

i=1

πi.pi(c|skin), (3)

where k is the number of mixture components and πi

are mixture parameters. By approximating the skin color

cluster with three 3D Gaussian in YCbCr colorspace, the

Multiple Gaussian Clusters method was proposed. The pixel

is classified as skin, if the Mahalanobis distance from the

c color vector to the nearest model cluster center is less

than a threshold. In another approach, the Elliptic boundary

model claimed that the Gaussian model shape is insufficient

to approximate color skin distribution. Instead, they proposed

the shape of an elliptical boundary model.

The performance of these methods is clearly dependent on

the distribution shape of the appropriate application. Some

research proves the correctness of distribution choice for

specific cases. Our work, described in the next part of this

section, uses the Mixture of Gaussian model.

4) Dynamic skin distribution models: Rather than fix the

skin distribution model, methods of this type tune the model

dynamically during different operational conditions. These

methods require rapid training and classification. Besides,

they should have the ability to adapt themselves to changing

conditions. The complexity of this type of method is clearly

greater than that of the earlier methods.

Asserting the best method is outside the scope of this paper.

A comparative evaluation of skin-based detection methods

can be found in [4].

B. Our work

After considering properties of these skin detection meth-

ods, we concluded that the Mixture of Gaussians is the

best fit for the program after reviewing related work on

skin detection. In [6], the HSV colorspace is used to obtain

better tolerance to illumination. Operational conditions in [6]

change very rapidly, because they detect skin in real video.

331




(a) (b)

Fig. 2. Some images of our training database. (a) Images provide by Massey(b) Images generated by ourselves.

Fig. 3. Plotting sample database to guess number of Gaussians to train.

In our application, we use LUV colorspace and ignore the

Luminance part of the pixel’s information.

Since the database in [5] is no longer freely available, our

model is trained using two sources:

• Skin color database provided by Massey in [7].

• Skin color database created by ourselves: we capture

sample images in the same operational environment of

our target hand extraction program. We then manually

erase the non-skin part of those images and retain the

skin pixels.

Fig. 2 shows some of our samples used to train the Mixture

of Gausians model.We plot the sample and view the distribution shape to find

parameter k for the mixture of Gaussians in (3). Fig. 3 shows

the plot of part of our database. This part of the database is

best approximated by a mixture of two Gaussians, because

its shape contains two peaks. In our real experiment, the

database contains day-time and night-time images, they both

set k = 2, we use k = 4 for the entire database.

Fig. 4 is an intermediate result after the skin detection

(a)

(b)

Fig. 4. Intermediate result after skin detection. (a) Original left image and(b) Skin result image.

step. Background and noise remain on the resulting image.

We will filter those pixels in future steps.

III. GENERATING DISPARITY IMAGES

A. Related work

Much current research focusses on generating disparityimages. The official website [8] shows a number of recent

publications. A taxonomy and evaluation of these methods

is provided in [9]. This paper does not focus on disparity

images. Our purpose is to only consider disparity as an

input component. The Graph-Cut, a global matching method,

which is described in [10], [11], is reported to be the best

method.

B. Our work

To improve extractor performance, disparity images should

satisfy these requirements:

• High accuracy.

• Short running time.

Input images including raw skin-based image and disparity

image are not required to be highly accurate. We will

combine them to extract high-quality results. Run time is

considered to increase the system frame rate.

The method in [10] is chosen, because it provides adequate

results in an acceptable run time. Fig. 5 shows a sample of

disparity images using [10].

332




Fig. 5. Sample disparity images of Fig. 4(a), generated by Graph Cut [10].

(a)

(b)

Fig. 6. Intermediate result after filtering background and small noises. (a)Original left image. (b) Intermediate result that contains only face, handsand big noises.

IV. BACKGROUND AND SMALL NOISE REMOVAL

Disparity information is useful to remove background on

the raw skin-based image. The value range of disparity is

0 → 255. In this paper, using the range from 55 → 200,

we can simply remove the background. The appropriate

range-threshold can be re-adjusted in different applicationconditions.

We follow these two steps to remove small noises:

• Every connected component is extracted.

• Small connected components containing less than a pre-

defined number of pixels are removed.

Fig. 6 shows the intermediate result. At this step, the

background is filtered using disparity range. In addition, all

connected components, which have a small number of pixels,

are considered noises and will be removed from the image.

The resulting image, Fig. 6(b), still has some big components,

located at the bottom left corner. These will be filtered in the

next step.

V. HAN D EXTRACTION

After removing background and small noises, the inter-

mediate result contains only the face, two hands and big

noises. Some research uses many hand features or limitations

to rapidly and easily extract hands. In [12], the hand is always

the largest component inside the camera view. Or in [13],

a lot of hand shape must be trained before extraction. In

another approach, we use skin and depth information from

both hands. This makes the extractor powerful, but is not too

time consuming.

The relative information of any two components is consid-

ered to extract hands. Before designing an evaluation function

to extract hands, we need to define some features of theconnected components detected.

• Size of a component A: size(A), number of pixels in

A.

• Size index of a component A: size index(A), the

index of A, using size information, compared to other

remaining components. The largest component would

have size index 0. The second largest component would

have size index 1. The smallest component would have

size index equal to the number of components minus 1.

• Average height of a component A: avg height(A),

average horizontal positions of all pixels in A.

• Average disparity of a component A: avg disp(A),

average disparity value all pixels in A.• Disparity index of component A: disp index(A), the

index of A, using average disparity information, com-

pared to other remaining components. The component

that has the greatest average disparity value would have

disparity index 0. The component that has the second

largest average disparity value would have disparity

index 1. The component with the smallest average

disparity value would have disparity index equal to the

number of components minus 1.

We define the evaluation function as follows:

f (A, B) = f 1

(A, B) + f 2

(A, B)

+f 3(A, B) + f 4(A, B), (4)

where A and B are any two components. f 1, f 2, f 3 and f 4are respectively sub functions that represent size index, size

difference, disparity index, and height difference of A and

B.

Each sub-function in (4) is defined as follows. By observ-

ing that hands, together with the face are usually the biggest

333




components, we define the size index component as:

f 1(A, B) =

5, if size index(A) ≤ 3

and size index(B) ≤ 3

3, if size index(A) ≤ 5

and size index(B) ≤ 5

0, otherwise.

(5)

Two hands, in our application, usually have the same size.

We define the size difference component of the evaluation

function as:

f 2(A, B) =

5, if |size(A) − size(B)| <110

. max(size(A),size(B))

3, if |size(A) − size(B)| <15 . max(size(A),size(B))

1, if |size(A) − size(B)| <13 . max(size(A),size(B))

0, otherwise.

(6)

We observe that the two hands are usually the closest

components to the camera, causing their average disparity

values to be larger than any others. We define the disparity

index sub-function as:

f 3(A, B) =

5, if disp index(A) ≤ 3

and disp index(B) ≤ 3

3, if disp index(A) ≤ 5

and disp index(B) ≤ 5

0, otherwise.

(7)

The final part of the evaluation function is considered by

observing that two hands usually have a close average heightvalue. We define the height difference sub-function as:

f 4(A, B) =

10, if |avg height(A)

− avg height(B)| < 70





0, otherwise.

(8)

Each two components are to be evaluated using this

evaluation function. The components providing the largest

return values are hands. Return values of all sub-functions of

f can also be re-adjusted to fit the appropriate application.

V I . EXPERIMENTAL RESULT

A. Working environment

We used a standard personal computer for the experiments.

• CPU: AMD Athlon 64 X2 Dual 3800+.

• Memory: 1GB RAM.

• Operating System: WindowsXP 32bit SP3.

• Camera: BumbleBee 2 ICX204, product of Point Grey

Research inc. [14].

• Frame resolution: 640×480.

• Operational environment: indoor (either day time or

night time).

B. Result

We chose an environment close to the training environment

for the experiment. About 80 sample images were used in

the training process. More than 200 images were evaluated

by our extractor. Table I and Table II summarize statistical

information for our experiment.

In Table I, we captured 120 stereo images in the morning

or afternoon, extracted hand results, and entered on the

upper row. These were considered the results under day-

time conditions. For night conditions, we captured another

120 stereo images in the evening, extracted hand results, and

entered on the lower row. The statistical information shows

that we obtained 100 percent accuracy.

We calculated the average running time of skin based

detection, noises removal and hand extraction and entered

in Table II. The statistical information shows that the speed

of the entire process is good. The remaining time-consuming

step is generating disparity by Graph Cut. In our test, this

usually takes around 7 minutes for one frame. This problem

can be overcome by using BVZ [11] or Census [15] with

the help of FPGA. Our current Census software result is 5

seconds for one frame. We will apply to our application in

future work.

Fig. 7 are some of our hand extraction results. Some holes

on the results can be simply removed using the morphology

opening operation (chapter 9, [16]). In later steps, for someapplications, hole removal is sometime unnecessary. There-

fore, we do not chose to implement it in our work, as it would

increase the run time.

TABLE IACCURACY STATISTICS

Total frames Correct frames Correct percentage

Day-time 120 120 100%

Night-time 120 120 100%

TABLE IIRUNNING TIME STATISTICS

Skin extraction Noises removal & hand extraction Total

30 ms/frame 90 ms/frame 120 ms/frame

VII. CONCLUSION AND FUTURE WOR K

Our work, under specific conditions, uses skin color and

disparity information to extract the images of two hands. The

334




(a) (b) (c)

(d) (e) (f)

Fig. 7. Experimental result. (a), (b) and (c) are original left color images. (d), (e) and (f) are the corresponding respective hand extraction results.

program has been implemented and tested. Accuracy is high

and run time is acceptable.

Future research will focus on posture and gesture recogni-

tion. Extracted hands will provide us with features related to

hand position, and hand posture. The gesture recognizer will

be taught using a training model. In [17], a Hiden Markov

Model is used with a high reported accuracy. We will applythis model for our gesture recognizer.

ACKNOWLEDGMENT

This research was performed as part of Samsung Project

on Gesture Recognition, funded by the Samsung Electronics,

Republic of Korea.

REFERENCES

[1] A. Sepehri, Y. Yacoob and L. S. Davis, ”Estimating 3D Hand Positionand Orientation Using Stereo,” Proc. of Conference on Computer Vision,Graphics and Image Processing, pp. 58-63, 2004.

[2] L. W. Chan, Y. F. Chuang, Y. W. Chia, Y. P. Hung and J. Y. Hsu,”A New Method for Multi-finger Detection Using a Regular Diffuser,”Proc. of International Conference on Human-Computer Interaction, pp.573-582, 2007

[3] X. Yin, D. Guo and M. Xie, ”Hand Image Segmentation using Colorand RCE Neural Network,” International journal of Robotics and

Autonomous System, pp. 235-250, 2001.[4] V. Vezhnevets, V. Sazonov and A. Andreeva, ”A Survey on Pixel-based

Skin Color Detection Techniques,” Proc. of Graphicon, pp. 85-92, 2003.[5] M. J. Jones and J. M. Rehg, ”Statistical Color Models with Application

to Skin Detection,” Proc. of Computer Vision and Pattern Recognition,vol. 1, pp. 274-280, 1999.

[6] L. Sigal, S. Sclaroff and V. Athitsos, ”Skin Color-Based Video Segmen-tation under Time-Varying Illumination,” IEEE Trans. Pattern Analysisand Machine Intelligence, pp. 862-877, 2004.

[7] F. Dadgostar, A. L. C. Barczak and A. Sarrafzadeh, ”A Color HandGesture Database for Evaluating and Improving Algorithms on HandGesture and Posture Recognition,” Research Letters in the Information

and Mathematical Sciences, vol. 7, pp. 127-134, 2005.[8] The Middlebury website. [Online]. Available:

http://vision.middlebury.edu/stereo/, 2008.[9] D. Scharstein and R. Szeliski, ”A Taxonomy and Evaluation of Dense

Two-Frame Stereo Correspondence Algorithms,” International Journalof Computer Vision, vol. 47, pp. 7-42, 2002.

[10] V. Kolmogorov and R. Zabih, ”Multi-camera Scene Reconstruction viaGraph Cuts,” Proc. of European Conference on Computer Vision, pp.82-96, 2002.

[11] Y. Boykov, O. Veksler, and R. Zabih, ”Efficient Approximate EnergyMinimization via Graph Cuts,” IEEE Trans. Pattern Analysis and

Machine Intelligence, pp. 1222-1239, 2001.[12] S. J. Schmugge, M. A. Zaffar, L. V. Tsap and M. C. Shin, ”Task-based

Evaluation of Skin Detection for Communication and Perceptual Inter-faces,” Journal of Visual Communication and Image Representation, pp.487-495, 2007.

[13] E. J. Ong and R. Bowden, ”A Boosted Classifier Tree for Hand ShapeDetection,” Proc. of Automatic Face and Gesture Recognition, pp. 889-894, 2004.

[14] The Point Grey Research Inc. [Onli ne]. Avail able:http://www.ptgrey.com/, 2008.

[15] J. Woodfill and B. V. Herzen, ”Real-Time Stereo Vision on the PARTSReconfigurable Computer,” IEEE Symposium on FPGAs for CustomComputing Machines, pp. 201-210, 1997.

[16] R. C. Gonzalez and R. E. Woods, Digital Image Processing, 3rdEdition. Prentice Hall, 2008.

[17] H. K. Lee and J. H. Kim, ”An HMM-Based Threshold Model Ap-proach for Gesture Recognition,” IEEE Trans. Pattern Analysis and

Machine Intelligence, pp. 961-973, 1999.

335

A h i d li d li i d UNIVERSIDADE FEDERAL DE SAO CARLOS D l d d A 03 2010 11 18 39 UTC f IEEE X l R i i l