Automatic scalable face model design for 2D model · PDF fileAutomatic scalable face ......

ARTICLE IN PRESS

Signal Processing: Image Communication 19 (2004) 421–436

*Correspondin

E-mail addre

0923-5965/$ - see

doi:10.1016/j.ima

Automatic scalable face model design for 2D model-basedvideo coding

M. Hu*, S. Worrall, A.H. Sadka, A.M. Kondoz

Centre for Communication Systems Research (CCSR), Department of Electronics Engineering, School of Electronics and Physical

Science, University of Surrey, Guildford, Surrey, GU2 7XH, UK

Received 14 August 2003; received in revised form 14 January 2004; accepted 6 February 2004

Abstract

Scalable low bit-rate video coding is vital for the transmission of video signals over wireless channels. A scalable

model-based video coding scheme is proposed in this paper to achieve this. This paper mainly addresses automatic

scalable face model design. Firstly, a robust and adaptive face segmentation method is proposed, which is based on

piecewise skin-colour distributions. 43 million skin pixels from 900 images are used to train the skin-colour model,

which can identify skin-colour pixels reliably under different lighting conditions. Next, reliable algorithms are proposed

for detecting the eyes, mouth and chin that are used to verify the face candidatures. Then, based on the detected facial

features and human face muscular distributions, a heuristic scalable face model is designed to represent the rigid and

non-rigid motion of head and facial features. A novel motion estimation algorithm is proposed to estimate the object

model motion hierarchically. Experimental results are provided to illustrate the performance of the proposed

algorithms for facial feature detection and the accuracy of the designed scalable face model for representing face

motion.

r 2004 Elsevier B.V. All rights reserved.

Keywords: Face detection; Facial feature extraction; Scalable face modelling; Model-based video coding; Scalable compression

1. Introduction

In the recent decade, much research has focusedon 3-dimensional (3-D) model-based video codingfor videophones [6,13,16]. In model-based videocoding, a 3-D wireframe model is predefined inboth encoder and decoder to analyse and synthe-size the face object. Only the analysis data need tobe transmitted to the decoder. Experimentalresults show that it can achieve very low bit-rate

g author. Tel.: +44-1483-68-3433.

ss: [email protected] (M. Hu).

front matter r 2004 Elsevier B.V. All rights reserve

ge.2004.02.003

video coding (lower than 5 kbps) [6]. However,3-D model based video coding scheme has severaldisadvantages:

* The adaptation of face model to a particularhuman face in the sequence is very complicatedand time-consuming.

* The analysis process in video encoder is toocomplex to get the data that can be used tosynthesize the image in the decoder. This is notsuitable for handheld mobile phone.

* It is very sensitive to channel error. Althoughscalable coding can be used to solve this

d.

ARTICLE IN PRESS

M. Hu et al. / Signal Processing: Image Communication 19 (2004) 421–436422

problem, it is very complex to design 3-Dscalable model.

Currently, 3-D model-based video codec is toorigidly object-specific because the extraction of 3-D structure from single objects in an unrestrictedenvironment and the efficient modelling of theirsurfaces is an extremely difficult task. Modelingobjects is the most important issue in model-basedvideo coding as the complexity of analysis andsynthesis depends on the model adopted. There-fore, at present, 3-D model-based video coding isjust applied for coding head–shoulder sequences.

Much research has also focused on 2-D model-based video coding [2,13]. Compared with 3-Dmodel-based video coding, 2-D model has severaladvantages. First, 2-D model-based video codingis rather universal and not limited to facial images.2-D mesh models (unlike 3-D wireframe models)can be easily designed for arbitrary scenes. Second,2-D parametric motion estimation is a better-posed problem than 3-D motion and structureestimation. Therefore, the analysis process of 2-Dmodel-based video coding is much easier than thatof 3-D model-based video coding. Complexalgorithms are needed to get the accurate 3-Dmotion and structure estimation due to its ill-posed problems. Although a priori knowledge ofobject can be used to improve the efficiency of 3-Dmodel-based video coding, it can also be usedefficiently to improve 2-D model design and objectcoding.

Some research has shown that 2-D model-basedcoding with affine/perspective transformationand triangular mesh models can simulate almostall capabilities of 3-D model-based approachesusing wireframe models at a fraction of thecomputational cost [2,13]. Recently, 2-D objectmodels have been applied for object descrip-tion and retrieval in information databases[15]. Therefore, it is useful to employ 2-D modelsin our research on scalable model-based videocoding.

In our research, a scalable 2-D model-basedvideo coding scheme has been proposed [7]. Ourproposed scheme consists of the following steps:video segmentation and 2-D scalable model de-sign; progressive texture coding; scalable object

model coding (including shape coding); objecttracking and model adaptation; scalable coding ofregion of interest (ROI) for newly appearedregion, etc. Although our proposed scheme israther universal and not limited to head–shouldersequence, head–shoulder foreground object isdealt with specially to improve the codingefficiency by using a priori knowledge on headobject.

This paper mainly addresses scalable face modeldesign and its performance evaluation. Firstly, arobust and adaptive face segmentation method isproposed, which is based on piecewise skin-colourdistributions. Next, reliable algorithms are pro-posed for detecting eyes, mouth and chin that arealso used to verify the face candidatures. Then,based on the detected facial features and humanface muscular distributions, a heuristic scalableface model is designed to represent the rigid andnon-rigid motion of head and facial features. Anefficient motion estimation method is proposed toevaluation the efficiency of the designed model.

The proposed method features three majornovelties. (1) We proposed a robust and simpleface detection scheme. Illumination-piecewise sta-tistical skin-colour model and Bayesian detection/relaxation scheme can achieve robust detection todifferent lighting conditions and skin colour. (2) Areliable and simple facial feature detection schemeis proposed, which is very important for itsapplication to 2-D scalable model-based videocoding. (3) Facial muscular distribution is intro-duced to build the scalable face model, which candescribe face motion more precisely, hence redu-cing the warping error during the model-basedvideo coding.

The paper is organized as follows. Section 2presents efficient and robust facial feature detec-tion algorithms. In Section 3, a heuristic scalableface model is designed based on the detected facialfeatures, head structure and face muscular dis-tributions. A novel motion estimation algorithm isproposed in Section 4 to evaluate the modelperformance. Experimental results are providedin Section 5 to demonstrate the accuracy andefficiency of facial feature detection and scalableface model design. Conclusions are drawn inSection 6, and followed by the future work.

ARTICLE IN PRESS

M. Hu et al. / Signal Processing: Image Communication 19 (2004) 421–436 423

2. Facial feature detection

In recent years, facial feature detection hasreceived considerable attention due to its widerange of applications, such as face recognition,human–computer interface, and model-based vi-deo coding, etc. Many approaches have beenproposed. These approaches apply different tech-niques, such as neural network, support vectormachine, geometrical modelling, motion extrac-tion, and colour analysis [9,12,19,24,26]. Moreschemes are discussed in papers [11,21,29] on facialfeature detection of frontal-view faces. Ming-Hsuan Yang gives a detailed review of facedetection algorithms [31]. One of the disadvan-tages of these methods, such as [4,9,11,24,26], isheavy computational complexity (including train-ing), which makes it quite hard to use them in ourproposed 2-D scalable model-based video coding.The other disadvantage is that they are not robustenough to the seriously cluttered background anddifferent lighting conditions.

The proposed detection approach consists ofthree steps: face location, eyes and mouth detec-tion and chin detection, which will be described indetails in the following sections. During thedetection process, we assume that the detectedfaces in video sequence are frontal or near-frontalview of faces. This makes facial feature detectionand the construction of scalable face model easier.

2.1. Face location

A robust and adaptive face segmentationmethod is proposed to locate and regularize facecandidatures, which is based on luminance-piece-wise skin-colour distributions. It consists of threesteps:

1.
Detect face candidatures based on luminance-piecewise statistical skin-colour model andBayesian decision/relaxation.
2.
Regularize the face candidatures by spatialsegmentation results.
3.
Evaluate face candidatures by using bothshape and size.
There are many methods to locate face candi-datures based on skin-colour model due to its

processing speed [4,9]. However, we find in ourexperiments that none of these methods can detectthe face robustly under poor and strong lightingconditions, and the detected face is full of holes orin zigzag shape. In fact, the skin-colour model,that is, the distributions of chrominance compo-nents Cr and Cb; is related to the illuminationvalue Y : In our research, non-parametric kerneldensity estimation is used to build the piecewisestatistical skin-colour distributions. Forty-threemillion skin pixels from 900 images in [17] areused to train the skin models as shown in Fig. 1. Inorder to increase its robustness to different lightingconditions, the skin models are separated into 6parts based on luminance value Y ; as shown inFig. 1(a)–(f).

In our research, the pixels are classified based onthe Bayesian decision and relaxation in order tominimize the fault decision. Let x be the featurevector of a pixel. Let pðxjo1Þ and pðxjo2Þ be theclass conditional probability densities of skin-colour class and non-skin-colour class, respec-tively, where o1 and o2 represent skin-colour classand non-skin-colour class, respectively.

The decision commonly involves the followingprocess:

If LðxÞ ¼pðxjo1Þpðxjo2Þ

XTH ; then xAo1

else xAo2: ð1Þ

By applying the Bayesian formula and minimizingthe classification cost, the following relationexits:

If L xð Þ ¼pðxjo1Þpðxjo2Þ

XC12 � C22

C21 � C11

Pðo2ÞPðo1Þ

;

then xAo1;

else xAo2: ð2Þ

Unfortunately, it is very hard to get the non-skinclass conditional probability density of all kinds ofbackgrounds pixels, and a prior probabilities pðo1Þand pðo2Þ: In our experiments, we assume that thenon-skin class conditional probability densityconforms to uniform distribution, and a prioriprobabilities pðo1Þ and pðo2Þ are equal. Thethreshold TH is chosen to be 0.5.

ARTICLE IN PRESS

Fig. 1. (a)–(f) Statistical distributions of human skin colour with different luminance values.


The above decision process does not take therelationship among adjacent pixels into considera-tion, that is, the neighbours of a skin-colour pixelare more likely to be skin-colour pixels. Therefore,in our research, Bayesian relaxation algorithm in[1] is exploited. The decision is based on thefollowing formula:

Ifpðxjo1Þpðxjo2Þ

X ðTH þ 8�ðB þ CÞ � 4�ðvBðsÞ�B

þ vCðsÞ�CÞÞ; then xAo1

else xAo2; ð3Þ

where vBðsÞ is the number of skin pixels whichborder pixel x horizontally or vertically, and vCðsÞis the number of skin pixels that are diagonalneighbours of pixel x: The cost parameters B andC in relaxation algorithm are chosen to 0.25 and0.125, respectively.

After Bayesian decision and relaxation, spatialsegmentation is used to regularize the facecandidatures. Watershed transformation is usedto achieve spatial segmentation. Face candidaturesare then superimposed on the top of the spatialsegmentation mask to regularize the shape of facecandidature. In our study, if 80% of the spatially

ARTICLE IN PRESS


segmented region belongs to face candidature, thewhole segmented region is considered as part ofthe face candidature. If 20% of the spatiallysegmented region belongs to face candidature,the whole segmented region is not part of the facecandidature. However, if the ratio lies between20% and 80%, no change is happened.

For every candidate, its size and shape areevaluated, assuming that human faces in the videoare not too small and their shape is characterizedby elliptical or oval shape [25]. Some facecandidatures that do not meet the above condi-tions are considered as non-face patches.

Fig. 2 demonstrates the face locating results byusing (a) the skin-colour model in [4] and (b) theproposed skin-colour model. The images have thesame bright lighting conditions. The left images in(a) and (b) are the original images containing facecandidature. The right images are the detected facecandidatures. The results show that our proposedskin-colour model can locate the face moreprecisely.

Fig. 2. Face localization using different skin-colour models: origina

method in [4]; (b) proposed method. The left images of (a) and (b)

candidature.

Fig. 3 shows the face location results for severalsequences under different lighting conditions.Parts (a) and (b) are the sequences captured inLab with controlled lighting conditions. We canfind that the proposed skin-colour model canlocate faces correctly under both bright and darklighting conditions, which cannot be achieved byusing the skin-colour model in [4]. Severalstandard video sequences are also used to testthe performance of our proposed scheme, such asMiss am, Claire, Akiyo, Carphone, etc. Thedetection results of Akiyo and Carphone se-quences are shown in parts (c) and (d) of Fig. 3.More face locating results will be presented inSection 5.

2.2. Eye and mouth detection

After locating face candidatures, eyes andmouth should be detected to verify these facecandidatures. We assume that the detected face isfrontal or near-frontal view, which makes eyes and

l image and the detected face candidature using (a) literature

are the original images and the right ones are localised face

ARTICLE IN PRESS

Fig. 3. (a)–(d) Results of face localization algorithm: the images in left column are the original images; and the images in right column

are the localised face candidatures.


mouth detection easy. The detection proceduresare listed as follows:

1. Locate eyes and mouth in the face candidatures.

2. Verify detected eyes and mouth candidatures byusing both geometry and orientation informa-tion.

3. Detect the corners of both eyes and mouth.

ARTICLE IN PRESS


2.2.1. Locating eyes and mouth

In our research, colour and luminance are usedto locate eyes’ and mouth’s position, which isbased on the observation that high Cb; low Cr

values are found around the eyes, and eyes containboth dark and bright pixels in the luminance part.Mouth region also contains red lips and smallluminance pixels located between upper and lowerlips.

The search procedures are listed as follows andthe search region is restricted to face candidatures:

1. Enhance Cb and Cr by using histogram equal-ization.

2. Calculate colour map Map C ¼ Cb þ255 � Crð Þ; and then enhance it using histogramequalization.

3. Emphasize the dark pixels in Y componentusing morphological dilation operation, andcalculate map: Map Y ¼ ðdilationðY Þ=ðerrosionðY Þ þ 0:0001ÞÞ: Then, enhance it byusing histogram equalization.

Fig. 4. (a)–(g) Illustration of face featur

4. Calculate map: EyeMouthMap ¼ Map Y þMap C; and normalize it to brighten both eyesand mouth, and to suppress other noises.

5. The eyes’ and mouth’s candidatures are initiallyestimated by iterative thresholdingEyeMouthMap: In our experiments, the upperbound on the number of both eyes and mouthcandidatures is 6.

Fig. 4 shows the results of above procedures forCarphone sequence, and the final location of eyesand mouth. In Fig. 4, (a) is the luminancecomponent; (b) and (c) are the enhanced Cb andCr: (d) and (e) are the calculated maps Map Y

and Map C: (f) is the calculated EyeMouthMapand it is used to locate the position of eye–mouthcandidatures in (g). Experimental test shows thatour method has two advantages when comparedwith that in [9]. First, it requires less computation.Next, it is more robust for faces with differentkinds of lighting conditions because some lipcolour is faint or similar to its surrounding skin.

e detection for carphone sequence.

ARTICLE IN PRESS


2.2.2. Verifying eyes and mouth pairs

For the detected eyes and mouth candidatures inFig. 4, there are 6 eye or mouth candidatures andC3

6 ¼ 20 kinds of eye–mouth combinations, theo-retically. However, the geometry and orientationinformation can be used to reduce this number. Inthis paper, symmetry-based cost functions for eyesand mouth localization are proposed to verify theeyes and mouth pairs. Fig. 5 illustrates thegeometry and orientation relations among face,eyes and mouth. The cost functions are designed totake advantage of the following criteria, whichstand for frontal or near-frontal view of faces.

1. The face is upright and eye pair should belocated in the upper half face (above the minoraxis of the fitted ellipse. This can reduce thenumber of eye–mouth pair from 20 to 9 for theface candidature in Fig. 4.

2. For every face candidature, the direction y2 ofthe major axis of the fitted ellipse should bealmost the same as the direction of the vectorfrom the midpoint of the two eyes to the mouthy1: If the difference between y1 and y2 is lessthan a threshold, it is a face. Otherwise, it is nota face.

3. The vector, which is perpendicular to theinterocular segment and passing the midpointof two eyes, should pass the mouth candida-tures.

Fig 5. Face and facial feature geometry and orientation.

4. The line passing two mouth corners should bealmost parallel to the line passing two eyes. Thismeans that the y3 and y4 in Fig. 5 should be thesame.

Experimental results show that the eye–mouth paircan be detected and verified correctly based onabove four criteria. The computational complexityis also very small.

2.2.3. Detecting the corners of eyes and mouth

After detecting the position of eyes and mouth,their four corners should also be detected in orderto build the scalable face model. For eyes’ cornerdetection, two methods have been proposed fordifferent face sizes. If face size is small, themethod, which is based on the morphologicalopen by reconstruction filter (MORF) and thresh-olding, can be used. If the face size is large,deformable template matching algorithm is used todetect the eyes’ corners with high accuracy, butalso with high computational complexity.

For detecting eyes with small size face, theprocedure consists of:

* MORFs are used for the eye patch followed bythresholding in order to obtain a binary map.

* For every column of this patch (from left toright), the first and last columns with zeroelements are chosen as the column on which theeye corners are located. The centre of the eyecan be estimated based on the eye corners. Thisscan step is illustrated in Fig. 6.

* The upper and lower eyelid (two points) can beestimated based on the fact that the line joiningthese two points is perpendicular to the linejoining the two eye corners.

Fig. 6. Scan procedure for eye corner detection.

ARTICLE IN PRESS


For detecting eyes with larger size, it is not easyto estimate the eye corners precisely by using theabove method. Deformable template matchingalgorithm is used. The edge and valley energiesare used to adjust the template, and defined asthose in [32]. Interested readers are referred to [32]for more detailed procedures.

For mouth corner detection, SUSAN cornerdetector [23] and deformable template matchingalgorithm are used to detect the four mouthcorners. However, several modifications are madeto improve its speed:

1. SUSAN corner detector is used to detect theright and left mouth corner candidatures. Thiscan also reduce the search region for deform-able template matching. Based on the result ofmouth candidatures, one patch with size m; n½ �is selected, where m and n are decided by theface size. SUSAN corner detector detectsmouth corner candidature whose value is largerthan a predefined threshold.

2. Lip colour distribution is used to reduce thesearch position of deformable template further.Colour distribution inside the mouth is mod-elled as a Gaussian mixture with three compo-nents: a dark aperture, pink lips and brightreflection of light from teeth or lips. Theparabolas of the upper and lower lips shouldtry to include more pink lips to the template.

Fig. 7 shows the eyes, and mouth detection resultsof Akiyo and Carphone sequences. It shows thateyes, and mouth can be detected precisely by usingthe proposed methods.

Fig. 7. Eyes and mouth detection for (a)

2.3. Chin detection

Several methods have been published to esti-mate the chin contour [10,20]. In [20], activecontour model was used to detect the chincontour. Active contour model is an energy-minimizing spline influenced by external forcesand image features. Its performance is sensitive toface textures, such as face smallpox, moustache,and weak contrast of the chin in relation to theneck below it. Furthermore, the initial positionand the chosen external force affect its perfor-mance seriously. In [10], deformable templatematching method was used to estimate the chincontour. Two parabolas are used to represent thechin shape. A cost function is minimized to findthe best fit of the template to the chin. However,not all of human chins have parabola-like shapes.For example, the method in [10] cannot detect thechin of Carphone correctly. Experiments showthat it is not a robust detection method.

In our research, a reliable chin detection methodis proposed to overcome the disadvantages of theabove methods. In this method, both deformabletemplate matching algorithm and active contourmodel are exploited sequentially. The detectionresult of deformable template matching algorithmis used as the initialization of the active snakemodel. Interested readers are referred to [30] for adetailed description of the proposed method.

In this method, edge energies are chosen as theexternal forces. First, edge is detected by usingCanny edge detector [3]. Low value used in thehysteresis step of Canny edge detector is set to 0 inorder to detect the weak edge. Short edges areremoved in order to reduce the effect of noise in

Akiyo and (b) Carphone sequence.

ARTICLE IN PRESS


the face area, and binary edge mapBinaryEdgeMap is achieved. The gradient vectorflow (GVF) of BinaryEdgeMap is then calculatedand used as the external force [30]. It can yield asmoother field than the gradient map used in [5].

Fig. 8(a) shows the chin detection results ofAkiyo and Carphone sequences by using deform-able template matching, and (b) shows the resultsof the proposed method which uses (a) as theinitialization of active snake. The experimentalresults show that, for Akiyo, the contour, detectedby deformable template matching algorithm,matches the real chin very well. However, forcarphone, the performance is not very satisfying.From the detection results of our proposedalgorithm, the Akiyo chin contour is almostunchanged. For Carphone sequence, the chin

Fig. 8. Chin detection results of Carphone and Akiyo by using: (a)

position improves greatly. This shows that theproposed method can detect the chins withdifferent face shapes and weak chin edge. Moreresults will be demonstrated in Section 4.

3. 2-D scalable face model design

In our proposed scalable model-based videocoding scheme, human face is considered as aspecial object and is modelled separately fromother video objects since the human face under-goes both rigid and non-rigid motion. Its motiondescription is complex. Furthermore, human faceis more attractive in videophone and a small errorcould be annoying. Much research has beenconducted on facial feature motion analysis and

deformable template matching; and (b) the proposed method.

ARTICLE IN PRESS

Fig. 9. Scalable face model design.


description [28]. This a priori information can beused for our scalable model design.

In our research, a heuristic scalable face modelis constructed, based on the muscle distribution ofhuman faces in [25]. During scalable model design,more important nodes in the lower level areallocated to the face features and intersectionpoints between different muscles in order topresent the facial motion more precisely andreduce the estimation and warping error duringvideo coding.

In our research, a three-level scalable face modelis designed which is shown in Fig. 9. The designprocess is described as follows:

* First, eight nodes and four nodes are allocatedon the eyes and mouth, respectively, to repre-sent their movement. In addition, five nodes areused to approximate the contour of chin. Points13 and 17 are found by extending the inter-connecting lines between the mouth right andleft corners towards the borders of the facesegment. Point 15 is the intersect point betweenthe line formed by interconnecting the mouthupper and lower corners (points 10 and 12), andthe detected chin. These points are veryimportant to represent the movement of theface and are included in the lowest level 1 (from1 to 17 in Fig. 9(a)).

* For level 2 shown in Fig. 9(b), six additionalpoints are introduced to represent the move-ment of eyebrows and nose (from 18 to 23).They are useful for head motion estimation.For points 22 and 23, if the face size is small,they are merged as one point located at themiddle of their positions.

* For level 3, 8 additional points are allocatedmainly based on the face muscle distribution[28]. The scalable model of level 3 is shown inFig. 9(c). The points PA and PB are onlyauxiliary points, which do not bear additioninformation. They are found by extending theinterconnecting lines between the predeter-mined points of eye corners towards the bordersof the face segment. Points 25 and 26 arelocated at the middle of line segments joiningmouth corners to points PA and PB. Thelocations of other points are shown in Fig. 9(c).

This heuristic model is based on the face muscledistributions used to represent the face motion. Amore complex model can be designed if morecomplex face expression needs to be encoded.

4. Motion estimation scheme for model evaluation

In our research, a hierarchical motion estima-tion scheme has been proposed to estimate themotion vectors of control points for the designedobject model. As some control points are allocatedon the position with smooth texture, the blockmatching motion estimation methods can be workefficiently and accurately. The proposed methodconsists of the following steps: the foreword/backward motion estimation, reliability evalua-tion, and MV prediction and refinement.

First, a number of points, which are differentfrom those used to represent object model andhave good features for tracking [22], are allocatedin the interior of object. Then, both forward andbackward motion vectors of these points between

ARTICLE IN PRESS


frame Iðx,; t � 1Þ and frame Iðx,; tÞ are estimatedusing Shi–Tomasi feature tracking algorithm in[28]. That is, the forward motion vector of the ithnode location Vi in frame t � 1; moves to locationV

0

i in frame t: Then the backward motion vector atthe location V

0

i in frame t maps back to V 00i in

frame t � 1:Next, their motion ‘‘reliability’’ are estimated

based on both forward and background motionvectors. The ‘‘reliability’’ is evaluated by thefollowing formula:

Re ¼ exp �jjVi � V 00

i jj2

2s2m

� �; ð4Þ

where sm is the free parameter. From (4), it showsthat the smaller the difference between Vi and V 00

i ;the more reliable the motion vector of ith node.For the nodes whose reliability is smaller than athreshold (0.3 is chosen in our experiments), theyare not considered during contour prediction.

After, estimating the motion vectors of thesepoints, the motion vectors of control pointsrepresenting the object model are predicted fromtheir m nearest surrounding motion vectors. m ischosen as 6 in our experiments. The weighted leastsquares (WLS) estimation in [18] is used todetermine the affine parameters of motion for thecontrol points. During estimation, each motionvector is weighed according to its ‘‘reliability’’.

As the estimated MV of object model has highprecision, they are then refined to half-pixelresolution with lower warping error. Duringrefinement, hexagonal matching algorithm in [14]is exploited, which is efficient for mesh-basedmotion estimation and can keep the mesh struc-ture during estimating.

5. Experimental results

5.1. Results for facial feature detection

The objective of our proposed method is todetect the frontal or near-frontal view of faceunder varying lighting conditions so that scalableface model can be designed automatically. There-fore, profile views of face are not considered in ourexperiments. In Section 2, some results have

demonstrated the performance of our proposedmethod. More images and image sequences areused in this section to test the performance of ourproposed methods for face and facial featuredetections. These faces cover several racial groupsand varying lighting conditions. Some face andfacial feature detection results are illustrated inFig. 10. The first column shows the originalimages. The second column gives the detected facepatches. The third and fourth columns show thedetected eyes, mouth and chin components. Fromthe results, it shows that the proposed algorithmscan detect the facial features correctly, irrespectiveof whether the face is under strong lighting oruneven lighting conditions.

5.2. Results for scalable face model design and

evaluation

Four head–shoulder sequences (QCIF) are usedto test the performance of the designed scalableface model for representing face motion throughvideo sequence. They are Carphone, Akiyo, Claire,and Miss am.

Before designing the scalable model of theforeground head–shoulder objects, we first seg-ment the object into face object and human bodyobject (including hair part). Then, for human bodypart, the scalable model is designed by using ourpublished method in [8], where four levels arechosen. For face objects, above proposed methodis used to design the scalable face models, whereonly 3 levels are used. They are combined toachieve four-level representation of the foregroundhead–shoulder object.

Figs. 11 and 12 demonstrate the designedscalable models of head–shoulder object forCarphone sequence and Akiyo sequence. Table 1lists the number of control points in different levelsfor Carphone, Akiyo, Claire, and Miss am se-quence.

In order to test the performance of the designedscalable face model, for every video sequence, fourframes (frame 2, 4, 6, and 8) are warped fromframe 0 based on the designed scalable model andthe estimated motion vectors (MVs) of controlpoints, which are estimated by using the method inSection 4 and with half-pixel resolution. Then,

ARTICLE IN PRESS

Fig. 10. Face and facial feature detection results containing with different skin colour.


ARTICLE IN PRESS

Fig. 11. Scalable object models (four levels) for Carphone sequence, the number of points are 63, 37, 15, 35 for (a) level 1, (b) level 2,

(c) level 3, and (d) level 4, respectively.


average PSNR values of every level are calculatedbased on the warped four frames (frames 2, 4, 6,and 8) and their corresponding original frames.During PSNR calculation, only the intersection ofthe warped and original VOP alpha-plane regionsis considered.

Table 2 lists the PSNR values of Carphone,Claire, Miss am and Akiyo sequence. Comparedwith the results in [27], for Akiyo sequence, theproposed method can achieve about 2–5 dBimprovements. It shows that the designed scalableface models can represent the object motion moreprecisely than existing published methods.

During our experiments, we found that with thetest frame increase, the PSNR value of warpedframe drops. Some PSNR values of warped framesare very low as the face turns aside (profile view offace) in these frames and the face becomes self-occluded. In this case, the model adaptation stepshould be included in order to achieve scalable

model-based video coding with good performance.Another reason causing the PSNR value drop isthat some texture patches newly appear. Inconclusion, our proposed methods can design thescalable face model design automatically andefficiently. Further research, such as model adap-tation, is needed to achieve scalable 2-D model-based video coding.

6. Conclusions and future work

In this paper, we have studied the facial featuredetection and a scalable face model design forachieving 2-D scalable model-based video coding.First, face skin-colour model is proposed, which isrobust to different lighting conditions. A reliableand efficient face localization and facial featuredetection scheme has been proposed. Thesemethods can achieve precise and reliable eye,

ARTICLE IN PRESS

Fig. 12. Scalable object models (four levels) for Akiyo sequence, the number of points are 54, 35, 20, 31 for (a) level 1, (b) level 2, (c)

level 3, and (d) level 4, respectively.

Table 1

The number of control points for different levels

Carphone Akiyo Miss am Claire

Level 1 63 54 42 40

Level 2 37 35 30 30

Level 3 15 20 18 12

Level 4 35 31 30 28

Table 2

Average warping PSNR values (dB) of four QCIF sequences

for different levels of representation

Average PSNR value (dB)

Level 1 Level 2 Level 3 Level 4

Claire (QCIF) 31.46 34.27 36.18 39.02

Miss am (QCIF) 33.14 36.57 38.09 40.23

Carphone (QCIF) 32.82 34.39 35.06 35.78

Akiyo (QCIF) 31.25 32.91 34.56 35.91

PSNR of Akiyo

(QCIF) using

method in [27]

28.64 29.27 29.83 30.56


mouth, and chin detection. Furthermore, a heur-istic scalable face model is designed based on facemuscular distributions and detected facial features.In order to evaluate the designed scalable facemodel, a novel motion estimation scheme isproposed which can estimate the model motionprecisely although some points are allocated onthe textureless region, such as human face.Experimental results show that this scalable modelcan represent the face motion more precisely thanpreviously published techniques.

Future work will focus on the application ofmore robust method, such as support vectormachine (SVM) and active shape model (ASM),to verify the eye and mouth pair. Future work willalso focus on the object model adaptation in videosequence and scalable texture coding to achieve

ARTICLE IN PRESS


scalable model-based video coding for the wholesequences.

Acknowledgements

The authors would like to thank the reviewersfor their valuable comments.

References

[1] T. Aach, A. Kaup, Statistical model-based change detec-

tion in moving video, Signal Process. 31 (1993) 165–180.

[2] Y. Altunbasak, A.M. Tekalp, Occlusion-adaptive, content-

based mesh design and forward tracking, IEEE Trans.

Image Processing 6 (9) (1997) 1270–1280.

[3] J. Canny, A computational approach to edge detection,

IEEE Trans. Pattern Anal. Mach. Intell. 8 (6) (1986)

679–698.

[4] D. Chai, K.N. Ngan, Face segmentation using skin-color

map in videophone applications, IEEE Trans. Circ. Syst.

Video Technol. 9 (4) (1999) 551–564.

[5] J.Y. Deng, F. Lai, Region-based template deformation

and masking for eye feature extraction and description,

Pattern Recogn. 30 (3) (1997) 403–419.

[6] P. Eisert, T. Wiegand, B. Girod, Model-aided coding: a

new approach in incorporate facial animation into motion

compensated video coding, IEEE Trans. Circ. Syst. Video

Technol. 10 (3) (2000) 344–358.

[7] M. Hu, A new compression scheme for scalable video

transmission, Technique Report, CCSR, University of

Surrey, October 2002.

[8] M. Hu, S. Worrall, A.H. Sadka, A.M. Kondoz, Model

design for scalable 2-dimensional model-based video

coding, IEE Electron. Lett. 38 (24) (2002) 1513–1515.

[9] R.-L. Hsu, M. Abdel-Mottaleb, A.K. Jain, Face detection

in colour images, IEEE Trans. Pattern Anal. Mach. Intell.

24 (5) (2001) 696–706.

[10] M. Kampmann, Estimation of the chin and cheek contours

for precise face model adaptation, in: Proceedings of

ICIP’97, 1997, pp. 300–304.

[11] C.J. Kuo, R.-S. Huang, T.-G. Lin, 3-D facial model

estimation from single front-view facial image, IEEE

Trans. Circ. Syst. Video Technol. 12 (3) (2002) 183–192.

[12] D. Maio, D. Maltoni, Real-time face location on

gray-scale static image, Pattern Recogn. 33 (9) (2000)

1525–1539.

[13] A. Murat Tekalp, Y. Altunbasak, G. Bozdagi, Two- versus

three-dimensional object-based video compression, IEEE

Trans. Circ. Syst. Video Technol. 7 (2) (1997) 391–397.

[14] Y. Nakaya, H. Harashima, Motion compensation based

on spatial transformation, IEEE Trans. Circ. Syst. Video

Technol. 4 (3) (1994) 339–356.

[15] J.-R. Ohm, B. Makai, Feature-similarity retrieval of face

images based on 2-D model description, in: Proceedings of

WIAMIS’99, Berlin, May 1999.

[16] D.E. Pearson, Developments in model-based video coding,

Proc. IEEE 83 (6) (1995) 892–906.

[17] S.L. Phung, ECU face detection database, Edith Cowan

University, School of Engineering and Mathematics, 2002,

Available at: http://www.soem.ecu.edu.au/Bsphung/

face detection/database/.

[18] P.J. Rousseeuw, A.M. Leroy, Robust Regression and

Outlier Detection, Wiley, New York, 1987.

[19] H.A. Rowley, S. Baluja, T. Kanade, Neural Network-

based Face Detection, IEEE Trans. Pattern Anal. Mach.

Intell. 20 (1) (1998) 23–38.

[20] R.L. Rudianto, K.N. Ngan, Automatic 3-D wireframe

model fitting to frontal facial image in model-based video

coding, in: Picture Coding Symposium (PCS’96), 1996,

pp. 585–588.

[21] A. Saber, A.M. Tekalp, Frontal-view face detection and

facial feature extraction using colour shape and symmetry

based cost functions, Pattern Recogn. Lett. 19 (8) (1998)

669–680.

[22] J. Shi, C. Tomasi, Good features to track, in: Proceed-

ings of IEEE International Conference CVPR, 1994,

pp. 593–600.

[23] S.M. Smith, J.M. Brady, SUSAN—a new approach to low

level image processing, Int. J. Comput. Vis. 23 (1) (1997)

45–78.

[24] K. Sobottka, I. Pitas, Face localization and feature

extraction based on shape and colour information, in:

Proceedings of International Conference on IP (ICIP-96),

1996, pp. 483–486.

[25] K. Sobottka, I. Pitas, A novel method for automatic face

segmentation, Facial Feature Extract. Track. Signal

Process.: Image Commun. 12 (2) (1998) 263–281.

[26] K. Sung, T. Poggio, Example-based learning for view-

based human face detection, IEEE Trans. Pattern Anal.

Mach. Intell. 20 (1) (1998) 39–51.

[27] P. van Beek, A.M. Tekalp, N. Zhuang, I. Celasum, M.

Xia, Hierarchical 2-D mesh representation, tracking, and

compression for object-based video, IEEE Trans. Circ.

Syst. Video Technol. 9 (2) (1999) 353–369.

[28] K. Waters, A muscle model for animating three-dimen-

sional facial expression, Comput. Graph. 21 (4) (1987)

17–24.

[29] K.-W. Wong, K.-M. Lam, W.-C. Siu, A robust scheme for

live detection of human faces in colour images, Signal

Process.: Image Commun. 18 (2) (2003) 103–114.

[30] C. Xu, J.L. Prince, 1997, Gradient vector flow: a new

external force for snakes, in: Proceeding of International

Conference on CVPR’97, 1997, pp. 66–71.

[31] M.-H. Yang, D. Kriegman, N. Ahuja, Detecting faces in

images: a survey, IEEE Trans. Pattern Anal. Mach. Intell.

24 (1) (2002) 34–58.

[32] A.L. Yulle, P.W. Hallinan, D.S. Cohen, Feature extraction

from face using deformable templates, Int. J. Comput. Vis.

8 (2) (1992) 99–111.

http://www.soem.ecu.edu.au/sphung/face_detection/database/



Date post:	19-Mar-2018
Category:	Documents
Upload:	phungliem
View:	221 times
Download:	3 times

Automatic scalable face model design for 2D model · PDF fileAutomatic scalable face ......

Documents