Patrona, F., Iosifidis, A., Tefas, A., Nikolaidis, N ......1 Visual Voice Activity Detection in the...

Patrona, F., Iosifidis, A., Tefas, A., Nikolaidis, N., & Pitas, I. (2016).Visual Voice Activity Detection in the Wild. IEEE Transactions onMultimedia, 18(6), 967-977.https://doi.org/10.1109/TMM.2016.2535357

Peer reviewed version

Link to published version (if available):10.1109/TMM.2016.2535357

Link to publication record in Explore Bristol ResearchPDF-document

This is the accepted author manuscript (AAM). The final published version (version of record) is available onlinevia Institute of Electrical and Electronics Engineers at http://dx.doi.org/10.1109/TMM.2016.2535357. Please referto any applicable terms of use of the publisher.

University of Bristol - Explore Bristol ResearchGeneral rights

This document is made available in accordance with publisher policies. Please cite only thepublished version using the reference above. Full terms of use are available:http://www.bristol.ac.uk/red/research-policy/pure/user-guides/ebr-terms/

https://doi.org/10.1109/TMM.2016.2535357

https://doi.org/10.1109/TMM.2016.2535357

https://research-information.bris.ac.uk/en/publications/eb623bf8-7e4e-4878-aafc-2ab152604fb2

https://research-information.bris.ac.uk/en/publications/eb623bf8-7e4e-4878-aafc-2ab152604fb2

1

Visual Voice Activity Detection in the WildFoteini Patrona∗, Alexandros Iosifidis†, Anastasios Tefas∗, Nikolaos Nikolaidis∗, and Ioannis Pitas∗

∗Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki 54124, Greece.{tefas,nikolaid,pitas}@aiia.csd.auth.gr.

†Department of Signal Processing, Tampere University of Technology, Tampere, Finland{aiosif}@aiia.csd.auth.gr.

Abstract—The Visual Voice Activity Detection (V-VAD) prob-lem in unconstrained environments is investigated in this paper.A novel method for V-VAD in the wild, exploiting local shapeand motion information appearing at spatiotemporal locationsof interest for facial video description and the Bag of Words(BoW) model for facial video representation, is proposed. Facialvideo classification is subsequently performed using state-of-the-art classification algorithms. Experimental results on one publiclyavailable V-VAD data set denote the effectiveness of the proposedmethod, since it achieves better generalization performance inunseen users, when compared with recently proposed state-of-the-art methods. Additional results on a new, unconstrained, dataset provide evidence that the proposed method can be effectiveeven in such cases in which any other existing method fails.

Index Terms—Voice Activity Detection in the wild, Space-TimeInterest Points, Bag of Words model, kernel Extreme LearningMachine, Action Recognition

I. INTRODUCTION

The task of identifying silent (vocal inactive) and non-silent(vocal active) periods in speech, called Voice Activity Detec-tion (VAD) has been widely studied for many decades usingaudio signals. In the last two decades, though, considerableattention has been paid to the use of visual information, mainlyas an aid to the traditional Audio-only Voice Activity Detection(A-VAD), due to the fact that, contrary to audio, visualinformation is insensitive to environmental noise and can,thus, be of help to A-VAD methods for speech enhancementand recognition [1], speaker detection [2], segregation [3] andidentification [4] as well as speech source separation [5], [6]in noisy and reverberant conditions or in Human ComputerInterfaces (HCIs).

All V-VAD methods proposed in the literature till nowset several assumptions concerning the visual data recordingconditions, which are rather constraining in their vast majority.In brief, the available data sets used for evaluating the per-formance of such methods are recorded indoors, under fullyconstraint conditions, e.g., using preset static illumination,simple background and no or negligible background noiseproduced by humans speaking or by other sound sources.Moreover, no or slight speaker movements are encounteredand the recording setting is calibrated so that the entire speakerface as well as the mouth are always fully visible froma camera positioned right in front of the speaker, so thatspecial features describing their shape and/or motion can becalculated. That is, the human face should have a frontalorientation with respect to the capturing camera and the facial

Region Of Interest (ROI) should have adequate resolution (inpixels). Such a scenario restricts the applications, where V-VAD methods can be exploited. For example, in movie (post-)production, the persons/actors are free to move and their facialpose may change over time, as is also the case in all the placeswhere audio-visual surveillance would be of interest. MostV-VAD methods proposed in the literature would probablyfail in such an application scenario. Last but not least, mostcurrently existing methods focus on the accurate detectionof the visually silent intervals in a video sequence, whichin general is not as challenging as the accurate detection ofthe visually speaking intervals, due to the fact that the lattercan be easily confused with intervals of laughter, masticas-ion or other facial activities. The aforementioned difficultyof distinguishing especially between laughter and speech ishighlighted in [7], where a method exploiting both audio andvisual information aiming at an effective discrimination ispresented.

Non-invasive V-VAD, where the persons under investigationare free to change their orientation and their distance from thecapturing camera, is within the scope of this paper. Inspiredby relative research in human action recognition [8], [9],[10], this unconstrained V-VAD problem will subsequentlybe mentioned as V-VAD in the wild. While human actionrecognition in the wild has been extensively studied in the lastdecade and numerous methods addressing this problem havebeen proposed, V-VAD in the unconstrained case has not beenaddressed yet. In this paper, a method oriented at dealing withthe problem of V-VAD in the wild is proposed, having as onlyprerequisite assumption that the faces appearing in the facialmoving region videos being processed can be automaticallydetected using a face detection algorithm and tracked for anumber of consecutive frames.

The proposed method is formed by three processing steps.In the first step, a face detection technique [11] is appliedto a video frame, in order to determine the facial Regionof Interest (ROI), which is subsequently tracked over time[12], in order for a facial ROI trajectory of the personunder investigation to be created. Such videos are noted asfacial moving regions hereafter. In the second step, localshape and motion information appearing in spatiotemporalvideo locations of interest is exploited for the facial movingregion video representation. To this end, two facial movingregion representation approaches are evaluted, a) Histogramof Oriented Gradient (HOG) and Histogram of Optical Flow(HOF) descriptors calculated on Space Time Interest Point

2

(STIP) video locations [8] and b) HOG, HOF and MotionBoundary Histogram (MBHx, MBHy) descriptors calculatedon the trajectories of video frame interest points that aretracked for a number of L consecutive frames [9]. Both facialmoving region descriptors are combined with the Bag ofWords (BoWs) model [13], [14], in order to determine facialmoving region video representations.

Finally, facial moving region video classification in visuallysilent and visually speaking ones is performed employing aSingle Hidden Layer Feedforward Neural (SLFN) network,trained by applying the recently proposed kernel ExtremeLearning Machine (kELM) classifier [15], [16]. A facialmoving region verification step is introduced before this step,in cases where videos not depicting facial images may beencountered, in order to ensure that only facial moving regionvideos are going to be classified as visually silent and non-silent, by performing facial moving region - non facial movingregion video classification. The proposed approach is evaluatedon a publicly available V-VAD data set, namely CUAVE [17],where it is shown to outperform recently proposed V-VADmethods to a large extend. In addition, a new V-VAD dataset, extracted from full length movies in order to evaluate theperformance of the proposed approach on a case of V-VADin the wild was created. Experimental results denote that theproposed approach can operate reasonably well in the caseswhere other V-VAD methods fail.

The remainder of this paper is organized as follows. SectionII discusses previous work on V-VAD. The proposed V-VADapproach is described in Section III. The data sets used inour experiments and the respective experimental results arepresented in Section IV. Finally, conclusions are drawn inSection V.

II. PREVIOUS WORK

V-VAD methods proposed in the literature can be roughlydivided in model-based and model-free ones. Model-basedmethods require a training process, where positive and nega-tive paradigms are employed for model learning. In model-freemethods, no direct training is performed, thus circumventingthe need for an a-priori knowledge of the data classes at thedecision stage. Moreover, either visual only or audiovisualdata features can be exploited. In the latter case, combinationof the audio and video modalities can be achieved in twodifferent ways, either by combining the audio and visualfeatures (feature/early fusion) or by performing A-VAD andV-VAD independently and fusing the obtained classificationresults (decision/late fusion) [18].

Model-free V-VAD methods, usually rely solely on com-binations of speaker-specific static and dynamic visual dataparameters, like lip contour geometry and motion [19], orinner lip height and width trajectories [20] that are comparedto appropriate thresholds for decision making. Emphasis isgiven on dynamic parameters due to the fact that identicallip shapes can be encountered both in silent and non-silentframes, making static features untrustworthy. In both theseapproaches, there is no discrimination between speech andnon-speech acoustic events, which are thus handled as non-silent sections. Another model-free approach is proposed in

[21], where signal detection algorithms are applied on mouthregion pixel intensities along with their variations, in order todiscriminate between speech and non-speech frames.

Concerning model-based V-VADs, features like lip opening,rounding and labio-dental touch (a binary feature indicatingwhether the lower lip is touching the upper teeth) for lipconfiguration followed by motion detection and SVM classi-fication are proposed in [22], in an attempt to distinguish be-tween moving and non-moving lips and then between lip mo-tion originating either from speech or from other face/mouthactivities, e.g., from facial expressions or mastication [19],[20]. Such a VAD system can constitute the first stage of aVisual Speech Recognition (VSR) system. The discriminativepower of static and dynamic visual features in V-VAD isinvestigated in [23], where the predominance of dynamic onesis highlighted. The same approach is also adopted in [24],where facial profile as well as frontal views are used. Thoughnot providing as much useful information as the frontal ones,facial profile views are proven to be useful in VAD. A greedysnake algorithm exploiting rotational template matching, shapeenergy constraints and area energy for lip extraction avoidingcommon problems resulting from head rotation, low imageresolution and active contour mismatches is introduced in[25], where adaboost is used for classifier training. Adaboostis also used in [5] for the V-VAD classifier training, ofa system performing Blind Source Separation (BSS) basedon interference removal, after the extraction of lip regiongeometric features. Finally, HMMs are used in [26] to modelthe variation of the optical flow vectors from a speaker mouthregion during non-speech periods of mouth activity.

An early-fusion model-based AV-VAD approach is intro-duced in [27]. 2D discrete cosine transformations (2D-DCTs)are extracted from the visual signal and a pair of GMMs isused for classification of the feature vector. V-VAD accuracy isquite high in the speaker-dependent case. However, it dramat-ically decreases in the speaker-independent case experiments,conducted on a simplistic dataset called GRID [28]. Colorinformation is used in the V-VAD subsystem proposed in [29]for skin and lip detection, followed by video-based HMMsaiming to distinguish speech from silence, while lip opticalflow input provided to SVMs is employed in [6] for utilizationof the visual information, subsequently combined with audioinformation for multispeaker mid-fusion AV-VAD and SoundSource Localization (SSL).

III. PROPOSED V-VAD METHOD

The proposed method operates on grayscale facial movingregions. Face detection and tracking [11], [12] techniques areused to find such regions in a video. After determining thefacial Regions of Interest (ROIs) in each facial video sequence,we find the union R = {∪Rk, k = 1, . . . ,K} of all ROIs Rk

within this video sequence. Then, we use this new ROI R forpositioning the face in each video frame and we resize it to afixed size of H ×W pixels in order to produce the so calledfacial video segments. Subsequently, we apply the proposedV-VAD method. In this Section, we describe each step of theproposed V-VAD method in detail.

3

A. STIP-based facial video representation

Let U be an annotated facial video segment database con-taining N facial videos, which are automatically preprocessed,in order to determine the relevant set of STIPs. In thispaper, the Harris3D detector [30], which is a spatiotemporalextension of the Harris detector [31] is employed, in orderto detect spatiotemporal video locations, where the imageintensity values undergo significant spatiotemporal changes.After STIP localization, each facial video is described in termsof local shape and motion by a set HOG/HOF descriptors(concatenation of L2 normalized HOG and HOF descriptors)pij , i = 1, . . . , N, j = 1, . . . , Ni, where i refers to the facialvideo index and j indicates the STIP index detected in facialvideo i. In the conducted experiments, the publicly availableimplementation in [32] has been used for the calculation ofHOG/HOF descriptors. An example of STIP locations on

Fig. 1. Examples of detected STIPs on facial videos.

facial videos is illustrated in Figure 1. pij , i = 1, . . . , N, j =1, . . . , Ni are clustered by applying K-Means [33] and thecluster centers vk, k = 1, . . . ,K form the so-called codebook,i.e., V = {v1, . . . ,vK}. The descriptors pij , j = 1, . . . , Ni aresubsequently quantized using V and l1 normalized in order todetermine the BoW-based video representation of facial videoi, si ∈ RK . si are noted as facial motion vectors hereafter.

B. Dense Trajectory-based facial video representation

In Dense Trajectory-based facial video segment description[9], video frame interest points are detected on each videoframe and are tracked for a number of L consecutive frames.Subsequently, D = 5 descriptors, i.e., HOG, HOF, MBHx,MBHy and the (normalized) trajectory coordinates, are calcu-lated along the trajectory of each video frame point of interest.The publicly available implementation in [9] for the calcula-tion of the Dense Trajectory-based video description was usedin the conducted experiments. An example of Dense Trajectorylocations on facial videos is illustrated in Figure 2. Let usdenote by sdij , i = 1, . . . , N, j = 1, . . . , Ni, d = 1, . . . , D theset of descriptors calculated for the N facial video segments inU . Five codebooks Vd, d = 1, . . . , D are obtained by applyingK-Means on sdij for the determination of K prototypes for

each descriptor type. The descriptors sdij , j = 1, . . . , Ni aresubsequently quantized using Vd in order to determine DBoW-based video representations for facial video i.

Fig. 2. Examples of Dense Trajectories on facial videos.

C. Facial video segment verification

Due to the fact that the proposed method aims to beapplicable in the wild, and on real life recordings, it wouldbe rather inaccurate and optimistic to consider that the facedetection and tracking algorithms [11], [12] applied, performflawlessly and, thus, only facial video segments are produced.For this reason, and in order for a fully automatic approach,not requiring human intervention, to be proposed, a facialvideo segment verification step had to be introduced beforethe facial video segment classification as visually silent andvisually speaking. In this step, videos are being indeed facialvideos or not. Both the STIP and the Dense Trajectory-basedvideo representations are employed in this step, and thus, whena test video is introduced to the pretrained SVM or the SLFNnetwork, the corresponding descriptors are calculated on thevideo locations of interest and transformed to feature vectors,which are subsequently quantized with the aid of the codebookvectors, in order to produce the facial vector and introduce itto the trained classifiers. Based on the obtained responses, thevideo is classified as being a facial video segment or not,and the videos identified as non-facial moving regions arediscarded from the data set, thus not introduced to the secondlayer of classifiers, performing V-VAD.

D. SLFN classification

After the calculation of the facial vectors si ∈ RK , i =1, . . . , N obtained by using the STIP or the Dense Trajectory-based facial video representation, they are used to train a SLFNnetwork. Since both face verification and V-VAD correspondto two-class problems, the network should consist of K input,L hidden and one output neurons, as illustrated in Figure3. The number L of hidden layer neurons is, usually, muchgreater than the number of classes involved in the classificationproblem [10], [15], i.e., L ≫ 2.

4

L

wout

K

oisi

Win b

Fig. 3. SLFN network topology for V-VAD.

The network target values ti, i = 1, . . . , N , each corre-sponding to a facial vector si, are set to ti = 1 or ti = −1,depending on whether the respective video segment i is afacial video segment in the facial video verification case oron whether the facial video segment depicts a talking or anon-talking human face in the case of V-VAD, respectively. InELM-based classification schemes, the network input weightsWin ∈ RK×L and the hidden layer bias values b ∈ RL arerandomly assigned, while the network output weight w ∈ RL

is analytically calculated. Let us denote by vj and wj the j-thcolumn of Win and the j-th element of w, respectively. Foran activation function Φ(·), the output oi of the SLFN networkcorresponding to the training facial vector si is calculated by:

oi =L∑

j=1

wj Φ(vj , bj , si). (1)

It has been shown [34], [35] that almost any nonlinearpiecewise continuous activation functions Φ(·) can be usedfor the calculation of the network hidden layer outputs, e.g.,the sigmoid, sine, Gaussian, hard-limiting and Radial BasisFunctions (RBF), Fourier series, etc. In our experiments, wehave employed the RBF − χ2 activation function, which hasbeen found to outperform other choices for BoW-based actionclassification [36].

By storing the network hidden layer outputs correspondingto the training facial vectors si, i = 1, . . . , N in a matrix Φ:

Φ =

Φ(v1, b1, s1) · · · Φ(v1, b1, sN )

· · ·. . . · · ·

Φ(vL, bL, s1) · · · Φ(vL, bL, sN )

, (2)

equation (1) can be expressed in a matrix form as o = ΦTw.In order to increase robustness to noisy data, by allowing

small training errors, the network output weight w can beobtained by solving for:

Minimize: J =1

2∥w∥22 +

c

2

N∑i=1

∥ξi∥22 (3)

Subject to: wTϕi = ti − ξi, i = 1, ..., N, (4)

where ξi is the error corresponding to training facial vectorsi, ϕi is the i-th column of Φ denoting the si representation in

the ELM space and c is a parameter denoting the importanceof the training error in the optimization problem. The optimalvalue of parameter c is determined by applying a line searchstrategy using cross-validation. The network output weight wis finally obtained by:

w = Φ

(K+

1

cI

)−1

t, (5)

where K ∈ RN×N is the ELM kernel matrix, having elementsequal to [K]i,j = ϕT

i ϕj [16], [37].By using (5), the network response ol for a test vector xl ∈

RD is given by:

ol = WToutϕl = T

(ΦTΦ+

1

cI

)−1

kl, (6)

where kl ∈ RN is a vector having its elements equal to kl,i =ϕT

i ϕl.The RBF − χ2 similarity metric provides the state-of-the-

art performance for BoW-based video representations [36],[38]. Therefore, RBF − χ2 kernel function is used in ourexperiments:

K(i, j) = exp

(− 1

4A

K∑k=1

(sik − sjk)2

sik + sjk

), (7)

where the value A is set equal to the mean χ2 distance betweenthe training data si.

In order to employ the Dense Trajectory-based facial videorepresentation to train the kernel ELM network describedabove, a multi-channel kernel learning approach [39] is fol-lowed, where:

K(i, j) = exp

(−

D∑d=1

(1

4A

K∑k=1

(sdik − sdjk)2

sdik + sdjk

)). (8)

In most applications where ELM-based classification isperformed, classification decision is made solely based onthe sign of ot. However, due to the fact that high precisionvalues, i.e., high true positive rate, are mainly of interest here,a threshold α was introduced in the training phase and finetuning was performed in order to identify the threshold valuegiving the best classification precision values.

E. Facial video segment classification (test phase)

In the test phase, a test facial video segment is introduced tothe SLFN network. When the STIP-based facial video segmentrepresentation is employed, HOG and HOF descriptors are cal-culated on STIP video locations, L2 normalized and concate-nated, in order to form the corresponding HOG/HOF featurevectors ptj ∈ RD, j = 1, . . . , Nt. ptj are quantized by usingthe codebook vectors vk ∈ RD, k = 1, . . . ,K determined inthe training phase and L1 normalized, in order to producethe facial vector st. st is subsequently introduced to thetrained kernel ELM network using (7) and its responses ot areobtained. Similarly, when the Dense Trajectory-based facialvideo representation is employed, HOG, HOF, MBHx, MBHy,and Trajectory descriptors are calculated on the trajectories of

5

densely-sampled video frame interest points and D = 5 BoW-based video representations sdt , d = 1, . . . , D are produced.sdt are subsequently introduced to the trained kernel ELMnetwork using (8) and its responses ot are obtained. Finally,the test facial video is classified to the visually talking class ifot ≥ α, or to the visually non-talking class if ot < α. In facialvideo segment verification testing, feature vectors consistingsolely of HOG descriptors are also used, both with STIP andwith Dense Trajectory-based video segment representation.

In facial video segment verification testing, feature vectorsconsisting solely of HOG descriptors are also used, bothwith STIP and with Dense Trajectory-based video segmentrepresentation.

IV. EXPERIMENTS

In this section, experiments conducted in order to evaluatethe performance of the proposed approach on V-VAD arepresented. One publicly available data set, namely CUAVEas well as a new movie data set containing visual voiceactivity samples in the wild, were used to this end. A shortdescription of these data sets is provided in the followingsubsections. Experimental results obtained by SVM and ELM-based classification are subsequently given. Regarding theoptimal parameter values used in our method, they have beendetermined by applying a grid search strategy using the valuesc = 10r, r = −6, . . . , 6 and α = 0.1e, e = 0, . . . , 5.

The classification performance metrics adopted for the eval-uation of the classification results achieved by the variousmethods are classification accuracy (CA), precision (P), F1measure (F1), miss rate (MR), false acceptance rate (FAR)and half total error rate (HTER = FAR + MAR/2). Moreover,it should be clear by now that, in case no or very slight motionis encountered in a facial video, the adopted video descriptiontechniques detect no points of interest, and as a consequence,calculate no descriptors. Even though these videos are omittedduring classification, they are taken into consideration in thecalculations of the aforementioned performance metrics in theevaluation phase as we make the assumption that they depicteither visually silent facial videos or background images whichare considered to belong to the visually silent class, too.

A. CUAVE data set

CUAVE [17] is a speaker-independent data set which canbe used for voice activity detection, lip reading and speakeridentification. It consists of videos of 36 speakers, recordedboth individually and in pairs uttering isolated and connecteddigits standing still in front of a simplistic background of solidcolor, or slightly moving. The participants are both male andfemale, with different skin complexions, accents and facialattributes, as can be seen in Figure 4. The facial videos usedin our experiments were extracted at a resolution of 195×315pixels.

Experiments on this data set are usually conducted byperforming multiple training-test rounds (sub-experiments),omitting a small percentage of the speakers and using 80%of the remaining for training and the rest 20% for testing,as suggested in [23], [24] and adopted in our experiments.

The performance of the evaluated method is subsequentlymeasured by reporting the mean classification rate over allsub-experiments.

Fig. 4. Sample speakers of the CUAVE data set.

B. Movie data setThe motive for the construction of a data set consisting of

facial image videos extracted from full-length movies, was theabsence of a data set suitable for (audio)-visual voice activitydetection, speech recognition or speaker identification, in thewild (i.e., resembling real-life conditions), as the vast majorityof the currently available public data sets are recorded inconstrained conditions, e.g., with participants usually standingstill in front of a plain background uttering digits, letters,or small phrases. Our data set was, thus, constructed afterperforming automatic face detection and tracking [11], [12], inthree full-length movies. The detected ROIs containing facialimages were cropped and resized to fixed size facial images of195×315 pixels. In some initial exploratory experiments sucha resolution was proven adequate for this particular problem.In this way, 4194 video sequences depicting facial imagetrajectories of 126 actors were extracted in a fully automatedway, consisting of facial videos of people of different ages,gender and maybe origin appearing at random poses perform-ing unconstrained movements and talking normally. Moreover,indoor, as well as outdoor shots are encountered, with bothstationary and moving complicated backgrounds.

In order for the proposed method to be evaluated on thisdata set, the leave-one-movie-out cross-validation protocolwas applied. Thus, mean classification accuracy results arereported. It should be noted here that, due to the fact thatthe face detection and tracking were fully automated, somevideo sequences not depicting facial images also emerged.However, such videos should not exist in a data set orientedfor testing V-VAD methods and thus were removed from thedata set. This removal can be done either manually or in anautomated way. The automatic approach entails the additionof another classification step, prior to the V-VAD step. Inthis step, the videos are classified based on the presence orabsence of human faces in them, using the method describedin Section III. Only those classified as facial image videos arefed to the second layer of classifiers, in order to be classifiedas visually speaking or silent. This preliminary classificationstep was performed both using all the descriptor histogramscalculated for visual speech/silence classification, and utilizingonly HOG histograms.

C. Experimental ResultsThe proposed method has been applied on the CUAVE data

set by using the experimental protocols suggested in [23],

6

[24] after a preprocessing step, which was necessary in orderto get frame based results by the proposed method, whichnormally conducts video based classification. Specifically, asliding window of length equal to 7 frames moving with stepequal to 1 frame was applied on the original videos, in orderto split them in smaller parts and labels were assigned to theresulting videos using majority voting on the labels of theframes constituting them. Frame based classification was thusperformed, as in [23], [24]. The sliding window length, waschosen in such a way that the number of frames used in V-VADby the proposed method was equal to the number of framesused for the calculation of the dynamic features exploited bymethods [23], [24] for the same purpose.

Table I summarizes in terms of classification accuracy (CA)and visually talking class precision (P) the performance ob-tained for each experimental setup and each video descriptionmethod by the aforementioned classification algorithms. Ascan be seen in this Table, satisfactory visual voice activitydetection performance is obtained by applying the proposedmethod. In more details, the STIP-based video descriptionseems to be more suitable for this data set than DenseTrajectory-based description (DT), achieving better classifica-tion accuracies by approximately 15% in both experiments.This can be explained, by taking into account that the com-bination scheme derived from the second video descriptionmethod is very complicated, while the visual data set is quitesimplistic, thus leading to overtraining and poor generalizationin testing.

TABLE ICLASSIFICATION RATES AND TALKING CLASS PRECISION ON THE CUAVE

DATA SET.

CUAVE DS Experiment [23] Experiment [24]CA P CA P

STIPs SVM 87.2% 87.4% 86.7% 88.0%ELM 87.6% 87.0% 86.8% 88.9%

DT SVM 74.2% 76.7% 71.4% 73.7%ELM 73.8% 75.7% 70.3% 72.4%

Comparison results with other state-of-the-art methods eval-uating their performance on the CUAVE data set, are providedin Table II. As can be seen, the proposed method outperformsthe classification accuracy of the methods reported in [23], [24]by 15.9% and 12.7%, respectively, on the two experimentalsetups used in the CUAVE data set, thus achieving great gen-eralization ability on new data. Moreover, in both experimentsthe proposed method has significantly lower error rates whilemethod [21] seems to be unable to handle the problem posedby this data set.

The results obtained after applying the proposed methodon the new, fully unconstrained data set without removing thevideos which do not depict facial images are presented in TableIII. Satisfactory performance is achieved by both descriptionmethods, with a half error rate (HTER) of approximately 30%,that is comparable to the respective performance obtained bystate-of-the-art in constrained visual data sets. In addition,dense trajectory-based approach outperforms the STIP-basedin all the reported metrics, contrary to what was the case in theCUAVE data set. This can be explained by the fact that in our

data set, head movements as well as complex background areencountered. Thus, the descriptors calculated using the densetrajectories method seem to be more efficient, enabling goodestimation of face contour and its distinctive motion from thatof the background, resulting in better classification rates thanthose obtained using STIP points description.

The problem whose results were reported on Table III wasnot the usual V-VAD one, since a third class of samples wasalso present in the data set, consisting basically of noise. Inorder to test our method in the real V-VAD problem, wemanually removed all the irrelevant videos and performed theexperiments again. The results on the ”clear” data set arepresented in Table IV. By comparing the reported results withthose in Table III, a fall in performance metrics rates is noticedin Table IV, especially in the visual silence class, emanatingfrom the removal of irrelevant videos, which were correctlyclassified as visually silent cases in the previous experiment.

Mean classification results obtained on the three full-lengthmovies constituting the constructed data set, detailed in Sec-tion IV-B, are presented in Table V. As can be seen, thefacial video segment verification step performs quite well.Very low miss rates are obtained using STIPs and the face classprecision as well as the the overall accuracy are satisfactory.Even better results are obtained using Dense Trajectory baseddescription and representation, reaching 93% precision rate,thus allowing the use of this step in the construction of thefully automatic system proposed in this paper, even though themiss rates are slightly worse (∼2 − 4%) than those reportedfor STIPs.

TABLE VFACIAL VIDEO SEGMENT VERIFICATION RATES ON THE FULL MOVIE DATA

SET.

MOVIE DS CA P MR F1

STIPs

KSVM 83.6% 85.8% 3.4% 90.8%HOG KSVM 84.0% 85.0% 1.7% 91.2%

KELM 83.8% 86.5% 4.2% 90.9%HOG KELM 83.8% 86.1% 3.8% 90.8%

DT

KSVM 94.8% 91.0% 5.2% 92.8%HOG KSVM 88.1% 91.5% 5.8% 92.8%

KELM 89.1% 93.0% 6.3% 93.3%HOG KELM 87.7% 92.1% 7.0% 92.5%

Table VI summarizes the classification results obtained byall the classifier pairs adopted for the automatic removalof non-facial videos from the data set and the subsequentclassification of the facial videos as visually speaking andnon-speaking. According to them, our approach performs verywell, even in the wild, as the classification rates reported aresimilar to those obtained by other already existing methodson the several simplistic data sets available. Moreover, asalready mentioned, STIP-based facial video description isproven inadequate for classification purposes in this case,leading to ∼10% lower precision rates and ∼5% higher HTERrates than the Dense Trajectory-based method. However, auniversal choice of one of the classifier pairs, reported as thebest one, would not be right, as depending on the applica-tion, different performance metrics are considered the mostimportant. By taking this into consideration, the combinationof two neural network based classification steps using Dense-

7

TABLE IICOMPARISON RESULTS ON THE CUAVE DATA SET.

CUAVE DS Experiment [23] Experiment [24]CA HTER FAR MR CA HTER FAR MR

Method [21] 52.8% 47.1% 40.8% 53.3% 52.6% 47.2% 41.0% 53.5%Method [23] 71.3% 25.6% 31.8% 28.7% - - - -Method [24] - - - - 74.1% 25.9% 24.2% 27.6%

Proposed method 87.2% 11.3% 14.1% 8.5% 86.8% 11.4% 11.5% 11.3%

TABLE IIICLASSIFICATION RATES ON THE FULL MOVIE DATA SET.

CONSTRUCTED DS Full data set Visual silence Visual speechCA HTER P FAR F1 P MR F1

STIPs 70.8% 37.7% 71.8% 8.9% 80.2% 68.6% 66.4% 44.0%DT 76.4% 30.5% 76.1% 7.3% 83.6% 77.6% 53.8% 57.9%

TABLE IVCLASSIFICATION RATES ON THE ”CLEAR” MOVIE DATA SET.


STIPs 67.8% 35.5% 68.5% 15.4% 75.5% 67.8% 55.6% 52.8%DT 71.1% 31.3% 69.9% 13.2% 77.2% 74.8% 49.4% 60.3%

Trajectory based facial video description can be regarded asthe best alternative. This is in line with the remark that inour experiments, we mainly focus on the minimization offalse detection error, and thus, on the maximization of visuallyspeaking class precision metric.

TABLE VICLASSIFICATION RATES ON THE AUTOMATICALLY CLEARED MOVIE DATA

SET.

MOVIE DS CA HTER P

STIPs

KSVM-KSVM 68.5% 37.0% 62.2%HOG KSVM-KSVM 70.9% 35.9% 67.5%

KSVM-KELM 69.7% 37.8% 68.2%HOG KSVM-KELM 70.8% 36.7% 68.2%

KELM-KSVM 70.1% 36.4% 67.3%HOG KELM-KSVM 70.7% 35.8% 67.5%

KELM-KELM 69.3% 37.3% 64.9%HOG KELM-KELM 69.6% 37.2% 65.8%

DT

KSVM-KSVM 73.0% 29.8% 70.9%HOG KSVM-KSVM 73.0% 29.6% 71.2%

KSVM-KELM 73.1% 31.0% 76.5%HOG KSVM-KELM 73.2% 30.7% 77.5%

KELM-KSVM 72.5% 29.7% 71.1%HOG KELM-KSVM 72.6% 29.8% 71.0%

KELM-KELM 73.2% 30.3% 78.8%HOG KELM-KELM 73.4% 30.3% 78.6%

Finally, based on the results reported in Table VII, ourmethod is proven to be much more efficient than one ofthe current state-of-the-art methods for visual voice activitydetection, as it outperforms it by 23.8%. More specifically,method [21] which was tested only on facial videos offrontal images, seems to fail in dealing with the unconstrainedproblem, while the proposed method achieves satisfactoryclassification accuracy. The poor performance of the method[21] in this data set was to a great extend expected, as itsimplementation utilizes face proportions in order to performmouth detection. This approach is successfully applicable onlyin frontal facial images and apparently fails in cases, where

face rotation of more than ∼30◦ horizontally and/or ∼10◦

vertically are encountered, which are very frequent in our dataset.

V. CONCLUSIONS

In this paper, we proposed a novel method for VisualVoice Activity Detection in the wild that exploits local shapeand motion information appearing at spatiotemporal locationsof interest for facial video description and the BoW modelfor facial video representation. SVM and Neural Network-based classification based on the ELM using the BoW-basedfacial video representations leads to satisfactory classificationperformance. Experimental results on one publicly availabledata set denote the effectiveness of the proposed method, sinceit outperforms recently proposed state-of-the-art methods in auser independent experimental setting. The respective resultson the fully unconstrained data of a new movie data setespecially constructed for dealing with the V-VAD problemin wild, prove the efficiency of the proposed method even inthe unconstrained problem, in which state-of-the-art methodsfail.

ACKNOWLEDGEMENT

The research leading to these results has received fundingfrom the European Union Seventh Framework Programme(FP7/2007-2013) under grant agreement number 287674(3DTVS). This publication reflects only the author’s views.The European Union is not liable for any use that may bemade of the information contained therein.

REFERENCES

[1] G. Zhao, M. Barnard, and M. Pietikainen, “Lipreading with localspatiotemporal descriptors,” IEEE Transactions on Multimedia, vol. 11,no. 7, pp. 1254–1265, November 2009.

8

TABLE VIICOMPARISON RESULTS ON THE CONSTRUCTED DATA SET.


Method [21] 49.6% 49.2% 57.2% 64.9% 43.1% 45.2% 33.5% 53.8%Proposed method 73.4% 30.3% 71.5% 9.3% 80.0% 78.6% 51.4% 60.0%

[2] C. Zhang, P. Yin, Y. Rui, R. Cutler, P. Viola, X. Sun, N. Pinto, andZ. Zhang, “Boosting-based multimodal speaker detection for distributedmeeting videos,” IEEE Transactions on Multimedia, vol. 10, no. 8, pp.1541–1552, December 2008.

[3] K. Nathwani, P. Pandit, and R. Hegde, “Group delay based methodsfor speaker segregation and its application in multimedia informationretrieval,” IEEE Transactions on Multimedia, vol. 15, no. 6, pp. 1326–1339, October 2013.

[4] M. Sargin, Y. Yemez, E. Erzin, and A. Tekalp, “Audiovisual synchroniza-tion and fusion using canonical correlation analysis,” IEEE Transactionson Multimedia, vol. 9, no. 7, pp. 1520–1403, November 2007.

[5] Q. Liu, A. Aubrey, and W. Wang, “Interference reduction in reverberantspeech separation with visual voice activity detection,” IEEE Transac-tions on Multimedia, vol. 16, no. 6, pp. 1610–1623, October 2014.

[6] V. Minotto, C. Jung, and B. Lee, “Simultaneous-speaker voice activitydetection and localization using mid-fusion of SVM and HMMs,” IEEETransactions on Multimedia, vol. 16, no. 4, pp. 1032–1044, June 2014.

[7] S. Petridis and M. Pantic, “Audiovisual discrimination between speechand laughter: Why and when visual information might help,” IEEETransactions on Multimedia, vol. 13, no. 2, pp. 216–234, April 2011.

[8] I. Laptev, “On space-time interest points,” International Journal ofComputer Vision, vol. 64, no. 2–3, pp. 107–123, September 2005.

[9] H. Wang, A. Klaser, C. Schmid, and C.-L. Liu, “Action recognitionby dense trajectories,” Computer Vision and Pattern Recognition, pp.3169–3176, 2011.

[10] A. Iosifidis, A. Tefas, and I. Pitas, “Minimum class variance extremelearning machine for human action recognition,” IEEE Transactions onCircuits and Systems for Video Technology, vol. 23, no. 11, pp. 1968–1979, November 2013.

[11] G. Stamou, M. Krinidis, N. Nikolaidis, and I. Pitas, “A monocularsystem for person tracking: Implementation and testing,” Journal onMultimodal User Interfaces, vol. 1, no. 2, pp. 31 – 47, 2007.

[12] O. Zoidi, A. Tefas, and I. Pitas, “Visual object tracking based on localsteering kernels and color histograms,” IEEE Transactions on Circuitsand Systems for Video Technology, vol. 23, no. 5, pp. 870 – 882, 2013.

[13] Y. Huang, Z. Wu, L. Wang, and T. Tan, “Feature coding in imageclassification: A comprehensive study,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 36, pp. 493–506, 2014.

[14] A. Iosifidis, A. Tefas, and I. Pitas, “Discriminant bag of words basedrepresentation for human action recognition,” Pattern Recognition Let-ters, vol. 49, pp. 185–192, 2014.

[15] G. Huang, Q. Zhu, and C. Siew, “Extreme learning machine: a newlearning scheme of feedforward neural networks,” International JointConference on Neural Networks, vol. 2, pp. 985–990, July 2004.

[16] A. Iosifidis, A. Tefas, and I. Pitas, “On the kernel extremelearning machine classifier,” Pattern Recognition Letters, D.O.I.10.1016/j.patrec.2014.12.003, 2014.

[17] E. Patterson, S. Gurbuz, Z. Tufekci, and J. Gowdy, “CUAVE: Anew audio-visual database for multimodal human-computer interfaceresearch,” International Conference on Acoustics, Speech and SignalProcessing, vol. 2, pp. II–2017 – II–2020, May 2002.

[18] S. Takeuchi, H. Takashi, S. Tamura, and S. Hayamizu, “Voice activitydetection based on fusion of audio and visual information,” AVSP, pp.151–154, 2009.

[19] D. Sodoyer, B. Rivet, L. Girin, J.-L. Schwartz, and C. Jutten, “Ananalysis of visual speech information applied to voice activity detection,”International Conference on Acoustics, Speech and Signal Processing,vol. 1, pp. I–I, 2006.

[20] D. Sodoyer, B. Rivet, L. Girin, C. Savariaux, J.-L. Schwartz, andC. Jutten, “A study of lip movements during spontaneous dialog and itsapplication to voice activity detection,” The Journal of the AcousticalSociety of America, vol. 125, no. 2, pp. 1184–1196, 2009.

[21] S. Siatras, N. Nikolaidis, and I. Pitas, “Visual speech detection usingmouth region intensities,” European Signal Processing Conference,2006.

[22] K. Saenko, K. Livescu, M. Siracusa, K. Wilson, J. Glass, and T. Darrell,“Visual speech recognition with loosely synchronized feature streams,”

International Conference on Computer Vision, vol. 2, pp. 1424–1431,2005.

[23] R. Navarathna, D. Dean, P. Lucey, S. Sridharan, and C. Fookes, “Dy-namic visual features for visual-speech activity detection,” Conferenceof International Speech Communication Association, 2010.

[24] R. Navarathna, D. Dean, S. Sridharan, C. Fookes, and P. Lucey, “Visualvoice activity detection using frontal versus profile views,” InternationalConference on Digital Image Computing Techniques and Applications,pp. 134–139, 2011.

[25] Q. Liu, W. Wang, and P. Jackson, “A visual voice activity detectionmethod with adaboosting,” Sensor Signal Processing for Defence (SSPD2011), pp. 1–5, 2011.

[26] A. Aubrey, Y. Hicks, and J. Chambers, “Visual voice activity detectionwith optical flow,” IET Image Processing,, vol. 4, no. 6, pp. 463–472,2010.

[27] I. Almajai and B. Milner, “Using audio-visual features for robustvoice activity detection in clean and noisy speech,” European SignalProcessing Conference, vol. 86, 2008.

[28] M. Cooke, J. Barker, S. Cunningham, and X. Shao, “An audio-visualcorpus for speech perception and automatic speech recognition,” Journalof the Acoustical Society of America, vol. 120, no. 5, pp. 2421 – 2424,November 2006.

[29] V. Minotto, C. Lopes, J. Scharcanski, C. Jung, and B. Lee, “Audiovisualvoice activity detection based on microphone arrays and color informa-tion,” IEEE Journal of Selected Topics in Signal Processing, vol. 7,no. 1, pp. 147–156, 2013.

[30] I. Laptev and T. Lindeberg, “Space-time interest points,” InternationaConference on Computer Vision, pp. 432–439, 2003.

[31] C. Harris and M. Stephens, “A combined corner and edge detector,”Alvey Vision Conference, pp. 147–152, 1988.

[32] H. Wang, M. Ullah, A. Klaserr, I. Laptev, and C. Schmid, “Evaluation oflocal spatio-temporal features for action recognition,” British MachineVision Conference, 2009.

[33] S. Theodoridis and K. Koutroumbas, “Pattern recognition,” AcademicPress, 2008.

[34] G. B. Huang, L. Chen, and C. K. Siew, “Universal approximation usingincremental constructive feedforward networks with random hiddennodes,” IEEE Transactions on Neural Networks, vol. 17, no. 4, pp. 879–892, 2006.

[35] G. B. Huang and L. Chen, “Convex incremental extreme learningmachine,” Neurocomputing, vol. 70, no. 16, pp. 3056–3062, 2008.

[36] A. Iosifidis, A. Tefas, and I. Pitas, “Minimum variance extreme learningmachine for human action recognition,” IEEE International Conferenceon Acoustics, Speech and Signal Processing, pp. 5427–5431, 2014.

[37] G. Huang, H. Zhou, X. Ding, and R. Zhang, “Extreme learning machinefor regression and multiclass classification,” IEEE Transactions onSystems, Man, and Cybernetics, Part B: Cybernetics, vol. 42, no. 2,pp. 513–529, 2012.

[38] H. Wang, M. Ullah, A. Klaser, I. Laptev, and C. Schmid, “Evaluation oflocal spatio-temporal features for action recognition,” British MachineVision Conference, 2009.

[39] J. Zhang, M. Marszalek, M. Lazebnik, and C. Schmid, “Local featuresand kernels for classification of texture and object categories: A com-prehensive study,” International Journal of Computer Vision, vol. 73,no. 2, pp. 213–238, 2007.

Date post:	09-Mar-2021
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Patrona, F., Iosifidis, A., Tefas, A., Nikolaidis, N ......1 Visual Voice Activity Detection in the...

Documents