+ All Categories
Home > Documents > Using Deep Autoencoders for Facial Expression …body language, and electroencephalography (EEG),...

Using Deep Autoencoders for Facial Expression …body language, and electroencephalography (EEG),...

Date post: 14-Jul-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
7
1 Using Deep Autoencoders for Facial Expression Recognition Muhammad Usman 1 , Siddique Latif 2,3 , and Junaid Qadir 2 1 COMSATS Institute of Information Technology, Islamabad 2 Information Technology University (ITU), Punjab, Lahore, Pakistan 3 National University of Sciences and Technology (NUST), Islamabad, Pakistan [email protected], [email protected], [email protected] Abstract—Feature descriptors involved in image processing are generally manually chosen and high dimensional in nature. Selecting the most important features is a very crucial task for systems like facial expression recognition. This paper investigates the performance of deep autoencoders for feature selection and dimension reduction for facial expression recognition on multiple levels of hidden layers. The features extracted from the stacked autoencoder outperformed when compared to other state-of-the- art feature selection and dimension reduction techniques. I. I NTRODUCTION Emotion recognition is an important area of research to enable effective human-computer interaction. Human emo- tions can be detected using speech signal, facial expressions, body language, and electroencephalography (EEG), etc. In this paper, we focus on facial expression recognition (FER), which is a widely being studied problem [1], [2]. FER has become a very interesting field of study and its applications are not limited to human mental state identification and operator fatigue detection, but also to other scenarios where computers (robots) play a social role such as an instructor, a helper, or even a companion. In such applications, it is essential that computers are able to recognize human emotions and behave according to their affective states. In healthcare, recognizing patients’ emotional instability can help in early diagnosis of psychological disorders [3]. Another application of FER is to monitor human stress level in daily human-computer interaction. Humans can easily recognize another human’s emotions using facial expressions but the same task is very challenging for machines. Generally, FER consists of three major steps as shown in the Figure 1. The first step involves the detection of a human face from the whole image by using image processing techniques. In the second step, key features are extracted from the detected face. Finally, machine learning models are used to classify images based on the extracted features. Features descriptors like histograms of oriented gradients (HOG) [4], Local Gabor features [5] and Weber Local Descrip- tor (WLD) [6] are widely used techniques for FER, whereas HOG has shown to be particularly effective in literature for the task of FER [7]. The dimensionality of these features is usually high. Due to the complexity of multi-view features, dimension reduction and more meaningful representation of this high dimensional data is a challenging task. Therefore, techniques like Principal Component Analysis (PCA) and Local Binary Pattern (LBP), [5], [8], Non-Negative Matrix Factorization (NMF), etc., are being used to overcome high dimensionality problem by representing the most relevant features in lower- dimensions. Fig. 1: Facial expression recognition (FER) block diagram Machine learning techniques have revolutionized many fields of science including computer vision, pattern recogni- tion, and speech processing through its powerful ability to learn nonlinear relationships over hidden layers, which makes it suitable for automatic features learning and modeling of nonlinear transformations. Deep neural networks (DNNs) can be used for feature extraction as well as for dimensionality reduction [7], [9]. A large number of classification techniques has been used for FER. For example, Choi et al. [10] used artificial neural networks (ANNs) for classification of facial expressions. Authors in [8], [11] have used Support Vector Machines (SVMs) for FER. In [12], [13], authors utilized Hidden Markov Model (HMM) for FER. HMMs are mostly used for frame-level features to handle sequential data. Be- sides these classifiers, Dynamic Bayesian Networks [14] and Gaussian Mixture Model [15] are also utilized for learning facial expressions. The recent success of deep learning also motivates its use for FER [16], [17]. In this paper, we use a novel approach based on stacked autoencoders for FER. We exploited autoencoders network for effective representation of high dimensional facial features in lower dimensions. Autoencoders represent an ANN config- uration in which output units are linked to the input units through the hidden layers. A fewer number of hidden units allow them to represent input data into a low dimensional latent representation. While in stacked autoencoder, output of first layer is immediately given to second layer as an input. In other words, stacked autoencoders are built by stacking additional unsupervised feature learning hidden layers, and arXiv:1801.08329v1 [cs.CV] 25 Jan 2018
Transcript
Page 1: Using Deep Autoencoders for Facial Expression …body language, and electroencephalography (EEG), etc. In this paper, we focus on facial expression recognition (FER), which is a widely

1

Using Deep Autoencoders for Facial ExpressionRecognition

Muhammad Usman1, Siddique Latif2,3, and Junaid Qadir2

1COMSATS Institute of Information Technology, Islamabad2Information Technology University (ITU), Punjab, Lahore, Pakistan

3National University of Sciences and Technology (NUST), Islamabad, [email protected], [email protected], [email protected]

Abstract—Feature descriptors involved in image processingare generally manually chosen and high dimensional in nature.Selecting the most important features is a very crucial task forsystems like facial expression recognition. This paper investigatesthe performance of deep autoencoders for feature selection anddimension reduction for facial expression recognition on multiplelevels of hidden layers. The features extracted from the stackedautoencoder outperformed when compared to other state-of-the-art feature selection and dimension reduction techniques.

I. INTRODUCTION

Emotion recognition is an important area of research toenable effective human-computer interaction. Human emo-tions can be detected using speech signal, facial expressions,body language, and electroencephalography (EEG), etc. Inthis paper, we focus on facial expression recognition (FER),which is a widely being studied problem [1], [2]. FER hasbecome a very interesting field of study and its applications arenot limited to human mental state identification and operatorfatigue detection, but also to other scenarios where computers(robots) play a social role such as an instructor, a helper, oreven a companion. In such applications, it is essential thatcomputers are able to recognize human emotions and behaveaccording to their affective states. In healthcare, recognizingpatients’ emotional instability can help in early diagnosisof psychological disorders [3]. Another application of FERis to monitor human stress level in daily human-computerinteraction.

Humans can easily recognize another human’s emotionsusing facial expressions but the same task is very challengingfor machines. Generally, FER consists of three major steps asshown in the Figure 1. The first step involves the detection of ahuman face from the whole image by using image processingtechniques. In the second step, key features are extracted fromthe detected face. Finally, machine learning models are usedto classify images based on the extracted features.

Features descriptors like histograms of oriented gradients(HOG) [4], Local Gabor features [5] and Weber Local Descrip-tor (WLD) [6] are widely used techniques for FER, whereasHOG has shown to be particularly effective in literature for thetask of FER [7]. The dimensionality of these features is usuallyhigh. Due to the complexity of multi-view features, dimensionreduction and more meaningful representation of this high

dimensional data is a challenging task. Therefore, techniqueslike Principal Component Analysis (PCA) and Local BinaryPattern (LBP), [5], [8], Non-Negative Matrix Factorization(NMF), etc., are being used to overcome high dimensionalityproblem by representing the most relevant features in lower-dimensions.

Fig. 1: Facial expression recognition (FER) block diagram

Machine learning techniques have revolutionized manyfields of science including computer vision, pattern recogni-tion, and speech processing through its powerful ability tolearn nonlinear relationships over hidden layers, which makesit suitable for automatic features learning and modeling ofnonlinear transformations. Deep neural networks (DNNs) canbe used for feature extraction as well as for dimensionalityreduction [7], [9]. A large number of classification techniqueshas been used for FER. For example, Choi et al. [10] usedartificial neural networks (ANNs) for classification of facialexpressions. Authors in [8], [11] have used Support VectorMachines (SVMs) for FER. In [12], [13], authors utilizedHidden Markov Model (HMM) for FER. HMMs are mostlyused for frame-level features to handle sequential data. Be-sides these classifiers, Dynamic Bayesian Networks [14] andGaussian Mixture Model [15] are also utilized for learningfacial expressions. The recent success of deep learning alsomotivates its use for FER [16], [17].

In this paper, we use a novel approach based on stackedautoencoders for FER. We exploited autoencoders network foreffective representation of high dimensional facial features inlower dimensions. Autoencoders represent an ANN config-uration in which output units are linked to the input unitsthrough the hidden layers. A fewer number of hidden unitsallow them to represent input data into a low dimensionallatent representation. While in stacked autoencoder, output offirst layer is immediately given to second layer as an input.In other words, stacked autoencoders are built by stackingadditional unsupervised feature learning hidden layers, and

arX

iv:1

801.

0832

9v1

[cs

.CV

] 2

5 Ja

n 20

18

Page 2: Using Deep Autoencoders for Facial Expression …body language, and electroencephalography (EEG), etc. In this paper, we focus on facial expression recognition (FER), which is a widely

2

can be trained by using greedy methods for each additionallayer. As a result, when the data is passed through the multiplehidden layers of stacked autoencoders, it encodes the inputvector in a smaller representation more efficiently [18]. In ourcase, autoencoders network is more suitable as it not onlyreduces the dimension of data but can also detect most relevantfeatures. In previous work, Hinton et al. [19] have shown thatautoencoders networks can be used for effective dimensionreduction and they can produce more effective representationthan PCA.

For our experiments, we choose Extended Cohn-Kanade(CK+) [20] dataset which is extensively used for automaticfacial image analysis and emotion classification. The HOGfeatures are computed from the selected area of facial expres-sions and their dimensions have been reduced by using stackedautoencoders on multiple levels and with multiple hiddenlayers to get the most optimal encoded features. SVM model inthe one-vs-all scenario is used for classification on this reducedform of features. We have performed multiple experiments onthe selection of optimal dimension (10-500 features) of thefeature vector. The feature vector with length 60, obtainedafter the introduction of four hidden layers in autoencodersnetwork outperformed as compared to other dimensions. Mostimportantly, we also use PCA for dimension reduction inorder to compare the baseline results with autoencoders. Ourproposed method for FER using stacked autoencoders is alsooutperformed when results were compared with PCA and otherrecent approaches published in this domain. This demonstratesthe effectiveness of stacked autoencoders for the selection ofthe most relevant features for FER task.

The rest of the paper is organized as follows. In SectionII, we present background and related work. In Section III,the detail on each step of our proposed method is described.In Section IV, we explain the experimental procedure andobtained results. Finally, we conclude in Section V.

II. RELATED WORK

Facial expressions are visually observable non-verbal com-munication signals that occur in response to a person’s emo-tions and originate by the change in facial muscle. They arethe key mechanism for conveying and understanding emotions.Ekman and Freisen [21] postulated six universal emotions (i.e.,anger, fear, disgust, joy, surprise, and sadness) with distinctcontent and unique facial expression. Most of the studies inthe area of emotion recognition usually focus on classifyingthese six emotions.

Much of the efforts have been made to classify facial ex-pression with various facial feature by using machine learningalgorithms. For example, Anderson et al. [22] developed anFER system to recognize the six emotions. They use SVM andMultilayer Perceptrons and achieved a recognition accuracy of81.82%. In [6], Wang et al. combined HOG and WLD featuresto have missing information about the contour and shape. Theproposed solution attained 95.86% recognition rate by usingchi-square distance and the nearest neighbor method to classifythe fused features. Lia et al. [23] used k-nearest neighbor tocompare the performance of PCA and NMF on Taiwanese and

Indian facial expression databases. They attained above 75%recognition rate by using both techniques.

Recently, a comprehensive study has been made by Liu etal. [8], they also combined HOG with Local Binary Patterns(LBP) features. For dimension reduction of extracted features,PCA was used. After applying several classifiers on reducedfeatures, he received 98.3% maximum recognition rate. Simi-larly, Xing et al. [24] used local Gabor features with Adaboostclassifier for FER and achieved 95.1% accuracy with the 10-time reduced dimensionality of traditional Gabor features.

Encouragingly, Jung et al. [25] used deep neural networksto extract temporal appearance as well as temporal geometricfeatures from RAW data. They tested this technique on severaldatasets and obtained higher accuracy than state-of-the-arttechniques. Jang et al. [26] worked on color images andattained 85.74% recognition rate by using color channel-wiserecurrent learning using deep learning. Similarly, Talele et al.[27] used LBP features and ANN for classification and themaximum success rate was 95.48%.

Recently, the autoencoders models have been used morewidely for features learning from data and classification prob-lems. For example, Huang et al. [28] used sparse autoen-coder networks for feature learning and classification. Histechnique was good to avoid human interaction but at thecost of computation complexity. Interestingly, Gupta et al.[29] developed a multi-velocity autoencoder network by usingthe multi-velocity layers for generating velocity-free deepmotion features for facial expressions. The proposed techniqueattained the state-of-the-art accuracy on various FER datasets.An interesting work has been done by Makhzan et al. [30]to investigate the effectiveness of sparsity on MNIST data.They showed that sparse autoencoders are simple to trainand achieve better classification results as compared to thedenoising autoencoders and Restricted Boltzmann Machines(RBMs) as well as networks trained with dropout. Anotherstudy [31] explored the effect of hidden layers in stackedautoencoders on MNIST data. The authors showed that stackedautoencoders with larger depth have better learning capabilitybut require more training examples and time.

III. PROPOSED METHOD

Our proposed FER system consists of four steps (as illus-trated in Figure 2). The first step is related to image processing,in which, we use the state-of-the-art Viola Jones [32] facedetection method for face detection and extraction. This ex-tracted portion represents the most variance when expressionchanges. In the second step, HOG features are computedfrom the cropped image. In the third step, high-dimensionalHOD features are reduced to lower dimension using stackedautoencoders. Finally, in the fourth step, the SVM modelis used on these lower dimension features to classify thefacial expressions. Extended Cohn-Kanade Dataset (CK+) isused in our experiment. Most importantly, we investigated theperformance of encoded features of length 5 to 100 usingdifferent depth of stacked autoencoders. Figure. 2 shows theflowchart of our overall experiment. The detail of each step isgiven below.

Page 3: Using Deep Autoencoders for Facial Expression …body language, and electroencephalography (EEG), etc. In this paper, we focus on facial expression recognition (FER), which is a widely

3

Fig. 2: Facial Emotion Recognition (FER) framework

A. Image Processing

At the image processing stage, we first detect and extractthe face region to eliminate redundant regions which can affectrecognition rate. The used databases contain much redundantinformation in the images, and to eliminate the redundantinformation, the robust real-time detector developed by Violaand Jones [32] is employed. In this way, we obtained the facelocal region around mouth and eyes as these parts representthe most discriminating information when facial expressionchanges.

Fig. 3: Face detection and cropping salient area

As shown in Figure. 3, we crop out the face region andre-size the face image to 128×128 to get the salient areas offacial expression.

B. Feature Extraction

Histogram of oriented gradients (HOG) is a feature de-scriptor which is widely used in computer vision and imageprocessing [33]. The technique counts occurrences of gradientorientation in localized portions of an image. HOG is invari-ant to geometric and photometric transformations, except forobject orientation. The images that are in the database havedifferent expressions and different orientations of eyes, nose,and lip corners. HOG is used in our algorithm because it is apowerful descriptor to detect the variations (i.e., when facialexpressions change). In our proposed approach, we appliedHOG on cropped face images and extracted the feature vectors.The cropped image of size 128×128 gives a feature vector ofsize 1×8100 using HOG. The feature vectors are concatenatedto form feature matrix as shown in table I.

C. Dimension Reduction

The main aim of this work is to show that how nonlinearmachine learning technique can be effectively used to obtainthe most relevant holistic representation of features in alower dimension. As the extracted HOG features have a high

1x8100 (HOG Feature of image 1)1x8100 (HOG Feature of image 2)1x8100 (HOG Feature of image 3)1x8100 (HOG Feature of image 4)

............

............

............

............1x8100 (HOG Feature of image (N-1))

1x8100 (HOG Feature of image N)

TABLE I: HOG feature Matrix of N×8100

dimension (N×8100) as compared to the number of availableimages (327). The state-of-the-art is to reduce the dimensionof the features vector by using different dimension reductiontechniques such as PCA, linear discriminant analysis (LDA)and NMF. For this purpose, we use autoencoder networkon high dimensional feature descriptors extracted by usingHOG. To compare the performance of these features, we alsouse PCA for dimension reduction of features. Both of thesetechniques are discussed below.

1) Autoencoders: An autoencoder is an unsupervised ar-chitecture that replicates the given input at its output. It takesan input feature vector X and learns a code dictionary bychanging the raw input data from one representation to another.An autoencoder applies backpropagation by setting the targetvalues to be equal to the inputs (i.e., x(i) = y(i)) as shown inthe Figure 4.

Fig. 4: Architecture of an autoencoder network. The input vector X of length m isencoded to lower dimensional feature vector a of length n then reconstructed as y

with length m similar to x (n < m)

For example, if autoencoders are inputted with correlatedstructural data, then the network will discover some of thesecorrelations [34]. In an autoencoder, the lower dimension a isrepresented by

a = f(∑

Wx+ b)

(1)

Where W is associated weight vector with the input unit andhidden unit, b is the bias associated with the hidden unit anda is the activation of the hidden unit in the network. Similarly,f(x) is the sigmoid function that is given by

f(x) =1

1 + exp(−z)(2)

zi =

m∑j=1

Wijx+ bi (3)

Page 4: Using Deep Autoencoders for Facial Expression …body language, and electroencephalography (EEG), etc. In this paper, we focus on facial expression recognition (FER), which is a widely

4

andai = f(zi) (4)

The stacked autoencoder can be described as follow

ai = f(∑

Wia(i−1) + bi

)(5)

Encouragingly, an autoencoder can also discover the inter-esting structure of data, even when the number of hiddenunits is large, by imposing sparsity constraint on the hiddenunits. Such architecture is called sparse autoencoder. The costfunction J(W, b) of a sparse autoencoder is given by:

J(W, b) =

[1

m

m∑i=1

(1

2||hW,b(x

(i))− x(i)||2)]

+λ12

L−1∑l=1

sl∑i=1

sl+1∑j=1

(W(l)ji )

2 + β

s2∑j=1

KL (ρ||ρ̂j) (6)

Where hW,b(x) is an activation function, W and b are weightsand biases respectively. The first term in equation (6) tries tominimize the difference between the input and output. Thesecond term is the weight decay that avoids over-fitting, whereλ is the parameter for weight decay, L is the number of layersin autoencoder network, and sl denotes the number of unitsfor the lth layer. Similarly, W (l)

ji represents the weight valuebetween the ith unit of layer l and the jth unit of layer l+1,and b(l)i is the bias associated with unit i in layer l + 1. Thelast term is a sparse penalty term, where β controls the weightof this term, and ρ is a sparsity parameter and KL is theKullback-Leibler (KL) divergence that is given by

KL (ρ||ρ̂j) = ρ logρ

ρ̂j+ (1− ρ) log 1− ρ

1− ρ̂j(7)

Typically, ρ is set to be a small value close to 0. KLdivergence is a standard function used for measuring thedifference between two different distributions.

In this experiment, the extracted features using HOG isinputted to autoencoder network to encode them at the desiredlevel of dimension by limiting the hidden units in hiddenlayers. The number of hidden layers is always experiential.Therefore, we also tried to explore the effect of an increasein the number of hidden layers for stacked autoencoder. Thiseffect typically used for dimension reduction of input data. Wehave performed multiple experiments to validate our findings.To get a quality of encoded features from autoencoder, we usebackpropagation for fine-tuning of network parameters. Meansquare error (MSE) is used as a loss function with 400 epochs.

2) Principal Component Analysis: The research domain ofpattern recognition and computer vision is dominated by theextensive use of PCA which is also referred as Karhunen-Loeve expansion [35]. PCA is a statistical procedure that usesan orthogonal transformation to convert a set of observationsof possibly correlated variables into a set of values of linearlyuncorrelated variables called principal components. PCA isan effective method to reduce the feature dimension andhas been extensively being applied in FER for dimensionreduction of features [5], [8]. Therefore, we chose PCA to

compare its performance with nonlinear dimension reductionby autoencoder. High dimension feature matrix (N×8100) isreduced to the multiple numbers of dimension (i.e., 10 to 500)using PCA.

D. Support Vector Machine

SVMs are very powerful tool for binary as well as multi-class classification problems. Initially, SVMs was designed forbinary classification that separates the binary class data (k=2)with a maximized margin. However, for real-world problems,it is often required to discriminate between data for more thantwo (k >2) categories. Therefore, two representative ensembleschemes exist in SVMs, i.e., one-versus-all and one-versus-oneto classify multi-class data. In this experiment, we use SVM inthe one-vs-all scenario with a Gaussian kernel function. In one-vs-all scheme, SVM constructs k separate binary classifiersto classify k-classes of data. The mth binary classifier istrained by using the data from mth class as the positiveexample and the remaining k−1 number of classes as negativeexamples. During testing, the class label is predicted by thebinary classifier that gives maximum output value. For binaryclassification task with training data xi(i = 1, 2, 3, ....N) andcorresponding labels yi = ±1, the decision function can beformulated as:

f(x) = sign(wTx+ b). (8)

Where wTx + b = 0 denotes a separating hyper-plane, w isa weight vector normal to the separating hyper-plane and bdenotes the bias or offset of the hyper-plane. Following is theregion between hyper-planes that is also called margin band.

γ =2

|w|(9)

Finally, choosing the optimal values of w and b is formulatedas a optimization problem, where equation 9 is maximizedsubject to the following constrain:

yi(wTxi + b) > 1∀i (10)

IV. EXPERIMENTAL RESULTS AND DISCUSSION

The performance of our proposed approach for FER hasevaluated on publicly available CK+ database. This datasetcontains 593 sequences of images from 123 subjects. Only327 out of 593 sequences of images are given the labels for7 human facial expressions. Out of 7 expressions, we usedsix expressions (i.e., angry, happy, disgust, sadness, surprise,and fear) similar to the methods adopted in [8], [18], [36].Contempt has only 18 labeled images so it was not includedin our experiment. Each expressional image sequence startswith a neutral expression and ends with a peak expression(i.e., anger). In our experiment, we use five peak images ofeach expression to incorporate the temporal information of anexpression. Figure. 5 shows a sample sequence of images forthe six emotions that we use for training. For training, weuse 80% of data while testing was performed using remaining20%. During testing, we only use one peak image of eachexpression.

Page 5: Using Deep Autoencoders for Facial Expression …body language, and electroencephalography (EEG), etc. In this paper, we focus on facial expression recognition (FER), which is a widely

5

Fig. 5: Sample sequence of five images for various emotions: (a) Angry, (b) Disgust,(c) Fear, (d) Happy, (e) Sad and (f) Surprised

Multiclass SVM in the one-vs-all scheme with Gaussiankernel has been used for classification of facial expressionsusing MATLAB. We have performed multiple experiments ona different length of features obtained by autoencoder and PCAas shown in table II. It can be noted that encoded featuresobtained by the stacked autoencoders mostly outperformedthe baseline (PCA) performance. By using autoencoder fordimension reduction, we achieved the highest recognitionrate of 99.60% with 60 dimensions while with PCA 96.44%success rate is obtained with 80 dimensions.

Number ofFeature

PCA (Accuracy%)

Autoencoders(Accuracy %)

10 88.34 97.8020 90.29 98.1040 96.01 98.5060 96.11 99.6080 96.44 98.10100 94.17 98.40200 94.82 97.84300 95.15 96.28400 95.46 96.98500 95.45 96.98

TABLE II: Accuracy using different dimension (number of features) with SVM

We also investigated the effect of adding more hiddenlayers in autoencoder network. We have performed multipleexperiments by introducing more hidden layers with a differentnumber of hidden units (i.e., 500, 400, 300 and 200) whilethe encoded features are from 5 to 100. Figure 6 shows thestructure of five autoencoders used in our experiments.

Table III shows the results of experiments in which adifferent number of hidden layers are introduced. It can benoted that higher the number of hidden layers not necessarilyincrease the accuracy, as already indicated in [37], [38]. Fromthese results, we can state that after a certain number of thehidden layer for each number of feature, the accuracy startsdecreasing. For example, with 80 features, when hidden layersare introduced, recognition rate increases till the second layer

Fig. 6: The structure of five stacked autoencoders used in our experiment. (a) onehidden layer autoencoder, (b) two hidden layer autoencoder, (c) three hidden layer

autoencoder, (d) four hidden layer autoencoder and (e) five hidden layer autoencoder

but after that, it decreases. Similarly, we find the same trendfor all number of features but at a different number of hiddenlayers.

Number ofFeature

HiddenLayer 1

HiddenLayer 2

HiddenLayer 3

HiddenLayer 4

HiddenLayer 5

5 78.6 89.0 90.1 93.4 94.910 78.4 93.0 94.9 97.8 97.120 77.7 95.2 98.5 98.1 97.840 96.2 98.3 98.9 98.5 98.160 96.9 98.7 99.2 99.6 98.880 97.6 98.9 98.4 98.1 97.3100 97.9 98.5 98.9 98.4 98.2

TABLE III: Recognition rate achieved on multiple hidden layers for different dimension(number of features)

Figure. 7 shows the trend of best recognition rate and thenumber of reduced dimensions with autoencoder and PCA. Itcan be seen from Figure. 7 that there is no regular relationshipbetween accuracy and the number of dimensions, however, itremains almost same after the specific dimension, i.e., 200.The accuracy of 99.60 with less (i.e., 60) number of features

Fig. 7: Obtained accuracy by using different dimension of features

is not reported in the literature. We have also revived thelatest papers in table IV, to compare our method with recentlypublished papers in this domain. It can be seen from table

Page 6: Using Deep Autoencoders for Facial Expression …body language, and electroencephalography (EEG), etc. In this paper, we focus on facial expression recognition (FER), which is a widely

6

IV, previously maximum achieved accuracy is 99.51% using acombination of features (i.e, HOG+LDA+PCA). Similarly, Liuet al. [8] achieved 98.3% recognition rate using local binarypatterns (LBP) and HOG features together. They achieved thisaccuracy using a combination of features with 80 dimensions.While our proposed method has shown significantly better re-sults while using a single type of features at lower dimensions.

Study Year Method Accuracy (%)

Xing et al. [24] 2016 Local Gabor Feature +Adaboost Classifier 95.1

Cossetin et al. [39] 2016 Pairwise Feature 98.07Kar et al. [40] 2017 HOG+LDA+PCA 99.51

Liu et al. [24] 2017 Local binary patterns(LBP)+HOG with PCA 98.3

Our study 2017 HOG + Autoencoders 99.60

TABLE IV: Comparison of some recent paper with our study

Although our proposed approach has achieved a state-of-the-art recognition rate but the time complexity of autoen-coders is linearly dependent on the number of features andhidden layers. Greater the number of features or hidden layers,the more time that is required to train the model.

V. CONCLUSION

The main contribution of this paper is to investigate theperformance of deep autoencoders for lower dimensionalfeature representation. The experiment proves that nonlineardimension reduction using autoencoders is more effective thanlinear dimension reduction techniques for FER. We used CK+dataset in our experiments and compared our results usingfeatures obtained by autoencoder networks with state-of-the-art PCA. Most importantly, we explored the effect of anincrease in the number of hidden layers which enhanced thelearning capability of the network to provide more robust andoptimal features for facial expression recognition.

REFERENCES

[1] J. Kumari, R. Rajesh, and K. Pooja, “Facial expression recognition: Asurvey,” Procedia Computer Science, vol. 58, pp. 486–491, 2015.

[2] B. Fasel and J. Luettin, “Automatic facial expression analysis: a survey,”Pattern recognition, vol. 36, no. 1, pp. 259–275, 2003.

[3] S. Latif, J. Qadir, S. Farooq, and M. A. Imran, “How 5g (and con-comitant technologies) will revolutionize healthcare,” arXiv preprintarXiv:1708.08746, 2017.

[4] P. Carcagnı̀, M. Del Coco, M. Leo, and C. Distante, “Facial expressionrecognition and histograms of oriented gradients: a comprehensivestudy,” SpringerPlus, vol. 4, no. 1, p. 645, 2015.

[5] M. Abdulrahman, T. R. Gwadabe, F. J. Abdu, and A. Eleyan, “Gaborwavelet transform based facial expression recognition using pca andlbp,” in Signal Processing and Communications Applications Conference(SIU), 2014 22nd. IEEE, 2014, pp. 2265–2268.

[6] X. Wang, C. Jin, W. Liu, M. Hu, L. Xu, and F. Ren, “Feature fusionof hog and wld for facial expression recognition,” in System Integration(SII), 2013 IEEE/SICE International Symposium on. IEEE, 2013, pp.227–232.

[7] M. V. Akinin, N. V. Akinina, A. I. Taganov, and M. B. Nikiforov,“Autoencoder: Approach to the reduction of the dimension of the vectorspace with controlled loss of information,” in Embedded Computing(MECO), 2015 4th Mediterranean Conference on. IEEE, 2015, pp.171–173.

[8] Y. Liu, Y. Li, X. Ma, and R. Song, “Facial expression recognition withfusion features extracted from salient facial areas,” Sensors, vol. 17,no. 4, p. 712, 2017.

[9] Y. Wang, H. Yao, S. Zhao, and Y. Zheng, “Dimensionality reductionstrategy based on auto-encoder,” in Proceedings of the 7th InternationalConference on Internet Multimedia Computing and Service. ACM,2015, p. 63.

[10] H.-C. Choi and S.-Y. Oh, “Realtime facial expression recognition usingactive appearance model and multilayer perceptron,” in SICE-ICASE,2006. International Joint Conference. IEEE, 2006, pp. 5924–5927.

[11] D. Ghimire, S. Jeong, J. Lee, and S. H. Park, “Facial expressionrecognition based on local region specific features and support vectormachines,” Multimedia Tools and Applications, vol. 76, no. 6, pp. 7803–7821, 2017.

[12] M. H. Siddiqi, S. Lee, Y.-K. Lee, A. M. Khan, and P. T. H. Truc, “Hi-erarchical recognition scheme for human facial expression recognitionsystems,” Sensors, vol. 13, no. 12, pp. 16 682–16 713, 2013.

[13] M. Z. Uddin, J. Lee, and T.-S. Kim, “An enhanced independentcomponent-based human facial expression recognition from video,”IEEE Transactions on Consumer Electronics, vol. 55, no. 4, 2009.

[14] Y. Li, S. Wang, Y. Zhao, and Q. Ji, “Simultaneous facial featuretracking and facial expression recognition,” IEEE Transactions on ImageProcessing, vol. 22, no. 7, pp. 2559–2573, 2013.

[15] M. Schels and F. Schwenker, “A multiple classifier system approachfor facial expressions in image sequences utilizing gmm supervectors,”in Pattern Recognition (ICPR), 2010 20th International Conference on.IEEE, 2010, pp. 4251–4254.

[16] P. Liu, S. Han, Z. Meng, and Y. Tong, “Facial expression recognition viaa boosted deep belief network,” in Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, 2014, pp. 1805–1812.

[17] J. M. Susskind, G. E. Hinton, J. R. Movellan, and A. K. Anderson,“Generating facial expressions with deep belief nets,” in AffectiveComputing. InTech, 2008.

[18] M. Liu, S. Li, S. Shan, and X. Chen, “Au-inspired deep networks forfacial expression feature learning,” Neurocomputing, vol. 159, pp. 126–136, 2015.

[19] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality ofdata with neural networks,” science, vol. 313, no. 5786, pp. 504–507,2006.

[20] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews,“The extended cohn-kanade dataset (ck+): A complete dataset foraction unit and emotion-specified expression,” in Computer Vision andPattern Recognition Workshops (CVPRW), 2010 IEEE Computer SocietyConference on. IEEE, 2010, pp. 94–101.

[21] P. Ekman and W. Friesen, “Facial action coding system: a technique forthe measurement of facial movement,” Palo Alto: Consulting Psycholo-gists, 1978.

[22] K. Anderson and P. W. McOwan, “A real-time automated system for therecognition of human facial expressions,” IEEE Transactions on Systems,Man, and Cybernetics, Part B (Cybernetics), vol. 36, no. 1, pp. 96–105,2006.

[23] J. Li and M. Oussalah, “Automatic face emotion recognition system,”in Cybernetic Intelligent Systems (CIS), 2010 IEEE 9th InternationalConference on. IEEE, 2010, pp. 1–6.

[24] Y. Xing and W. Luo, “Facial expression recognition using local ga-bor features and adaboost classifiers,” in Progress in Informatics andComputing (PIC), 2016 International Conference on. IEEE, 2016, pp.228–232.

[25] H. Jung, S. Lee, J. Yim, S. Park, and J. Kim, “Joint fine-tuning indeep neural networks for facial expression recognition,” in Proceedingsof the IEEE International Conference on Computer Vision, 2015, pp.2983–2991.

[26] J. Jang, D. H. Kim, H.-I. Kim, and Y. M. Ro, “Color channel-wise re-current learning for facial expression recognition,” in Acoustics, Speechand Signal Processing (ICASSP), 2017 IEEE International Conferenceon. IEEE, 2017, pp. 1233–1237.

[27] K. Talele, A. Shirsat, T. Uplenchwar, and K. Tuckley, “Facial expressionrecognition using general regression neural network,” in Bombay SectionSymposium (IBSS), 2016 IEEE. IEEE, 2016, pp. 1–6.

[28] B. Huang and Z. Ying, “Sparse autoencoder for facial expressionrecognition,” in Ubiquitous Intelligence and Computing and 2015 IEEE12th Intl Conf on Autonomic and Trusted Computing and 2015 IEEE15th Intl Conf on Scalable Computing and Communications and ItsAssociated Workshops (UIC-ATC-ScalCom), 2015 IEEE 12th Intl Confon. IEEE, 2015, pp. 1529–1532.

[29] O. Gupta, D. Raviv, and R. Raskar, “Multi-velocity neural networks forfacial expression recognition in videos,” IEEE Transactions on AffectiveComputing, 2017.

[30] A. Makhzani and B. Frey, “K-sparse autoencoders,” arXiv preprintarXiv:1312.5663, 2013.

[31] Q. Xu, C. Zhang, L. Zhang, and Y. Song, “The learning effect ofdifferent hidden layers stacked autoencoder,” in Intelligent Human-Machine Systems and Cybernetics (IHMSC), 2016 8th InternationalConference on, vol. 2. IEEE, 2016, pp. 148–151.

Page 7: Using Deep Autoencoders for Facial Expression …body language, and electroencephalography (EEG), etc. In this paper, we focus on facial expression recognition (FER), which is a widely

7

[32] P. Viola and M. J. Jones, “Robust real-time face detection,” Internationaljournal of computer vision, vol. 57, no. 2, pp. 137–154, 2004.

[33] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in Computer Vision and Pattern Recognition, 2005. CVPR2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp.886–893.

[34] B. Leng, S. Guo, X. Zhang, and Z. Xiong, “3d object retrieval withstacked local convolutional autoencoder,” Signal Processing, vol. 112,pp. 119–128, 2015.

[35] A. E. Omer and A. Khurran, “Facial recognition using principalcomponent analysis based dimensionality reduction,” in Computing,Control, Networking, Electronics and Embedded Systems Engineering(ICCNEEE), 2015 International Conference on. IEEE, 2015, pp. 434–439.

[36] L. Zhong, Q. Liu, P. Yang, B. Liu, J. Huang, and D. N. Metaxas,“Learning active facial patches for expression analysis,” in ComputerVision and Pattern Recognition (CVPR), 2012 IEEE Conference on.IEEE, 2012, pp. 2562–2569.

[37] Y. Chen, Z. Lin, X. Zhao, G. Wang, and Y. Gu, “Deep learning-basedclassification of hyperspectral data,” IEEE Journal of Selected topicsin applied earth observations and remote sensing, vol. 7, no. 6, pp.2094–2107, 2014.

[38] J. Zabalza, J. Ren, J. Zheng, H. Zhao, C. Qing, Z. Yang, P. Du, andS. Marshall, “Novel segmented stacked autoencoder for effective di-mensionality reduction and feature extraction in hyperspectral imaging,”Neurocomputing, vol. 185, pp. 1–10, 2016.

[39] M. J. Cossetin, J. C. Nievola, and A. L. Koerich, “Facial expressionrecognition using a pairwise feature selection and classification ap-proach,” in Neural Networks (IJCNN), 2016 International Joint Con-ference on. IEEE, 2016, pp. 5149–5155.

[40] N. B. Kar, K. S. Babu, and S. K. Jena, “Face expression recognitionusing histograms of oriented gradients with reduced features,” in Pro-ceedings of International Conference on Computer Vision and ImageProcessing. Springer, 2017, pp. 209–219.


Recommended