Ensemble of Multi Features for Facial Expression
Recognition using Deep Learning Techniques
A Thesis submitted to Gujarat Technological University
for the Award of
Doctor of Philosophy
in
Computer/IT Engineering
By
Thacker Chintan Bhupeshbhai
Enrollment No. 159997107003
Under the supervision of
Dr. Ramji M. Makwana
GUJARAT TECHNOLOGICAL UNIVERSITY,
AHMEDABAD
May -2021
Ensemble of Multi Features for Facial Expression
Recognition using Deep Learning Techniques
A Thesis submitted to Gujarat Technological University
for the Award of
Doctor of Philosophy
in
Computer/IT Engineering
By
Thacker Chintan Bhupeshbhai
Enrollment No. 159997107003
Under the supervision of
Dr. Ramji M. Makwana
GUJARAT TECHNOLOGICAL UNIVERSITY,
AHMEDABAD
May -2021
iv
DECLARATION
I declare that the thesis entitled “Ensemble of Multi Features for Facial Expression
Recognition using Deep Learning Techniques” submitted by me for the degree of
Doctor of Philosophy is the record of research work carried out by me during the period
from October 2016 to May 2021 under the supervision of Dr. Ramji M. Makwana and
this has not formed the basis for the award of any degree, diploma, associateship,
fellowship, titles in this or any other University or other institution of higher learning.
I further declare that the material obtained from other sources has been duly acknowledged
in the thesis. I shall be solely responsible for any plagiarism or other irregularities, if
noticed in the thesis.
Signature of Research Scholar: Date: 20/05/2021
Name of Research Scholar: Thacker Chintan Bhupeshbhai
Place: Bhuj
v
CERTIFICATE
I certify that the work incorporated in the thesis “Ensemble of Multi Features for Facial
Expression Recognition using Deep Learning Techniques” submitted by Mr. Thacker
Chintan Bhupeshbhai was carried out by the candidate under my supervision/guidance.
To the best of my knowledge: (i) the candidate has not submitted the same research work
to any other institution for any degree/diploma, Associateship, Fellowship or other similar
titles (ii) the thesis submitted is a record of original research work done by the Research
Scholar during the period of study under my supervision, and (iii) the thesis represents
independent research work on the part of the Research Scholar.
Signature of Supervisor: ................................................... Date: 20/05/2021
Name of Supervisor: Dr. Ramji M. Makwana
Place: Rajkot
vi
Course-work Completion Certificate
This is to certify that Mr. Thacker Chintan Bhupeshbhai Enrollment no.159997107003
is a PhD scholar enrolled for PhD program in the branch Computer/IT Engineering of
Gujarat Technological University, Ahmedabad.
(Please tick the relevant option(s))
He/She has been exempted from the course-work (successfully completed
during M.Phil. Course)
He/She has been exempted from Research Methodology Course only
(successfully completed during M.Phil. Course)
He/She has successfully completed the PhD course work for the partial
requirement for the award of PhD Degree. His/ Her performance in the
course work is as follows-
Grade Obtained in Research
Methodology (PH001)
Grade Obtained in Self Study Course
(Core Subject) (PH002)
BB AB
Supervisor’s Sign
(Dr. Ramji M. Makwana)
vii
Originality Report Certificate
It is certified that PhD Thesis titled “Ensemble of Multi Features for Facial Expression
Recognition using Deep Learning Techniques” by Thacker Chintan Bhupeshbhai has
been examined by us.
We undertake the following:
a. Thesis has significant new work/knowledge as compared already published or are
under consideration to be published elsewhere. No sentence, equation, diagram,
table, paragraph, or section has been copied verbatim from previous work unless it
is placed under quotation marks and duly referenced.
b. The work presented is original and own work of the author (i.e. There is no
plagiarism). No ideas, processes, results or words of others have been presented as
Author own book.
c. There is no fabrication of data or results which have been complied/analysed.
d. There is no falsification by manipulating research materials, equipment or
processes, or changing or omitting data or results such that the research is not
accurately represented in the research record.
e. The thesis has been checked using “URKUND Plagiarism Checker” (copy of
originality report attached) and found within limits as per GTU Plagiarism Policy
and instructions issued from time to time (i.e., permitted similarity index <=10 %).
Signature of Research Scholar: Date: 20/05/2021
Name of Research Scholar: Thacker Chintan Bhupeshbhai
Place: Bhuj
Signature of Supervisor: ......................................................... Date: 20/05/2021
Name of Supervisor: Dr. Ramji M. Makwana
Place: Rajkot
x
PhD Thesis Non-Exclusive License to
GUJARAT TECHNOLOGICAL UNIVERSITY
In consideration of being PhD Research Scholar at GTU and in the interests of the
facilitation of research at GTU and elsewhere I, “Thacker Chintan Bhupeshbhai” having
Enrollment No. 159997107003 hereby grant a non-exclusive, royalty free and perpetual
license to GTU on the following terms:
a) GTU is permitted to archive, reproduce and distribute my thesis, in whole or in a
part, and/or my abstract, in whole or in part (referred to collectively as the “Work”)
anywhere in the world, for non-commercial purposes, in all forms of media;
b) GTU is permitted to authorize, sub-lease, sub-contract or procure any of the acts
mentioned in the paragraph (a);
c) GTU is authorized to submit the Work at any National/International Library, under
the authority of their “Thesis Non- Exclusive License”;
d) The Universal Copyright Notice (©) shall appear on all copies made under the
authority of this license;
e) I undertake to submit my thesis, through my University, to any Library and
Archives. Any abstract submitted with the thesis will be considered to form part of
the thesis.
f) I represent that my thesis is my original work, does not infringe any rights of others,
including privacy rights, and that I have the right to make the grant conferred by this
non-exclusive license.
g) If third part copyrighted material was included in my thesis for which, under the
terms of the Copyright Act, written permission from the copyright owners is
required, I have obtained such permission from the copyright owners to do the acts
mentioned in paragraph (a) above for the full term of copyright protection.
h) I retain copyright ownership and moral rights in my thesis, and may deal with the
copyright in my thesis, in any way consistent with rights granted by me to my
University in this non-exclusive license.
xi
i) I further promise to inform any person to whom I mat hereafter assign or license my
copyright in my thesis of the rights granted by me to my University in this non-
exclusive license.
j) I am aware of and agree to accept the conditions and regulations of PhD including
all policy matters related to authorship and plagiarism.
Signature of the Research Scholar:
Name of Research Scholar: Thacker Chintan Bhupeshbhai
Date: 20/05/2021 Place: Bhuj
Signature of Supervisor: ...............................................................................
Name of Supervisor: Dr. Ramji M. Makwana
Date: 20/05/2021 Place: Rajkot.
Seal: M.D. Aiivine PXL Pvt. Ltd.
xii
Thesis Approval Form
The viva-voce of the PhD Thesis submitted by Mr. Thacker Chintan Bhupeshbhai
(Enrollment No. 159997107003) entitled Ensemble of Multi Features for Facial
Expression Recognition using Deep Learning Techniques was conducted on 20/05/2021
on Thursday at Gujarat Technological University.
(Please tick any one of the following options)
The performance of the candidate was satisfactory. We recommend that he be
awarded the PhD degree.
Any further modifications in research work recommended by the panel after 3
months from the date of first viva- voce upon request of the Supervisor or
request of Independent Research Scholar after which viva – voce can be re-
conducted by the same panel again.
The performance of the candidate was unsatisfactory. We recommend that he
should not be awarded the PhD degree.
(Dr. Ramji M. Makwana)
Name and signature of Supervisor with Seal
(Dr. Binod Kumar)
External Examiner-1 (Name and Signature)
(Dr. Sharnil Pandya)
External Examiner-2 (Name and Signature)
(Dr. Subodh Srivastava)
External Examiner-3 (Name and Signature)
(Briefly specify the modifications suggested by the panel)
(The panel must give justifications for rejecting the research work)
xiii
Abstract
As we move towards a digital world, Human-Computer Interaction becomes very important.
Facial Expressions are the key features of non-verbal communication and they play an
essential role in human-computer interaction. Facial Expressions play a crucial role in social
interactions and commonly used in the behavioural interpretation of emotions. It becomes easy
to understand anyone’s emotional state and intentions based on the shown facial expression.
Over the last few years, facial expression recognition has attracted researchers in psychology,
computer science, security and medicine-related fields. These fields have an extensive range of
applications like in surveillance cameras to identify suspicious person, patient’s painful
situation at hospital, online meeting or in E-learning system, music player play songs based on
person’s mood, driver’s tiredness from his expression while driving, robotics, behavioural
science, etc. based on facial expressions. Although human beings can identify the facial
expressions correctly and effortlessly, still reliable automatic facial expression recognition by
machines is a challenge.
Facial expression recognition system consists of different stages like Face Detection, Feature
Extraction and Emotion Classification. There are seven universally defined facial expressions:
Angry, Disgust, Fear, Happy, Neutral, Sad and Surprise. Facial expression recognition using
the Convolutional Neural Network has been actively researched in the last decade due to its
high number of applications in the human-computer interaction domain. As Convolutional
Neural Networks have an exceptional capability to learn, they outperform well on features
using its different pre-trained architectures. Existing state-of-the-art models have achieved
good recognition accuracy on laboratory trained facial expression datasets; however, they
struggle to achieve good accuracy for the real-time facial expression datasets trained in an
uncontrolled environment. Images captured in an uncontrolled setting or taken from the
internet contains many challenges like lower resolutions, occlusion, variations in lighting
conditions, and head pose variations.
xiv
The work introduced in this research focuses on recognizing facial expressions from the
images using deep learning techniques to improve its recognition accuracy. This research work
deals with investigating of methods using deep learning techniques to deal with the issue of
recognition accuracy of lower resolution images for facial expression recognition. The key
factor of this research is to improve the recognition accuracy of the real-time facial expression
dataset which contains real-world images with challenges and the laboratory trained dataset
images that are trained in a controlled environment for the cross-database evaluation study.
The feature extraction process becomes more difficult in real-world images than the images
trained in a controlled environment. In this research work, three models are proposed: Multi-
Layer Feature-Fusion based Classification (MLFFC) model, Multi-Model Feature-Fusion
based Classification (MMFFC) model and Novel facial expression recognition model based on
Normalized CNN. MLFFC and MMFFC model use the fusion concept of layers in different
aspects. The idea of fusion utilizes a combination of knowledge obtained from two different
domains for enhancing the feature extraction for the given images. In MLFFC model, the
concept of inter-layer feature fusion on InceptionV3 CNN architecture is applied. From the
literature survey, it is discovered that the majority of the work focuses on the feature maps
obtained at the last layer of the CNN model and gives little consideration to the advantages of
the extra layers of the model which has added some significant features. MLFFC proposed
model is tested on two publicly available datasets: laboratory trained CK+ dataset and real-
time facial expression dataset FER2013. The proposed model performs well and provides
better recognition accuracy on both kinds of facial expression datasets, unlike the models
which work exceptionally well on laboratory-trained facial expression dataset but fail to do so
when it comes to real-time facial expression dataset.
In MMFFC model, the concept of an ensemble of two CNN architectures is applied by
concatenating two feature vectors generated at final layers of VGG16 and ResNet50 CNN
architectures. Existing research approaches with the help of a single CNN model used to
extract facial expression recognition features. An Ensemble of CNN concept found from the
literature survey to improve the recognition accuracy. In this concept concatenation of features
from various networks helps to overcome the limitations of a single network and produce
superior performance. The MMFFC model is tested on two publicly available datasets:
laboratory-trained KDEF dataset and real-time facial expression dataset FER2013. The
proposed model performs well and provides better recognition accuracy on real-time facial
expression dataset as well as on laboratory-trained facial expression dataset. Results are
compared with other state-of-the-art methods.
xv
In the third model, a novel concept is proposed known as EfficientNet: Rethinking scaling
model for CNN is implemented. There are different EfficientNet models B0 to B7 based on the
Compound Scaling method to scale up CNN in a more structured way. Unlike the conventional
approaches that arbitrarily scale the network dimensions such as width, depth and resolution,
this approach uniformly scales the network dimension with a fixed set of scaling coefficients.
This is the important characteristic of this novel EfficientNet approach which works well on
higher resolution images. From a literature study, it is found that no work has been carried out
for facial expression recognition using this concept until date. Different optimizers applied to
EfficientNetB7 architecture to determine which optimizer performs well on this architecture
for facial expression recognition. Also, different optimizers applied to ResNet152 architecture
for the cross-evaluation study. Experimental results show that the RMSprop optimizer
performs well on EfficientNetB7 architecture and SGD optimizer performs well on ResNet152
architecture for facial expression recognition. The vanishing gradient descent issue is also
identified in the experimental results due to variance generated in the computational process.
Due to this issue, weirdness is caused by the model’s accuracy and loss graph instead of the
smooth curve. This issue is resolved by applying the proposed internal batch normalization
approach which retrains the model again by considering only batch normalization layers that
regularize the model and reduce layer inputs’ variances.
The experimental results demonstrate that all the proposed models achieve identical outcomes
for recognition accuracy compared the existing state-of-the-art methods.
xvi
Acknowledgment
Every achievement is a result of committed activities that too when headed and guided by
worthy and knowledgeable persons. It is with a sense of pride and pleasure that I humbly
look back to acknowledge those who have been a source of encouragement in my entire
endeavour.
First and foremost, I would like to express my sincere gratitude to my Ph.D. research
supervisor, Dr. Ramji M. Makwana for introducing me to this exciting research area
field and for his continuous support, guidance, inspiration and encouragement throughout
my Ph.D. research. His passion, his robust view of research and his quest to provide high-
quality work have made a deep impression on me. During our interactions, I have learned
extensively from him, including how to do positive thinking, how to look a problem from a
new perspective and how to approach the problem through systematic thinking. I am very
much obliged to him for his profound approach, motivation, and spending valuable time to
mould this work and bring a hidden aspect of research in a light.
I extend the special words of thanks to my Doctoral Progress Committee (DPC) members,
Dr. Narendra C. Chauhan and Dr. Apurva M. Shah for their excellent guidance,
valuable comments, useful suggestions and encouragement to visualize the problem from
the different perspective. Their humble approach and the way of appreciation for good
work have always created an amenable environment and boost-up my confidence to push
the limit. I owe a lot of gratitude to them for always being there for me and I feel
privileged to be associated with people like them during my life. Also, I would like to
thank my foreign supervisor Dr. Shishir Shah for their valuable guidance and support to
improve my research work.
I would also like to express my appreciation towards my parent institute, HJD Institute of
Technical Education and Research-Kutch, Dr. Jagdish Halai, Hon. Chairman sir and Dr.
Rasila Hirani, Institute Coordinator of this institute for providing all kinds of technical and
nontechnical support for my research work. It is a pleasure to thank my colleagues and
non-teaching staff of computer engineering department, who have directly or indirectly
xvii
helped me during my research work. My special regards to my dear friend Dr. Safvan
Vahora for his valuable suggestions and guidance.
I feel a deep sense of gratitude for my grandparents, mother, father, brother, who were part
of my vision. Their unfailing love and support have always been my strength. Their
patience and sacrifice will remain my inspiration for my entire life. Finally, my sincere
heartiest special thanks to my wife Bhoomi for her eternal support and understanding of
my goals and aspirations. Her support has always been my strength. Her patience and
sacrifice will remain my inspiration throughout my life. Without her support, I would not
have been able to complete much of what I have done. I am short of words to express my
loving gratitude to my loving son, Aadit, for his innocent smiles which inspired me during
the entire work.
Above all, I am very much thankful to the Almighty God for giving me this beautiful life
and standing by me at each stage of my life to complete this research.
Chintan B. Thacker
xviii
Table of Contents
Abstract ........................................................................................................................ xiii
Acknowledgment .......................................................................................................... xvi
Table of Contents ....................................................................................................... xviii
List of Abbreviations .................................................................................................... xxi
List of Figures ............................................................................................................ xxvii
List of Tables ............................................................................................................ xxxiii
1 Introduction .............................................................................................................. 1
1.1 Overview ............................................................................................................. 1
1.2 Research Motivation ............................................................................................ 3
1.3 Research Challenges ............................................................................................ 5
1.4 Problem Statement ............................................................................................... 7
1.5 Research Objectives and Scope ............................................................................ 8
1.5.1 Research Objectives ...................................................................................... 8
1.5.2 Scope of Research Work ............................................................................... 8
1.6 Organization of the Thesis ................................................................................... 9
2 Theoretical Background ......................................................................................... 11
2.1 Facial Expression Recognition System ............................................................... 11
2.1.1 Pre-processing for Face Detection ............................................................... 12
2.1.2 Feature Extraction ....................................................................................... 14
2.1.3 Facial Expression Classification .................................................................. 20
2.2 Deep Learning for Facial Expression Recognition.............................................. 25
2.2.1 Evolution of AI: Machine Learning and Deep Learning .............................. 25
2.2.2 Deep Neural Networks ................................................................................ 29
2.2.2.1 Convolutional Neural Networks .................................................................. 29
2.2.2.2 Deep Auto Encoder ..................................................................................... 31
2.2.2.3 Restricted Boltzmann Machine ................................................................... 32
2.2.2.4 Deep Belief Network .................................................................................. 32
2.2.2.5 Recurrent Neural Network .......................................................................... 33
2.2.2.6 Long Short-Term Memory .......................................................................... 35
xix
2.3 Convolutional Neural Network .......................................................................... 37
2.3.1 Convolutional Layer ................................................................................... 38
2.3.2 Pooling Layer ............................................................................................. 42
2.3.3 Fully Connected Layer ................................................................................ 43
2.3.4 Transfer Learning ....................................................................................... 44
2.4 Fusion Approach in Convolutional Neural Network ........................................... 46
2.4.1 Multi-Feature Fusion based approach .......................................................... 46
2.4.2 Ensemble of Multi-CNN Feature Fusion based approach ............................ 48
3 Literature Review ................................................................................................... 50
3.1 Overview ........................................................................................................... 50
3.2 Conventional FER Approaches .......................................................................... 51
3.3 Deep-Learning based FER Approaches .............................................................. 56
3.4 Multi-Feature Fusion based FER Approaches .................................................... 62
3.4.1 Multi-Feature fusion in a single model ........................................................ 62
3.4.2 Multi-Feature fusion using multi-model ...................................................... 65
3.5 Summary and Discussion ................................................................................... 68
4 Proposed Multi-Layer Feature-Fusion based Classification Model .................... 69
4.1 Introduction ....................................................................................................... 69
4.2 Inception-V3 CNN Architecture ........................................................................ 70
4.3 Proposed MLFFC ............................................................................................. 74
4.4 Dataset Details ................................................................................................... 77
4.4.1 CK+ Dataset ............................................................................................... 77
4.4.2 FER2013 Dataset ........................................................................................ 77
4.5 Experiment and Results ..................................................................................... 78
4.5.1 Experimental Setup and Implementation Details ......................................... 78
4.5.2 Experimental Results on Inception Module C layers ................................... 79
4.5.3 Experimental Results on CK+ Dataset ........................................................ 80
4.5.4 Experimental Results on FER2013 Dataset ................................................. 84
4.6 Discussion and Summary ................................................................................... 88
5 Proposed Multi-Model Feature-Fusion based Classification Model .................... 89
5.1 Introduction ....................................................................................................... 89
5.2 VGG-16 CNN Architecture ............................................................................... 90
5.3 ResNet-50 CNN Architecture ............................................................................ 91
xx
5.4 Proposed MMFFC model................................................................................... 94
5.5 Dataset Details ................................................................................................... 97
5.5.1 FER2013 Dataset ........................................................................................ 97
5.5.2 KDEF Dataset ............................................................................................. 98
5.6 Experiments and Results .................................................................................... 98
5.6.1 Experimental Setup and Implementation Details ......................................... 99
5.6.2 Experimental Results of Ensemble approach using different CNN
architectures ................................................................................................ 99
5.6.3 Experimental Results on FER2013 Dataset ............................................... 100
5.6.4 Experimental Results on KDEF Dataset .................................................... 105
5.7 Discussion and Summary ................................................................................. 109
6 Novel FER Model based on Normalized CNN..................................................... 110
6.1 Introduction ..................................................................................................... 110
6.2 EfficientNet Architecture and Working Methodology ...................................... 112
6.3 Proposed novel FER Model: EfficientNet-B7 .................................................. 117
6.4 Dataset Details ................................................................................................. 120
6.4.1 KDEF Dataset ........................................................................................... 120
6.4.2 FER2013 Dataset ...................................................................................... 121
6.5 Experiments and Results .................................................................................. 121
6.5.1 Experimental Results on proposed EfficientNet-B7 model ........................ 122
6.5.2 Internal Batch Normalization (IBN) & Experimental Results .................... 126
6.6 Discussion and Summary ................................................................................. 130
7 Conclusion and Further Enhancements .............................................................. 132
7.1 Conclusion....................................................................................................... 132
7.2 Future Enhancements ....................................................................................... 135
List of References ......................................................................................................... 137
List of Publications ...................................................................................................... 149
xxi
List of Abbreviations
1-D 1-Dimensional
2-D 2-Dimensional
3-D 3-Dimensional
AI Artificial Intelligence
AFER Automatic Facial Expression Recognition
Adam Adaptive Moment Estimation Optimizer
ANN Artificial Neural Network
AF Average Filter
AMF Adaptive Median Filter
AAM Active Appearance Model
AUs Action Units
AFEW Acted Facial Expression in Wild dataset
AUC Area Under the Curve
AutoML Automated Machine Learning
BF Bilateral Filter
BU-3DFE Binghamton University 3D Facial Expression dataset
BiLSTM Bidirectional Long Short-Term Memory
xxii
BDBN Boosted Deep Belief Network
BP4D Binghamton-Pittsburgh 4D Spontaneous expression dataset
CNN Convolutional Neural Network
CV Computer Vision
CK+ The Extended Cohn-Kanade dataset
Conv Convolutional
C-LSTM Convolutional Long Short-Term Memory
DCT Discrete Cosine Transform
1D DCT 1-Dimensional Discrete Cosine Transform
2D DCT 2-Dimensional Discrete Cosine Transform
DNN Deep Neural Network
DL Deep Learning
DAE Deep Auto Encoder
DBN Deep Belief Network
DWT Discrete Wavelet Transform
3DCNN 3-Dimensional Convolutional Neural Network
DISFA The Denver Intensity of Spontaneous Facial Action dataset
DSN Deep Spatial Network
DTN Deep Temporal Network
xxiii
DLBP Directional Local Binary Pattern
DTAN Deep Temporal Appearance Network
DTGN Deep Temporal Geometry Network
DTAGN Deep Temporal Geometry Appearance Network
DSAE Deep Sparse Auto Encoder
EmotiW Emotion Recognition in the Wild
ECNN Ensemble of Convolutional Neural Network
FER Facial Expression Recognition
FER2013 The Facial Expression Recognition 2013 dataset
FC Fully Connected
FACS Facial Action Coding System
FED-RO Facial Expression Dataset in the presence of Real Occlusion
FERA Facial Expression Recognition and Analysis dataset
FV Feature Vector
FLOPS Floating-Point Operations Per Second
GPU Graphics Processing Unit
GK Gaussian Kernel
GRU Gated Recurrent Units
GF Gaussian Filter
xxiv
HCI Human-Computer Interaction
HMM Hidden Markov Model
HOG Histogram of Oriented Gradients
ICA Independent Component Analysis
ICML International Conference on Machine Learning
IDE Integrated Developer Environment
IACNN Identity Aware Convolutional Neural Network
ILSVR ImageNet Large Scale Visual Recognition
IBN International Batch Normalization
JAFFE The Japanese Female Facial Expression dataset
KDEF The Karolinska Directed Emotional Faces dataset
KNN K-Nearest Neighbour
LBP Local Binary Pattern
LSTM Long Short-Term Memory
T-LSTM Temporal Long Short-Term Memory
MLFFC Multi-Layer Feature-Fusion based Classification
MMFFC Multi-Model Feature-Fusion based Classification
ML Machine Learning
MLP Multi-Layer Perceptron
xxv
MF Median Filter
MFFNN Multilayer Feed Forward Neural Network
MRE-CNN Multi-Region Ensemble Convolutional Neural Network
MLCNN Multi-level Convolutional Neural Network
MNF Multi-Network Fusion
MBConv Mobile Inverted Bottleneck Convolution
NIR Near-Infrared
NN Neural Network
NWPU-
RESISC
The Northwestern Polytechnical University Remote Sensing
Image Scene Classification dataset
OpenCV Open-Source Computer Vision
PCA Principal Component Analysis
POOL Pooling
RMSprop Root Mean Square Propagation optimizer
RNN Recurrent Neural Network
RBM Restricted Boltzmann Machine
ROI Region of Interest
ReLU Rectified Linear Unit
RBF Radial Basis Function
RAF-DB Real world Affective Faces Database
xxvi
RaFD Radboud Faces Database
ROC Receiver Operating Characteristic curve
SGD Stochastic Gradient Descent optimizer
SVM Support Vector Machine
SVD Singular Value Decomposition
SIFT Scale Invariant Feature Transform
SURF Speeded-Up Robust Features
STTM Spatio-Temporal Texture Map
SBP Sparse Batch Normalization
SFEW Static Facial Expressions in the Wild database
SDCNN Single Deep Convolutional Neural Network
SBN-CNN Sparse Batch Normalization Convolutional Neural Network
STC-NLSTM Spatio-Temporal Convolutional Features with Nested Long
Short-Term Memory
VIS Visible Light Spectrum
VGD Vanishing Gradient Descent
xxvii
List of Figures
FIGURE 1.1 Example of Facial Expression Recognition based on Human-Computer
Interaction [15]…………………………………………………………….1
FIGURE 1.2 Different Facial Expressions of one person from the JAFFE dataset[4]……2
FIGURE 1.3 Use of Facial Expression Recognition System to identify suspicious person
or criminal at the airport, railway station or any crowded place[16]……...3
FIGURE 1.4 Use of Facial Expression Recognition System to identify students'
engagement level in online classes[17]……………………………………4
FIGURE 1.5 Use of Facial Expression Recognition System to play songs based on a
person's mood[18]…………………………………………………………4
FIGURE 1.6 Example of high similarity between facial expressions in two different
classes[6,7]………………………………………………………………...6
FIGURE 1.7 Example of facial expression images taken in an uncontrolled environment
of FER2013 dataset which contains challenges like vaying illumination,
head pose variation, lower resolution and occlusion[9]…………………...6
FIGURE 1.8 Examples of laboratory trained facial expression dataset (a) CK+ (b) JAFFE
[6,14]………………………………………………………………………7
FIGURE 2.1 Conventional Facial Expression Recognition System[37]…………………11
FIGURE 2.2 Example of Image Rotation during the pre-processing phase[21]…………12
FIGURE 2.3 Example of Image Cropping during the pre-processing phase[21]………...12
FIGURE 2.4 Example of Illustration of the Intensity Normalization during the pre-
processing phase[21]……………………………………………………..13
FIGURE 2.5 Example of Face detection carried out on a sample image using OpenCV-
Python……………………………………………………………………14
FIGURE 2.6 Geometric and Appearance-based Feature Extraction[26]…………………14
FIGURE 2.7 Classification of different Feature Extraction Methods[28]………………..15
FIGURE 2.8 Feature Extraction using LBP Histogram Method[29]……………………..16
xxviii
FIGURE 2.9 Two sample facial expressions on the left-hand side and its optical method
result available on the right-hand side[30]……………………………….18
FIGURE 2.10 Feature-Point Tracking method using feature points displacements[29]…19
FIGURE 2.11 Feature extraction process in a Convolutional Neural Network generating
feature maps[31]………………………………………………………….19
FIGURE 2.12 Example of Seven basic Facial Expressions from CK+ dataset[32]……...20
FIGURE 2.13 Example of Support vector and hyperplane in the SVM method[36]…….23
FIGURE 2.14 Evolution of Artificial Intelligence (AI)[38]……………………………...25
FIGURE 2.15 Working methodology difference between Machine Learning and Deep
Learning[39]……………………………………………………………...27
FIGURE 2.16 Basic structure of Neural Networks with Input, Hidden and Output
layers[39]…………………………………………………………………28
FIGURE 2.17 Basic CNN Architecture[41]……………………………………………...30
FIGURE 2.18 Basic Structure of Deep Autoencoders (DAE)[44]……………………….31
FIGURE 2.19 Basic Structure of Restricted Boltzmann Machine (RBM)[45]…………..32
FIGURE 2.20 Basic Structure of Deep Belief Network (DBN) [46]…………………….33
FIGURE 2.21 The Schematic diagram of RNN Node[43]……………………………….34
FIGURE 2.22 Basic Structure of Recurrent Neural Network (RNN)[43]………………..35
FIGURE 2.23 The Schematic diagram of LSTM block with memory cell and gates[45]..36
FIGURE 2.24 General CNN Structure in facial expression recognition system[47]……..37
FIGURE 2.25 Convolutional Operation with Image matrix multiplies kernel or filter
matrix[49]………………………………………………………………...38
FIGURE 2.26 Example of dot product in Convolutional operation with image and
filter[48]…………………………………………………………………..39
FIGURE 2.27 Convolutional operation with Stride size of 2[48]………………………...40
FIGURE 2.28 Rectified Linear Unit (ReLU) Activation function[49]…………………...41
FIGURE 2.29 Rectified Linear Unit (ReLU) operation[48]……………………………...41
FIGURE 2.30 Example of Max pooling and Average pooling operations[48]………..…42
xxix
FIGURE 2.31 Example of Flattening operation converting into a single vector[47]…….43
FIGURE 2.32 Representation of features at different stages in the network[48]………...44
FIGURE 2.33 Conceptual diagram of transfer learning where learning of a new task relies
on the previously learned task[53]……………………………………….45
FIGURE 2.34 General framework of Multi-Feature-Fusion model[54]………………….47
FIGURE 2.35 Framework of Inter-layer Feature-Fusion process[54]…………………....47
FIGURE 2.36 Framework of Ensemble Multi-CNN feature-fusion-based approach[56]..48
FIGURE 4.1 Two 3x3 convolutions replacing one 5x5 convolution[130]……………….70
FIGURE 4.2 Basic Inception Module (naïve version)[132]……………………………...71
FIGURE 4.3 Inception Module with Dimension Reductions[132]……………………….71
FIGURE 4.4 The schematic diagram of Inception-V3 architecture[131]………………...72
FIGURE 4.5 Factorization process of Module A in Inception-V3 architecture[130]…….73
FIGURE 4.6 Factorization process of Module B in Inception-V3 architecture[130]…….73
FIGURE 4.7 Factorization process of Module C in Inception-V3 architecture[130]…….73
FIGURE 4.8 General framwork of Multi-Feature-Fusion model[54]…………………….74
FIGURE 4.9 Framework of Inter-layer Feature-fusion process[54]……………………...74
FIGURE 4.10 Proposed Multi-Layer Feature-Fusion based Classification (MLFFC)
model……………………………………………………………………..75
FIGURE 4.11 Example of images in the CK+ dataset with different emotions[126]……77
FIGURE 4.12 Example of images in the FER2013 dataset with different emotions[135].78
FIGURE 4.13 Confusion matrix using the proposed MLFFCmodel on the CK+ dataset..81
FIGURE 4.14 Classification report for the proposed MLFFC model on theCK+ dataset..82
FIGURE 4.15 ROC-AUC curve on the CK+ dataset for (a) without feature-fusion and (b)
with feature-fusion……………………………………………………….82
FIGURE 4.16 Accuracy graph of the proposed MLFFC model for the CK+ dataset for
batch size 8………………………………………………………..……...83
xxx
FIGURE 4.17 Accuracy graph of the proposed MLFFC model for the CK+ dataset for
batch size 16………………………………………………..…………….83
FIGURE 4.18 Confusion matrix using the proposed MLFFC model on the FER2013
dataset…………………………………………………………………….85
FIGURE 4.19 Classification report for the proposed MLFFC model on the FER2013
dataset…………………………………………………………………….86
FIGURE 4.20 ROC-AUC curve on the FER2013 dataset for (a) without feature-fusion and
(b) with feature-fusion ...…………...…………………………………….86
FIGURE 4.21 Accuracy graph of the proposed MLFFC model for the FER2013 dataset
with batch size 8………………………………………………………….87
FIGURE 4.22 Accuracy graph of the proposed MLFFC model for the FER2013 dataset
with batch size 16………………………………………………………...87
FIGURE 5.1 Sample architecture of Ensemble of Multi-CNN[55]………………………89
FIGURE 5.2 VGG-16 architecture diagram with its layers' details[55]………………….90
FIGURE 5.3 VGG-16 architecture diagram[146]………………………………………...91
FIGURE 5.4 Residual Learning: a building block concept[149]…………………………92
FIGURE 5.5 ResNet architecture diagram comparison to plain network[150]…………..93
FIGURE 5.6 Diagram shwoing conversion of residual block[150]………………………93
FIGURE 5.7 Sample framework of Ensemble of Multi-CNN feature-fusion[126]………94
FIGURE 5.8 Proposed Multi-Model Feature-Fusion based Classification (MMFFC)
model……………………………………………………………………..95
FIGURE 5.9 Example of images in the FER2013 dataset with different emotions[135]...97
FIGURE 5.10 Sample images in the KDEF dataset with different emotions[152]………98
FIGURE 5.11 Confusion matrix using the proposed MMFFC model on the FER2013
dataset…………………………………………………………………...102
FIGURE 5.12 Classification report for the proposed MMFFC model on the FER2013
dataset…………………………………………………………………...103
FIGURE 5.13 ROC-AUC curve on the FER2013 dataset for (a) without multi-model
fusion and (b) with multi-model fusion ………………………………...103
xxxi
FIGURE 5.14 Accuracy graph of the proposed MMFFC model for the FER2013 dataset
for batch size 16………………………………………………………...104
FIGURE 5.15 Accuracy graph of the proposed MMFFC model for the FER2013 dataset
for batch size 32………………………………………………………...104
FIGURE 5.16 Accuracy graph of the proposed MMFFC model for the FER2013 dataset
for batch size 64………………………………………………………...104
FIGURE 5.17 Confusion matrix using the proposed MMFFC model on the KDEF
dataset…………………………………………………………………...106
FIGURE 5.18 Classification report for the proposed MMFFC model on the KDEF
dataset…………………………………………………………………...107
FIGURE 5.19 ROC-AUC curve on the KDEF dataset for (a) without multi-model fusion
and (b) with multi-model fusion ………………………………………..107
FIGURE 5.20 Accuracy graph of the proposed MMFFC model for the KDEF dataset for
batch size 16.……………………………………………………………108
FIGURE 5.21 Accuracy graph of the proposed MMFFC model for the KDEF dataset for
batch size 32.……………………………………………………………108
FIGURE 6.1 ImageNet performance evaluation with other ConvNets[159]……………111
FIGURE 6.2 Model Scaling Approach[159]…………………………………………….112
FIGURE 6.3 Scaling up a Baseline model with different network width (w), depth (d), and
resolution (r) [159]……………………………………………………...113
FIGURE 6.4 A basic block representation of the EfficientNet-B0[161]………………..114
FIGURE 6.5 A basic representation of Depthwise and Pointwise Convolutions in (a) and
(b)[161]………………………………………………………………….115
FIGURE 6.6 Proposed Novel FER model with EfficientNet-B7 and ResNet152
architecture……………………………………………………………...117
FIGURE 6.7 Sample images in the KDEF dataset with different emotions[152]………120
FIGURE 6.8 Example of images in the FER2013 dataset with different emotions[135].121
FIGURE 6.9 Confusion matrix using the proposed novel EfficientNet-B7 model on the
KDEF dataset…………………………………………………………...123
FIGURE 6.10 Classification report for the proposed novel EfficientNet-B7 model on the
KDEF dataset…………………………………………………………...124
xxxii
FIGURE 6.11 Comparative analysis of recognition accuracy on the proposed
EfficientNet-B7 model and ResNet152 architecture by applying different
optimizers……………………………………………………………….124
FIGURE 6.12 Vanishing Gradient Descent problem due to Variance in model loss and
accuracy graph of the proposed EfficientNet-B7 model………………..126
FIGURE 6.13 Sample figure of batch normalization process with N as batch axis, C as the
channel axis and (H,W) as the spatial axes[163]………………………..127
FIGURE 6.14 Resultant smooth curve achieved by applying an Internal Batch
Normalization concept and reducing variance effect…………………...130
xxxiii
List of Tables
TABLE 2.1 Descriptions of seven facial expressions[34]………………………………...21
TABLE 3.1 Performance summary of facial expression recognition using deep-learning-
based approaches…………………………………………………………60
TABLE 3.2 Performance summary of Multi-Layer Feature-Fusion methods using deep
learning techniques……………………………………………………….64
TABLE 3.3 Performance summary of Multi-Layer Feature-Fusion methods using
Ensemble of CNN models using deep learning techniques……………...67
TABLE 4.1 Comparison accuracy on different layers on the proposed MLFFC
architecture……………………………………………………………….79
TABLE 4.2 Results on the CK+ dataset by using and not using feature-fusion approach on
the proposed MLFFC model……………………………………………..80
TABLE 4.3 Comparative analysis of the proposed MLFFC model with state-of-the-art
methods on the CK+ dataset……………………………………………...81
TABLE 4.4 Results on the FER2013 dataset by using and not using feature-fusion
approach on the proposed MLFFC model………………………………..84
TABLE 4.5 Comparative analysis of the proposed MLFFC model with state-of-the-art
methods on the FER2013 dataset………………………………………...85
TABLE 4.6 Comparative analysis of the Error-Rate on both the datasets using the
proposed MLFFC model………………………………………................88
TABLE 5.1 Comparison accuracy on two different CNN architectures using an ensemble
approach………………………………………………………………….99
TABLE 5.2 FER2013 dataset performance for VGG16, ResNet50 and proposed MMFFC
model using ensemble approach………………………………………...101
TABLE 5.3 Comparative analysis of proposed MMFFC model with state-of-the-art
methods on the FER2013 dataset……………………………………….102
TABLE 5.4 KDEF dataset performance for VGG16, ResNet50 and proposed MMFFC
model using an Ensemble approach…………………….………………105
TABLE 5.5 Comparative analysis of the proposed MMFFC model with state-of-the-art
methods on the KDEF dataset…………………………………………..106
xxxiv
TABLE 5.6 Comparative analysis of the Error-Rate on both the datasets using the
proposed MMFFC model…………………………………….................108
TABLE 6.1 EfficientNet-B0 Baseline Network[161]…………………………………...115
TABLE 6.2 Comparative analysis of the proposed EfficientNet-B7 model with
Optimizers………………………………………………………………122
TABLE 6.3 Comparative analysis on ResNet152 CNN architecture with different
Optimizers………………………………………………………………123
TABLE 6.4 Comparative result analysis of the proposed EfficientNet-B7 model with
different optimizers on the FER2013 dataset…………………………...125
1
CHAPTER 1
Introduction
1.1 Overview
Human-computer Interaction (HCI) is playing an essential role in everyone’s day to day
life activities. In this 21st century, we live in a digital world where most of our activities
are accomplished by computer-driven systems. With the rise of Artificial Intelligence (AI)
with its subareas Machine Learning and Deep Learning, the ability to combine intelligent
models with computer vision systems has become a popular way to handle more complex
application areas. One area that has been getting a rising amount of attention is the art of
detecting the emotional state of humans, depending on their faces because of its many
potential applications in today’s world. This task is known as Emotion detection of Facial
Expression Recognition (FER). A system that could automatically recognize human
emotions from their facial expressions could play an essential role in a wide range of
applications such as video games, to identify suspicious person, patient’s painful situation
at hospital, online meeting, E-learning system, music player play songs based on person’s
mood, driver’s tiredness from expressions while driving, robotics, behavioural science,
etc. [1].
FIGURE 1.1 Example of Facial Expression Recognition based on Human-Computer Interaction [15]
2
The human face plays a vital role in interpersonal communication, so understanding
someone’s emotional state through facial expressions becomes easy. Facial Expression
Recognition is the process of identifying a person’s mental state. These expressions are the
limits of a machine to feel emotions, which help to feel a certain kind of situation or action
[2]. Emotions can be described in several ways, but six basic universal expressions
proposed by Ekman et al. [3] are Anger, Disgust, Fear, Happy, Sad and Surprise. Sample
image of these expressions is shown below in figure 1.2. After some time, a neutral
expression is also added in this emotion category and now many facial expression datasets
contain seven facial expressions like Anger, Disgust, Fear, Happy, Neutral, Sad, and
Surprise.
FIGURE 1.2 Different Facial Expressions of one person from the JAFFE dataset [4,14]
In recent years, deep learning strategies have achieved great success as well as achieved
better accuracy than traditional methods due to the inexpensive computational power. One
example called the Convolutional Neural Network (CNN) has obtained excellent state-of-
the-art results in the field of computer vision (e.g., image classification, face recognition,
object detection). Different CNN models have been successfully applied to FER and have
shown better results than conventional methods for their efficiency in feature learning and
representations. A well-designed CNN trained on millions of images can set the parameters
3
of a series of filters, which capture both low-level generic features and high-level semantic
features. In addition, the current Graphics Processing Units (GPUs) accelerate the training
process of deep neural networks to address processing time issues in training and testing
phases. Although human beings can identify facial expressions correctly and effortlessly,
still reliable automatic facial expression recognition by machine is a challenge. Research is
going on to develop more reliable and robust deep learning models for facial expression
recognition. Many researchers are trying to improve the recognition accuracy and improve
the limitations of existing deep learning models [5].
1.2 Research Motivation
Facial Expressions are responses to a person’s internal emotional states, intentions or
social communications and make the other people understand. Facial variation analysis has
gained much attention from the scientific and industrial communities over recent decades
due to its potential value across information security and access control applications,
surveillance and image understanding. Ability to create and understand computers to
distinguish facial expressions and use that information in Human-Computer Interface to
take intelligent decisions by machine has generated considerable research interest in the
research community. If computers can analyze and understand different facial expressions
of a person based on their mood then automatic intelligent decisions can be taken which
will be further helpful in many applications like emergency cases at the hospital, a
student’s engaged time and feedback during online class, identify suspicious persons at
airport or railway stations, etc. [10] Above listed applications are illustrated in figure 1.3,
figure 1.4 and figure 1.5
FIGURE 1.3 Use of Facial Expression Recognition system to identify the suspicious person or criminal at
the airport, railway station or any crowded place [16]
4
FIGURE 1.4 Use of Facial Expression Recognition system to identify students’ engagement level in online
classes [17]
FIGURE 1.5 Use of Facial Expression Recognition system to play songs based on a person’s mood [18]
Automatic human facial expression recognition has been receiving increasing attention from
researchers in the deep learning area, and several solutions have been proposed. Most of the
existing work focused on a single CNN architecture for facial expression database. Rather than
expanding layers in CNN and making it more complex Deep CNN, researchers are dealing
with the concept of fusion of different internal layers and fusion of different models to improve
the recognition accuracy on facial expression database which contains real-world images. Deep
learning techniques are used to automatically extract useful features with the fusion-based
approach, are new active research directions compared with traditional methods for facial
expression recognition [11,12].
5
Deep Learning techniques have been shown to perform well in solving various computer
vision problems, which have not been possible using traditional machine learning techniques.
The application of deep learning techniques has surpassed the accuracy of classical methods in
several computer vision tasks. The advances in faster Graphics Processing Units (GPUs) and
models trained using deep learning in the ImageNet Challenge attract many researchers to
improve existing models’ recognition accuracy by tackling many research challenges.
Therefore, a reasonable approach would be to use a deep learning model to train the automatic
facial expression recognition (AFER) system and improve this task’s accuracy [13]. Motivated
by the above factors and the power of deep learning techniques, this research is carried out
with deep learning techniques to address the challenge of improving recognition accuracy
of lower resolution images for facial expression recognition and enhancing the
performance using a feature-fusion approach.
1.3 Research Challenges
Recognition of expressions done by machine is still considered a challenge in the facial
expression recognition process. Humans can easily analyze and identify the expressions
but for the machines, to identify the expressions more accurately is necessary to further
make intelligent decisions. Many researchers are working to improve the recognition
accuracy of facial expression recognition systems by handling some challenges. Under a
controlled environment, the FER system is working well and no longer a substantial
problem. However, it is still a challenge for machines to make accurate decisions in real-
life scenarios [11].
Instead of being realistic, many datasets on facial expressions appear to have their
expressions acted out. This is a drawback since the real emotion is not fully conveyed, and
thus, when a system is used in the real world, it may fail to properly recognize the realistic
expressions. Facial expression datasets are available in two categories: laboratory trained
facial expression dataset and real-world facial expression datasets. In real-world facial
expression datasets, images are captured in an uncontrolled environment, so it contains
many challenges like illumination variation, occlusions, head pose variations and lower
resolution images. While in the case of laboratory trained facial expression datasets,
images are captured under a controlled environment, so it contains fewer challenges than
the real-world facial expression datasets [13].
6
As facial expressions vary from person to person due to different ages, cultures, and
genders, recognizing emotion from the face is very challenging and another issue in the
facial expression recognition system. Variation of image size, the orientation of face,
glasses or mask on the faces, lightning conditions are the factors that increase the
complexity of the recognition task. Another research challenge is related to the high
similarity between two specific classes of facial expressions, e.g., disgust with anger and
sadness with fear, which leads to misclassification [8,11]. This problem is shown in figure
1.6. Challenges like non-frontal faces, lower resolutions, varying lighting conditions, and
occlusions in the images of facial expression datasets are shown in figure 1.7. Laboratory
trained facial expression dataset where images are taken in a controlled environment is
shown in figure 1.8
(a) Disgust (b) Angry (c) Sadness (d) Fear
FIGURE 1.6 Example of high similarity between facial expressions in two different classes [6,7]
FIGURE 1.7 Example of facial expression images taken in an uncontrolled environment of FER2013 dataset
which contains challenges like varying illumination, head pose variation, lower resolution and occlusion [9]
7
FIGURE 1.8 Examples of laboratory trained facial expression datasets (a) CK+ (b) JAFFE [6,14]
Also, a large amount of training data is required to carry out the feature extraction process
efficiently. A significant challenge that deep FER systems facing is a lack of sufficient
training data in terms of quality and quantity. To overcome the above challenges, robust
and reliable feature extraction techniques are required. Despite CNN’s better performance
on FER systems, still robust FER based on CNN remains a challenging unsolved problem
[12].
1.4 Problem Statement
The key problem to be addressed in this study is to recognize facial expressions more
precisely from the lower resolution images. Facial expression datasets classified into two
categories. First, laboratory trained datasets where images are captured in a controlled
environment. Second, real-world datasets where images are captured under the
uncontrolled open environment. The recognition task becomes more challenging for real-
world datasets due to various challenges like varying illumination, occlusion, head pose
variations, and lower resolution images. Many researchers are trying to improve the
recognition accuracy by tackling these challenges of facial expression datasets using deep
learning techniques. Further, it will enhance the facial expression recognition system’s
performance and make it more accurate to use in different applications. Based on the above
analysis, the problem definition is:
“To develop deep learning models for facial expression recognition using deep
learning techniques to extract unique and distinct features from images for achieving
better recognition accuracy compared to the existing state-of-the-art research work.”
8
1.5 Research Objectives and Scope
The objectives and scope of our research work are as follows:
1.5.1 Research Objectives
• To study and investigate various deep learning methods and models for facial
expression recognition
• To study and investigate existing feature fusion-based techniques used in
convolutional neural networks for improving recognition accuracy of a facial
expression recognition system
• To design and develop effective proposed models for efficient facial expression
recognition using deep learning techniques
• To evaluate and validate the performance of proposed models on laboratory trained
and standard facial expression datasets which contains real-world images
1.5.2 Scope of Research Work
• The proposed research work evaluated on the facial expression datasets contain
images captured in an uncontrolled environment. It includes challenges like
illumination variation, head pose variation, and lower resolution images. Also
evaluated for laboratory trained facial expression dataset contains images captured
in a controlled environment for the cross-database evaluation study.
• In the proposed research work, all the seven facial expressions are considered,
which includes: Happy, Angry, Sad, Neutral, Surprise, Disgust, and Fear. Only
Frontal faces without occlusion have been considered from the facial expression
database for evaluation.
9
• Ensemble of models with feature-fusion deep learning methods applied to carry out
accurate result in terms of classification of the expressions, which will further
enhance recognition accuracy. The most recent EfficientNet architecture applied to
the facial expression dataset improves its recognition accuracy by solving vanishing
gradient descent issue.
1.6 Organization of the Thesis
The contents of the thesis are organized as follows.
Chapter 2 presents an overview of Facial Expression Recognition System, Convolutional
Neural Network (CNN), and about Deep Learning.
Chapter 3 presents a comprehensive literature survey for facial expression recognition
using deep learning techniques with their pros and cons. This chapter also reviews and
analyses multi feature-fusion based deep learning approaches. In addition to this,
information related to facial expression datasets used for this research is described. This
survey helps to identify Research Gap and challenges to do this research. Three types of
algorithms are proposed in the thesis, and each one is explained briefly in chapters 4, 5,
and 6.
Chapter 4 describes the proposed Multi-Layer feature-fusion based Classification
(MLFFC) model using InceptionV3 CNN architecture. Inter-Layer feature fusion
technique has been applied in this model which integrates feature maps from different
layers instead of the last layer consideration. The proposed model is tested on different
internal layers of module C in InceptionV3 architecture as it contains higher feature
representations. It is found that concatenation of the internal layer with the final feature
vector layer improves the recognition accuracy. The standard CK+ and FER2013 datasets
are used to evaluate the proposed model.
Chapter 5 describes the proposed Multi-Model Feature-Fusion based Classification
(MMFFC) model, which uses an Ensemble of CNN model approach. In this model, the
concatenation of features of different layers from various networks helps to overcome the
10
limitation of a single network and produces robust and superior performance. Different
combinations of CNN have been tested for this approach, and finally, an ensemble of
VGG16 and ResNet50 architectures are selected and applied in this proposed model. The
standard KDEF and FER2013 datasets are used to evaluate the proposed model.
Chapter 6 describes the novel concept of EfficientNet architecture. Facial expression
recognition system investigated and implemented using EfficientNetB7 architecture. The
Novel concept of EfficientNet is introduced in 2019 is using the Compound Scaling
method to scale up CNN in a more structured way. From the literature review, it is found
that no work has been carried out for facial expression recognition using this architecture
until date. Different optimizers SGD, RMSProp, and Adam are applied to determine which
optimizer gives better performance in terms of recognition accuracy. Moreover, the
Vanishing Gradient Descent issue is resolved by applying the proposed internal batch
normalization method. This concept works well with higher resolution images, so the
standard KDEF dataset is used to evaluate this model. Also, FER2013 dataset is applied to
a cross-database study.
Chapter 7 contains the conclusion in which the contributions made in this thesis are
summarized, and the scope of further enhancement is outlined.
11
CHAPTER 2
Theoretical Background
2.1 Facial Expression Recognition System
Identifying an individual’s emotion depending on that person’s features is known as a
facial expression recognition system. Using a deep learning approach, the common
approach to facial expression recognition contains three steps: Face Detection, Feature
Extraction and Classification. Figure 2.1 shows the structure of a deep learning-based
facial expression recognition system. [19]
FIGURE 2.1 Conventional Facial Expression Recognition System [37]
12
2.1.1 Pre-processing for Face Detection
The first step in the facial expression recognition system structure is face detection which
is a pre-processing part. In this phase, the main task is to obtain pure facial images with
normalized intensity, uniform size and shape. Pre-process the input image helps to remove
noise - unwanted information and compensate illumination variations if required. For
converting an image into a normalized image for feature extraction task involves steps like
Face Alignment, Data Augmentation and Face Normalization. These steps include
processes like detecting feature points, rotating to line up, locating and cropping the face
region using a rectangle according to the model. [20] Example of Image Rotation, Image
Cropping and Image Intensity Normalization processes are shown in figure 2.2, 2.3 and
2.4, respectively.
FIGURE 2.2 Example of Image Rotation during the pre-processing phase [21]
FIGURE 2.3 Example of Image Cropping during the pre-processing phase [21]
13
FIGURE 2.4 Example of Illustration of the Intensity Normalization during the pre-processing phase [21]
Face detection is one of the most studied topics in computer vision area, not only because
of the challenging nature of face as an object but also due to the many applications that
require the application of face detection as a first step. During the past 15 years,
tremendous progress has been made due to data availability in an unconstrained
environment (so-called ‘in-the-wild’) through the Internet to develop robust facial
expression recognition algorithms using deep learning techniques [22].
Face detection refers to detecting the face region in a frame from images. Viola and Jones
[23] was the first algorithm that made face detection practically feasible in real-world
applications. Instead of working with image intensities, Papageorgiou et al. [24] developed
a framework based on Haar wavelet representation in 1998. Later in 2001, Viola and Jones
further developed this idea by proposing the Haar-like features that represent the changes
of texture or edges of particular facial regions and can be operated much faster than pixels
in the system. Also, OpenCV (Open-Source Computer Vision) library used in processing
the images. It comes with a programming interface to Python. OpenCV-Python used in
many algorithms to detect frontal faces using HaarCascade classifier function from
images. It will detect faces from the images by putting a rectangular box on frontal faces
in the image [25]. As an exercise, a sample image of mine with my friends implemented
using HaarCascade using OpenCV-Python and detected several frontal faces as shown in
figure 2.5
14
FIGURE 2.5 Example of Face Detection carried out on a sample image using OpenCV-Python
2.1.2 Feature Extraction
Feature Extraction usually occurs immediately after face detection. It can be considered
one of the essential stages of facial expression recognition, as their effectiveness depends
on the quality of the extracted features. The changes in the facial expression can be either
based on minor deformations in wrinkles/bulges or based on significant deformations in
eyes, eyebrow, mouth, nose, etc. The feature extraction process is classified as Appearance
based features (non-geometric/non-structural features) and Geometric/Structural based
features as shown in figure 2.6 [26]
FIGURE 2.6 Geometric and Appearance-based Feature Extraction [26]
15
Geometric based features represent the contour and position of face parts like forehead,
eye, nose, lips and chin. These features are extracted from a feature vector which is known
as face geometry. Geometric feature extraction encodes these features using point, stretch,
angle and other geometric relationships among the component. In Appearance-based
feature extraction method, single image filter or a filter bank is applied either on the
complete image or on the part of the image to extract changes in appearance [27].
Feature extraction can be performed using various mathematical models, image processing
techniques and computational intelligence tools such as neural networks of fuzzy logic.
Feature extraction methods are classified into four categories, namely: feature-based,
appearance-based, template-based and part-based approaches, as shown in figure 2.7 [28].
FIGURE 2.7 Classification of different Feature Extraction Methods [28]
Feature extraction may directly influence algorithms’ performance, which is usually the
bottleneck of the facial expression recognition system. Widely used feature extraction
methods in FER systems mainly include Gabor Filter, Local Binary Pattern (LBP), Optical
Flow method, Haar-like feature extraction, Feature point tracking etc.
A. Gabor Filter:
Gabor filters are the set of wavelets. Each wavelet occupies energy at a particular
frequency and particular orientation, expanding a signal using these set of wavelet gives
the localized frequency descriptor and capture feature of the signal. One of Gabor filter
specializations is that the scale of frequency or illumination and orientations property can
16
be tuned, so in many applications where the object of interest may appear at different scale
and pose, Gabor filter using multi-scale and multi-orientation is the most suitable for
feature extraction. Gabor kernel represented as a product of 2D Gaussian Kernel (GK) and
Sinusoidal kernel by equation (2.1) [27]
Where (x,y) is the position in the digital image and , are the standard deviations in x
and y direction respectively, is the project angle and is the opposite of project
frequency. Variables and can be found using the following equation (2.2)
B. Local Binary Pattern (LBP):
The LBP calculates the brightness relationship between each pixel contained in the image
and its local neighbourhood. Binary sequences are then coded to create a local binary
pattern. Finally, it uses a multi-region histogram as a feature description of the image as
shown in figure 2.8 [29]
FIGURE 2.8 Feature Extraction using LBP Histogram Method [29]
LBP operator is used here to take various sizes of pixel location, size of the pixel section is
not limited here, this is formulated as per the below equation (2.3)
17
………. (2.3)
Compared with Gabor wavelet, the LBP operator requires less storage space and higher
computational efficiency. However, the LBP operator is ineffective on the images with
noise.
C. Principal Component Analysis (PCA):
PCA is a transform that chooses a new coordinate system for the dataset such that the
greatest variance by any projection of the dataset comes to reside on the first axis, the
second greatest variance lies on the second axis, and so on. The goal of PCA is to reduce
the dimensionality of the data while retaining as much as the information present in the
original dataset. PCA has the ability to compress the data to lower dimensions by keeping
the most informative dimensions and rejecting the noisy and unnecessary dimensions so
that the data can be fed to machine learning algorithms. As per its name, principal
components are the direction along which most variance in data or the directions where the
data is most spread out, unlike other transform methods. [27,28] PCA will construct a set
of dominant features as per equation (2.4), the condition will contain true for newly
composed dominant feature ( ) is the linear combination of the primary criteria ( )
The dominant feature set is a set of mutually perpendicular axes where the condition will
be given as follows:
D. Discrete Cosine Transform (DCT):
The discrete cosine transform (DCT) is used to transform and compress the train or test
image in the frequency domain without losing the key features in the image or using the
key features. DCT represents the whole image as coefficients of various frequencies of
cosine. In DCT, low-frequency components of an image are extracted as it represents the
higher magnitude and rests high-frequency components are rejected. Low-frequency area
is in the low upper corner of the DCT matrix, and high-frequency coefficients increase
crossways into the lowermost right corner. Numerous techniques can extract low-
frequency areas, but the zigzag selection technique gives efficient selection [27].
18
The 1-Dimensional (1D) DCT is defined as per below equation (2.5):
2-Dimensional (2D) DCT is defined as per below equation (2.6):
E. Optical Flow Method:
Optical flow is the pattern of apparent motion caused by the relative motion. The optical
flow method’s basic principle is that each pixel in an image is assigned to a velocity
vector. These velocity vectors form a motion field for an image. In a motion moment, the
image point corresponds to the actual object point. In the field of FER, the optical method
is widely used to extract facial expression features from dynamic image sequences since it
highlights facial deformation and reflects the motion trend of image sequences as shown in
figure 2.9 [30]
FIGURE 2.9 Two sample facial expressions on the left-hand side and its optical method result available on
the right-hand side [30]
19
F. Feature Point Tracking:
The primary purpose of feature point tracking method is to synthesise the input emotional
expressions according to the displacement of the feature points, as shown in figure 2.10.
The feature point tracking methods often select some feature points with large changes in
the corner eye and mouth corner. Then, following these points will be able to get facial
feature displacement or deformation information [29,30].
FIGURE 2.10 Feature-Point Tracking method used with feature points displacement [29]
G. Feature Extraction in CNN:
In the case of Convolutional Neural Network (CNN), the feature extraction process
includes several convolutional layers followed by max-pooling and an activation function
as per its architecture, as shown in below figure 2.11. All these layers generate feature
maps in the feature extraction process [31].
FIGURE 2.11 Feature extraction process in a Convolutional Neural Network generating feature maps [31]
20
2.1.3 Facial Expression Classification
The last step of the facial expression recognition system is the classification that can be
realized either by attempting recognition or by interpretation. FER deals with the
classification of the face and its features into abstract classes that are entirely based on
visual information. Facial expression classification aims to design an appropriate
classification mechanism to identify facial expression. Earlier facial expressions were
categorized into six basic emotions: Disgust, Anger, Fear, Surprise, Happy and Sad. But
after some time, many of the recent research work includes Neutral expression in this list.
Hence facial expressions are categorized into seven basic emotions: Disgust, Anger, Fear,
Neutral, Surprise, Happy and Sad. Example of seven basic emotions is shown in figure
2.12 [33].
FIGURE 2.12 Example of Seven basic Facial Expressions from CK+ dataset [32]
To identify the above listed facial expressions, a process must be able to recognize facial
feature movements. According to these different emotions will be classified into seven
categories as mentioned in below table 2.1
21
TABLE 2.1 Descriptions of seven facial expressions [34]
Emotion Class Description of Facial Expressions
Happy Eyebrows are relaxed. The Mouth is open, and Mouth corners are
upturned.
Sad Eyes are slightly closed. Eyebrows are bent upward, and Mouth is
relaxed.
Fear Eyebrows are raised and pulled together. Eyes are open and
tensed.
Anger Eyebrows are pulled downward and together. Eyes are wide open,
and lips are tightly closed.
Surprise Eyebrows are raised. Eyes are wide open, and Mouth is open.
Disgust Eyebrows and eyelids are relaxed. The Upper lip is raised and
curled, often asymmetrically.
Neutral Eyebrows, Eyes and Mouth, are relaxed.
Some of the most relevant classification methods are Support Vector Machine (SVM),
AdaBoost Method, K-Nearest-Neighbor (KNN), Hidden Markov Model (HMM),
Bayesian Classification etc. explained below [30]
A. Hidden Markov Model (HMM):
Hidden Markov Model (HMM) is a Markov process containing hidden, unknown
parameters and can effectively describe the statistical model of the random signal
information. HMM consists of two interrelated processes. One is the underlying and
unobservable Markov chain with a certain number of states. The other is a set of
probability density distribution corresponding to each state [30]. The following triplet can
define an HMM:
Where A is the state transition probability matrix, B is the observation probability
distribution, and π is the initial state distribution. In a discrete density HMM, B represents
a matrix of probability entries. In a continuous density HMM, B is denoted by the
parameters of the probability distribution function of observations such as the Gaussian
distribution function or a mixture of Gaussians. HMM-based face recognition methods
22
have the following advantages: they allow expression changes and large head rotation, do
not need to retrain all the samples after adding new samples, but part of parameters are
given by experience [35].
B. Bayesian Network:
A Bayesian network is a probabilistic graphical model based on the Bayesian formula and
presents random variables via directed acyclic graphs. Bayesian network based on the
probabilistic reasoning is developed to solve the uncertainty and incompleteness problem.
A Bayesian classifier represents the dependencies among feature data and sample labels by
using a directed acyclic graph. Generally, Bayesian Network classifiers can be learned
using a fixed structure – the naïve-Bayes classifier [30].
Given a Bayesian Network classifier with a parameter set Ɵ, the optimizing classification
rule based on the maximum likelihood idea to classify an observed feature vector
with n dimension, to one of | C | class labels, c {1,2, …., |C|}, is denoted by [30]:
The Bayesian network can improve the classification accuracy. Still, it requires many
parameters that human experiences give part of them, and the estimated result deviates
from the actual result if the number of training samples is small [35].
C. K-Nearest Neighbor (KNN):
KNN is a type of instance-based learning classification algorithm. The KNN method
principle is that in the feature space one sample has k-closest samples, and its label is
assigned to the class most common among its KNNs by using a majority vote of its
neighbours. Without prior knowledge, the KNN classification algorithm frequently
employs the Euclidean distance as the distance metric [30]. Given two vectors
and their Euclidean distance is given as:
23
D. Support Vector Machine (SVM):
Support Vector machine method is based on the structural risk minimization principle for
classification method. It constructs a hyperplane or set of hyperplanes in a high or infinite-
dimensional space. Training data points are marked as belonging to one of the categories
with the most considerable distance to other categories [35]. The principle of SVM is to
transform the input vectors to a higher dimensional space by a non-linear transform. Then
an optimal hyperplane which separates the data can be found [30].
SVM algorithm helps to find the best line or decision boundary; this best boundary or
region is called a Hyperplane. SVM algorithm finds the closest point of the lines from
both the classes. These points are called support vectors. The distance between the vectors
and the hyperplane is called as margin, and the goal of SVM is to maximize the margin.
The hyperplane with maximum margin is known as Optimal hyperplane as shown in
below figure 2.13 [36]
FIGURE 2.13 Example of Support vector and hyperplane in the SVM method [36]
24
There are four types of kernel functions for the SVM model, such as the linear kernel, the
polynomial kernel, the radial basis function kernel and the sigmoid kernel mentioned
below: [30]
The linear kernel function is given as:
The polynomial kernel function is given as:
The radial basis kernel function is given as:
The sigmoid kernel function is given as:
E. Adaboost Algorithm Method:
The core idea of Adaboost is combining weak classifiers together to a more robust final
classifier by changing the distribution of data. One of the researchers has proposed a
classification method based on the Adaboost algorithm, where Haar features were used to
construct a weak classifier space and got a facial expression classifier using continuous
Adaboost algorithm for learning. This method concludes that this algorithm is faster
compared to the support vector machine. However, research also shows that this
classification method is not suitable for small samples [35].
F. Classification using Artificial Neural Network:
Artificial Neural Network (ANN) system is an algebra arithmetic system about
information processing simulated human brain neural system. It is a flexible mathematical
structure which can distinguish complex nonlinear relationships between input data and
output data. A neuron in an artificial neural network is defined as a set of input values and
associated weights. A training function adds the weights and associates the results to output. In
general, three layers are configured for neural network structure for organizing neurons that
are input layer, hidden layer and output layer. The input layer comprises the value of records
that are to be given as inputs to the next layer of neurons. The hidden layer is the next layer.
One Neural network may consist of several hidden layers. Then the number of hidden layers
can vary based on the application. The output layer is the final layer, where each node is
25
available for each class. Artificial Neural network has the advantage of the high-speed ability
due to its parallel processing mechanism; its distributed storage leads to the ability to recover
feature extraction and have a self-learning function, while it’s high parallelism non-linear
characteristic limit development to some degree [35,30].
2.2 Deep Learning for Facial Expression Recognition
2.2.1 Evolution of AI: Machine Learning and Deep Learning
Artificial Intelligence (AI) is the just as the word implies, the intelligence is artificial,
programmed by humans to perform human activities. This artificial intelligence is
incorporated into computer systems to create AI systems that ultimately function as units
of “thinking machine”. Humans design AI systems to make decisions from historical or
real-time data or both. AI systems have the ability to learn and adapt as they compile
information and make decisions. AI systems often incorporate machine learning, deep
learning and data analytics with artificial intelligence that enable intelligent decision
making. This intelligence is not human intelligence. It’s the machine’s best approximation
to human intelligence [38].
FIGURE 2.14 Evolution of Artificial Intelligence (AI) [38]
26
As shown in figure 2.14, Machine learning is a subset of AI, which implies for the fact
that we can build intelligent machines that can learn based on a provided dataset on its
own. Further, you will notice that Deep learning is a subset of Machine Learning where
similar machine learning algorithms are used to train Deep Neural Networks to achieve
better accuracy in those cases where former was not performing up to the mark.
Machine Learning:
It is an application of artificial intelligence that provides the AI system with the ability to
learn from the environment and applies that learning to make better decisions. In order to
do this effectively, there are three categories of machine learning algorithms that make this
possible known as Supervised Machine Learning, Unsupervised Machine Learning, and
Reinforcement Learning explained here [39]. A variety of algorithms that machine
learning uses to iteratively learn, describe and improve data to predict better outcomes.
These algorithms use statistical techniques to spot patterns and then perform actions on
these patterns.
Supervised Machine Learning: “Supervised” means that a teacher helps the program
throughout the training process: there is a training set with labelled data. For example, you
want to teach the computer to put red, blue and green socks into different baskets. First,
you show to the system each of the objects and tells what is what. Then, run the program
on a validation set that checks whether the learned function was correct or not. This type
of learning is commonly used for classification and regression. This method’s algorithms
are Naïve Bayes, Support Vector Machine, Decision Tree, K-Nearest Neighbours, Logistic
Regression, Linear and Polynomial regression etc.
Unsupervised Machine Learning: In unsupervised learning, you do not provide any
features to the program to search for patterns independently. Imagine you have a big
laundry basket that the computer has to separate into different categories: socks, t-shirts,
jeans etc. This is called Clustering, and unsupervised learning is often used to divide data
into groups by similarity. Unsupervised learning is also suitable for insightful data
analytics. For example, it can be used to find fraudulent transactions, forecast sales and
discounts or analyze customer’s preferences based on their search history. The
programmer does not know what they are trying to find, but there are indeed some
patterns, and the system can detect them. This method’s algorithms are K-means
27
clustering, DBSCAN, Mean-Shift, Principal Component Analysis (PCA), Singular Value
Decomposition (SVD) etc.
Reinforcement Learning: This is very similar to how humans learn: through trial.
Humans don’t need constant supervision to learn effectively, like in supervised learning.
By only receiving positive or negative reinforcement signals in response to our actions,
still, learn effectively. For example, a child learns not to touch a hot pan after feeling pain.
One of the essential parts of reinforcement learning is that it allows you to step away from
training on static datasets. Instead, the computer can learn in a dynamic, noisy
environment such as the real world. Algorithms used for these methods are for self-driving
cars, games, robots, resource management etc.
Deep Learning:
Deep learning is a subset of Machine learning. Deep Learning models can make their own
predictions entirely independent of humans. Machine learning models of the past still need
human intervention in many cases to arrive at the optimal outcome. Deep learning models
used for artificial neural networks. The design of this network is inspired by the biological
neural network of the human brain. It analyses the data with a logical structure similar to
how a human would conclude, as shown in below figure 2.15.
FIGURE 2.15 Working Methodology difference between Machine Learning and Deep Learning [39]
28
Deep learning is the next generation of machine learning algorithms that use multiple
layers to extract higher-level features from raw input. For instance, in image recognition
applications, instead of just recognizing matrix pixels, deep learning algorithms will
recognize edges at a certain level, nose at another level, and face at yet another level. With
the ability to understand data from the lower level all the way up the chain, a deep learning
algorithm can improve its performance over time and arrive at decisions at any given
moment in time.
Deep learning algorithms use complex multi-layered neural networks, where the level of
abstraction increases gradually by non-linear transformations of input data. In a neural
network, the information is transferred from one layer to another over connecting
channels, and they are known as weighted channels because each of them has a value
attached to it. All neurons have a unique number called bias. This bias added to the
weighted sum of inputs reaching the neuron is then applied to the activation function. The
result of the function determines if the neuron gets activated. Every activated neuron
passes on information to the following layers. This continues up to the second last layer.
The output layer in an artificial neural network is the final layer that produces outputs for
the program.
Most deep learning methods use neural network architectures, so deep learning models are
often referred to as Deep Neural Networks. The term “deep” usually refers to the number
of hidden layers in the neural network, as shown in figure 2.16. Traditional neural
networks only contain 2-3 hidden layers, while deep networks can have as many as 150.
Deep learning models are trained by using large sets of labelled data and neural network
architectures that learn features directly from the data without the need for manual feature
extraction.
FIGURE 2.16 Basic Structure of Neural Networks with Input, Hidden and Output layers [39]
29
2.2.2 Deep Neural Networks:
A Deep Neural Network (DNN) is an artificial neural network (ANN) with multiple
hidden layers of units between the input and output layers. Similar to shallow ANNs,
DNNs can model complex non-linear relationships. DNNs are typically designed as Feed
Forward Networks. Data flows from the input layer to the output layer without going
backwards. The links between the layers are one way which is in the forward direction,
and they never touch a node again. Different architectures have been developed to solve
problems in various domains or use-cases. E.g., CNN is used most of the time in computer
vision and image recognition, and RNN is commonly used in time series
problems/forecasting. More recently, CNNs have been applied to acoustic modelling for
Automatic Speech recognition, where they have shown success over previous models [40].
Different architectures are designed and developed as a part of deep neural networks
shown in figure 2.17. From which, below are the most common architectures of deep
neural networks:
1. Convolutional Neural Network (CNN)
2. Deep Auto Encoder (DAE)
3. Restricted Boltzmann Machine (RBM)
4. Deep Belief Network (DBN)
5. Recurrent Neural Network (RNN)
6. Long Short-Term Memory (LSTM)
2.2.2.1 Convolutional Neural Networks (CNN):
Convolutional Neural Networks (CNN) have broad applications in video and image
recognition, natural language processing, speech recognition, and computer vision
including Facial Expression Recognition. From several studies, it is found that CNN is
robust to face location changes and scale variations and behaves better than the multi-layer
perceptron (MLP) in the case of previously unseen face pose variations. CNN have several
advantages over DNN including the very similarity of the human visual processing
system, which is well adapted to the structure of 2D and 3D image processing, and the
effective learning and extraction of 2D features [41].
30
FIGURE 2.17 Basic CNN Architecture [41]
As shown in figure 2.17, CNN has three types of different layers: convolutional layers,
pooling layers, and fully connected layers. The convolutional layer has a collection of
learnable filters to convolve through the whole input image and produce various specific
types of activation feature maps. The convolution operation is associated with three main
advantages: local connectivity which learns correlations among neighbouring pixels;
weight sharing in the same feature map which significantly reduces the number of the
parameters to be learned; and shift-invariance to the location of the object. The pooling
layer follows the convolutional layer and is used to reduce the spatial size of the feature
maps and the network’s computational cost. Average pooling and max pooling are the two
most commonly used nonlinear down-sampling strategies for translation invariance. The
fully connected layer is usually included at the end of the network to ensure that all
neurons in the layer are fully connected to activations in the previous layer and to enable
2D feature maps to be converted to 1D feature maps for further feature representation and
classification [11].
A significant advantage of CNN over conventional approaches is its ability to concurrently
extract features, reduce data dimensionality, and capability to classify in one network
structure. Additionally, the CNN technique requires only minimal image processing due to
CNN’s robust ability to minimize noise during image acquisition [41].
31
2.2.2.2 Deep Autoencoder (DAE):
Autoencoders are neural networks that are used to reduce the dimensionality of datasets.
They are implemented in an unsupervised fashion to generate only a representation of the
dataset within their hidden layer neurons, also called the latent vector. Taking the same set
of values for both input and output of the network, an autoencoder learns to reduce a
dataset into a representation state and learns how to reconstruct the data sample to its
original form from the learned representations [43].
FIGURE 2.18 Basic Structure of Deep Autoencoders (DAE) [44]
Deep Autoencoder (DAE) was introduced to learn efficient coding for dimensionality
reduction. Figure 2.18 represents the structure of DAE which is composed of Encoder,
Decoder and Bottleneck layer. It reconstructs the original input image from the noisy
image conversion. It extracts only the features of an image and produces the output by
eliminating any disturbance or unnecessary noise in the system. The code layer, also
known as the “bottleneck” of the network presents the compressed image inputted into the
decoder. The decoder layer translates the encoder image (noisy image) back to its original
32
dimension (denoised image). Deep sparse autoencoder extracts low dimensional features
that efficiently represent human activity or movement, from data of human action or
motion with high dimensional [41].
2.2.2.3 Restricted Boltzmann Machine (RBM):
Restricted Boltzmann Machine is an artificial neural network where the unsupervised
learning algorithm can apply to build non-linear generative models from unlabelled data.
The goal is to train the network to increase the probability of vector in the visible units to
probabilistically reconstruct the input. It learns the probability distribution over its inputs.
As shown in figure 2.19, RBM is made of a two-layered network called the visible and
hidden layers. Each unit in the visible layer is connected to all units in the hidden layer,
and there are no connections between the units in the same layer [45].
FIGURE 2.19 Basic Structure of Restricted Boltzmann Machine (RBM) [45]
2.2.2.4 Deep Belief Network (DBN):
The DBN is a typical network architecture but includes a novel training algorithm. The
DBN is a multilayer network (typically deep, including many hidden layers) in which each
pair of connected layers is a restricted Boltzmann machine (RBM). In this way, a DBN is
represented as a stack of RBMs. In the DBN, the input layer represents the raw sensory
inputs, and each hidden layer learns abstract representations of this input. The output layer,
33
which is treated somewhat differently than the other layers, implements the network
classification. Training occurs in two steps: unsupervised pretraining and supervised fine-
tuning.
FIGURE 2.20 Basic Structure of Deep Belief Network (DBN) [46]
In unsupervised pretraining, each RBM is trained to reconstruct its input. The next RBM is
trained similarly, but the first hidden layer is treated as the input (or visible) layer, and the
RBM is trained by using the outputs of the first hidden layer as the inputs. This process
continues until each layer is pre-trained. When the pretraining is complete, fine-tuning
begins. In this phase, the output nodes are applied labels to give them meaning [46].
2.2.2.5 Recurrent Neural Network (RNN):
RNNs are a type of artificial neural network that includes a recurrent layer. The difference
of a recurrent layer from a regular fully-connected hidden layer s that neurons within a
recurrent layer could be connected to each other. In other words, the output of a neuron is
conveyed both to the neuron(s) within the next layer and to the next neuron within the
same layer. Using this mechanism, RNNs can carry information learned within a neuron to
34
the next neuron in the same layer. The traditional neural networks and CNN models
cannot remember any information from the past, as they do not contain any memory cell.
The RNN architecture has an internal memory such as the hidden state to store the
sequential input’s temporal dynamics, as shown in figure 2.22. The RNN model can
predict the class label based on the sequence of the previous context information. The
RNN maps the input data to a hidden state and hidden state data to the output, as shown in
figure 2.21 for sequence learning in the temporal domain [43]. These mappings are
mathematically expressed as:
Where is the input at time t, is the hidden state at time t, is the hidden state at
time t-1, is a non-linear function such as sigmoid, rectified linear unit or a hyperbolic
tangent and is the output at time t. is the weight matrix from the input to hidden
state, is the weight matrix from hidden to hidden and at last and are bias units
of hidden state and output state respectively.
FIGURE 2.21 The Schematic diagram of RNN Node [43]
35
FIGURE 2.22 Basic Structure of Recurrent Neural Network (RNN) [43]
2.2.2.6 Long Short-Term Memory (LSTM):
LSTM is an implementation of the Recurrent Neural Network. Unlike the earlier described
feed-forward network architectures, LSTM can retain the knowledge of previous states
and can be trained for work that requires memory or state awareness. LSTM partly
addresses a major limitation of RNN, i.e., the problem of vanishing gradients by letting
gradients to pass unaltered. As shown in figure 2.23, LSTM consists of blocks of memory
cell state through which signal flows while being regulated by input, forget and output
gates. These gates control what is stored, read and written on the cell [45].
In figure 2.23, C, , h represents a cell, input and output values. Subscript t denotes time
stamp value, i.e., t-1 is from previous LSTM block, and t indicates current block values.
The symbol is the sigmoid function and is the hyperbolic tangent function.
Operator + is the elementwise summation and x is the element-wise multiplication. The
computations of the gates are described in the below equations [45]:
36
where f, i, o are the forget, input and output gate vectors respectively. W, w, b and
represents weights of input, weights of recurrent output, bias and element-wise
multiplication respectively. There is a similar variation of the LSTM known as gated
recurrent units (GRU). GRUs are smaller in size than LSTM as they don’t include the
output gate, and perform better than LSTM on only some simpler datasets.
LSTM recurrent neural networks can keep track of long-term dependencies. So, they are
great for learning from sequence input data and building models that rely on context and
earlier states. The cell block of LSTM retains pertinent information of previous states. The
input, forget and output gates dictate new data going into the cell, what remains in the cell
and the cell values used to calculate the output of the LSTM block respectively [45].
FIGURE 2.23 The Schematic diagram of LSTM block with memory cell and gates [45]
37
2.3 Convolutional Neural Network (CNN):
Convolutional Neural Network (CNN) is one of the variants of neural networks used
heavily in the field of Computer Vision. It derives its name from the type of hidden layers
it consists of. In neural networks, Convolutional neural network (ConvNets or CNNs) is
one of the main categories to do image recognition, image classifications, object
detections, face recognitions etc. are some of the areas where CNNs are widely used.
Convolutional neural network (CNN) is a supervised learning method that can perform the
feature extraction and classification process simultaneously and can automatically
discover the multiple levels of representations in data, which has been widely used in the
field of computer vision [47]. The general structure of the basic CNN model is shown in
below figure 2.24
FIGURE 2.24 General CNN structure in facial expression recognition system [47]
The reason for the increasing popularity of CNNs may arise from its ability to learn and
extract features directly from raw input data (even distorted images) that conventional
machine learning and computer vision techniques require for manually extracted features.
CNNs combine the three steps of facial expression recognition (feature learning, feature
selection and feature classification) into one step and require minimal pre-processing.
Also, with the advantage of the graphical processing unit (GPU) technology, tasks that
require intensive computation can achieve promising results at low power consumption.
The CNNs gain the advantage by automatically learning features representation without
depending on human-crafted features using end-to-end system starting from raw pixels to
classifier outputs. Researchers focus on improving the performance of CNNs architecture
and methods such as layer design, activation function, regularization and exploring the
performance in different fields [11].
38
As shown in figure 2.24, CNN consists of an input and output layer and multiple hidden
layers. The hidden layers of a CNN typically consist of Convolutional layers, Pooling
layers, Fully connected layers and normalization layers.
2.3.1 Convolutional Layer:
The convolutional layer is the core building block of a CNN. Convolution is the first layer
to extract features from an input image that convolves the pixels of an input image with
the locally connected small area called the neuron’s respective field. In the CNN
terminology, this respective field is also called ‘kernel’ or ‘filter’ working as a feature
detector. The resulting dot product is the so-called ‘feature-map’ obtained by sliding these
filters over images. Every neuron shares a fixed set of weights with the respective fields in
a locally connected layer, which is called the weight-sharing scheme. A mathematical
operation that takes two inputs: image matrix and a filter or kernel, as shown in figure
2.25. [48]
FIGURE 2.25 Convolutional operation with Image matrix multiplies kernel or filter matrix [49]
This layer aims to figure out features in the image, for instance, the vertical/horizontal
edges, gradients, etc. In order to have multiple features examined, there will be various
different filters. Together, they will form the output of the neurons that are connected to
local regions in the input. In other words, the output after this layer is the features
39
extracted from the input of regions in the images. To get the result, dot-product is
performed between the Conv-Layer and the input layer, as shown in figure 2.26.
FIGURE 2.26 Example of Dot product in Convolutional operation with image and filter [48]
Mathematically, a convolution of two functions f and g is defined as:
which is nothing but dot product of the input function and a kernel function. Convolution
of an image with different filters can perform operations such as edge detection, blur and
sharpen by applying filters. The Conv-Layer will move step by step from left to right, top
to bottom on the input. At each stage, it will move by a number of Strides that was
specified.
Strides:
Stride is the number of pixels shifts over the input matrix. When the stride is 1 then it
moves the filters to 1 pixel at a time. When the stride is 2 then it moves the filters to 2
pixels at a time and so on. Below figure 2.27 shows convolution would work with a stride
of 2.
40
FIGURE 2.27 Convolution operation with Stride size of 2 [48]
Padding:
Sometimes filter does not fit perfectly fit the input image. Two options are available in this
case: One option is padding the picture with Zeros (Zero-Padding), and the other option is
Drop the part of the image where the filter did not fit it. Zero-Padding will pad the input
volume with zeros around the border. The nice feature of zero-padding is that it allows us
to control the size of the output. If the output size is the same as the input, we call this
padding same. If they are not the same, we call the padding valid. Then, the size of the
output layer will be calculated according to the below formula [49]:
where N is the input layer’s size, F is the size of the Conv-Layer, P is the number of zero-
padding is used, and S is the number of Strides. After performing Convolution, it will
activate the output according to the activation function. ReLU (Rectified Linear Unit) is
widely used at this step. This step can be combined with the Conv-Layer and form a single
step Convolution+ReLU
Non-Linearity (ReLU):
ReLU stands for Rectified Linear Unit for a non-linear operation. The output formula is
f(x) = max(0,x) as shown in figure 2.28.
41
FIGURE 2.28 Rectified Linear Unit (ReLU) Activation function [49]
ReLU’s purpose is to introduce non-linearity in ConvNet. Since the real-world data would
want ConvNet to learn would be non-negative linear values, as shown in figure 2.29.
There are other non-linear functions such as tanh or sigmoid that can also be used instead
of ReLU. Most of the researchers use ReLU since performance-wise ReLU is better than
the other two.
FIGURE 2.29 Rectified Linear Unit (ReLU) Operation [48]
42
2.3.2 Pooling Layer:
Sometimes when the images are too large, it is required to reduce the number of trainable
parameters. It is then desired to periodically introduce pooling layers between subsequent
convolution layers. Pooling is done for the sole purpose of reducing the spatial size of the
image. Pooling is done independently on each depth dimension; therefore, the depth of the
image remains unchanged. Spatial pooling is also called subsampling or down-sampling,
which reduces the size of each map but retains important information. Spatial pooling can
be of different types: Max Pooling, Average Pooling, Sum Pooling.
Max pooling takes the largest element from the rectified feature map. Finding and taking
the average value of the given elements is known as Average pooling. Sum of all elements
in the feature map call as Sum pooling. Example of Max pooling and Average pooling is
shown in below figure 2.30 [49]
FIGURE 2.30 Example of Max Pooling and Average Pooling operations [48]
43
The pooling operation has two advantages: the first being helping prevent the model from
over-fitting by providing as it makes an abstraction of the input volume. And the second, it
reduces the input volume hence reducing the number of learnable parameters and saving
computation resources.
2.3.3 Fully Connected Layer:
Fully connected layers are an essential component of CNN, which have been proven very
successful in recognizing and classifying images. The CNN process begins with
convolution and pooling, breaking down the image into features, and analyzing them
independently. The result of this process feeds into a fully connected neural network
structure that drives the final classification decision. The input to the fully connected layer
is the output from the final pooling or convolution layer, “Flattens” them and turns them
into a single vector that can be an input for the next stage. Below figure 2.31 shows
Flattening operation, which converts into a single feature vector [49].
FIGURE 2.31 Example of Flattening operation which will convert into a single vector [47]
After passing through the fully connected layers, the final layer uses the SoftMax
activation function to get probabilities of the input being in a particular class which is
known as Classification which classifies the outputs into the required label.
44
Convolution Neural Network made up of CONV, POOL and FC layers. These layers are
often stacked together follow the pattern:
INPUT [CONV RELU CONV RELU POOL] * 3 [FC + RELU]
* 2 FC
Here a pooling layer after two Conv-Layers which is a good idea for more extensive and
deeper networks. The idea of a more extensive and deeper network is that it will try to
extract small features related to the picture in the early layer. Later, when going more in-
depth, these features will gradually create bigger ones and have more meanings as shown
in below figure 2.32
FIGURE 2.32 Representation of features at different stages in the network [48]
2.3.4 Transfer Learning:
Transfer learning is a machine learning technique whereby a model is trained and
developed for one task and is then re-used on a second related task. In some cases where
the problem domain is similar, and the training dataset is too small, transfer learning can
be used instead of constructing a new model. Using transfer learning, one can use a pre-
defined model that is already trained on a dataset, and instead of retrieving the original
model. Transfer learning is usually applied when a new dataset is smaller than the original
dataset used to train the pre-trained model [50].
A good way to define transfer learning is to look at the student-teacher relationship. The
teacher offers a course after gathering details knowledge regarding that subject. Details
will be conveyed through a series of lectures over time. This can be considered that the
teacher (expert) is transferring information (knowledge) to the students (learner). The
same thing happens when it comes to deep learning, a network is trained with a significant
45
amount of data, and during the training, the model learns the weights and bias. These
weights can be transferred to other networks for testing or retraining a similar new model.
The network can start with pre-trained weights instead of training from scratch [42].
The basic premise of transfer learning is simple: take a model trained on a large dataset
and transfer knowledge to a smaller dataset. For object recognition or image recognition
with a CNN, freeze the early convolutional layers of the network and only train the last
few layers which make a prediction. The idea is the convolutional layers extract general,
low-level features that are applicable across images – such as edges, patterns, gradients –
and the later layers identify specific features within an image such as eyes or wheels [51].
The conceptual diagram for transfer learning method is shown in figure 2.33
FIGURE 2.33 Conceptual diagram of Transfer Learning where learning of a new task relies on the
previously learned task [53]
46
Regarding the initial training, transfer learning allows us to start with the learned features
on the ImageNet dataset and adjust these features and perhaps the model’s structure to suit
the new dataset instead of beginning the learning process on the data from scratch with
random weight initialization. TensorFlow is used to facilitate the transfer learning of the
new CNN pre-trained model [52].
Following is the general outline for transfer learning for object/image recognition [51]:
1. Load in a pre-trained CNN model trained on a large dataset
2. Freeze parameters (weights) in the model’s lower convolutional layers
3. Add custom classifier with several layers of trainable parameters to model
4. Train classifier layers on training data available for the task
5. Fine-tune hyperparameters and unfreeze more layers needed
2.4 Fusion Approach in Convolutional Neural Network:
CNN consists of the convolution layer, pooling layer and full-connection layer. The
number of different layers varies in amount. Each layer deals with the previous layer’s
output and then delivers the result to the next layer in order. In other words, the features
extracted by different layers become closer to the semantic information from the shallow
layer to deep layer. CNN provides an effective way between raw data and abstract
representation and is widely used for different given problems by combining many
operations, including deeper layers. From the literature review, two essential fusion
approaches were found for our proposed models: Multi-Feature fusion-based approach,
Ensemble (fusion) of multi-CNN feature fusion-based approach.
2.4.1 Multi-Feature Fusion-based Approach:
Most works deliver the feature maps of the last layer into the classifier, and few pay
attentions to feature information contained in additional layers. In fact, the feature
information hidden in different layers has the potential for feature discrimination capacity.
Therefore, many researchers tried to implement a fusion-based approach for feature
extraction from different CNN layers, which contains various properties. This approach
integrates feature maps from different layers in CNN instead of the last layer only. Results
show performance improvement in recognition accuracy of the model with this fusion of
47
features approach. But this approach usually uses the features with the same size, which
limit the effect of feature fusion. A framework of the multi-layers feature-fusion is shown
in figure 2.34 [54]
FIGURE 2.34 General Framework of Multi Feature-Fusion model [54]
As shown in figure 2.34, multi-layer feature fusion method simplifies the general structure
of CNN where features are extracted from different intermediate layers to achieve fusion
of these features at the fusion module. Further, it will be provided to fully connected layer
and classifier respectively to complete the recognition task. Instead of generating a feature
map at the final layer in CNN, here features from different intermediate layers are
extracted and finally concatenated as a fusion approach in fusion module to generate the
final feature map. This process is also known as Inter-Layer Feature Fusion approach, as
shown in figure 2.35. This approach gives the advantage to extract and merge different
features taken from different layers before providing to the final layer, which will help to
improve the recognition accuracy of a model for the given problem [54].
FIGURE 2.35 Framework of Inter-Layer Feature-Fusion process [54]
48
In a concatenation of features process as a part of a fusion approach, the concatenation
process usually designates a consolidated dimension to implement fusion. The principle of
the concatenation process is shown in the below formula: [54]
Where Xk is a set of output feature maps of one layer and k refers to the index of the layer.
From the above formula, each X has unique features and Zconcat can be regarded as a
fusion set with all the features. This denotes Zconcat increases the feature diversity, and
then, classifier gets more features instead of only the feature map of the last layer.
Therefore, the essence of the concatenation process is enriching feature diversity to make
classifier obtain better recognition ability.
2.4.2 Ensemble of Multi-CNN Feature Fusion-based Approach:
Convolution Neural Networks (CNN) are becoming increasingly popular in large-scale
image recognition and classification process. Existing CNN models use the single model
to extract the features but recognition accuracy is not adequate for real-time applications.
Therefore, many researchers have tried to implement ensemble of CNN models’ concept
for feature fusion in facial expression recognition to improve the recognition accuracy. In
the Ensemble of CNN concept, concatenation of layers from different CNN architecture is
carried out and finally generated a feature vector. Concatenating features from various
layers of various networks helps to overcome the limitations of a single network and
produced robust and superior performance. A Framework of Ensemble of Multi-CNN
feature fusion-based approach is shown in figure 2.36 [56]
FIGURE 2.36 Framework of Ensemble of Multi-CNN feature fusion-based approach [56]
49
In this concept, there are two possibilities of the feature-fusion process: One is to do
concatenate of two final feature vectors generated from two different CNN architectures
by providing the same input images to both the architectures. Another is to extract features
from different CNN architectures and concatenate them as a feature fusion-based
approach. After doing feature-fusion by any of the above methods, further, it will be
provided to fully connected layer and classification respectively to complete the
recognition process. This approach improves classification performance but complexity is
the major issue for implementing this approach [55].
50
CHAPTER 3
Literature Review
3.1 Overview
This research concerns about the recognition system by identifying facial expressions from
images by applying ensemble and fusion of features using deep learning techniques. Over
the last years, various methods have been proposed for facial expression recognition in
images. In this chapter, the literature review carried out for various conventional
approaches and deep learning-based approaches for facial expression recognition. Also,
the literature review is carried out for multi-feature fusion using two different methods
which are used in our research work: Multi-Feature Fusion based approach in a single
model and Multi-Feature Fusion by an ensemble of different models. A notable
characteristic of the conventional FER approach is that it is highly dependent on mutual
feature engineering. The researchers need to pre-process the image and select the
appropriate feature extraction and classification method for the target dataset. The
conventional FER procedure can be divided into three major steps: image pre-processing,
feature extraction and feature classification. Deep Learning has demonstrated outstanding
performance in many machine learning tasks including identification, classification and
target detection. In terms of FER, deep learning-based approaches highly reduce the
reliance on image pre-processing and feature extraction. They are more robust to the
environments with different elements, e.g., illumination and occlusion, which means that
they can significantly outperform than the conventional approaches. Also, it has the
potential capability to handle a high volume of data. Although deep-learning-based FER
generally produces better FER accuracy than conventional FER, it also requires a large
amount of processing capacity, such as a graphic processing unit (GPU). Also, a new
approach known as Multi-feature fusion is an effective way of facial expression
representation. The different features in many algorithms are combined to describe facial
expressions. CNN learns features through layer-by-layer propagation, which may lose
some important feature information at intermediate layers. Feature fusion from different
layers of a single CNN model or from different layers of other CNN models helps to
achieve better classification process, which helps to improve the recognition accuracy of
51
the facial expression recognition system. Detailed survey for multi-feature fusion approach
with comparative analysis and state-of-the-art methods presented in this chapter.
This literature survey divided into the following parts:
• Conventional FER approaches
• Deep Learning-based FER approaches
• Multi-Feature fusion-based approaches in a single model (Inter-Layer Feature
Fusion)
• Multi-Feature fusion-based approaches in an ensemble of multi models (Multi-
model feature fusion)
3.2 Conventional FER Approaches
Various types of conventional approaches have been studied for the facial expression
recognition system. These approaches’ commonality is detecting the face region and
extracting geometric features, or appearance features, or a hybrid of geometric and
appearance features on the target face. The conventional FER procedure can be divided
into three significant steps: Image pre-processing, Feature extraction and expression
classification [57].
Image Pre-processing:
This step is used to eliminate irrelevant information from the images and enhance the
detection ability of relevant information. Image pre-processing can directly affect the
extraction of features and the performance of expression classification. Some pictures may
have complex backgrounds, e.g., light intensity, occlusion, size, composed of colour or
gray-scaled images. These objective interference factors need to be pre-processed before
recognition [29]. Different processes of image pre-processing are described as: Noise
reduction is the first step of pre-processing. Average Filter (AF), Gaussian Filter (GF),
Median Filter (MF), Adaptive Median Filter (AMF) and Bilateral Filter (BF) are
frequently used image processing filters. Face detection is an essential pre-step in FER
systems to localize and extract the face region. The Normalization of the scale and gray-
scale is to normalise the size and colour of input images, which is to reduce calculation
complexity under the premise of ensuring the key features of the face [29].
52
In 1998, Papageorgiou et al. [58] developed a framework based on Haar wavelet
representation and later on in 2001, Viola and Jones further developed this idea by
proposing the Haar-like features that represent the changes of texture or edges of particular
facial regions and can be operated much faster than pixels in systems. They used Haar-like
features which compute the differences between the sums of pixels within those
rectangular areas. An algorithm known as a cascade of classifier used these features to
speed up the computation [59]. Weilong Chen et al. [60] proposed an illumination
normalization approach under changing lighting conditions for the facial recognition
system. They have used a Discrete Cosine Transform (DCT) method which helps to
recognize faces under changing illumination condition but failed to the shadowing and
specularity problems. Owusu et al. [61] used the Bessel down-sampling approach for face
image size reduction, but it protects the aspects and the perceptual worth of the original
image. Biswas et al. [62] used the Gaussian filter approach for resizing the input images,
which provides the smoothness to the images.
Normalization is the pre-processing method for reduction of illumination and variations of
the facial images. Idrissi et al. [63] used this normalization approach with the median filter
to achieve an improved face image. They have used this to extract eye positions which
makes it more robust for the FER systems and providing more clarity to the input images.
Zhang et al. [64] and Happy et al. [65] presented localization pre-processing method using
the Viola-Jones algorithm to detect facial images from the input image. Adaboost learning
algorithm and Haar-like features are used to detect the size and areas of the face images.
The localization is mainly used for spotting the size and locations of the face from the
image. Face alignment pre-processing method can be performed using the SIFT (Scale
Invariant Feature Transform) algorithm. ROI (Region of Interest) segmentation is an
essential pre-processing method that includes three functions: regulating the face
dimensions by dividing the colour components and face image, eye, or forehead and
mouth regions segmentation. Above SIFT and ROI methods used by Dahmane et al. [66]
and Hernandez et al. [67] respectively, in the facial expression recognition system. Demir
et al. [68] and Cossetin et al. [69] used the Histogram equalization method to conquer the
illumination variations, which useful for enhancing the contrast of the face images and to
improve the intensities.
53
Feature Extraction:
The feature extraction process is the next stage of the FER system. Feature extraction is
finding and depicting positive features of concern within an image for further processing.
It is a significant stage where feature extraction data depiction can be used to input the
classification. Common approaches used in the feature extraction process are geometric
features, appearance features and a hybrid of geometry and appearance features.
Ghimire and Lee [70] used two types of geometric features based on the position and
angle of 52 facial landmark points. First, the angle and Euclidian distance between each
pair of landmarks within a frame are calculated. Second, the distance and angles are
subtracted from the corresponding distance and angles in the first frame of the video
sequence. Two classifiers AdaBoost and SVM were applied to this approach. Junkai Chen
et al. [71] proposed feature extraction technique which works on the entire face. Facial
muscle movements were labelled by Histogram of Oriented Gradients (HOG), and
Support Vector Machine (SVM) is used as a classifier. Experiments were conducted on
JAFFE and CK+ datasets. Happy et al. [72] utilized a Local Binary Pattern (LBP)
histogram of different block sizes from a global face region as the feature vectors. They
classified various facial expressions using a Principal Component Analysis (PCA).
Although this method is implemented in real-time, the recognition accuracy tends to be
degraded because it cannot reflect local variations of the facial components to the feature
vector. Ghimire et al. [73] extracted region-specific appearance-based features by dividing
the entire face region into domain-specific local regions. Important local regions are
determined using an incremental search approach which reduces the feature dimensions
and improves the recognition accuracy. Aruna Bhadu et al. [74] used Discrete Cosine
Transform and Wavelet Transform, two different feature extraction techniques. Hybrid
features are extracted with DCT and DWT methods with AdaBoost as a classifier.
Experiments performed on the JAFFE dataset show that hybrid features give better results
than individual feature extraction techniques.
Bao et al. [75] proposed feature extraction method based on the Bezier curve on JAFFE
dataset with SVM as a classifier. Bezier control points are used to identify key parts of the
face like eyes, eyebrows and mouth. Elena Lozano et al. [76] proposed a method based on
geometric-features using the Active Shape Model approach. Active Shape model approach
is used to track the fiducial points, and then threshold segmentation is applied to determine
54
the mouth’s position. SVM classifier is applied to classify expressions into seven
categories. Huang et al. [77] proposed facial expression recognition method using
Speeded-Up Robust Features (SURF) method. The Probability density function is used as
initial classification, and Weighted majority voting classifier is used to get the output of
recognition. This proposed method has experimented on JAFFE dataset. Kamarol et al.
[78] proposed a method which uses appearance-based feature extraction using Spatio-
Temporal Texture Map (STTM) to extract features. This method is used due to its ability
to capture spatial and temporal changes of facial expressions. 3D Harris corner function is
used to extract spatiotemporal information from faces and classify the expressions with
SVM classifier. Toan et al. [79] proposed a hybrid approach in which geometric feature-
based method combined with PCA method and geometric feature-based method combined
with Independent Component Analysis (ICA) method. Features are extracted from
different regions of the face like eyes, mouth and nose. The integrated features are applied
to the neural network for classification. Experiments were performed on Caltech dataset
and achieved 90% accuracy. Bermani et al. [80] proposed a hybrid approach where
geometric-based features and appearance-based features were combined for feature
extraction. Radial Basis Function (RBF) neural network used as a classifier.
Some researchers have tried to recognize facial emotions using infrared images instead of
Visible Light Spectrum (VIS) image because visible light (VIS) image is variable
according to illumination status. Zhao et al. [81] used Near-Infrared (NIR) video
sequences and LBP-TOP (Local binary patterns from three orthogonal planes) feature
descriptors. This study uses component-based facial features to combine geometric and
appearance information of the face. For FER, an SVM and sparse representation classifiers
are used. Shen et al. [82] used infrared thermal videos by extracting horizontal and vertical
temperature difference from different facial sub-regions. For FER, the Adaboost algorithm
with the weak classifiers of k-Nearest Neighbor is used. Szwoch and Pieniazek [83]
recognized facial expression and emotion-based only on the Microsoft Kinect sensor’s
depth channel without using the camera. This study used local movements within the face
area to feature and recognize facial expressions using relations between particular
emotions. Sujono and Gunawan [84] used a Kinect motion sensor to detect face region
based on depth information and active appearance model (AAM) to track the detected
face. To the role of AAM is to adjust the shape and texture model in a new face, when
there is a variation of shape and texture compared to the training result. The change of key
55
features in AAM and fuzzy logic based on prior knowledge derived from Facial Action
Coding System (FACS) are used to recognize facial emotion. Wei et al. [85] proposed
FER using colour and depth information by Kinect sensor together. This study extracts
facial feature points vector by face tracking algorithm using captured sensor data and
recognizes six facial emotions by random forest algorithm.
Facial Expression Classification:
Facial expression classification is the final step in the facial expression recognition system
in which classifiers categorizes the expressions into seven categories: Happy, Sad, Fear,
Surprise, Disgust, Angry and Neutral. Recent research work has added Neutral expression
into the classification category. Many previous approaches worked on basic six
expressions: Happy, Sad, Fear, Angry, Disgust and Surprise. Commonly and widely used
classifiers in conventional methods include KNN (K-Nearest Neighbour), SVM (Support
Vector Machine), Naïve Bayes Classifier, AdaBoost Classifier, HMM (Hidden Markov
Model), Decision Tree and NN (Neural Network).
Many researchers [86-88] have used KNN classifier because it is simple and easy to
imply. An important characteristic of KNN classifier is that it is sensitive to the local
structure of the data. The result optimization has been done by varying size of
neighbourhood, i.e., value of k. None of the literature has shown any determined technique
to decide the value of k. Hence, it is concluded that the value of k is solely dependent on
the application and more precisely, on the kind of input features given to classify the
samples.
SVM classifier can find an excellent compromising solution on complex models by
providing limited sample data to obtain generalization ability. It is also possible to map
linearly indivisible data to higher dimensions by kernel functions to convert the data into
linear separable. By using kernel function, the system can effectively process high-
dimensional data. Liyuan Chen et al. [89] presented work for person relevant and person
irrelevant expression recognition. SVM with both linear function and Radial Basis
function (RBF) is used for classification purpose. Recognition of approx. 80% is achieved
for a few of the cases. Many researchers [90-93] have used SVM classifier in their
proposed approach for classification purpose.
56
AdaBoost [94-97] and Naïve Bayes [98-100] classifiers used by many researchers in the
facial expression recognition system. AdaBoost is sensitive to noisy and anomaly data. In
some problems, it can be less susceptible to the overfitting problem than other learning
algorithms. AdaBoost with a decision tree is often referred to as the best out-of-the-box
classifier. In other cases, Naïve Bayes classifier is highly scalable, requiring linear
parameters for the number of variables in learning problems. One advantage is that only a
small amount of training data is required to estimate the parameters required for
classification.
Many algorithms in facial expression recognition system have been using a neural network
as a classifier due to the high tolerance to noisy data and their ability to classify patterns
on which they have not been trained. Multilayer Feed Forward Neural Network (MFFNN)
using backpropagation algorithm for classification [61]. Bayesian neural network classifier
is the classification method which also includes input, hidden and output layers. The
classical backpropagation algorithm is used with Bayesian classifier and its accuracy
[101]. Some researchers [102-104] have used a probabilistic neural network as a classifier
in facial expression recognition.
The conventional approaches for facial expression recognition are less dependent on data
and hardware compared to deep learning-based approaches. However, feature extraction
and classification have to be designed manually and separately, which means two phases
cannot be optimised simultaneously. Advantage of the conventional approach is it requires
relatively lower computing power and memory than deep learning-based methods.
Therefore, these approaches are still being studied for use in real-time embedded systems
because of their low computational complexity.
3.3 Deep Learning-based FER Approaches
Despite the notable success of traditional facial recognition methods through the extracted
of handcrafted features, over the past decade researchers have directed to the deep learning
approach due to its high automatic recognition capacity. These deep learning-based
algorithms have been used for feature extraction, classification and recognition tasks. In
terms of FER, deep learning-based approaches are more robust to the environments with
different elements, e.g., illumination and occlusion, which means that they can
57
significantly outperform the conventional methods. Also, it has the potential capability to
handle a high volume of data. In recent decades, there has been a breakthrough in deep-
learning algorithms applied to the field of computer vision, including a convolutional
neural network (CNN) and recurrent neural network (RNN). The main advantage of CNN
is to completely remove or highly reduce the dependence on physics-based models or
other pre-processing techniques by enabling “end-to-end” learning directly from input
images. For the reasons, CNN has achieved state-of-the-art results in various fields
including object recognition, scene understanding, face detection and facial expression
recognition.
Mollahosseini et al. [105] propose deep CNN for FER across several available databases.
After extracting the facial landmarks from the data, the images reduced to 48x 48 pixels.
Then, they applied the augmentation data technique. The architecture consists of two
convolution-pooling layers, then add two inception styles modules, which contain
convolutional layers size 1x1, 3x3 and 5x5. They present the ability to use the network-in-
network technique, which allows increasing local performance due to the convolution
layers applied locally. This technique also makes it possible to reduce the over-fitting
problem.
Lopes et al. [106] Studied the impact of data pre-processing before the training the
network to have a better emotion classification. Data augmentation, rotation correction,
cropping, down-sampling with 32x32 pixels and intensity normalisation are the steps that
were applied before CNN consisting of two convolution-pooling layers ending with two
fully connected with 256 and 7 neurons. The best weight gained at the training stage is
used at the test stage. This experience was evaluated in three accessible databases: CK+,
JAFFE, BU-3DFE. Researchers show that combining all of these pre-processing steps is
more effective than applying them separately.
Mohammadpour et al. [107] proposed a novel CNN for detecting AUs of the face. The
network uses two convolution layers, each followed by a max-pooling and ending with
two fully connected layers that indicate the numbers of AUs activated. Cai et al. [108]
propose a novel architecture CNN with Sparse Batch normalization SBP. This network’s
property uses two convolution layers successive at the beginning, followed by max-
pooling then SBP, and to reduce the over-fitting problem, the dropout applied in the
58
middle of three fully connected. Li et al. [109] present a new method of CNN. Firstly, the
data introduced into VGGNet network, then they apply CNN’s technique with attention
mechanism of CNN. This architecture trained and tested in three large databases FED-RO,
RAF-DB and AffectNet. Detection of the essential parts of the face was proposed by
Yolcu et al. [110]. They used three CNN with same architecture each one detects a part of
the face such as eyebrow, eye and mouth. Before introducing the images into CNN, they
go through the crop stage and detect key-point facial. The iconic face obtained combined
with the raw image was introduced into the second type of CNN to detect facial
expression. Researchers show that this method offers better accuracy than the use of raw
images or iconize face alone.
Agrawal et al. [111] make a study of the influence variation of the CNN parameters on the
recognition rate using FER2013 database. First, all the images are all defined at 64x64
pixels. They make a variation in size and number of filters also the type of optimizer
chosen (Adam, SGD, AdaDelta) on a simple CNN, which contain two successive
convolution layers. The second layer plays the role the max-pooling, then a SoftMax
function for classification. According to these studies, researchers create two novel models
of CNN achieve average 65.23% and 65.77% of accuracy; the particularity of these
models is that they do not contain fully connected layers dropout. The same filter size
remains in the network. Deepak jain et al. [112] propose a novel deep CNN that includes
two residual blocks; each contains four convolution layers. These model trains on JAFFE
and CK+ databases after a pre-processing step, which allows cropping and normalizing the
intensity of the images.
Kim et al. [113] study variation facial expression during the emotional state; they propose
a Spatio-temporal architect with a combination between CNN and LSTM. At the first
time, CNN learns the spatial features of the facial expression in all the emotional state
frames followed by an LSTM applied to preserve the whole sequence of these spatial
features. Also, Yu et al. [114] Present a novel architecture called Spatio-Temporal
Convolutional with Nested LSTM (STC-NLSTM), this architecture based on three deep
learning sub-network such as 3DCNN for spatiotemporal extraction features followed by
temporal T-LSTM to preserve the temporal dynamic, then the convolutional C-LSTM for
modelled the multi-level features. Deep convolutional BiLSTM architecture was proposed
by Liang et al. [115], they create two DCNN, one of which is designated for spatial
59
features and the other for extracting temporal features in facial expression sequences, these
features fused at a level on a vector with 256 dimensions, and for the classification into
one of the six basic emotions, researchers used BiLTSM network. The pre-processing
stage used the Multitask cascade convolutional network to detect the face, then applied the
data augmentation techniques to broadening the database.
Liu et al. [116] proposed a novel Boosted Deep Belief Network (BDBN) for facial
expression recognition using three training stages iteratively in a unified loopy framework.
Their experiments have selected the first frame (with neutral expression), and the last three
frames from each image sequence to obtain more samples from the CK+ database.
Extensive experiments with CK+ and the JAFFE databases proved that their framework
achieved dramatic improvements over current state-of-the-art algorithms which have been
benchmarked on these two databases.
Burkert et al. [117] proposed a CNN architecture that does not depend on the handcrafted
features. Four parts compose this architecture, and the images are first pre-processed
automatically through a convolutional layer. The images are then down-sampled by the
pooling layer in the second part. The next block, called the FeatEx, serves as the
fundamental structure in this architecture inspired by GoogleNet. Finally, after two
concatenated FeatEx blocks, the extracted features are fed into a fully connected layer to
perform the classification. Different layers’ deep features are visualized to show its
validity, and evaluations are conducted with two standards datasets, namely, MMI and
CK+. Their experiment on the CK+ dataset evaluated seven classes and achieved a
recognition rate of 99.6%.
Inspired by the architecture of the AlexNet and GoogleNet, Mollahosseini et al. [118]
proposed their own CNN architecture in 2016, which consisted of two conventional CNN
modules (one of which contained a convolutional layer followed by a max-pooling layer),
four Inception modules and two fully connected layers, having only 25M operations
(compared to 100M in AlexNet). Face registration was performed to improve the
performance of FER by using the bidirectional warping of the active appearance model
(AAM) and a supervised method called IntraFace that adopted the SIFT features to extract
49 facial landmarks. Both subject-independent and cross-database experiments were
carried out on seven public standards datasets (MultiPIE, MMI, CK+, DISFA, FERA,
60
SFEW, and FER-2013), and six specific classes (angry, disgust, fear, happiness, sadness,
surprise, excluding the neutral and contempt classes) were evaluated on the CK+ dataset.
State-of-the-art performance analysis for facial expression recognition using deep-
learning-based approach is shown in below table 3.1
Table 3.1 Performance summary of facial expression recognition using deep-learning-based approaches
Author
Name
Year Methods/Algorithms Details Dataset Accuracy
(%)
Agrawal et al.
[111]
2020 Two proposed CNN models to
evaluate for different kernel sizes
and number of filters to overcome
the limitations of FER2013
dataset
FER2013 65.23% &
65.77%
Dandan Liang
et al. [115]
2020 BiLSTM network for fusion
approach of spatial features and
temporal dynamics using a deep
spatial network (DSN) and deep
temporal network (DTN)
CK+, Oulu-
CASIA &
MMI
99.4%,
91.07% &
80.71%
Gozde Yolcu
et al. [110]
2019 4-stage CNN structure: first CNN
for eyebrow segmentation,
second CNN for eye
segmentation, third CNN for
mouth segmentation and fourth
CNN for expression recognition
Radboud
Face
Database
(RaFD)
94.44%
Deepak
Kumar Jain et
al. [112]
2019 Single Deep Convolutional
Neural Network (SDCNN)
contains convolutional layers and
deep residual blocks
JAFFE &
CK+
95.23% &
93.24%
Jun Cai et al.
[108]
2018 Sparse Batch Normalization CNN
(SBN-CNN) model using a
convolutional network and Batch
Normalization to reduce the risk
of over-fitting
JAFFE &
CK+
95.24% &
96.87%
61
Yong Li et al.
[109]
2018 Convolutional Neural Network
with Attention mechanism
(ACNN) that focus on un-
occluded face regions and
perceive occlusion regions with
patch-based ACNN & global-
local-based ACNN versions
RAF-DB &
AffectNet
85.07% &
58.78%
Zhenbo Yu
et al. [114]
2018 Spatio-Temporal Convolutional
features with Nested LSTM
(STC-NLSTM) model which is
using 3DCNN method to extract
Spatio-temporal convolutional
features and dynamics of
expressions modelled by Nested
LSTM
CK+, Oulu-
CASIA,
MMI &
BP4D
99.4%,
93.45%,
84.53 &
58%
Andre Lopes
et al. [112]
2017 Facial expression recognition
system with CNN and different
pre-processing operations on
images to decrease variations
between images and reduces the
need for a large amount of data
CK+,
JAFFE &
BU-3DEF
96.76%,
86.74% &
83.50%
Mostafa et al.
[107]
2017 Proposed CNN architecture to
detect AUs
CK+ 97.01%
Dae Hoe Kim
et al. [113]
2017 Fusion of CNN and LSTM
architectures are used for spatial
feature representation and
temporal feature representation
respectively
MMI &
CASME II
69.94% &
58.54%
Ali
Mollahosseini
et al. [105]
2016 Deep Neural Network (DNN)
architecture consists of
convolutional, max-pooling and
four Inception layers
CK+,
FER2013
& MMI
93.2%,
66.4% &
77.9%
62
Peter Burkert
et al. [117]
2015 Deep Convolutional Neural
Network (DCNN) architecture
with FeatEx – Parallel Feature
Extraction blocks for reach
feature representation
MMI &
CK+
98.36% &
99.5%
Ping Liu et al.
[116]
2014 Boosted Deep Belief Network
(BDBN) characterize expression-
related facial shape changes, can
be learned and selected to form a
boosted strong classifier via a
joint fine-tune process in BDBN
framework
CK+ 93%
3.4 Multi-Feature Fusion based FER Approaches:
3.4.1. Multi-Feature Fusion in a single model (Inter-Layer Fusion):
Automatic human facial expression recognition has been receiving increasing attention
from researchers in a deep learning area, and several solutions have been proposed. Most
of the existing works have focused on the single model and methodology work using any
CNN architecture for facial expression database. Instead of increasing layers in CNN and
creating more complex Deep CNN, researchers are working on the concept of Fusion of
layers and models to improve the recognition accuracy of models and real-time facial
expression database which contains real-world images. Using deep learning techniques to
extract useful features from multi-model information automatically and their use in fusion
and classification are new research directions nowadays by dealing with different network
parameters to improve the system’s recognition accuracy. The following review examines
the state-of-the-art literature review for feature-fusion based approach from where we
found the concept of our research work.
Cehnhui Ma et al. [54] proposed multi-layers feature fusion based CNN fusion model
where intra-layer and inter-layer fusion mechanism are used on InceptionV3 and VGG16
CNN architectures for UCM and NWPU-RESISC database and achieved 98.4% and
95.32% accuracy respectively. The novelty lies in using this approach to enhance the
63
features by integrating information extracted from different layers of CNN to build more
discriminative feature representation for classification. According to data distribution and
selecting appropriate CNN models, choosing fusion methods is the conclusive research
point. This approach increases the computational burden, although multi-layer feature
fusion has achieved good results. Hai-Duong Nguyen et al. [119] presented a multi-level
convolutional neural network (MLCNN) approach that selects important mid-level and
high-level features their contribution. Feature maps generated at intermediate layers are
selected, and then fusion approach applied before applying to the classification layer. This
approach is evaluated on FER2013 dataset and achieved 73.03% recognition accuracy. A
drawback of the model is that it involves two stages for training. The plain model and the
MLCNN should be trained separately since the weights of the former are used.
Tianhao Tang et al. [120] presented a hybrid multimodal method that includes audio,
video frame, video sequence and face landmark movement. This method combines
acoustic features and facial features in both non-temporal and temporal mode. They have
applied this approach on Acted Facial Expression in Wild (AFEW) dataset and achieved
61.87% recognition accuracy in EmotiW challenge. Although researchers have achieved
better accuracy, but emotion recognition on video clips has not been solved entirely due to
lack of specific specimen samples in the dataset. VenkataRamiReddy et al. [121] presented
a multi-feature fusion-based approach extracted from different techniques. Researchers
have used Directional Local Binary Pattern (DLBP) and Discrete Cosine Transform
(DCT) methods to extract local and global features, respectively. Further, Weighted
summation and Principal Component Analysis (PCA) fusion methods can fuse the local
and global features extracted from facial images. Radial Bias Function (RBF) neural
network is used as a classifier for classification purpose. Cohan-Kanade (CK) database is
used to evaluate the proposed method and achieved 97% better recognition accuracy.
Kuang Liu et al. [122] proposed a model which consists of several different structured
subnets. Each subnet is a compact CNN model trained separately. Then the whole network
is structured by assembling these subnets together. The proposed network is evaluated on
FER2013 dataset and achieved 65.03% recognition accuracy. This proposed model’s main
advantage is to focus on different CNNs rather than one, which provides better
performance by combining all the results together, but will increase the complexity.
Yingruo et al. [123] proposed a novel Multi-Region Ensemble CNN (MRE-CNN)
64
framework for facial expression recognition which aims to enhance the learning power of
CNN models by capturing both the global and local features from multiple human face
sub-regions. Weighted prediction scores from each sub-network then aggregated to
produce the final prediction of high accuracy. The proposed method is evaluated on two
publicly available datasets AFEW and RAF-DB and achieved 76.73% & 47.43%
recognition accuracy, respectively.
Jung et al. [124] used two different types of CNN: the first extracts temporal appearances
feature from the image sequences known as deep temporal appearance network (DTAN),
whereas the second extract temporal geometry features from temporal facial landmark
points which are known as deep temporal geometry network (DTGN). These two models
are combined using a new integration method to boost the performance of facial
expression recognition which is known as a deep temporal appearance-geometry network
(DTAGN). This approach is applied to CK+ and Oulu-CASIA datasets and achieved
97.25% and 81.46% recognition accuracy, respectively. State-of-the-art performance
analysis based on recent research for a multi-feature fusion-based approach using deep
learning methods is shown in below table 3.2
Table 3.2 Performance summary of Multi-Layer Feature-Fusion Methods using Deep Learning techniques
Author
Name
Year Methods/Algorithms Details Dataset Accuracy
(%)
Chenhui Ma
et al. [54]
2019 Inter-Layer and Intra-Layer
Feature-Fusion using Inception-
V3 and VGG16 CNN architecture
UCM &
NWPU
97.7 % &
94.7%
Long D.
Nguyen et al.
[119]
2018 Multi-level Convolutional Neural
Network (MLCNN) approach by
an Ensemble of feature maps
from different layers
FER2013 73.03%
Tianaho Tang
et al. [120]
2018 Hybrid Multimodal (Audio +
Video feature data fusion)
AFEW 61.87%
Kuang Liu et
al. [122]
2018 Multi-Feature fusion with an
ensemble of CNN subnets
approach
FER2013 65.03%
65
Yingruo et al.
[123]
2018 Multi Region Ensemble CNN
(MRE-CNN) approach with
feature fusion
AFEW &
RAF-DB
47.43% &
76.7%
Jung et al.
[124]
2015 Multi-feature fusion of temporal
appearance features and temporal
geometry features
CK + 97.25%
VenkataRami
Reddy et al.
[121]
2014 Fusion of local and global
features with Directional Local
Binary Pattern (DLBP) and
Discrete Cosine Transform
(DCT) methods
CK 97%
3.4.2. Multi-Feature Fusion using multi-model (Multi-Model Fusion):
Long D. Nguyen et al. [125] and V. Vaidehi et al. [56] mentioned the use of an ensemble
of different CNN architectures for feature extraction. Then they concatenated output of
these architectures into a single image vector for the classification purpose instead of
getting one feature map by applying a single CNN architecture. Researchers conclude that
Concatenation of features from various networks helps to overcome the limitations of a
single network and produces robust and superior performance.
Long D. Nguyen et al. [125] proposed a novel deep neural network architecture based on
transfer learning for microscopic image classification where features are extracted from
three different pre-trained CNN architectures InceptionV3, ResNet50 and Inception-
ResNetV2. They created a fusion approach to multimodal with layers. They applied this
approach on 2D-Hela and PAP-Smear datasets and achieved 92.57% and 92.63%
accuracy, respectively. V. Vaidehi et al. [56] proposed Ensemble of Convolutional Neural
Networks (ECNN) based approach for facial expression recognition. The proposed model
addresses the challenges of facial expression, ageing, low resolution, and pose variations.
Features are extracted and concatenated from VGG16, Xception and Inception-V3 CNN
architectures before applying to classification. The proposed model was evaluated on
Web-Face and YouTube datasets and achieved 97.12% and 99% recognition accuracy,
66
respectively. A lot of Computation time is required to implement this approach which can
be resolved by using multiple GPUs.
Yingying Wang et al. [126] proposed an Auxillary model approach which combines
multiple face sub-regions and entire face image by a weighting factor which can capture
more vital information to improve recognition accuracy. Here four different CNN models
work parallel to find weights for Eyes, Mouth, Nose and whole facial image, and finally,
fusion will be applied of all these four models. This approach tested on JAFFE, CK+,
FER2013 and SFEW dataset and achieved 95.9%, 99.09%, 67.7%, 59.97% accuracy,
respectively. Hseng Li et al. [127] proposed an emotion recognition system for a
humanoid robot. They have mounted the camera on a robot’s head to detect live facial
expression recognition using CNN and LSTM approach. They have tested using JAFFE
and CK+ dataset and achieved 94.9% and 90.5% accuracy, respectively, which is
somewhat less compared to other state-of-the-art methods.
Alessandro Renda et al. [128] presented two direct ensemble strategies: Seed strategy and
Pre-processing strategy by combining the base classifiers’ outputs by using the most
common aggregation schemes: average and majority voting. The proposed approaches are
evaluated on two scenarios: CNN 10-S – training from scratch an Ad-hoc architecture and
VGG-16 – fine-tuning a pre-trained model. The proposed strategy is evaluated on
FER2013 dataset and achieved 70.5% recognition accuracy. Chao Li et al. [129] proposed
a multi-network fusion (MNF) model based on CNN to recognize facial expressions.
Researchers have trained two network structures in the experiment, one based on Tang’s
network structure, the other based on Caffe-ImageNet network structure and then used L2-
SVM for classification. The best network model parameters are extracted from the two
previous trained networks as the initialization parameters to MNF structure, and then MNF
structure is fine-tuned. The proposed model is evaluated on FER2013 and JAFFE datasets
and achieved 70.03% and 95.7% recognition accuracy, respectively. State-of-the-art
performance analysis based on recent research work for multi-feature fusion-based
approach in multi-model using deep learning methods are shown in below table 3.3
67
Table 3.3 Performance summary of Multi-Layer Feature-Fusion Methods using Ensemble of CNN models
using Deep Learning techniques
Author
Name
Year Methods/Algorithms Details Dataset Accuracy
(%)
V. Vaidehi et
al. [56]
2019 Ensemble of Convolutional
Neural Network (ECNN)
approach using VGG16, Xception
and Inception-V3 CNN
architectures
Web-Face
& YouTube
97.12 % &
99 %
Yingying
Wang et al.
[126]
2019 Auxiliary model approach which
combines multiple face sub-
regions and entire face image for
fusion approach using four
different CNN works in parallel
JAFFE,
CK+ &
FER2013
95.95%,
99.07% &
67.7%
Hseng Li et
al. [127]
2019 Hybrid Multimodal of CNN and
LSTM architecture for humanoid
robot mounted with camera
JAFFE &
CK+
94.9% &
90.5%
Long D.
Nguyen et al.
[125]
2018 Deep Neural Network (DNN)
architecture by feature-fusion
using InceptionV3, ResNet50 and
Inception-ResNetV2
2D-Hela &
PAP-Smear
92.57% &
92.63%
Yingru et al.
[123]
2018 Multi-Region Ensemble CNN
(MRE-CNN) approach with
feature fusion
AFEW &
RAF-DB
47.43% &
76.7%
Alessandro
Renda et al.
[128]
2018 Seed-strategy and Pre-processing
strategy using CNN-10S and
VGG16 Architecture for fine-
tuning and Ensemble approach
FER2013 70.53%
Chao Li et al.
[129]
2018 Multi-Network Fusion (MNF)
model based on the fusion
approach of Tang’s Network and
Caffe-ImageNet Network
FER2013
& JAFFE
70.03% &
95.7%
68
3.5 Summary and Discussion:
This chapter provides a review of literature on different facial expression recognition
methods and algorithms. Various methods and algorithms based on Conventional
approaches, Deep-Learning based techniques and Feature-Fusion based approaches are
discussed. Difference between conventional approaches and deep-learning-based
approaches with their advantages is explored. It concludes that researchers are working on
different deep-learning-based methods to improve facial expression recognition systems’
recognition accuracy. Nowadays, Feature-fusion based approach is used by many
researchers as it is an effective way of the facial expression representation and different
features in many algorithms or methods are combined to describe facial expressions.
Feature-fusion based approach is classified into two categories: Multi-Level feature fusion
in a single model and Multi-level feature fusion by different models. In the first concept,
in a single model feature-fusion can be applied at low-level or intermediate- level or at
high-level to achieve the best accuracy. It is also known as Inter-Layer or Intra-layer
multi-feature fusion-based approach. In the second concept of fusion-based approach,
ensemble of different models’ strategy can be applied to do multi-feature fusion. Features
generated at intermediate layers in different models can be combined together to get
advantages of other models into a single multimodal fusion-based approach. A State-of-
the-art literature review is explained in this chapter for both of these feature-fusion based
approaches.
From the above literature review, it is observed that many researchers are working to
improve the performance of existing CNN models on real-time facial expression datasets
which contains real-world images and also with laboratory trained images. Therefore, an
efficient model is required with deep-learning fusion techniques to extract essential
features from the images and classify emotions correctly. So, in our research work, we
have proposed three models: Multi-Layer Feature-Fusion based classification (MLFFC),
Multi-Modal Feature-Fusion based classification (MMFFC) and novel FER model based
on Normalized CNN approach.
69
CHAPTER 4
Proposed Multi-Layer Feature-Fusion based
classification (MLFFC) Model
4.1 Introduction
In this chapter, the proposed multi-layer feature-fusion based classification (MLFFC)
approach is presented to perform inter-layer fusion in InceptionV3 CNN architecture from
images for facial expression recognition. From the literature review, it is found that many
researchers are working to improve the recognition accuracy of facial expression datasets
which contains real-world images as well as laboratory-trained images. Novel approaches
are introduced by many researchers which include feature-fusion based approach. This
approach is classified into two categories: Feature-fusion based approach for a single
model and feature-fusion based approach for multi-modal. In a single model, feature-
fusion based can be applied at low-level or intermediate-level or at high-level to achieve
best recognition accuracy. It is also known as Inter-layer multi-features fusion-based
approach.
Many researchers have tried to implement a fusion-based approach for feature extraction
in facial expression recognition. From the literature survey, it is found that most of the
existing work focuses on the feature maps of the last convolutional layer in CNN and pay
little attention to the benefits of additional layers in the model. Feature information hidden
in the different layers has the potential for feature discrimination capacity [54].
InceptionV3 CNN architecture is talking about Factorization ideas, and the aim is to
reduce the number of parameters without decreasing the network efficiency [130]. So, in
this proposed MLFFC model, multi-feature fusion-based classification approach is applied
to InceptionV3 CNN architecture. The proposed model is evaluated on two facial
expression datasets containing real-world facial images and laboratory-trained facial
images to improve recognition accuracy. Facial expression dataset, which contains real-
world images contains more challenges like pose variations, illumination variation and
lower resolution images, making the feature extraction process more challenging.
Laboratory trained facial images are used for the cross-database evaluation study.
70
4.2 Inception-V3 CNN Architecture:
The inception deep convolutional architecture was introduced as GoogleNet in 2015
named as Inception-V1. Later the Inception architecture was refined in various ways, first
by introducing Batch Normalization which is known as Inception-V2. Later by additional
Factorization ideas which will be referred to as Inception-V3 architecture. Thus,
Inception-V3 is an extended version of GoogleNet. It contains the concept of Inception
module to reduce the number of connections/parameters. The Inception-V3 model is
trained on the ImageNet datasets, including the information that can identify 1000 classes
in ImageNet. The aim of factorizing convolutions is to reduce the number of parameters
without decreasing network efficiency [130].
Figure 4.1: Two 3x3 convolutions replacing one 5x5 convolution [130]
Factorization process shown in figure 4.1 to convert into smaller convolutions by taking
an example of two 3x3 convolutions replaces one 5x5 convolution. Using one layer of 5x5
filter, the number of parameters required is 5x5=25 while using two layers of 3x3 filters,
the number of necessary parameters is (3x3) + (3x3) = 18. It shows the number of
parameters reduced by 28% with this concept. The inception module uses the multiple
convolution and multi-level convolution to extract the features in different modules and
aggregate these extracted features at the end of the corresponding module, as shown in
figure 4.2. It is also known as “naïve” inception module. It performs convolution on input,
71
with three different sizes of filters (1x1, 3x3, 5x5). Additionally, max pooling is also
performed. The outputs are concatenated and sent to the next inception module [131].
Figure 4.2: Basic Inception module (naïve version) [132]
Moreover, the inception module reduces the dimensionality and uses the rectified linear
unit activation function to make it for dual purpose, as shown in figure 4.3. Deep neural
networks are computationally expensive. To make it cheaper, limit the number of input
channels by adding an extra 1x1 convolution before the 3x3 and 5x5 convolutions. Also,
1x1 convolution is added after the max-pooling layer [132].
Figure 4.3: Inception module with Dimension Reductions [132]
Using the above dimension reduced Inception module, Inception-V3 architecture was
built, shown in figure 4.4. The schematic diagram of Inception-V3 architecture includes
72
different 11 modules. In this architecture, the dropout layer with 70% ratio reduces the
network’s overfitting. Inception-V3 is a widely used image recognition model that has
been shown to attain greater than 78.1% accuracy on the ImageNet dataset. The model is
the culmination of many ideas developed by multiple researchers over the years. The
model itself comprises symmetric and asymmetric building blocks, including
convolutions, average pooling, max pooling, concatenation, dropouts, and fully connected
layers. Batch normalization is used extensively throughout the model and applied to
activation inputs. Loss is computed via SoftMax classifier.
Figure 4.4: The schematic diagram of Inception-V3 architecture [131]
As shown in figure 4.4, the architecture of Inception-V3 contains three kinds of Inception
modules named as Inception Module A, Inception Module B and Inception Module C.
With factorization, the number of parameters are reduced for the whole network, it is less
likely to be overfitting, and consequently, the network can go deeper. Factorization
process in these three modules is shown in figure 4.5, 4.6 and 4.7, respectively, where
Inception module C is proposed for promoting high dimensional feature representations.
The Auxiliary classifier used as a Regularizer in this architecture for batch normalization
purposes gives high-quality training. Efficient Grid Size Reduction is used to further
downsize feature maps, providing an efficient network [130].
73
Figure 4.5: Factorization process in Inception Module A in Inception-V3 architecture [130]
Figure 4.6: Factorization process in Inception Module B in Inception-V3 architecture [130]
Figure 4.7: Factorization process in Inception Module C in Inception-V3 architecture [130]
74
4.3 Proposed MLFFC:
Inspired from the concept of multi-feature fusion in Inception-V3 architecture represented
by Chenuhi Ma et al. [54], proposed Multi-Layer Feature-Fusion based Classification
approach is proposed here for the facial expression recognition system to improve
recognition accuracy. Chenhui Ma et al. [54] mentioned using multi-feature fusion in
Inception-V3 architecture by integrating feature maps from different layers instead of the
last layer of the network only, as shown in figure 4.7. Authors have also suggested Inter-
Layer feature-fusion based approach which gives the advantage to extract and merge
different features taken from different layers before providing to final layer as shown in
figure 4.8
FIGURE 4.8 General Framework of Multi Feature-Fusion model [54]
FIGURE 4.9 Framework of Inter-Layer Feature-Fusion process [54]
75
Proposed Multi-Layer feature-fusion based Classification approach is shown in figure
4.10. In this proposed model, we are working on the Inception Module C which gives
higher feature representations in Inception-V3 architecture and applies multi-feature
fusion by integrating feature maps from the layers of Inception Module C with its final
layer.
FIGURE 4.10 Proposed Multi-Layer Feature-Fusion based Classification (MLFFC) model
In this proposed MLFFC model, the multi-feature fusion technique is applied to the
module C layer of Inception-V3 architecture with its final layer. As module C represents
higher-level feature representations, the feature-fusion approach will give the advantage to
classify emotions efficiently. Module C of Inception-V3 architecture contains mixed 8,
mixed 9 and mixed 10 layers and then last convolutional layer will come where feature
vector will be generated. Features will be extracted from these layers and concatenated
with the final layer as an Inter-Layer feature-fusion process before it will further provide
to classifier process. To identify which layer’s fusion approach will give better result in
terms of recognition accuracy, evaluation process carried out for all these different
combinations and finally selected the layer for fusion approach which will provide better
recognition accuracy. This proposed MLFFC model aims to improve the recognition
accuracy of real-time facial expression datasets which contains real-world images with
lower-resolution images and challenges. Also, this proposed approach applied to
76
laboratory trained facial expression dataset for the cross-database evaluation study.
Proposed MLFFC algorithm is explained below.
4.3.1 Detailed Process (Algorithm) of Multi-Layer feature-fusion based
classification (MLFFC)
base_model = InceptionV3
Input: CK+ dataset (593 grey-scaled images) and FER2013 dataset (35887
grey-scaled images)
1. Initialize parameters nb_class, x, y, epoch, lr, bs, c, where
nb_class = number of facial expression classes
x = height of the image
y = width of the image
epoch = number of iterations
lr = learning rate
bs = batch size
c = classifier
2. for 1: epochs
train_data, val_data = train_test_split (dataset, 0.8)
bs = 8,16 and lr = 1e-1 to 1e-5
for 1: last_block_layer
if layer=8 // for Inception Module C
f1 = extract features from mixed8 layer of Inception module C
f2 = extract features from final layer of base_model
res_f = f1 f2 // feature-fusion approach
acc1 = predict result of res_f using classifier C
end if
if layer=9
f1 = extract features from mixed9 layer of Inception module C
f2 = extract features from final layer of base_model
res_f = f1 f2 // feature-fusion approach
acc2 = predict result of res_f using classifier C
end if
if layer=10
f1 = extract features from mixed9 layer of Inception module C
f2 = extract features from final layer of base_model
res_f = f1 f2 // feature-fusion approach
acc3 = predict result of res_f using classifier C
end if
end for
end for
3. Calculate maximum recognition accuracy
final_acc = max (acc1, acc2, acc3)
4. END
77
4.4 Dataset Details:
Proposed Multi-Layer Feature-Fusion based Classification (MLFFC) model is tested on
CK+ [134] and FER2013 [133] datasets by varying network parameters like batch size and
learning rate. Stochastic Gradient Descent (SGD) optimizer and SoftMax classifier used
for the evaluation purpose.
4.4.1 CK+ Dataset:
The Extended Cohn Kanade (CK+) database is the most extensively used laboratory-
controlled database for evaluating Facial Expression Recognition system. CK+ contains
593 images from 123 subjects. 123 university students range from 18 to 30 years old,
where 65% are female, 15% are African-American, and 3% are Asian or South American.
The emotions consist of anger, disgust, fear, happiness, sadness, surprise and neutral.
CK+ dataset contains 10,674 grey-scaled images with the resolution of 640x490 pixels.
Example of CK+ dataset image is shown in figure 4.11 [126]
FIGURE 4.11 Example of images in the CK+ dataset with different emotions [126]
4.4.2 FER2013 Dataset:
The FER2013 dataset was introduced during the International Conference on Machine
Learning (ICML) challenges in Representation learning. FER2013 is a large-scale, and
unconstrained database collected automatically by the Google image research API.
78
FER2013 dataset contains 35,887 real-world images taken in an uncontrolled
environment. So, it has many research challenges for researchers like head-pose variations,
illumination, lower resolution etc. Dataset images contain seven different facial
expressions: Anger, Disgust, Happy, Surprise, Fear, Sad and Neutral, with the resolution
of 48x48 pixels. Example of FER2013 dataset image is shown in figure 4.12 [11]
FIGURE 4.12 Example of images in the FER2013 dataset with different emotions [135]
4.5 Experiment and Results:
In this section, we have described the experimental setup, implementation details with
required library support, different parameters used in the proposed model, optimization of
these parameters, the benchmark datasets such as CK+ [134] and FER2013 [133] used to
assess the performance of the proposed model, the results of the proposed model on these
datasets and comparison of these results with state-of-the-art methods. Both the datasets
are divided into 80%-20% ration for training and validation process.
4.5.1 Experimental Setup and Implementation Details:
The proposed MLFFC model is implemented using Python, OpenCV and Deep Learning
API Keras with TensorFlow as a backend. The experiments are performed on NVIDIA
GeForce GTX 1050Ti 4GB Graphic Processing Unit (GPU) with i7 8th generation
windows processor system with 16GB RAM. Anaconda deep learning environment is
installed with Keras-TensorFlow libraries and PyCharm Community Edition as an IDE
tool is used for implementation with Python programming language.
79
In this implementation, we have used different evaluation parameters like maximum 100
epochs and batch size varying from 16, 32 and 64 for different learning rate values ranging
from 0.1 to 0.00001 for detailed analysis purposes. Stochastic Gradient Descent (SGD)
used as an Optimizer and SoftMax are used as a Classifier. These network parameters like
Epochs, Batch Size, Learning Rate, Optimizer and Classifier selection are essential to
carry out effective results in recognition accuracy.
4.5.2 Experimental Results on Inception Module C layers:
As discussed in Inception-V3 architecture, Inter-Layer Feature-Fusion approach applied
on the layers of Inception Module C. Layers of Inception Module C contains higher-level
feature representations. Combining layer of Inception Module C with final layer of
Inception-V3 architecture, Inter-Layer Feature-Fusion approach is applied. To decide
which layer of Inception Module C outperforms well in Inter-Layer Feature-Fusion
approach, experiments carried out by applying the proposed approach and measure the
recognition accuracy. Comparison of Accuracy on different layers in the proposed
architecture is evaluated on the CK+ dataset using network parameters Max Epochs = 100,
Batch Size =16 and Learning rate 0.001. Experiment results are shown in below table 4.1
Table 4.1: Comparison accuracy on different layers on the proposed MLFFC architecture
Inter-Layer Feature-Fusion on layers of
Inception Module C
Validation Accuracy (%)
Mixed 8 layer + final layer 94.17%
Mixed 9 layer + final layer 99.63%
Mixed 10 layer + final layer 97.16%
From the experimental results mentioned in table 4.1, concatenation of mixed 9 layer with
the final layer selected for further evaluation of proposed MLFFC model. Before applying
to the classifier, MLFFC model combines features of mixed 9 layer and final neural
network layer. Final feature vector will be generated by this fusion approach which further
provided to classifier for classification purpose. This proposed MLFFC model tested on
CK+ and FER2013 datasets with varying network parameters to improve the recognition
80
accuracy of the real-time dataset and the laboratory-trained dataset for cross-database
evaluation purpose.
4.5.3 Experimental Results on CK+ dataset:
Proposed MLFFC model is tested on CK+ dataset with varying network parameters values
such as Batch size 8, 16 and Learning rate: 0.1 to 0.00001. SGD used as an optimizer and
SoftMax as a classifier. Experiments were carried out in two different ways: Without
using the feature-fusion approach and using the feature-fusion approach to Inception-V3
architecture. Table 4.2 shows the CK+ dataset results giving better recognition accuracy
by using proposed MLFFC model for different batch-size and learning rate values. The
maximum accuracy of 99.69% obtained for batch size value 16 and learning rate value is
0.1 using the proposed MLFFC model. CK+ dataset contains 593 images, as explained in
section 4.4. So, only two batch size values 8 and 16 taken for the evaluation purpose with
different learning rate values.
Table 4.2: Results on the CK+ dataset by using and not using Feature-fusion approach on the proposed
MLFFC model
Implementation Result of MLFFC Model Accuracy (%) on CK+ dataset
Max. Epochs = 50 Learning rate
Batch Size 0.1 0.01 0.001 0.0001 0.00001
Results Without Feature-Fusion
8 99.07 99.02 99.38 99.07 99.38
16 99.17 98.91 99.02 99.17 98.91
Results With Feature-Fusion
8 99.63 99.43 99.58 99.38 99.63
16 99.69 99.27 99.63 99.27 99.17
Many researchers are trying to improve the recognition accuracy of real-time facial
expression dataset. Comparative analysis of the proposed MLFFC model with the state-of-
the-art methods for the CK+ dataset is shown in table 4.3. The confusion matrix is shown
in figure 4.13. Detailed Classification Report and the ROC-AUC for proposed MLFFC
model are shown in figure 4.14 and 4.15. ROC-AUC curve on the CK+ dataset without
and with feature-fusion approach is shown. Model’s performance is measured in both the
cases for different epochs vs. AUC values based on the classifier. Higher the AUC, better
81
the model is predicting the results. ROC-AUC analysis on CK+ dataset shows that the
average value of AUC is 92% without feature-fusion and 96% with feature-fusion
approach. AUC value is between 09.-1.0 which shows an excellent performance. Accuracy
chart of the proposed MLFFC model based on the CK+ dataset for batch size 8 and 16 is
shown in figure 4.16 and figure 4.17.
Table 4.3: Comparative analysis of a proposed MLFFC model with state-of-the-art methods on CK+ dataset
Method Name Accuracy
(%)
Multi Scale CNN (Discriminative Analysis + Auto Encoder) [136] 91.4%
3DCNN Model [137] 92.4%
Deep Neural Network Architecture [138] 93.2%
DSAE (Deep Sparse Auto Encoders) [139] 93.7%
IACNN (Identity Aware CNN) Model [140] 95.3%
Zero-Bias CNN Method [142] 98.3%
Siamese Model for expression recognition network [141] 98.5%
Concatenate of four CNN Models in parallel [126] 99.07%
Novel Texture Extraction Method [143] 99.36%
Proposed MLFFC Model 99.6%
FIGURE 4.13 Confusion matrix using the proposed MLFFC model on the CK+ dataset
82
FIGURE 4.14 Classification Report for the proposed MLFFC model on the CK+ dataset
(a) (b)
FIGURE 4.15 ROC-AUC curve on the CK+ dataset for (a) without feature-fusion (b) with feature-fusion
83
FIGURE 4.16 Accuracy graph of the proposed MLFFC model for the CK+ dataset for batch size 8
FIGURE 4.17 Accuracy graph of the proposed MLFFC model for the CK+ dataset for batch size 16
84
4.5.4 Experimental Results on FER2013 dataset:
Many researchers are trying to improve the recognition accuracy of real-time FER2013
dataset as it contains real-world images with many challenges for facial expression
recognition purpose. Our proposed MLFFC model is tested on FER2013 real-time dataset
with varying network parameters values such as Batch size 8, 16 and Learning rate: 0.1 to
0.00001. SGD used as an optimizer and SoftMax as a classifier. Experiments were carried
out in two different ways: Without using the feature-fusion approach and using the
feature-fusion approach to Inception-V3 architecture. Results are mentioned in table 4.4.
Table 4.4: Results on the FER2013 dataset by using and not using Feature-fusion approach on the proposed
MLFFC model
Implementation Result of MLFFC Model Accuracy (%) on FER2013 dataset
Max. Epochs = 50 Learning rate
Batch Size 0.1 0.01 0.001 0.0001 0.00001
Results Without Feature-Fusion
8 62.97 66.98 65.17 63.89 65.7
16 68.18 67.54 67.03 66.36 66.11
Results With Feature-Fusion
8 67.93 68.06 68.79 67.81 67.7
16 68.43 68.26 70.29 68.93 69.37
Table 4.4 shows the FER2013 dataset results with different batch-size and learning rate
values. Maximum accuracy 70.29% achieved for batch size value 16 and learning rate
value 0.001 using proposed MLFFC model. Many researchers worked and tried to
improve the recognition accuracy of FER2013 real-time facial expression dataset.
Comparative analysis of the proposed MLFFC model on the FER2013 dataset is shown in
table 4.5, where our proposed model has achieved third-highest recognition accuracy
compared with other methods. Confusion matrix is shown in figure 4.18. Detailed
Classification Report and the ROC-AUC curve for proposed MLFFC model are shown in
figure 4.19 and 4.20. ROC-AUC curve on the FER2013 dataset without and with feature-
fusion approach is shown. Model’s performance is measured in both the cases for different
epochs vs. AUC values based on the classifier. Higher the AUC, better the model is
predicting the results. ROC-AUC analysis on FER2013 dataset shows an excellent
85
prediction performance with feature-fusion approach. Evaluating on real-time facial
expression dataset with lower resolution images, some variance is generated during
evaluation of model and hence not smooth curve is generated in the AUC graph. Average
AUC value is between 0.9-0.97. Accuracy chart of proposed MLFFC model based on
FER2013 dataset for batch size 8 and 16 is shown in figure 4.21 and figure 4.22.
Comparative analysis of the Error-rate on both the datasets using proposed MLFFC
approach is shown in table 4.6
Table 4.5: Comparative analysis of the proposed MLFFC model with state-of-the-art methods on the
FER2013 dataset
Method Name Accuracy
(%)
An Ensemble of CNN – Subnets [122] 65.03%
Deep Neural Network [138] 66.4%
Multi-Task Network [144] 67.2%
Auxiliary Model [126] 67.7%
DCN+AMN (Alignment Mapping Network) [145] 71.8%
Ensemble of 3 MLCNN Model [56] 73.03%
Proposed MLFFC Model 70.29%
FIGURE 4.18 Confusion matrix using the proposed MLFFC model on the FER2013 dataset
86
FIGURE 4.19 Classification Report for the proposed MLFFC model on the FER2013 dataset
(a) (b)
FIGURE 4.20 ROC-AUC curve on the FER2013 dataset for (a) without feature-fusion (b) with feature-
fusion
87
FIGURE 4.21 Accuracy graph of the proposed MLFFC model for the FER2013 dataset with batch size 8
FIGURE 4.22 Accuracy graph of the proposed MLFFC model for FER2013 dataset with batch size 16
88
Table 4.6: Comparative analysis of the Error-Rate on both the datasets using the proposed MLFFC model
Databases
Without Fusion Method With Fusion Method
Recognition
Accuracy (%)
Error rate
(%)
Recognition
Accuracy (%)
Error rate
(%)
CK+ 99.17 0.83 99.69 0.31
FER2013 67.03 32.97 70.29 29.71
4.6 Discussion and Summary:
To improve the recognition accuracy of real-time facial expression dataset is a significant
challenge as it contains many challenges like head-pose variations, illumination, lower-
resolution images etc. In this chapter, we have proposed a Multi-Layer Feature-Fusion
based Classification (MLFFC) model, which works on Inter-Layer Feature-Fusion
approach. This proposed model aims to integrate feature maps from different layers
instead of the last layer only. In our proposed MLFFC model, inter-layer feature-fusion is
applied with an internal layer of Module C of Inception-V3 CNN architecture with its final
layer to improve the model’s recognition accuracy. The proposed model is evaluated on
FER2013 and CK+ datasets. Experimental results show that the proposed MLFFC model
has achieved better recognition accuracy on real-time facial expression dataset as well as
on laboratory-trained facial expression dataset. Experimental results validated on
FER2013 real-time facial expression dataset with the proposed model have achieved better
recognition accuracy (70.29%) compared to the state-of-the-art methods. For the cross-
database evaluation, experimental results validated on CK+ laboratory-trained facial
expression dataset with the proposed model have achieved the best recognition accuracy
(99.6%) compared to the state-of-the-art methods. Also, proposed MLFFC approach used
to reduce Error-rates in both the datasets. Without Feature-fusion approach, 0.83% and
32.97% error-rates are there for the CK+ and FER2013 dataset respectively. By using
proposed Feature-fusion approach, error-rate reduces from 0.83% to 0.31% for the CK+
dataset and from 32.97% to 29.71% for the FER2013 dataset. Inter-layer feature-fusion
approach using the proposed MLFFC model helps to overcome the challenges of real-time
facial expression dataset as well as a laboratory-trained facial expression dataset with
improved recognition accuracy in the experimental result.
89
CHAPTER 5
Proposed Multi-Model Feature-Fusion based
classification (MMFFC) Model
5.1 Introduction
In this chapter, the proposed multi-model feature-fusion based classification (MMFFC)
approach is presented to perform concatenation of features from different CNN
architectures for facial expression recognition. Many researchers have tried to implement
an ensemble of multi-CNN models for feature fusion in facial expression recognition.
From the literature survey, it is found that in existing research approaches single CNN
models are used to extract features for facial expression recognition. To improve
recognition accuracy, many researchers have proposed ensemble of CNN concept in which
they have concatenated layers from different CNN architectures and generated a final
feature vector. Long Nguyen et al. [55] and Vaidehi et al. [56] mentioned using an
ensemble of different CNN architectures for feature extraction and then concatenated the
output of these architectures into a single image vector for the classification purpose
instead of getting one feature map by applying single CNN architecture. Using this
approach, they conclude that concatenation of features from various networks helps to
overcome the limitation of a single network and produces robust and superior
performance. This will further help to improve the recognition accuracy of the model.
Sample architecture of the ensemble approach is shown in figure 5.1, where the same input
database images will be provided to different CNN architectures, and the final feature
vector will be generated by combining outputs from these architectures. Then it will be
provided to the classification stage for further process.
Figure 5.1: Sample architecture of Ensemble of multi-CNN [55]
90
In our proposed MMFFC model, we have analysed the performance of different CNN
architectures (InceptionV3, VGG16, VGG19 and ResNet50) for ensemble approach and
based on the better performance we have concatenated feature vectors generated at the last
layer of two CNN models VGG16 and ResNet50 by providing same input images to both
these architectures. Feature vector generates from each architecture VGG16 and ResNet50
then concatenated into a single final feature vector before applying to the classification
stage.
5.2 VGG-16 CNN Architecture:
VGG-16 is a Convolutional Neural Network (CNN) architecture used to win the ILSVR
(ImageNet) competition in 2014. This architecture was 1st runner up of the Visual
Recognition Challenge (ILSVR-2014) and was developed by Simonyan and Zisserman
from the University of Oxford. It is considered to be one of the excellent vision model
architecture till date. The unique thing about VGG-16 is that instead of having a large
number of hyper-parameters, they focused on having 16 CNV/FC layers including
convolutional layers of 3x3 filter with a stride 1 and always used same padding and max-
pool layer of 2x2 filter of stride 2. It follows this arrangement of convolution and max
pool layers consistently throughout the whole architecture. In the end, it has 2 FC (Fully
Connected layers) followed by a SoftMax for output. The 16 in VGG-16 refers to it has 16
layers that have weights. The VGG-16 architecture is shown in figure 5.3 and architecture
with its layers is shown in figure 5.2
Figure 5.2: VGG-16 Architecture diagram with its layer’s details [55]
91
Figure 5.3: VGG-16 Architecture diagram [146]
To conclude, VGG-16 consists of 16 weight layers containing 13 convolutional layers
with a filter size of 3x3 and three fully connected layers. The stride and padding of all
convolutional layers are fixed to 1 pixel. All convolutional layers are divided into five
groups, and a max-pooling layer follows each group. Max-pooling is carried out over a
2x2 window with stride 2. The number of filters of the convolutional layer group starts
from 64 in the first group and then increases by a factor of 2 after each max-pooling layer
until it reaches 512. All hidden layers are equipped with the rectification (ReLU) non-
linearity. The last layer of this model is the SoftMax layer which is used for classification.
The SoftMax layer can be replaced by a suitable classifier such as neural network, random
forest, support vector machine etc. The dropout layer is used to control the overfitting of
the network [147].
5.3 ResNet-50 CNN Architecture:
ResNet, short for Residual Network is a specific type of neural network introduced in
2015 by Xiangyu Zhang et al. [148]. The Residual network is a classical neural network
used as a backbone for many computer vision tasks. This model was the winner of the
ImageNet challenge in 2015. The fundamental breakthrough with ResNet was it allowed
us to train extremely deep neural networks with 150+ layers successfully. It is similar in
92
architecture to networks such as VGG-16 but with the additional identity mapping
capability. ResNet models fit a residual mapping to predict the delta needed to reach the
final prediction from one layer to the next rather than fitting the latent weights to predict
the final emotion at each layer. The identity mapping enables the model to bypass a typical
CNN weight layer if the current layer is not necessary. This further helps the model to
avoid overfitting to the training set. From an overall architecture and performance
perspective, ResNet allows for much deeper networks while training much faster than
other CNNs. The problem of training very deep networks has been alleviated with the
introduction of ResNet, and these ResNets are made up from Residual Blocks concept
which is shown in figure 5.4 [148,149]
Figure 5.4: Residual Learning: a building block concept [149]
Residual learning block, as shown in figure 5.4 using Skip Connection concept, which is
the core of residual blocks. Due to this skip connection, the output of the layer is not the
same now. Without using this skip connection, the input ‘x’ multiplied by the weights of
the layer followed by adding a bias term. This term goes through the activation function,
f() and we get the output as H(x). Without skip connection we get the output as H(x) = f(x)
while with the use of skip connection the output is changed to H(x) = f(x) + x. There
appears to be a slight problem with this approach when the input dimensions vary from
that of the output which can happen with convolutional and pooling layers. In this case,
when dimensions of f(x) are different from x, two approaches are there to solve this
problem: First, the skip connection is padded with extra zero entries to increase its
dimensions and second, the projection method is used to match the dimension which is
93
done by adding 1x1 convolutional layers to the input. The skip connections in ResNet
solve the problem of Vanishing gradient in deep neural networks by allowing this alternate
shortcut path for the gradient to flow through. ResNet architecture is shown in figure 5.5,
where a comparison is shown with the difference between plain and residual networks.
Figure 5.5: ResNet Architecture diagram comparison to plain network [150]
The diagram shown above visualizes ResNet34 architecture. For the ResNet50 model,
replace each two-layer residual block with a three-layer bottleneck block that uses 1x1
convolutions to reduce and subsequently restore the channel depth, allowing for a reduced
computational load calculating the 3x3 convolution as shown in figure 5.6.
Figure 5.6: Diagram shwoing conversion of residual block [150]
94
5.4 Proposed MMFFC Model:
Inspired from the concept of an ensemble of different CNN architectures explained by
Long Nguyen et al. [55] and Vaidehi et al. [56], we attempt to combine VGG16 and
ResNet50 CNN architectures by leveraging an ensemble approach in our proposed
MMFFC model. In proposed MMFFC model, we have analysed the performance of
different CNN architectures (InceptionV3, VGG16, VGG19 and ResNet50) for ensemble
approach and based on the better performance we have selected VGG16 and ResNet50 as
two different CNN architectures for Ensemble approach. In an ensemble approach,
concatenation of features generated from various networks helps to overcome the
limitations of a single network and produces robust and superior performance. We obtain a
feature vector generated from each individual architecture of VGG16 and ResNet50
architecture, then concatenated output of these into a single feature vector before applying
for the final emotion prediction classification. Here the size of the final feature vector
generated by this ensemble approach adds the size of a feature vector generated from each
network. Sample framework of ensemble of multi-CNN architecture for feature-fusion is
shown in figure 5.7 and proposed Multi-Modal Feature-Fusion based Classification
(MMFFC) model is shown in figure 5.8
Figure 5.7: Sample framework of Ensemble of Multi-CNN feature-fusion [126]
95
Figure 5.8: Proposed Multi-Modal Feature-Fusion based Classification (MMFFC) Model
As shown in figure 5.8, in our proposed MMFFC model VGG16 and ResNet50 considered
as CNN architecture 1 and CNN architecture 2, respectively for an ensemble approach.
Same input images provided to both the architectures. Both the architectures trained and
generated a feature vector at the end of the last layer before the classification stage.
Feature vector 1 (fv1) generated from CNN architecture 1 (VGG16) and feature vector 2
(fv2) generated from CNN architecture 2 (ResNet50) will be combined to create final
feature vector fv which is a combination of fv1 and fv2. Ensemble approach gives an
advantage over here by concatenating these architectures’ output into a single feature
vector for the classification purpose instead of getting one feature vector by applying
single CNN architecture. The proposed MMFFC model aims to improve the recognition
accuracy of real-time facial expression dataset, which contains real-world images with
challenges. Also, this proposed approach is applied to laboratory trained facial expression
dataset for the cross-database evaluation study. Proposed MMFFC algorithm is explained
below:
96
5.4.1 Detailed Process(Algorithm) of Multi-Model feature-fusion based
classification (MMFFC)
base_model1 = VGG16
base_model2 = ResNet50
Input: FER2013 dataset (35887 grey-scaled images) and KDEF dataset (4900
RGB images)
1. Initialize parameters nb_class, x, y, epoch, lr, bs, c, where
nb_class = number of facial expression classes
x = height of the image
y = width of the image
epoch = number of iterations
lr = learning rate
bs = batch size
c = classifier
2. for 1: epochs
train_data, val_data = train_test_split (dataset, 0.8)
bs = 16,32,64 and lr = 1e-1 to 1e-3 // for FER2013 dataset
for 1: last_block_layer // base_model1 VGG16
fv1 = generate feature vector for base_model1
acc1 = predict result of fv1 using classifier c
end for
for 1: last_block_layer // base_model2 ResNet50
fv2 = generate feature vector for base_model2
acc2 = predict result of fv2 using classifier c
end for
res_fr = fv1 fv2 // multi-model fusion approach
final_acc = predict result of res_fr using classifier C
end for
3. for 1: epochs
train_data, val_data = train_test_split (dataset, 0.8)
bs = 16,32 and lr = 1e-1 to 1e-5 // for KDEF dataset
for 1: last_block_layer // base_model1 VGG16
fv1 = generate feature vector for base_model1
acc1 = predict result of fv1 using classifier c
end for
for 1: last_block_layer // base_model2 ResNet50
fv2 = generate feature vector for base_model2
acc2 = predict result of fv2 using classifier c
end for
res_fr = fv1 fv2 // multi-model fusion approach
final_acc = predict result of res_fr using classifier C
end for
4. END
97
5.5 Dataset details:
Proposed Multi-Model Feature-Fusion based Classification (MMFFC) model is tested on
real-time facial expression dataset FER2013 [133] and laboratory trained facial expression
dataset KDEF [151] by varying network parameters like batch size and learning rate.
Stochastic Gradient Descent (SGD) optimizer and SoftMax classifier used for the
evaluation purpose.
5.5.1 FER2013 Dataset:
The FER2013 dataset was introduced during the International Conference on Machine
Learning (ICML) challenges in Representation learning. FER2013 is a large-scale, and
unconstrained database collected automatically by the Google image research API.
FER2013 dataset contains 35,887 real-world images taken in an uncontrolled
environment. So, it has many research challenges for researchers like head-pose variations,
illumination, lower resolution etc. Dataset images contain seven different facial
expressions: Anger, Disgust, Happy, Surprise, Fear, Sad and Neutral, with the resolution
of 48x48 pixels. Example of FER2013 dataset image is shown in figure 5.9 [11]
FIGURE 5.9 Example of images in the FER2013 dataset with different emotions [135]
98
5.5.2 KDEF Dataset:
The Karolinska Directed Emotional Faces (KDEF) dataset is created by Flykt & Ohman et
al. [151] from the department of clinical neuroscience, psychology section, Karolinska
Institute. KDEF is a set of totally 4900 pictures of human facial expressions. The
collection of pictures contains 70 individuals displaying seven different emotional
expressions. Each expression is viewed from 5 different angles. Dataset images have seven
different facial expressions: Anger, Disgust, Happy, Surprise, Fear, Sad and Neutral, with
the resolution of 224x224 pixels. Example of KDEF dataset image is shown in figure 5.10
[151,152]
FIGURE 5.10 Sample images in the KDEF dataset with different emotions [152]
5.6 Experiments and Results:
In this section, we have described the experimental setup, implementation details with
necessary library support, different parameters used in the proposed model, optimization
of these parameters, the benchmark datasets such as KDEF [151] and FER2013 [133] used
to assess the performance of the proposed model, the results of the proposed model on
these datasets and comparison of these results with state-of-the-art methods. Both the
datasets are divided into 80%-20% ration for training and validation process.
99
5.6.1 Experimental Setup and Implementation Details:
The proposed MMFFC model is implemented using Python and Deep Learning
environment using Keras API with TensorFlow as a backend. The experiments are
performed on NVIDIA GeForce GTX 1050Ti 4GB Graphic Processing Unit (GPU) with
i7 8th generation windows processor system with 16GB RAM. Anaconda deep learning
environment is installed with Keras-TensorFlow libraries. Google Colab and PyCharm
Community Edition used as an IDE tool for implementation purpose with Python
programming language.
In this implementation, we have used different evaluation parameters like maximum 100
epochs and batch size varying from 16, 32 and 64 for different learning rate values ranging
from 0.1 to 0.0001 for detailed analysis purposes. Stochastic Gradient Descent (SGD) used
as an Optimizer and SoftMax are used as a Classifier. These network parameters like
Epochs, Batch Size, Learning Rate, Optimizer and Classifier selection are essential in
order to carry out effective results in recognition accuracy.
5.6.2 Experimental results of Ensemble approach using different CNN
architectures:
As discussed in section 5.4, we have analysed the performance of different CNN
architectures (InceptionV3, VGG16, VGG19 and ResNet50) for an ensemble approach. To
decide which combination of two different CNN architectures from above gives better
performance in terms of recognition accuracy, experiments carried out by applying the
proposed approach and measure the recognition accuracy. Comparison of accuracy of a
different combination of two different CNN architectures evaluated on FER2013 dataset
by using network parameters as Max Epochs = 100, Batch size = 16 and learning rate
0.001. Experimental results are shown in Table 5.1
Table 5.1: Comparison accuracy on two different CNN architectures using an ensemble approach
An ensemble of two different CNN
architectures for proposed MMFFC model
Validation Accuracy (%)
InceptionV3 + VGG-16 50.24%
InceptionV3 + ResNet50 50.09%
VGG-16 + ResNet50 68.14%
VGG-19 + ResNet50 66.50%
100
From the experimental results mentioned in table 5.1, the combination of VGG-16 and
ResNet50 CNN architectures gives better recognition accuracy than the other
combinations of different CNN architectures. So, VGG-16 and ResNet50 two CNN
architectures selected for our proposed MMFFC model. Feature vectors generated at the
last layer of individual VGG-16 and ResNet50 architecture will be concatenated to create
a resultant feature vector as an ensemble approach. Resulting feature vector further
provided to classifier for classification purpose. This proposed MMFFC model tested on
KDEF and FER2013 datasets with varying network parameters to improve the recognition
accuracy of the real-time dataset and laboratory trained dataset for cross-database
evaluation purpose.
5.6.3 Experimental results on FER2013 dataset:
Proposed MMFFC model is tested on FER2013 real-time facial expression dataset with
varying network parameters values such as Batch size 16, 32 and 64 and Learning rate
varying from 0.1 to 0.001. Stochastic Gradient Descent (SGD) used as an optimizer and
SoftMax used as a classifier. Experiments are carried out for individual CNN architectures
VGG-16 and ResNet50 for the above network parameters. Then, experiments carried out
for the proposed MMFFC model (VGG-16 + ResNet50) using an ensemble approach.
Results are mentioned in table 5.2
Table 5.2 shows results in terms of recognition accuracy on the FER2013 dataset for
individual CNN architecture VGG-16 and ResNet50 and on the proposed MMFFC model
using the ensemble of VGG-16 & ResNet50 architectures. Maximum recognition accuracy
67.36% obtained for VGG-16 individual CNN architecture for batch size value 16 and
learning rate value 0.001. Similarly, maximum recognition accuracy 64.24% obtained
from ResNet50 individual CNN architecture for batch size value 32 and learning rate
value 0.001. In comparison with these two accuracy results, maximum accuracy of 68.14%
obtained using the proposed MMFFC approach for batch size value 64 and learning rate
value 0.01. As discussed in the literature review, many researchers worked and tried to
improve the recognition accuracy of FER2013 real-time facial expression dataset.
101
Table 5.2: FER2013 dataset performance for VGG16, ResNet50 and proposed MMFFC model using
ensemble approach
Implementation Result of MMFFC Model
Accuracy (%) on FER2013 Dataset
Max. Epochs = 50 Learning rate
Batch Size 0.1 0.01 0.001
Results of VGG16 Model
16 66.06 67.15 67.36
32 65.57 63.83 64.11
64 58.79 62.83 61.61
Results of ResNet50 Model
16 63.37 58.6 63.46
32 63.18 61.61 64.24
64 63.43 62.94 61.26
Results of Proposed MMFFC Model
16 66.93 65.63 67.61
32 68.08 65.97 67.33
64 67.95 68.14 66.99
Comparative state-of-the-art analysis of the proposed MMFFC model with other multi-
model methods on the FER2013 dataset is shown in table 5.3. Confusion matrix is shown
in figure 5.11. Detailed Classification Report and the ROC-AUC curve for proposed
MMFFC model are shown in figure 5.12 and 5.13. ROC-AUC curve on the FER2013
dataset without and with ensemble model approach is shown. Model’s performance is
measured in both the cases for different epochs vs. AUC values based on the classifier.
Higher the AUC, better the model is predicting the results. ROC-AUC analysis on
FER2013 dataset shows better prediction performance with an ensemble model approach.
To evaluate the model on real-time facial expression datasets contains lower resolution
images, some variance is generated during evaluation of model and hence not smooth
102
curve is generated in the AUC graph. Average AUC value generated here is between 0.8-
0.9. Also, Accuracy charts of proposed MMFFC model based on FER2013 dataset for
different batch sizes (16,32 and 64) are shown in figure 5.14, 5.15 and 5.16.
Table 5.3: Comparative analysis of the proposed MMFFC model with state-of-the-art methods on the
FER2013 dataset
Method Name Accuracy (%)
An Ensemble of CNN – Subnets [122] 65.03%
Deep Neural Network [138] 66.4%
Multi-Task Network [144] 67.2%
Auxiliary Model [126] 67.7%
DCN+AMN (Alignment Mapping Network) [145] 71.8%
Ensemble of 3 MLCNN Model [56] 73.03%
Proposed MMFFC Model 68.14%
FIGURE 5.11 Confusion matrix using the proposed MMFFC model on the FER2013 dataset
103
FIGURE 5.12 Classification Report for the proposed MMFFC model on the FER2013 dataset
(a) (b)
FIGURE 5.13 ROC-AUC curve on the FER2013 dataset for (a) without multi-model fusion (b) with multi-
model fusion
104
FIGURE 5.14 Accuracy graph of the proposed MMFFC model for the FER2013 dataset for batch size 16
FIGURE 5.15 Accuracy graph of the proposed MMFFC model for the FER2013 dataset for batch size 32
FIGURE 5.16 Accuracy graph of the proposed MMFFC model for the FER2013 dataset for batch size 64
105
5.6.4 Experimental results on KDEF dataset:
Proposed MMFFC model is also tested on laboratory trained KDEF dataset for cross-
database evaluation study with varying network parameters values such as Batch size 16
and 32, Learning rate varying from 0.1 to 0.0001. Stochastic Gradient Descent (SGD) used
as an optimizer and SoftMax used as a classifier. Experiments are carried out for the
proposed MMFFC model (VGG-16 + ResNet50) using an ensemble approach. Results are
mentioned in table 5.4. Results show that maximum recognition accuracy 92.04% on
KDEF dataset obtained for batch size value 16 and learning rate value 0.001.
Table 5.4: KDEF dataset performance for VGG16, ResNet50 and proposed MMFFC model using an
Ensemble approach
Implementation Result of MMFFC Model
Accuracy (%) on KDEF Dataset
Max. Epochs = 50 Learning rate
Batch Size 0.1 0.01 0.001
Results of VGG16 Model
16 90.61 90 89.79
32 88.36 87.77 89.59
Results of ResNet50 Model
16 88.97 83.87 90.81
32 85.1 81.63 87.55
Results of Proposed MMFFC Model
16 91.42 91.02 92.04
32 89.18 91.83 89.8
As discussed in the literature review, many researchers have also tried to improve the
recognition accuracy of laboratory trained facial expression dataset for the cross-dataset
evaluation purpose. Comparative state-of-the-art analysis of proposed MMFFC model on
KDEF dataset is shown in below table 5.5. Confusion matrix is shown in figure 5.17.
Detailed Classification Report and the ROC-AUC curve shown in figure 5.18 and 5.19.
106
ROC-AUC curve on the KDEF dataset without and with ensemble model approach is
shown. Model’s performance is measured in both the cases for different epochs vs. AUC
values based on the classifier. Higher the AUC, better the model is predicting the results.
ROC-AUC analysis on KDEF dataset shows better prediction performance with an
ensemble approach. Average AUC value generated here is between 0.8-0.9. Also,
Accuracy chart of proposed MMFFC model based on KDEF dataset for different batch
sizes 16 and 32 is shown in figure 5.20 and figure 5.21. Maximum recognition accuracy
92.04% achieved using proposed MMFFC model. Comparative analysis of the Error-rate
on both the datasets using the proposed MMFFC approach is shown in table 5.6
Table 5.5: Comparative analysis of proposed MMFFC model with state-of-the-art methods on KDEF dataset
Method Name Accuracy (%)
Histogram Oriented Gradients with SVM [153] 80.95%
Dynamic Bayesian Mixture Model [154] 85%
Gradient Laplacian RTNN Model [155] 88.16%
DCNN Model [156] 89.33%
Hybrid Approach for FER [157] 89.58%
Dense Facelivenet Model [158] 95.89%
Proposed MMFFC Model 92.04%
FIGURE 5.17 Confusion matrix using the proposed MMFFC model on the KDEF dataset
107
FIGURE 5.18 Classification Report for the proposed MMFFC model on the KDEF dataset
(a) (b)
FIGURE 5.19 ROC-AUC curve on the KDEF dataset for (a) without multi-model fusion (b) with multi-
model fusion
108
FIGURE 5.20 Accuracy graph of the proposed MMFFC model for the KDEF dataset for batch size 16
FIGURE 5.21 Accuracy graph of the proposed MMFFC model for the KDEF dataset for batch size 32
Table 5.6: Comparative analysis of the Error-Rate on both the datasets using the proposed MMFFC model
Method Databases
Error-Rate Analysis
Recognition
Accuracy (%)
Error rate
(%)
VGG16
FER2013
67.36 32.64
ResNet50 64.24 35.76
Ensemble
(VGG16+ ResNet50) 68.14 31.86
VGG16
KDEF
90.61 9.39
ResNet50 90.81 9.19
Ensemble
(VGG16+ ResNet50) 92.04 7.96
109
5.7 Discussion and Summary:
To improve the recognition accuracy of real-time facial expression dataset is a significant
challenge as it contains many challenges like head-pose variations, illumination, lower-
resolution images etc. In this chapter, we have proposed a Multi-Model Feature-Fusion
based Classification (MMFFC) model, which works on an ensemble of multi-CNN
approach. This proposed model aims to concatenate the output of different CNN
architectures for better feature extraction process instead of getting one feature map form
single CNN architecture. Our proposed MMFFC model, the ensemble of multi-CNN
approach for two CNN architectures VGG-16 and ResNet50 is carried out. The
concatenation of features from different networks helps to overcome the limitations of a
single network. It will further produce a robust and superior performance to improve
recognition accuracy. The only drawback is that complexity is increased due to the
concatenation of two different CNN architectures’ output instead of a single CNN
architecture. The proposed model is evaluated on FER2013 and KDEF datasets.
Experimental results show that the proposed MMFFC model has achieved better
recognition accuracy on real-time facial expression FER2013 dataset as well as on
laboratory-trained facial expression KDEF dataset. Experimental results validated on
FER2013 real-time facial expression dataset with the proposed model has achieved better
recognition accuracy (68.14%) compared to the state-of-the-art methods. For the cross-
database evaluation, experimental results validated on KDEF laboratory-trained facial
expression dataset with the proposed model has achieved the best recognition accuracy
(92.04%) in comparison with the state-of-the-art methods. Also, proposed MMFFC
approach used to reduce Error-rates in both the datasets. For the FER2013 dataset, Error-
rate reduces to 31.86% and 7.96% for the KDEF dataset using the proposed Ensemble
approach. An ensemble of a multi-CNN approach using the proposed MMFFC model
helps to overcome the challenges of real-time facial expression dataset as well as a
laboratory-trained facial expression dataset with improved recognition accuracy shown in
the experimental result. Based on experimental results, we conclude that the proposed
MMFFC model works much better and gives higher recognition accuracy with higher
resolution images than the lower resolution images in the facial expression datasets.
110
CHAPTER 6
Novel FER Model based on Normalized CNN
6.1 Introduction
In 2019, Mingxing et al. [159] have introduced “EfficientNet: Rethinking Scaling model
for CNN” concept in which EfficientNet B0 to B7 models are described. Advantage of this
concept is to improve recognition accuracy by reducing parameters and providing better
results comparing with existing architectures. EfficientNets are based on AutoML and
Compound Scaling approach. In particular, AutoML Mobile framework has been used to
develop a mobile-size baseline network, named as EfficientNet-B0; then Compound
Scaling method is used to scale up this baseline to obtain EfficientNet-B1 to B7 models.
Based on the literature survey and industry reference, it is found that no work has been
carried out for facial expression recognition using this latest model EfficientNet till date.
This EfficientNet approach is working well on high-resolution images, so we have studied
and implemented the proposed EfficientNetB7 model for facial expression recognition on
facial expression datasets which contains higher resolution images. EfficientNet using
compound scaling method to scale up CNN in a more structured way. It will uniformly
scale all the dimensions with a compound coefficient by principled scaling of depth, width
and resolution dimensions. Unlike conventional approaches that arbitrary scale network
dimensions such as width, depth and resolution. This approach uniformly scales each
dimension with a fixed set of scaling coefficients. Figure 6.1 summarizes the ImageNet
performance, where EfficientNets significantly outperform other ConvNets. In particular,
EfficientNetB7 surpasses the best existing accuracy but using 8.4x fewer parameters and
running 6.1x faster on inference. Besides ImageNet, EfficientNets also transfer well and
achieve state-of-the-art accuracy on widely used datasets, while reducing parameters by up
to 21x than existing ConvNets. [159]
In this chapter, EfficientNet architecture details are presented. Also, the proposed
EfficientNetB7 model is presented to perform a facial expression recognition task. For the
model performance comparison with EfficientNetB7, ResNet152 architecture is also
implemented for the facial expression recognition task. As per the literature review, no
work has been carried out for facial expression recognition with this novel EfficientNetB7
111
model till date. Different optimizers are applied to both the architectures and performance
analysis carried out in terms of recognition accuracy. Selection of the optimizer is essential
to decide the model’s better performance on facial expression dataset with higher
recognition accuracy. The different optimizers such as stochastic gradient descent (SGD),
RMSprop and Adam have experimented with the proposed EfficientNetB7 model and
ResNet152 architecture with network parameters such as epochs and learning rate. Also
Vanishing Gradient Descent (VGD) issue arise during evaluation of the proposed model to
measure accuracy and loss graph. This issue is further resolved by the proposed Internal
Batch Normalization (IBN) concept, which will help to reduce variance loss in the model
and achieve a smooth curve with optimized results.
Figure 6.1: ImageNet performance evaluation with other ConvNets [159]
As we can see in figure 6.1, EfficientNets significantly outperform other ConvNets. In
fact, EfficientNetB7 achieved new state-of-the-art accuracy compared with different
architectures being 8.4 times smaller and 6.1 times faster. The great thing about
EfficientNet is that not only do they have better accuracies compared to their counterparts,
they are also lightweight and thus, faster to run.
112
6.2 EfficientNet Architecture and Working methodology
Before the EfficientNets came along, the most common way to scale up ConvNets was
either by one of three dimensions – depth (number of layers), width (number of channels)
or image resolution (image size). EfficientNets, on the other hand, perform Compound
Scaling approach that scales all three dimensions while maintaining a balance between all
dimension of the network. The main difference between the scaling methods has also been
illustrated in below figure 6.2
Figure 6.2: Model Scaling Approach [159]
Above figure 6.2, (b) to (d) are conventional scaling that only increases one dimension of
network width, depth or resolution. (e) is the proposed compound scaling method that
uniformly scales all three dimensions with a fixed ratio. The idea of compound scaling
makes sense because if the input image is bigger (input resolution), then the network needs
more layers (depth) and more channels (width) to capture more fine-grained patterns on
the bigger image. This idea of Compound Scaling also works on existing MobileNet and
ResNet architectures.
113
Figure 6.3: Scaling up a Baseline Model with Different Network Width (w), Depth (d) and Resolution (r)
[159]
Authors of EfficientNet architecture ran many experiments scaling depth, width and image
resolution and made two main observations: First, scaling up any network width
dimension, depth or resolution improves accuracy, but the accuracy gain diminishes for
bigger models. Second, in order to pursue better accuracy and efficiency, it is critical to
balance all dimensions of network width, depth and resolution during ConvNet scaling.
Scaling Up a Baseline Model with Different Network Width (w), Depth (d) and
Resolution (r) Coefficients is shown in figure 6.3. Bigger networks with a larger width,
depth or resolution tend to achieve higher accuracy, but the accuracy gain quickly
saturates after reaching 80% as shown in above figure 6.3. It is demonstrating the
limitation of single dimension scaling. The individual scaling technique is called
Compound Scaling. [160]
Scaling network depth (number of layers), is the most common way used by many
ConvNets. With the advancements in deep learning, it has now been possible to train
deeper neural networks that generally have higher accuracy than their shallower
counterparts. The intuition is that deeper ConvNet can capture more prosperous and more
complex features. However, deeper networks are also the most challenging to train due to
the vanishing gradient problem. Figure 6.3 (middle) shows that accuracy saturates at
d=6.0, and no further improvement can be seen after. [160]
114
Scaling network width – that is, increasing the number of channels in Convolution layers –
is most commonly used for smaller sized models. Applications of wider networks seen in
MobileNets, MNasNet. While wider networks tend to capture more fine-grained features
and are easier to train, extremely wide but shallow networks tend to have difficulties in
capturing higher-level features. In figure 6.3 (left), accuracy quickly saturates when
networks become much wider with large w value. Also increasing image resolution to help
improve the accuracy of ConvNets. Figure 6.3 (right), accuracy increases with an increase
in input image size.
The authors used Neural Architecture Search approach similar to MNasNet, which is a
reinforcement learning-based approach where the authors developed a baseline neural
network architecture EfficientNet-B0. It optimizes both the accuracy and efficiency
measured on the floating-point operations per second (FLOPS) basis. Flops mean the
number of floating-point operations optimized and achieved a better result than other
models like Inception, ResNet etc. This developed architecture uses the Mobile Inverted
Bottleneck Convolution (MBConv). The researchers then scaled up this baseline network
to obtain a family of deep learning models, called EfficientNets. Its architecture is shown
in below figure 6.4
Figure 6.4: A basic block representation of the EfficientNet-B0 [161]
The EfficientNet-B0 architecture is summarized in table 6.1, where the MBConv layer is
nothing but an inverted bottleneck block with squeeze and excitation block along with
swish activation.
115
Table 6.1: EfficientNet-B0 Baseline Network [161]
In table 6.1, the architecture uses seven inverted residual blocks, but each is having
different settings. EfficientNet model architecture will have to scale in three stages:
Depthwise Convolution + Pointwise Convolution, Inverse Res and Linear bottleneck is
shown in figure 6.5 (a) and (b)
Figure 6.5: A basic representation of Depthwise and Pointwise Convolutions in (a) and (b) [161]
116
Depthwise Convolution and Pointwise Convolution divides the original convolution into
two stages to significantly reduce the calculation cost, with a minimum loss of accuracy.
This approach decreases trainable parameters by a large number. For Inverse Res, the
original ResNet blocks consist of a layer that squeezes the channels and then a layer that
extends the channels. In this way, it links skip connections to rich channel layers. In
MBConv, however, blocks consist of a layer that first extends channels and then
compresses them, so that layers with fewer channels are skip connected. Linear bottleneck
uses linear activation in each block’s last layer to prevent loss of information from ReLU
[161].
The main building block for EfficientNet is MBConv, an inverted bottleneck Conv, known
initially as MobileNetV2. Using shortcuts between bottlenecks by connecting a much
smaller number of channels (compared to expansion layers), it was combined with an in-
depth separable convolution, which reduced the calculation by almost k² compared to
traditional layers. Where k denotes the kernel size, it specifies the height and width of the
2-dimensional convolution window. Google Brain team suggested a newer activation that
tends to work better for deeper networks than ReLU, which is a Swish activation. Swish is
a multiplication of a linear and a sigmoid activation, Swish (x) = x * sigmoid (x)
To scale up EfficientNet architecture from EfficientNet-B0 to EfficientNet B1-B7, it is
necessary to use the following approach by taking network depth (d), width (w) and input
image resolution (r) as:
117
Intuitively, φ is a user-defined coefficient that determines how much additional resources
are available. The constants α, β, γ determine how to distribute these extra resources
across networks depth(d), width(w) and input resolution(r). Starting from the baseline
EfficientNet-B0, compound scaling method applied to scale it up with two steps:
STEP-1: First fix φ = 1, assuming twice more resources available, and do a small grid
search of α, β, γ. In particular, find the best values for EfficientNet-B0 are α = 1.2, β = 1.1,
γ = 1.15, under constraint of α * β2 * γ2 ≈ 2.
STEP-2: Then fix α, β, γ as constants and scale-up baseline network with different φ, to
obtain EfficientNet-B1 to B7.
6.3 Proposed novel FER model: EfficientNet-B7
Based on the literature survey and industry reference, it is found that no work has been
carried out for facial expression recognition using this latest concept of EfficientNet. So,
we have selected EfficientNet-B7 latest model as our proposed model for facial expression
recognition purpose. The proposed model approach is shown in figure 6.6
Figure 6.6: Proposed Novel FER model with EfficientNet-B7 and ResNet-152 architecture
118
In this proposed model, we have applied different optimizers (Adam, RMSprop and SGD)
on EfficientNet-B7 and ResNet152 pre-trained CNN architectures with varying network
parameters to decide which optimizers will provides better performance for the given FER
task. As no work is carried out for facial expression recognition on EfficientNet-B7 model,
it is essential to decide which optimizer gives better performance in terms of recognition
accuracy for facial expression recognition. ResNet152 architecture is used for the cross-
verification purpose with EfficientNet-B7 model. EfficientNet-B7 model is working well
on higher resolution images; performance evaluation is carried out KDEF dataset [151]
which contains higher resolution images and also on the FER2013 dataset [133] real-time
dataset which contains lower resolution images for cross-database evaluation. Proposed
algorithm is explained below in which different optimizers are applied to EfficientNet-B7
proposed model and ResNet152 model. Recognition accuracy will be generated for each
optimizer and stored. Maximum recognition accuracy using these three optimizers will be
considered as final recognition accuracy. This process is applied to both the models
EfficientNet-B7 and ResNet152.
6.3.1 Detailed Process (Algorithm) of proposed novel FER model
base_model1 = EfficientNet-B7
base_model2 = ResNet152
op1 = Adam optimizer
op2 = RMSprop optimizer
op3 = SGD optimizer
Input: KDEF dataset (4900 RGB images) and FER2013 dataset (35887 grey-
scaled images)
1. Initialize parameters nb_class, x, y, epoch, lr, bs, C, where
nb_class = number of facial expression classes
x = height of the image
y = width of the image
epoch = number of iterations
lr = learning rate
bs = batch size
C = classifier
2. for 1: epochs
train_data, val_data = train_test_split (dataset, 0.8)
for 1: last_block_layer
model = compile base_model1 with op1 optimizer
fv = generate feature vector for model
acc1 = predict result of fv using classifier C
end for
119
for 1: last_block_layer
model = compile base_model1 with op2 optimizer
fv = generate feature vector for model
acc2 = predict result of fv using classifier C
end for
for 1: last_block_layer
model = compile base_model1 with op3 optimizer
fv = generate feature vector for model
acc3 = predict result of fv using classifier C
end for
final_acc = max (acc1, acc2, acc3)
end for
3. for 1: epochs
train_data, val_data = train_test_split (dataset, 0.8)
for 1: last_block_layer
model = compile base_model2 with op1 optimizer
fv = generate feature vector for model
acc1 = predict result of fv using classifier C
end for
for 1: last_block_layer
model = compile base_model2 with op2 optimizer
fv = generate feature vector for model
acc2 = predict result of fv using classifier C
end for
for 1: last_block_layer
model = compile base_model2 with op3 optimizer
fv = generate feature vector for model
acc3 = predict result of fv using classifier C
end for
final_acc = max (acc1, acc2, acc3)
end for
4. END
120
6.4 Dataset details:
The proposed model is tested on KDEF dataset [151], which contains higher resolution
images and also on real-time facial expression dataset FER2013 [133], which includes
lower resolution images using network parameters like batch size and learning rate.
Different Optimizers (Adam, RMSprop and SGD) applied to check which optimizer
giving better performance in terms of recognition accuracy. SoftMax classifier used for the
classification purpose.
6.4.1 KDEF Dataset:
The Karolinska Directed Emotional Faces (KDEF) dataset is created by Flykt & Ohman et
al. [151] from the department of clinical neuroscience, psychology section, Karolinska
Institute. KDEF is a set of totally 4900 pictures of human facial expressions. The
collection of pictures contains 70 individuals displaying seven different emotional
expressions. Each expression is viewed from 5 different angles. Dataset images have seven
different facial expressions: Anger, Disgust, Happy, Surprise, Fear, Sad and Neutral, with
a resolution of 224x224 pixels. Example of KDEF dataset image is shown in figure 6.7
[151,152]
FIGURE 6.7 Sample images in the KDEF dataset with different emotions [152]
121
6.4.2 FER2013 Dataset:
The FER2013 dataset was introduced during the International Conference on Machine
Learning (ICML) challenges in Representation learning. FER2013 is a large-scale, and
unconstrained database collected automatically by the Google image research API.
FER2013 dataset contains 35,887 real-world images taken in an uncontrolled
environment. So, it has many research challenges for researchers like head-pose variations,
illumination, lower resolution etc. Dataset images contain seven different facial
expressions: Anger, Disgust, Happy, Surprise, Fear, Sad and Neutral, with a resolution of
48x48 pixels. Example of FER2013 dataset image is shown in figure 6.8 [11]
FIGURE 6.8 Example of images in the FER2013 dataset with different emotions [135]
6.5 Experiments and Results:
The proposed model is implemented using Python and Deep Learning environment using
Keras API with TensorFlow as a backend. The experiments are performed on NVIDIA
GeForce GTX 1050Ti 4GB Graphic Processing Unit (GPU) with i7 8th generation
windows processor system with 16GB RAM. Anaconda deep learning environment is
installed with Keras-TensorFlow libraries. Google Colab used as an IDE tool for
implementation purpose with Python programming language.
122
In this implementation, we have used different evaluation parameters like maximum 50
epochs for learning rate value 0.0001 on KDEF dataset, which contains higher resolution
images. Different Optimizers (Adam, RMSprop and SGD) applied for the given network
parameters value and recognition accuracy measured. The same process used on real-time
facial expression dataset FER2013 for the cross-database evaluation study. Based on
experimental results, the conclusion will be carried out that which optimizer for the given
network parameters was working well on EfficientNet-B7 and ResNet152 architectures to
improve recognition accuracy for facial expression recognition.
6.5.1 Experimental results on proposed EfficientNet-B7 model:
Proposed EfficientNet-B7 model is tested on KDEF dataset, which contains higher
resolution images by using network parameters as the maximum size of Epoch is 50, and
the learning rate value is 0.0001. Different optimizers are applied to this proposed model
EfficientNet-B7 by using these network parameters and measured recognition accuracy of
the model using KDEF dataset to decide which optimizer performs well for the facial
expression recognition task. Experimental results for different optimizers with its
recognition accuracy is shown in table 6.2
Table 6.2: Comparative analysis of the proposed EfficientNet-B7 model with different Optimizers
Optimizers Validation Accuracy (%)
Stochastic Gradient Descent (SGD) 62.41 %
RMSprop 91.78%
Adam 77.93%
Similarly, ResNet152 pre-trained CNN architecture is tested on KDEF dataset for cross
performance verification using the same above network parameters. Epochs size
maximum taken is 50, and the learning rate value is 0.0001. Different optimizers are
applied to this ResNet152 architecture using these network parameters and measured
recognition accuracy of the model using KDEF dataset to decided which optimizer is
performing well for facial expression recognition. Experimental results for different
optimizers with its recognition accuracy is shown in table 6.3
123
Table 6.3: Comparative analysis of the ResNet152 CNN architecture with different Optimizers
Optimizers Validation Accuracy (%)
Stochastic Gradient Descent (SGD) 88.77 %
RMSprop 70.40%
Adam 74.08%
From the experimental results mentioned in table 6.2 and table 6.3, proposed EfficientNet-
B7 novel model achieves highest 91.78% recognition accuracy with RMSprop optimizer
compared to the other optimizers. In comparison, ResNet152 architecture which is
implemented for cross-performance verification achieves 88.77% recognition accuracy
with SGD optimizer. It concludes that the proposed model gives better results with
RMSprop optimizer for facial expression recognition. Confusion matrix and Classification
Report generated for proposed EfficientNet-B7 model, which has achieved maximum
recognition accuracy with RMSprop optimizer is shown in figure 6.9 and 6.10,
respectively. Performance comparison evaluation of recognition accuracy using different
optimizers on proposed EfficientNet-B7 model with ResNet152 architecture is shown in
figure 6.11
FIGURE 6.9 Confusion Matrix using the proposed novel EfficientNet-B7 model on the KDEF dataset
124
FIGURE 6.10 Classification Report for the proposed novel EfficientNet-B7 model on the KDEF dataset
FIGURE 6.11 Comparative analysis of recognition accuracy on the proposed EfficientNet-B7 model and
ResNet152 architecture by applying different optimizers
125
We have tested our proposed novel EfficientNet-B7 model on real-time facial expression
dataset FER2013 [135] which contains lower resolution images. From the literature
survey, we found that the proposed EfficientNet-B7 model gives better performance for
the database, which includes higher resolution images. To do the cross-database
evaluation, we have tested our proposed model on FER2013 dataset by keeping the same
network parameters value evaluated for KDEF dataset. Results mentioned in table 6.4
shows that the proposed model achieves maximum recognition accuracy 57.5% on
RMSprop optimizer, which is significantly less compared to the KDEF dataset
performance evaluation results. It proves that the proposed model does not perform well
on facial expression dataset, which contains lower resolution images. Experimental results
for different optimizers on proposed EfficientNet-B7 model with its recognition accuracy
for FER2013 facial expression dataset is shown in table 6.4
Table 6.4: Comparative result analysis of the proposed EfficientNet-B7 model with different optimizers on
the FER2013 dataset
Optimizers Validation Accuracy (%)
Stochastic Gradient Descent (SGD) 21.58 %
RMSprop 57.56%
Adam 51.23%
After achieving better recognition accuracy of 91.78% on proposed EfficientNet-B7 model
with RMSprop optimizer, Vanishing Gradient Descent problem was found in model
accuracy and loss graph as shown in figure 6.12. EfficientNet-B7 is a pre-trained CNN
architecture trained on ImageNet dataset values. When we apply KDEF facial expression
dataset, this model weight gives Variance with Zigzag pattern in the model loss graph. We
are not getting a smoother curve in model accuracy and model loss graph and making our
proposed model unstable due to this problem. To make our proposed model more stable
and reduce variance generated in model loss and accuracy graph, Internal Batch
Normalization (IBN) concept is applied.
126
FIGURE 6.12 Vanishing Gradient Descent problem due to Variance in model loss and accuracy graph of
the proposed EfficientNet-B7 model
6.5.2 Internal Batch Normalization (IBN) & Experimental results:
Internal working of Batch Normalization is explained below: [162]
• Batch Normalization aims to reduce internal covariate shift, and in doing so
aims to accelerate the training of deep neural nets.
• It accomplishes this via a normalization step that fixes the means and variance
of layer inputs.
• Batch Normalization also has a beneficial effect on the gradient flow through
the network, by reducing the dependence of gradients on the parameters’ scale or
initial values.
127
• This allows for the use of much higher learning rates without the risk of
divergence.
• Furthermore, Batch Normalization regularizes the model and reduces the need
for Dropout.
We apply a batch normalization as follows for a minibatch β where γ and β are learnable
parameters: [162]
FIGURE 6.13 Sample figure of Batch Normalization process with N as batch axis, C as the channel axis and
(H, W) as the spatial axes [163]
128
Transfer learning concept for EfficientNet:
• While EfficientNet reduces the number of parameters, training of convolutional
network is still a time-consuming task. To further reduce the training time, we can
utilize transfer learning techniques.
• Transfer learning means we use a pre-trained model and fine-tune the model on
new data
• In image classification we can think of dividing model into two parts
• One part of the model is responsible for extracting the key features from images,
like edges etc.
• Other part is using these features for the actual classification
• Usually, a CNN is built of stacked convolution blocks reducing the image size
while increasing the number of learnable features (filters). In the end, everything is
put together into a fully connected layer, which does the classification.
• The idea of transfer learning is to make the first part transferable, so that it can be
used for different tasks by replacing only the fully connected layer (often called
“top”)
Now we can train the last layer on our applied dataset while the feature extraction layers
are using weights from ImageNet. But unfortunately, we are getting vanishing gradient
issue due to variance generated in the model accuracy and loss graph. To resolve this
issue, we again re-train our model by applying the proposed Internal Batch Normalization
(IBN) concept, where keeping batch normalization layers active only. The process is
explained below:
By applying Internal Batch Normalization (IBN) concept, vanishing gradient issue is
resolved by reducing variance effect in our proposed model and making our model more
129
stable. Also making model accuracy and loss curve graph very smooth compared to the
existing one with the ROC-AUC curve, as shown in figure 6.14. In the figure, we can see
that better results are achieved for the ROC-AUC curve also.
Proposed Algorithm for EfficientNet-B7 model with Internal Batch Normalization
(IBN) approach:
6.5.3 Detailed Process (Algorithm) of proposed novel FER model using IBN
approach
base_model = EfficientNet-B7
opt = RMSprop optimizer
Input: KDEF dataset (4900 RGB images)
1. Initialize parameters nb_class, x, y, epoch, lr, bs, C, where
nb_class = number of facial expression classes
x = height of the image
y = width of the image
epoch = number of iterations
lr = learning rate
bs = batch size
C = classifier
2. for 1: epochs
train_data, val_data = train_test_split (dataset, 0.8)
for 1: last_block_layer
for layer in base_model.layers:
if isinstance (layer, batchnormalization):
layer.trainable = true
else
layer.trainable = false
end for
model = compile base_model with opt optimizer
fv = generate feature vector for model
result_acc = predict result of fv using classifier C
end for
end for
3. END
130
FIGURE 6.14 Resultant smooth curve achieved by applying an Internal Batch Normalization concept and
reducing variance effect
6.6 Discussion and Summary:
Mingxing et al. [159] have introduced “EfficientNet: Rethinking Scaling model for CNN”
concept in which EfficientNet B0 to B7 models are described in 2019. It is the best
classification model used in many recognition tasks. EfficientNet using Compound
Scaling method to scale up CNN in a more structured way to improve the recognition
accuracy by reducing parameters. Unlike conventional approaches that arbitrarily scale the
network dimensions such as width, depth and resolution, this approach uniformly scales
each dimension with a fixed set of scaling coefficients. From the literature survey and
industry reference, it is found that no work for facial expression recognition has been
carried out using this EfficientNet model till date. Also, this novel EfficientNet model
working well on higher resolution images.
In our proposed model, EfficientNet-B7 model is implemented using different optimizers
and network parameters on KDEF dataset, which contains higher resolution images.
Optimizers play an essential role in CNN models to improve the recognition accuracy to
131
decide which optimizer is performing well and give better recognition accuracy is an
important constraint. Different optimizers (RMSprop, Adam and SGD) applied on the
proposed novel EfficientNet-B7 model with varying parameters of a network used as
epochs and learning rate. Same network parameters values are applied on ResNet152
architecture for cross-performance architecture evaluation. The experimental results
concluded that the proposed EfficientNet-B7 model performs well using RMSprop
optimizer compared with other optimizers and achieved 91.78% recognition accuracy for
the facial expression recognition task. In another case, ResNet152 architecture performs
well using SGD optimizer compared with other optimizers and achieved 88.77%
recognition accuracy for the facial expression recognition task. Proposed EfficientNet-B7
model is also applied on real-time facial expression dataset FER2013 for cross-database
evaluation study, which contains lower resolution images. Experimental results show that
the proposed model has achieved maximum recognition accuracy of 57.56% using
RMSprop optimizer compared to other optimizers. So, it concludes that the proposed
model does not perform well on the datasets, which contains lower resolution images.
After achieving maximum 91.78% recognition accuracy on KDEF dataset using the
proposed novel EfficientNet-B7 model, Vanishing Gradient Descent issue is found in
model accuracy and loss graph due to the variance problem. This will make the model
unstable. So, to make the model more stable and to reduce Variance in the graph, Internal
Batch Normalization (IBN) concept proposed and applied which helps to remove Variance
in the graph and make model accuracy and loss graph smoother by making the model
more stable. Further, it will be helpful to improve recognition accuracy for the facial
expression recognition task.
132
CHAPTER 7
Conclusion and Further Enhancement
7.1 Conclusion
Various methods and algorithms have been investigated for improving the recognition
accuracy of facial expression recognition from images. Several methods have been
proposed to detect the facial expressions in the images. Most of the methods use the
laboratory-controlled facial expression datasets, which have controlled conditions. The
lighting is uniform, and the images have an entire frontal face. The laboratory-controlled
images have no occlusions in most of the cases. So, face detection and feature extraction
process get easier as a part of the facial expression recognition process. Therefore, facial
expression recognition on such datasets becomes a lot simpler than for real-time facial
expression datasets. For the latter type of datasets, the images are taken from the internet
and real-world images. Therefore, they have problems like a difference in lighting
conditions, varying head poses, resolutions of images, and various occlusions like
sunglasses, hairs etc. The thesis’s overall goal is to develop efficient models for facial
expression recognition using deep learning techniques to achieve better recognition
accuracy on lower-resolution images of real-time facial expression dataset for recognizing
seven basic facial expressions such as happy, disgust, surprise, anger, sad, fear and neutral.
In this research work, we have proposed three models for recognizing facial expressions
from images using deep learning techniques. To improve recognition accuracy of lower-
resolution images for real-time facial expression dataset and for the laboratory-trained
facial expression datasets, the specific contributions of this research work are summarised
as follows:
We have proposed Multi-Layer Feature-Fusion based Classification (MLFFC) model
which works on Inter-Layer Feature-Fusion approach. Objective of this proposed model is
to integrate feature maps from different layers of a network instead of the last layer of a
network. InceptionV3 CNN architecture has introduced Inception Module concept by
factorizing the convolution node and applying filter concatenation approach. InceptionV3
architecture provides better performance in terms of image recognition accuracy. In the
proposed MLFFC model, inter-layer feature-fusion is used with an internal layer of
133
Module C of InceptionV3 CNN architecture with its final layer to improve the model’s
recognition accuracy. The proposed model utilizes features from two different domains on
the facial expression recognition problem. Proposed MLFFC model is tested on real-time
facial expression dataset FER2013 which contains real-world images. It achieves better
recognition accuracy (70.29%) on this dataset in comparison with the state-of-the-art
methods. The quality of the features learned by the proposed model is further tested by
performing a cross-database study on the laboratory-controlled CK+ dataset. The proposed
model has achieved the best recognition accuracy (99.6%) on this laboratory-trained
dataset. Also, proposed MLFFC approach used to reduce Error-rates in both the datasets.
Without Feature-fusion approach, 0.83% and 32.97% error-rates are there for the CK+ and
FER2013 dataset respectively. By using proposed Feature-fusion approach, error-rate
reduces from 0.83% to 0.31% for the CK+ dataset and from 32.97% to 29.71% for the
FER2013 dataset. This shows that the proposed MLFFC model can work better on both
kinds of facial expression datasets – laboratory-controlled and real-time facial expression
datasets, unlike the models that work exceptionally well on laboratory-controlled facial
expression datasets but fail to do so when it comes to real-time facial expression datasets.
Another model we have proposed is the Multi-Modal Feature-fusion-based classification
(MMFFC) model which works on Ensemble of Multi-CNN approach concatenate the
output of different CNN architectures for better feature extraction process instead of
getting one feature map from a single CNN architecture. In proposed MMFFC model, an
ensemble of multi-CNN approach for two CNN architectures VGG-16 and ResNet50 is
carried out where the concatenation of features from final layers of both the architectures
is performed. This helps to overcome the limitation of a single network, and it will further
produce a robust and superior performance to improve recognition accuracy. Complexity
is increased here due to concatenating output of two different architectures instead of a
single architecture. For applying the ensemble approach, experiments performed on
different four architectures (InceptionV3, VGG16, VGG19 and ResNet50) by
concatenating feature vectors at output layers. Among these, VGG16 and ResNet50
architecture combinations provided better performance in terms of better recognition
accuracy; hence, we selected these two architectures in our proposed MMFFC model.
Further, the proposed model is tested on real-time facial expression dataset FER2013 and
achieved 68.14% better recognition accuracy than with other state-of-the-art methods. For
the cross-database evaluation, the proposed model is tested on laboratory-trained KDEF
134
facial expression dataset. It has also achieved better recognition accuracy of 92.04%
compared to other state-of-the-art methods. Also, proposed MMFFC approach used to
reduce Error-rates in both the datasets. For the FER2013 dataset, Error-rate reduces to
31.86% and 7.96% for the KDEF dataset using the proposed Ensemble approach. This
shows that the proposed MMFFC model can work better on both kinds of facial
expression datasets – laboratory-controlled and real-time facial expression datasets.
We have proposed the third model using the novel concept ‘EfficientNet: Rethinking
Scaling model for CNN’ introduced in 2019. This concept contains different EfficientNet-
B0 to B7 models based on Compound Scaling method to scale up CNN in a more
structured way. From the literature review and industry reference, it is found that no work
has been carried out for facial expression recognition using this novel EfficientNet
approach till date. It is the best image classification model to improve recognition
accuracy by reducing parameters. Unlike the conventional methods that arbitrarily scale
the network dimensions such as width, depth and resolution, this approach uniformly
scales each dimension with a fixed set of scaling coefficients. This is the important
characteristic of this novel EfficientNet approach which works well on higher resolution
images. As a network parameter, Optimizer plays a vital role in improving the recognition
accuracy of any model. In our proposed model, novel EfficientNet-B7 architecture is
implemented by applying different optimizers such as Adam, RMSprop and SGD to
decide which optimizer performs better to give better recognition accuracy for our
proposed model. This concept works well with higher resolution images, so we have tested
our proposed model on laboratory-trained KDEF facial expression dataset by applying
different optimizers and other network parameters. Experimental results show that
proposed EfficientNet-B7 model performs well using RMSprop optimizer compared with
other optimizers and achieved better recognition accuracy 91.78%. We tested the same
approach on ResNet152 CNN architecture for cross-performance evaluation and achieved
88.77% recognition accuracy using SGD optimizer. Proposed EfficientNet-B7 model is
also tested on real-time facial expression dataset FER2013 which contains lower
resolution images for cross-database evaluation purpose. Experimental results declared
that the proposed model had achieved maximum 57.56% recognition accuracy using
RMSprop optimizer compared with other optimizers which are significantly less than the
KDEF dataset, which contains higher resolution images. Hence it is concluded that our
novel proposed EfficientNet-B7 model works better on higher resolution images. After
135
achieving better accuracy 91.78% on our proposed EfficientNet-B7 model using KDEF
dataset, Vanishing Gradient Descent issue found in the result performance of model
accuracy and loss graph due to variance generated in network calculation process and
making model unstable. Internal Batch Normalization (IBN) proposed approach is applied
by re-training the model using transfer learning concept and keeping all the normalization
layers active only except other layers. This will help to reduce variance in the resulting
graph to resolve the issue of vanishing gradient issue.
In summary, the thesis contributes to the study, investigation and development of novel
proposed models for improving the performance of facial expression recognition task
using deep learning techniques on real-time facial expression datasets as well as on
laboratory-trained facial expression datasets.
136
7.2 Future Enhancements
Among all the works presented here in this thesis, there are areas to progress and improve
further. Putting aside what we have successfully achieved, several useful extensions that
can be addressed to further improvements as explained below:
• Without considering the influence of head pose variations, only frontal faces are
taken for training and implementation purpose. So, further faces from several
views can be considered from the images or videos which may help to improve the
recognition accuracy.
• Complex and hybrid methodologies of Convolutional Neural Network and
Recurrent Neural Network (CNN+RNN) can be used to boost the facial expression
recognition system’s performance.
• The deep learning techniques lack sufficient data to be the most effective it can.
Therefore, it may be useful to pre-train a deep CNN on many other databases
before applying a fine-tuning process.
• We have considered Appearance-based features for our research work. So, a hybrid
method can be developed in the future by combining geometric features and
appearance-based features to improve the performance of the facial expression
recognition system.
137
List of References
[1] Shao, J., & Qian, Y. (2019). Three convolutional neural network models for facial expression
recognition in the wild. Neurocomputing, 355, pp. 82-92
[2] Cao, T., & Li, M. (2019). Facial Expression Recognition Algorithm Based on the Combination
of CNN and K-Means. In Proceedings of the 2019 11th International Conference on Machine
Learning and Computing, pp. 400-404
[3] Ekman, P., & Keltner, D. (1997). Universal facial expressions of emotion. Segerstrale U, P.
Molnar P, eds. Nonverbal communication: Where nature meets culture, pp. 27-46
[4] Revina, I. M., & Emmanuel, W. S. (2018). A survey on human face expression recognition
techniques. Journal of King Saud University-Computer and Information Sciences.
[5] Yu Miao (2018). A Real Time Facial Expression Recognition System using Deep Learning.
Masters of Applied Science thesis. University of Ottawa.
[6] Patrick Lucey et al. “The extended cohn-kanade dataset (ck+): A complete dataset for action
unit and emotion-specified expression”. In: Proceedings of the 2010 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition Workshops.2010, pp. 94–101
[7] Michael Lyons et al. “Coding Facial Expressions with Gabor Wavelets”. In: Proceedings of
the 3rd. International Conference on Face & Gesture Recognition. 1998, pp. 200–205.
[8] I. J. Goodfellow, D. Erhan, P. L. Carrier, A. Courville, M. Mirza, B. Hamner, W. Cukierski,
Y. Tang, D. Thaler, D.-H. Lee et al., “Challenges in representation learning: A report on three
machine learning contests,” in International Conference on Neural Information Processing.
Springer, 2013, pp. 117–124.
[9] Pramerdorfer, C., & Kampel, M. (2016). Facial expression recognition using convolutional
neural networks: state of the art. arXiv preprint arXiv:1612.02903.
[10] Mellouk, W., & Handouzi, W. (2020). Facial emotion recognition using deep learning: review
and insights. Procedia Computer Science, 175, pp. 689-694.
[11] Li, S., & Deng, W. (2020). Deep facial expression recognition: A survey. IEEE Transactions
on Affective Computing.
[12] Utami, P., Hartanto, R., & Soesanti, I. (2019). A Study on Facial Expression Recognition in
Assessing Teaching Skills: Datasets and Methods. Procedia Computer Science, pp. 544-552.
[13] Fathima, A., & Vaidehi, K. (2020). Review on facial expression recognition system using
machine learning techniques. In Advances in Decision Sciences, Image Processing, Security
and Computer Vision, pp. 608-618, Springer, Cham.
[14] Japanese Female Facial Expression Dataset: www.kasrl.org/jaffe.html
[15] Robust Facial expression recognition based on human computer interaction:
http://gsse.pafkiet.edu.pk/robust-fpga-based-face-recognition-system/
138
[16] Facial expression recognition market applications top 7 trends:
https://www.thalesgroup.com/en/markets/digital-security/government/biometrics/facial-
recognition
[17] Zeng, H., Shu, X., Wang, Y., Wang, Y., Zhang, L., Pong, T. C., & Qu, H. (2020).
EmotionCues: Emotion-Oriented Visual Summarization of Classroom Videos. IEEE
Transactions on Visualization and Computer Graphics.
[18] Musically Knowledge Company app recommends music based on your facial expressions:
https://musically.com/2020/07/24/mmp-app-recommends-music-based-on-your-facial-
expression/
[19] Jung, H., Lee, S., Park, S., Kim, B., Kim, J., Lee, I., & Ahn, C. (2015). Development of deep
learning-based facial expression recognition system. In 2015 21st Korea-Japan Joint
Workshop on Frontiers of Computer Vision, pp. 1-4, IEEE.
[20] Hemalatha, G., & Sumathi, C. P. (2014). A study of techniques for facial detection and
expression classification. International Journal of Computer Science and Engineering
Survey, 5(2), 27.
[21] Lopes, A. T., de Aguiar, E., De Souza, A. F., & Oliveira-Santos, T. (2017). Facial expression
recognition with convolutional neural networks: coping with few data and the training sample
order. Pattern Recognition, 61, pp. 610-628.
[22] Zafeiriou, S., Zhang, C., & Zhang, Z. (2015). A survey on face detection in the wild: past,
present and future. Computer Vision and Image Understanding, 138, pp. 1-24.
[23] Paul Viola et al. “Robust real-time face detection”. In: International journal of computer vision
57.2 (2004), pp. 137–154.
[24] Constantine P Papageorgiou et al. (1998) “A general framework for object detection”.In:
Proceedings of the sixth international conference on Computer vision , pp. 555–562.
[25] Sharma, M., Anuradha, J., KManne, H., & Kashyap, G. S. (2017). Facial detection using deep
learning. In School of Computing Science and Engineering. VIT University.
[26] Shepley, A. J. (2019). Deep Learning For Face Recognition: A Critical Analysis. arXiv
preprint arXiv:1907.12739.
[27] Sharma, A. K., Kumar, U., Gupta, S. K., Sharma, U., & LakshmiAgrwal, S. (2018). A survey
on feature extraction technique for facial expression recognition system. In 2018 4th
International Conference on Computing Communication and Automation (ICCCA), pp. 1-6,
IEEE.
[28] Pali, V., Goswami, S., & Bhaiya, L. P. (2014). An extensive survey on feature extraction
techniques for facial image processing. In 2014 International Conference on Computational
Intelligence and Communication Networks, pp. 142-148, IEEE.
[29] Huang, Y., Chen, F., Lv, S., & Wang, X. (2019). Facial expression recognition: A
survey. Symmetry, 11(10), 1189.
139
[30] Zhao, X., & Zhang, S. (2016). A review on facial expression recognition: feature extraction
and classification. IETE Technical Review, 33(5), pp. 505-517.
[31] Khoshdeli, M., Cong, R., & Parvin, B. (2017). Detection of nuclei in H&E stained sections
using convolutional neural networks. In 2017 IEEE EMBS International Conference on
Biomedical & Health Informatics (BHI), pp. 105-108, IEEE.
[32] Vedantham, R., & Reddy, E. S. (2020). A robust feature extraction with optimized DBN-SMO
for facial expression recognition. Multimedia Tools and Applications, pp. 1-26.
[33] Kumar, Y., & Sharma, S. (2017). A systematic survey of facial expression recognition
techniques. In 2017 international conference on computing methodologies and communication
(ICCMC), pp. 1074-1079, IEEE.
[34] Harshitha, S., Sangeetha, N., Shirly, A. P., & Abraham, C. D. (2019). Human facial expression
recognition using deep learning technique. In 2019 2nd International Conference on Signal
Processing and Communication (ICSPC), pp. 339-342, IEEE.
[35] Wu, T., Fu, S., & Yang, G. (2012). Survey of the facial expression recognition research.
In International Conference on Brain Inspired Cognitive Systems, pp. 392-402, Springer,
Berlin, Heidelberg.
[36] About Support Vector Machine Algorithm and its types details:
https://www.javatpoint.com/machine-learning-support-vector-machine-algorithm
[37] Shi, M., Xu, L., & Chen, X. (2020). A Novel Facial Expression Intelligent Recognition Method
Using Improved Convolutional Neural Network. IEEE Access, 8, pp. 57606-57614.
[38] Evolution of Artificial Intelligence, Machine Learning and Deep Learning details:
https://towardsdatascience.com/ai-machine-learning-deep-learning-explained-simply-
7b553da5b960
[39] Machine Learning Algorithm and its different types in detail : https://medium.com/ai-in-plain-
english/artificial-intelligence-vs-machine-learning-vs-deep-learning-whats-the-difference-
dccce18efe7f
[40] Benuwa, B. B., Zhan, Y. Z., Ghansah, B., Wornyo, D. K., & Banaseka Kataka, F. (2016). A
review of deep machine learning. In International Journal of Engineering Research in
Africa,Vol. 24, pp. 124-136, Trans Tech Publications Ltd.
[41] Abiodun, O. I., Jantan, A., Omolara, A. E., Dada, K. V., Umar, A. M., Linus, O. U., & Kiru,
M. U. (2019). Comprehensive review of artificial neural network applications to pattern
recognition. IEEE Access, 7, pp. 158820-158846.
[42] Alom, M. Z., Taha, T. M., Yakopcic, C., Westberg, S., Sidike, P., Nasrin, M. S., & Asari, V.
K. (2019). A state-of-the-art survey on deep learning theory and
architectures. Electronics, 8(3), 292.
[43] Sit, M., Demiray, B. Z., Xiang, Z., Ewing, G. J., Sermet, Y., & Demir, I. (2020). A
comprehensive review of deep learning applications in hydrology and water resources. Water
Science and Technology.
140
[44] Vachhani, B., Bhat, C., Das, B., & Kopparapu, S. K. (2017). Deep Autoencoder Based Speech
Features for Improved Dysarthric Speech Recognition. In Interspeech, pp. 1854-1858.
[45] Shrestha, A., & Mahmood, A. (2019). Review of deep learning algorithms and
architectures. IEEE Access, 7, pp. 53040-53065.
[46] Deep learning architectures: https://developer.ibm.com/articles/cc-machine-learning-deep-
learning-architectures/
[47] Wang, Y., Li, Y., Song, Y., & Rong, X. (2019). Facial Expression Recognition Based on
Random Forest and Convolutional Neural Network. Information, 10(12), 375.
[48] Understanding of Convolutional Neural Network CNN – Deep Learning:
https://medium.com/@RaghavPrabhu/understanding-of-convolutional-neural-network-cnn-
deep-learning-99760835f148
[49] Ajit, A., Acharya, K., & Samanta, A. (2020). A Review of Convolutional Neural Networks.
In 2020 International Conference on Emerging Trends in Information Technology and
Engineering (ic-ETITE), pp. 1-5, IEEE.
[50] Julin, F. (2019). Vision based facial emotion detection using deep convolutional neural
networks.
[51] Transfer learning with Convolutional Neural Networks:
https://towardsdatascience.com/transfer-learning-with-convolutional-neural-networks-in-
pytorch-dd09190245ce
[52] Hussain, M., Bird, J. J., & Faria, D. R. (2018). A study on cnn transfer learning for image
classification. In UK Workshop on Computational Intelligence, pp. 191-202, Springer, Cham.
[53] Improve your model accuracy by Transfer Learning: https://medium.com/data-science-
101/transfer-learning-57ce3b98650
[54] Ma, C., Mu, X., & Sha, D. (2019). Multi-layers feature fusion of convolutional neural network
for scene classification of remote sensing. IEEE Access, 7, pp. 121685-121694.
[55] Nguyen, L. D., Lin, D., Lin, Z., & Cao, J. (2018). Deep CNNs for microscopic image
classification by exploiting transfer learning and feature concatenation. In 2018 IEEE
International Symposium on Circuits and Systems (ISCAS), pp. 1-5, IEEE.
[56] Mohanraj, V., Chakkaravarthy, S. S., & Vaidehi, V. (2019). Ensemble of convolutional neural
networks for face recognition. In Recent Developments in Machine Learning and Data
Analytics, pp. 467-477, Springer, Singapore.
[57] Ko, B. C. (2018). A brief review of facial emotion recognition based on visual
information. sensors, 18(2), 401.
[58] Papageorgiou, C. P., Oren, M., & Poggio, T. (1998). A general framework for object detection.
In Sixth International Conference on Computer Vision (IEEE Cat. No. 98CH36271), pp. 555-
562, IEEE.
141
[59] Viola, P., & Jones, M. J. (2004). Robust real-time face detection. International journal of
computer vision, 57(2), pp. 137-154.
[60] Weilong Chen, MengJooEr, Shiqian Wu (2006), "Illumination Compensation and
Normalization forRobust Face Recognition Using Discrete Cosine Transform in Logarithm
Domain", IEEE transactions on systems, man and cybernetics— part b: cybernetics, Vol. 36(2)
pp.458-466.
[61] Owusu, E., Zhan, Y., & Mao, Q. R. (2014). A neural-AdaBoost based facial expression
recognition system. Expert Systems with Applications, 41(7), pp. 3383-3390.
[62] Biswas, S., & Sil, J. (2015). An efficient expression recognition method using contourlet
transform. In Proceedings of the 2nd International Conference on Perception and Machine
Intelligence, pp. 167-174.
[63] Ji, Y., & Idrissi, K. (2012). Automatic facial expression recognition based on spatiotemporal
descriptors. Pattern Recognition Letters, 33(10), pp. 1373-1380.
[64] Zhang, L., Tjondronegoro, D., & Chandran, V. (2014). Random Gabor based templates for
facial expression recognition in images with facial occlusion. Neurocomputing, 145, pp. 451-
464.
[65] Happy, S. L., & Routray, A. (2014). Automatic facial expression recognition using features of
salient facial patches. IEEE transactions on Affective Computing, 6(1), pp. 1-12.
[66] Dahmane, M., & Meunier, J. (2014). Prototype-based modeling for facial expression
analysis. IEEE Transactions on Multimedia, 16(6), pp. 1574-1584.
[67] Hernandez-Matamoros, A., Bonarini, A., Escamilla-Hernandez, E., Nakano-Miyatake, M., &
Perez-Meana, H. (2015, September). A facial expression recognition with automatic
segmentation of face regions. In International Conference on Intelligent Software
Methodologies, Tools, and Techniques, pp. 529-540, Springer, Cham.
[68] Uçar, A., Demir, Y., & Güzeliş, C. (2016). A new facial expression recognition based on
curvelet transform and online sequential extreme learning machine initialized with spherical
clustering. Neural Computing and Applications, 27(1), pp. 131-142.
[69] Cossetin, M. J., Nievola, J. C., & Koerich, A. L. (2016). Facial expression recognition using a
pairwise feature selection and classification approach. In 2016 International Joint Conference
on Neural Networks (IJCNN), pp. 5149-5155, IEEE.
[70] Ghimire, D., & Lee, J. (2013). Geometric feature-based facial expression recognition in image
sequences using multi-class adaboost and support vector machines. Sensors, 13(6), pp. 7714-
7734.
[71] Chen, J., Chen, Z., Chi, Z., & Fu, H. (2014). Facial expression recognition based on facial
components detection and hog features. In International workshops on electrical and computer
engineering subfields, pp. 884-888.
142
[72] Happy, S. L., George, A., & Routray, A. (2012). A real time facial expression classification
system using local binary patterns. In 2012 4th International conference on intelligent human
computer interaction (IHCI) (pp. 1-5). IEEE.
[73] Ghimire, D., Jeong, S., Lee, J., & Park, S. H. (2017). Facial expression recognition based on
local region specific features and support vector machines. Multimedia Tools and
Applications, 76(6), pp. 7803-7821.
[74] Bhadu, A., Kumar, V., Shekhawat, H. S., & Tokas, R. (1956). An improved method of feature
extraction technique for facial expression recognition using Adaboost neural
network. International Journal of Electronics and Computer Science Engineering (IJECSE)
Volume, 1, pp. 1112-1118.
[75] Bao, H., & Ma, T. (2014). Feature extraction and facial expression recognition based on bezier
curve. In 2014 IEEE International Conference on Computer and Information Technology, pp.
884-887, IEEE.
[76] Lozano-Monasor, E., López, M. T., Fernández-Caballero, A., & Vigo-Bustos, F. (2014). Facial
expression recognition from webcam based on active shape models and support vector
machines. In International Workshop on Ambient Assisted Living, pp. 147-154, Springer,
Cham.
[77] Huang, H. F., & Tai, S. C. (2012). Facial expression recognition using new feature extraction
algorithm. ELCVIA Electronic Letters on Computer Vision and Image Analysis, 11(1), pp. 41-
54.
[78] Kamarol, S. K. A., Jaward, M. H., Parkkinen, J., & Parthiban, R. (2016). Spatiotemporal
feature extraction for facial expression recognition. IET Image Processing, 10(7), pp. 534-541.
[79] Do, T. T., & Le, T. H. (2008). Facial feature extraction using geometric feature and
independent component analysis. In Pacific Rim Knowledge Acquisition Workshop, pp. 231-
241, Springer, Berlin, Heidelberg.
[80] Bermani, A. K., Ghalwash, A. Z., & Youssif, A. A. (2012). Automatic facial expression
recognition based on hybrid approach. Editorial Preface.
[81] Zhao, G., Huang, X., Taini, M., Li, S. Z., & PietikäInen, M. (2011). Facial expression
recognition from near-infrared videos. Image and Vision Computing, 29(9), pp. 607-619.
[82] Shen, P., Wang, S., & Liu, Z. (2013). Facial expression recognition from infrared thermal
videos. In Intelligent Autonomous Systems 12, pp. 323-333, Springer, Berlin, Heidelberg.
[83] Szwoch, M., & Pieniążek, P. (2015). Facial emotion recognition using depth data. In 2015 8th
International Conference on Human System Interaction (HSI), pp. 271-277, IEEE.
[84] Gunawan, A. A. (2015). Face expression detection on Kinect using active appearance model
and fuzzy logic. Procedia Computer Science, 59, pp. 268-274.
[85] Wei, W., Jia, Q., & Chen, G. (2016). Real-time facial expression recognition for affective
computing based on Kinect. In 2016 IEEE 11th Conference on Industrial Electronics and
Applications (ICIEA), pp. 161-165, IEEE.
143
[86] Sohail, A. S. M., & Bhattacharya, P. (2007). Classification of facial expressions using k-nearest
neighbor classifier. In International Conference on Computer Vision/Computer Graphics
Collaboration Techniques and Applications, pp. 555-566, Springer, Berlin.
[87] Wang, X. H., Liu, A., & Zhang, S. Q. (2015). New facial expression recognition based on
FSVM and KNN. Optik, 126(21), pp. 3132-3134.
[88] Valstar, M., Patras, I., & Pantic, M. (2004). Facial action unit recognition using temporal
templates. In RO-MAN 2004. 13th IEEE International Workshop on Robot and Human
Interactive Communication (IEEE Catalog No. 04TH8759), pp. 253-258, IEEE.
[89] Chen, L., Zhou, C., & Shen, L. (2012). Facial expression recognition based on SVM in E-
learning. Ieri Procedia, 2, pp. 781-787.
[90] Michel, P., & El Kaliouby, R. (2003). Real time facial expression recognition in video using
support vector machines. In Proceedings of the 5th international conference on Multimodal
interfaces, pp. 258-264.
[91] Tsai, H. H., & Chang, Y. C. (2018). Facial expression recognition using a combination of
multiple facial features and support vector machine. Soft Computing, 22(13), pp. 4389-4405.
[92] Hsieh, C. C., Hsih, M. H., Jiang, M. K., Cheng, Y. M., & Liang, E. H. (2016). Effective
semantic features for facial expressions recognition using SVM. Multimedia Tools and
Applications, 75(11), pp. 6663-6682.
[93] Saeed, S., Baber, J., Bakhtyar, M., Ullah, I., Sheikh, N., Dad, I., & Sanjrani, A. A. (2018).
Empirical evaluation of SVM for facial expression recognition. Int. J. Adv. Comput. Sci.
Appl, 9(11), pp.670-673.
[94] Wang, Y., Ai, H., Wu, B., & Huang, C. (2004). Real time facial expression recognition with
adaboost. In Proceedings of the 17th International Conference on Pattern Recognition, ICPR,
Vol. 3, pp. 926-929, IEEE.
[95] Liew, C. F., & Yairi, T. (2015). Facial expression recognition and analysis: a comparison study
of feature descriptors. IPSJ transactions on computer vision and applications, 7, pp. 104-120.
[96] Gudipati, V. K., Barman, O. R., Gaffoor, M., & Abuzneid, A. (2016). Efficient facial
expression recognition using adaboost and haar cascade classifiers. In 2016 Annual
Connecticut Conference on Industrial Electronics, Technology & Automation (CT-IETA), pp.
1-4, IEEE.
[97] Zhang, S., Hu, B., Li, T., & Zheng, X. (2018). A Study on Emotion Recognition Based on
Hierarchical Adaboost Multi-class Algorithm. In International Conference on Algorithms and
Architectures for Parallel Processing, pp. 105-113, Springer, Cham.
[98] Moghaddam, B., Jebara, T., & Pentland, A. (2000). Bayesian face recognition. Pattern
recognition, 33(11), pp. 1771-1782.
[99] Mao, Q., Rao, Q., Yu, Y., & Dong, M. (2016). Hierarchical Bayesian theme models for
multipose facial expression recognition. IEEE Transactions on Multimedia, 19(4), pp.861-873.
144
[100] Surace, L., Patacchiola, M., Battini Sönmez, E., Spataro, W., & Cangelos. (2017). Emotion
recognition in the wild using deep neural networks and Bayesian classifiers. In Proceedings of
the 19th ACM International Conference on Multimodal Interaction (pp. 593-597).
[101] Mahersia, H., & Hamrouni, K. (2015). Using multiple steerable filters and Bayesian
regularization for facial expression recognition. Engineering Applications of Artificial
Intelligence, 38, pp. 190-202.
[102] Kusy, M., & Zajdel, R. (2014). Application of reinforcement learning algorithms for the
adaptive computation of the smoothing parameter for probabilistic neural network. IEEE
transactions on neural networks and learning systems, 26(9), pp. 2163-2175.
[103] Neggaz, N., Besnassi, M., & Benyettou, A. (2010). Application of improved AAM and
probabilistic neural network to facial expression recognition. Journal of Applied
Sciences(Faisalabad), 10(15), pp. 1572-1579.
[104] Fazli, S., Afrouzian, R., & Seyedarabi, H. (2009). High-performance facial expression
recognition using Gabor filter and probabilistic neural network. In 2009 IEEE International
Conference on Intelligent Computing and Intelligent Systems, Vol. 4, pp. 93-96, IEEE.
[105] Mollahosseini, A., Chan, D., & Mahoor, M. H. (2016). Going deeper in facial expression
recognition using deep neural networks. In 2016 IEEE Winter conference on applications of
computer vision (WACV), pp. 1-10, IEEE.
[106] Lopes, A. T., de Aguiar, E., De Souza, A. F., & Oliveira-Santos, T. (2017). Facial
expression recognition with convolutional neural networks: coping with few data and the
training sample order. Pattern Recognition, 61, pp. 610-628.
[107] Mohammadpour, M., Khaliliardali, H., Hashemi, S. M. R., & AlyanNezhadi, M. M.
(2017). Facial emotion recognition using deep convolutional networks. In 2017 IEEE 4th
international conference on knowledge-based engineering and innovation (KBEI), pp. 0017-
0021, IEEE.
[108] Cai, J., Chang, O., Tang, X. L., Xue, C., & Wei, C. (2018). Facial expression recognition
method based on sparse batch normalization CNN. In 2018 37th Chinese Control Conference
(CCC), pp. 9608-9613, IEEE.
[109] Li, Y., Zeng, J., Shan, S., & Chen, X. (2018). Occlusion aware facial expression
recognition using cnn with attention mechanism. IEEE Transactions on Image
Processing, 28(5), pp. 2439-2450.
[110] Yolcu, G., Oztel, I., Kazan, S., Oz, C., Palaniappan, K., Lever, T. E., & Bunyak, F. (2019).
Facial expression recognition for monitoring neurological disorders based on convolutional
neural network. Multimedia Tools and Applications, 78(22), pp. 31581-31603.
[111] Agrawal, A., & Mittal, N. (2020). Using CNN for facial expression recognition: a study
of the effects of kernel size and number of filters on accuracy. The Visual Computer, 36(2),
pp. 405-412.
145
[112] Jain, D. K., Shamsolmoali, P., & Sehdev, P. (2019). Extended deep neural network for
facial emotion recognition. Pattern Recognition Letters, 120, pp. 69-74.
[113] Kim, D. H., Baddar, W. J., Jang, J., & Ro, Y. M. (2017). Multi-objective based spatio-
temporal feature representation learning robust to expression intensity variations for facial
expression recognition. IEEE Transactions on Affective Computing, 10(2), 223-236.
[114] Yu, Z., Liu, G., Liu, Q., & Deng, J. (2018). Spatio-temporal convolutional features with
nested LSTM for facial expression recognition. Neurocomputing, 317, pp. 50-57.
[115] Liang, D., Liang, H., Yu, Z., & Zhang, Y. (2020). Deep convolutional BiLSTM fusion
network for facial expression recognition. The Visual Computer, 36(3), pp. 499-508.
[116] Liu, P., Han, S., Meng, Z., & Tong, Y. (2014). Facial expression recognition via a boosted
deep belief network. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 1805-1812.
[117] Burkert, P., Trier, F., Afzal, M. Z., Dengel, A., & Liwicki, M. (2015). Dexpression: Deep
convolutional neural network for expression recognition. arXiv preprint arXiv:1509.05371.
[118] Mollahosseini, A., Chan, D., & Mahoor, M. H. (2016). Going deeper in facial expression
recognition using deep neural networks. In 2016 IEEE Winter conference on applications of
computer vision (WACV), pp. 1-10, IEEE.
[119] Nguyen, H. D., Yeom, S., Oh, I. S., Kim, K. M., & Kim, S. H. (2018). Facial expression
recognition using a multi-level convolutional neural network. In International Conference on
Pattern Recognition and Artificial Intelligence, pp. 217-221.
[120] Liu, C., Tang, T., Lv, K., & Wang, M. (2018). Multi-feature-based emotion recognition
for video clips. In Proceedings of the 20th ACM International Conference on Multimodal
Interaction, pp. 630-634.
[121] VenkataRamiReddy, C., Kishore, K. K., Bhattacharyya, D., & Kim, T. H. (2014). Multi-
feature fusion based facial expression classification using DLBP and DCT. International
Journal of Software Engineering and Its Applications, 8(9), pp. 55-68.
[122] Liu, K., Zhang, M., & Pan, Z. (2016). Facial expression recognition with CNN ensemble.
In 2016 international conference on cyberworlds (CW), pp. 163-166, IEEE.
[123] Fan, Y., Lam, J. C., & Li, V. O. (2018). Multi-region ensemble convolutional neural
network for facial expression recognition. In International Conference on Artificial Neural
Networks, pp. 84-94, Springer, Cham.
[124] Jung, H., Lee, S., Yim, J., Park, S., & Kim, J. (2015). Joint fine-tuning in deep neural
networks for facial expression recognition. In Proceedings of the IEEE international
conference on computer vision, pp. 2983-2991.
[125] Nguyen, L. D., Lin, D., Lin, Z., & Cao, J. (2018). Deep CNNs for microscopic image
classification by exploiting transfer learning and feature concatenation. In 2018 IEEE
International Symposium on Circuits and Systems (ISCAS), pp. 1-5, IEEE.
146
[126] Wang, Y., Li, Y., Song, Y., & Rong, X. (2019). Facial Expression Recognition Based on
Auxiliary Models. Algorithms, 12(11), 227.
[127] Li, T. H. S., Kuo, P. H., Tsai, T. N., & Luan, P. C. (2019). CNN and LSTM based facial
expression analysis model for a humanoid robot. IEEE Access, 7, pp. 93998-94011.
[128] Renda, A., Barsacchi, M., Bechini, A., & Marcelloni, F. (2018). Assessing Accuracy of
Ensemble Learning for Facial Expression Recognition with CNNs. In International Conference
on Machine Learning, Optimization, and Data Science, pp. 406-417, Springer, Cham.
[129] Li, C., Ma, N., & Deng, Y. (2018). Multi-network fusion based on cnn for facial expression
recognition. In 2018 International Conference on Computer Science, Electronics and
Communication Engineering (CSECE 2018). Atlantis Press.
[130] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the
inception architecture for computer vision. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 2818-2826.
[131] Review: Inception-V3 -1st Runner up in ILSVRC 2015: https://sh-
tsang.medium.com/review-inception-v3-1st-runner-up-image-classification-in-ilsvrc-2015-
17915421f77c
[132] A simple guide to the versions of the Inception Network:
https://towardsdatascience.com/a-simple-guide-to-the-versions-of-the-inception-network-
7fc52b863202
[133] Kaggle dataset real time FER2013: https://www.kaggle.com/deadskull7/fer2013
[134] Facial Expression dataset CK+ : http://www.consortium.ri.cmu.edu/ckagree/
[135] Tran, E., Mayhew, M. B., Kim, H., Karande, P., & Kaplan, A. D. (2018). Facial expression
recognition using a large out-of-context dataset. In 2018 IEEE Winter Applications of
Computer Vision Workshops (WACVW), pp. 52-59, IEEE.
[136] Rifai, S., Bengio, Y., Courville, A., Vincent, P., & Mirza, M. (2012). Disentangling factors
of variation for facial expression recognition. In European Conference on Computer Vision,
pp. 808-822, Springer, Berlin, Heidelberg.
[137] Liu, M., Li, S., Shan, S., Wang, R., & Chen, X. (2014). Deeply learning deformable facial
action parts model for dynamic expression analysis. In Asian conference on computer vision,
pp. 143-157, Springer, Cham.
[138] Mollahosseini, A., Chan, D., & Mahoor, M. H. (2016). Going deeper in facial expression
recognition using deep neural networks. In 2016 IEEE Winter conference on applications of
computer vision (WACV), pp. 1-10, IEEE.
[139] Zeng, N., Zhang, H., Song, B., Liu, W., Li, Y., & Dobaie, A. M. (2018). Facial expression
recognition via learning deep sparse autoencoders. Neurocomputing, 273, pp. 643-649.
[140] Meng, Z., Liu, P., Cai, J., Han, S., & Tong, Y. (2017). Identity-aware convolutional neural
network for facial expression recognition. In 2017 12th IEEE International Conference on
Automatic Face & Gesture Recognition (FG 2017), pp. 558-565, IEEE.
147
[141] Zhang, Z., Luo, P., Loy, C. C., & Tang, X. (2018). From facial expression recognition to
interpersonal relation prediction. International Journal of Computer Vision, 126(5), pp. 550-
569.
[142] Khorrami, P., Paine, T., & Huang, T. (2015). Do deep neural networks learn facial action
units when doing expression recognition?. In Proceedings of the IEEE International
Conference on Computer Vision Workshops, pp. 19-27.
[143] Al-Sumaidaee, S. A., Abdullah, M. A., Al-Nima, R. R. O., Dlay, S. S., & Chambers, J. A.
(2017). Multi-gradient features and elongated quinary pattern encoding for image-based facial
expression recognition. Pattern Recognition, 71, pp. 249-263.
[144] Devries, T., Biswaranjan, K., & Taylor, G. W. (2014). Multi-task learning of facial
landmarks and expression. In 2014 Canadian Conference on Computer and Robot Vision, pp.
98-103, IEEE.
[145] Kim, B. K., Dong, S. Y., Roh, J., Kim, G., & Lee, S. Y. (2016). Fusing aligned and non-
aligned face information for automatic affect recognition in the wild: a deep learning approach.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Workshops, pp. 48-57.
[146] Review: VGG-16 Architecture used in ILSVR 2014 Challenge:
https://towardsdatascience.com/step-by-step-vgg16-implementation-in-keras-for-beginners-
a833c686ae6c
[147] Gopalakrishnan, K., Khaitan, S. K., Choudhary, A., & Agrawal, A. (2017). Deep
convolutional neural networks with transfer learning for computer vision-based data-driven
pavement distress detection. Construction and Building Materials, 157, pp. 322-330.
[148] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-
778.
[149] Introduction to ResNet Residual Network: https://www.mygreatlearning.com/blog/resnet/
[150] Architectures in convolutional neural networks: https://www.jeremyjordan.me/convnet-
architectures/
[151] The Karolinska Directed Emotional Faces dataset: https://kdef.se/home/aboutKDEF.html
[152] Vedantham, R., & Reddy, E. S. (2020). A robust feature extraction with optimized DBN-
SMO for facial expression recognition. Multimedia Tools and Applications, pp. 1-26.
[153] Eng, S. K., Ali, H., Cheah, A. Y., & Chong, Y. F. (2019). Facial expression recognition in
JAFFE and KDEF Datasets using histogram of oriented gradients and support vector machine.
In IOP Conference Series: Materials Science and Engineering, Vol. 705, No. 1, p. 012031, IOP
Publishing.
[154] Faria, D. R., Vieira, M., Faria, F. C., & Premebida, C. (2017). Affective facial expressions
recognition for human-robot interaction. In 2017 26th IEEE International Symposium on
Robot and Human Interactive Communication (RO-MAN), pp. 805-810, IEEE.
148
[155] Pandey, R. K., Karmakar, S., Ramakrishnan, A. G., & Saha, N. (2019). Improving facial
emotion recognition systems using gradient and laplacian images. arXiv preprint
arXiv:1902.05411.
[156] Fei, Z., Yang, E., Li, D. D. U., Butler, S., Ijomah, W., Li, X., & Zhou, H. (2020). Deep
convolution network based emotion analysis towards mental health
care. Neurocomputing, 388, pp. 212-227.
[157] Puthanidam, R. V., & Moh, T. S. (2018). A Hybrid approach for facial expression
recognition. In Proceedings of the 12th International Conference on Ubiquitous Information
Management and Communication, pp. 1-8.
[158] Hung, J. C., Lin, K. C., & Lai, N. X. (2019). Recognizing learning emotion based on
convolutional neural networks and transfer learning. Applied Soft Computing, 84, 105724.
[159] Tan, M., & Le, Q. V. (2019). Efficientnet: Rethinking model scaling for convolutional
neural networks. arXiv preprint arXiv:1905.11946.
[160] EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks:
https://amaarora.github.io/2020/08/13/efficientnet.html
[161] Reviewing EfficientNet: Increasing the accuracy and robustness of CNNs:
https://heartbeat.fritz.ai/reviewing-efficientnet-increasing-the-accuracy-and-robustness-of-
cnns-6aaf411fc81d
[162] Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training
by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
[163] Wu, Y., & He, K. (2018). Group normalization. In Proceedings of the European
conference on computer vision (ECCV) (pp. 3-19).
149
List of Publications
• Chintan Thacker, Dr. Ramji Makwana, “Human Behavior Analysis through Facial
Expression Recognition in images using Deep Learning”, International Journal of
Innovative Technology and Exploring Engineering, Vol. 9, Issue 2, 2019. ISSN:
2278-3075 (Scopus Indexed)
• Chintan Thacker, Dr. Ramji Makwana, “Ensemble of Multi Features Layers in CNN
for Facial Expression Recognition using Deep Learning”, International Journal of
Recent Technology and Engineering, Vol. 8, Issue 4, 2019. ISSN: 2278-3878
(Scopus Indexed)
• Chintan Thacker, Dr. Ramji Makwana, “A Review on Intelligent Video Surveillance
System for Human Behavior Analysis”, International Journal of Institute on
Emerging Research and Engineering Technology, 2018. ISSN: 2320-7590
• Chintan Thacker, Dr. Ramji Makwana, “Multimodal Ensemble fusion of CNN for
Facial Expression Recognition using Deep Learning Techniques”, The Imaging
Science Journal (Taylor & Francis Group, SCI-Scopus Indexed) (Under Review)