Ballot Face 2005

8/22/2019 Ballot Face 2005

1/98

Face recognition using Hidden Markov Models

by

Johan Stephen Simeon Ballot

Thesis presented at the University of Stellenbosch

in partial fulfilment of the requirements for the

degree of

Master of Science in Electronic Engineering with Computer

Science

Department of Electrical & Electronic Engineering

University of Stellenbosch

Private Bag X1, 7602 Matieland, South Africa

Study leaders:

Prof. J.A. du PreezProf. B.M. Herbst

April 2005

8/22/2019 Ballot Face 2005

2/98

Copyright

2005 University of Stellenbosch

All rights reserved.

8/22/2019 Ballot Face 2005

3/98

Declaration

I, the undersigned, hereby declare that the work contained in this thesis is

my own original work and that I have not previously in its entirety or in

part submitted it at any university for a degree.

S i g n a t u r e : . . . . . . . . . . . . . . . . . . . . . . . . . .

J.S.S. Ballot

Date: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ii

8/22/2019 Ballot Face 2005

4/98

Abstract

Face recognition using Hidden Markov Models

J.S.S. Ballot

Department of Electrical & Electronic Engineering

University of Stellenbosch

Private Bag X1, 7602 Matieland, South Africa

Thesis: MScEng (E&E + CS)

April 2005

This thesis relates to the design, implementation and evaluation of statis-tical face recognition techniques. In particular, the use of Hidden Markov

Models in various forms is investigated as a recognition tool and critically

evaluated. Current face recognition techniques are very dependent on issues

like background noise, lighting and position of key features (ie. the eyes,

lips etc.). Using an approach which specifically uses an embedded Hidden

Markov Model along with spectral domain feature extraction techniques,

shows that these dependencies may be lessened while high recognition rates

are maintained.

iii

8/22/2019 Ballot Face 2005

5/98

Uittreksel

Gesigsherkenning met behulp van Verskuilde Markov

Modelle

J.S.S. Ballot

Departement Elektriese & Elektroniese Ingenieurswese

Universiteit van Stellenbosch

Privaatsak X1, 7602 Matieland, Suid Afrika

Tesis: MScIng (E&E + RW)

April 2005

Hierdie tesis handel oor die ontwerp, implementering en bespreking van

statistiese gesigsherkenningstegnieke. Spesifiek die gebruik van Verskuilde

Markov Modelle in verskeie vorme, is as herkenningstegniek ondersoek en

krities geevalueer. Huidige gesigsherkenningstegnieke word meestal beperk

deur faktore soos agtergrond, beligting en posisie van sleutel-kenmerke (soos

byvoorbeeld oe, lippe ens.). Deur spesifiek n gentegreerde Verskuilde

Markov Model te gebruik in samewerking met frekwensiegebiedkenmerk-

data, word getoon dat genoemde beperkings verminder word terwyl hoe

herkenningsvermoe behou word.

iv

8/22/2019 Ballot Face 2005

6/98

Acknowledgements

I would like to express my sincere gratitude to the following people and

organisations who have contributed to making this work possible:

Professors du Preez and Herbst for being enthusiastic study leaders

and staying excited about this thesis even when at times I was not.

The National Research Foundation who funded most of this work

through the grant holder linked program.

My mother and father who for 6 years provided the best bursary any

student could hope for. Not to mention the emotional support and

unconditional love!

My friend, Pieter Rautenbach for being a wall of ideas and for always

giving me an honest opinion or two about my project. Also for helping

on the segmentation code which provided the much needed artistic

flavour in the sea of analytical despair.

The lab coffee machine, for obvious reasons.

v

8/22/2019 Ballot Face 2005

7/98

Contents

Declaration ii

Abstract iii

Uittreksel iv

Acknowledgements v

Contents vi

List of Figures ix

List of Tables xi

Nomenclature xii

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Literature synopsis . . . . . . . . . . . . . . . . . . . . . . . 51.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.6 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Literature study 11

2.1 First efforts in face recognition . . . . . . . . . . . . . . . . . 11

vi

8/22/2019 Ballot Face 2005

8/98

Contents vii

2.2 Hidden Markov Models enter the face recognition race . . . . 11

2.3 Extending the extensible . . . . . . . . . . . . . . . . . . . . 12

2.4 The latest HMM flavours used in face recognition . . . . . . 14

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Face databases and their peculiarities 17

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Possible issues . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4 Feature extraction methods 24

4.1 To feature or not to feature, that is the question . . . . . . . 24

4.2 Pixel intensities . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3 An introduction to the Discrete Cosine Transform . . . . . . 27

4.4 The Discrete Cosine Transform . . . . . . . . . . . . . . . . 28

4.5 Giving DCT features an extra boost of robustness . . . . . . 32

4.6 Comparison of methods and summary . . . . . . . . . . . . 34

5 Constructing the Hidden Markov Models 36

5.1 A brief introduction to HMMs . . . . . . . . . . . . . . . . . 36

5.2 HMM background . . . . . . . . . . . . . . . . . . . . . . . . 36

5.3 Model Configurations . . . . . . . . . . . . . . . . . . . . . . 39

6 Implementation 43

6.1 Practical aspects . . . . . . . . . . . . . . . . . . . . . . . . 43

6.2 The HMM configurations . . . . . . . . . . . . . . . . . . . . 46

7 Experimental investigation 53

7.1 Experiments on the ORL database . . . . . . . . . . . . . . 53

7.2 Experiments on the XM2VTS database . . . . . . . . . . . . 57

7.3 Summary of classification results . . . . . . . . . . . . . . . 59

7.4 Face segmentation . . . . . . . . . . . . . . . . . . . . . . . 62

7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

8/22/2019 Ballot Face 2005

9/98

Contents viii

8 Conclusions and recommendations 67

8.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

8.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

8.3 Possible improvements and Recommendations . . . . . . . . 70

A The ORL database 74

A.1 The complete ORL database . . . . . . . . . . . . . . . . . . 74

B Solution to the evaluation problem 75

B.1 The forward-backward procedure . . . . . . . . . . . . . . . 75B.2 The Viterbi algorithm . . . . . . . . . . . . . . . . . . . . . 76

C Face image segmentations 77

C.1 Examples of segmentations of face images in the XM2VTS

database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Bibliography 82

8/22/2019 Ballot Face 2005

10/98

List of Figures

1.1 Information flow in recognising human faces . . . . . . . . . . 1

2.1 A one dimensional HMM for face recognition . . . . . . . . . . 12

2.2 A one dimensional HMM with end-of-line states . . . . . . . . 13

2.3 An embedded HMM for face recognition . . . . . . . . . . . . . 14

3.1 Examples of pictures from the ORL database . . . . . . . . . . 19

3.2 Examples of pictures from the University of Surrey XM2VTS

database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3 Histogram of pixel intensities of bottom left image of figure 3.2 213.4 Example of differences between images of the same class in the

XM2VTS database . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.5 Histogram of pixel intensities of top left image of figure 3.1 . . 23

4.1 Enlarged grey scale picture of matrix A . . . . . . . . . . . . . 26

4.2 Histogram of matrix A containing grey scale values . . . . . . . 27

4.3 Example face from the University of Surrey, XM2VTS database

(2002) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.4 Ordering of DCT coefficients for N=M=4 . . . . . . . . . . . . 30

4.5 Reconstructions of figure 4.3 using DCT coefficients . . . . . . 31

5.1 Standard left-to-right, non-ergodic HMM . . . . . . . . . . . . 37

5.2 Vertical top-to-bottom HMM modelling a face . . . . . . . . . . 39

5.3 Embedded HMM modelling a face . . . . . . . . . . . . . . . . . 41

ix

8/22/2019 Ballot Face 2005

11/98

List of Figures x

6.1 Passing of features from the feature domain to an HMM config-

uration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.2 HMM configuration I topology . . . . . . . . . . . . . . . . . . 48

6.3 Average of the DCT means of the ORL database . . . . . . . . 49

6.4 Average of the DCT means of the XM2VTS database . . . . . . 50

6.5 Average of the DCT-mod2 means of the ORL database . . . . . 50

6.6 Average of the DCT-mod2 means of the XM2VTS database . . 51

6.7 HMM configuration II topology . . . . . . . . . . . . . . . . . . 52

7.1 Wrong classifications on the ORL database using pixel values asfeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

7.2 Examples of wrongly classified face images from the XM2VTS

database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

7.3 Segmentation of an ORL face using DCT-mod2 features . . . . 63

7.4 Mapping segmentation from the DCT-mod2 domain to the pixel

domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

7.5 Segmentation of a XM2VTS face using DCT-mod2 features . . 65

8.1 Ultimate face classification system . . . . . . . . . . . . . . . . 73

A.1 The Olivetti Research Laboratory, ORL database (1994) . . . . 74

C.1 Segmentation of a XM2VTS face image using DCT-mod2 fea-

tures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

C.2 Segmentation of a XM2VTS face image using DCT-mod2 fea-

tures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

C.3 Segmentation of a XM2VTS face image using DCT-mod2 fea-tures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

C.4 Segmentation of a mystery face image using DCT-mod2 features 81

8/22/2019 Ballot Face 2005

12/98

List of Tables

3.1 Database comparison . . . . . . . . . . . . . . . . . . . . . . . 18

4.1 Comparisons of feature extraction methods . . . . . . . . . . . 34

4.2 Classification accuracy on small scale . . . . . . . . . . . . . . 35

6.1 Comparable partitioning of databases . . . . . . . . . . . . . . . 44

7.1 Summary of classification results configuration I . . . . . . 54

7.2 Summary of classification results configuration II . . . . . . 54

7.3 Best classification results from literature . . . . . . . . . . . . . 56

7.4 Our best classification results on the ORL database . . . . . . . 57

7.5 Summary of classification results configuration I . . . . . . 58

7.6 Summary of classification results configuration II . . . . . . 58

7.7 Best classification results of Zhang et al. (2004) . . . . . . . . 59

7.8 Our results using configuration II and DCT-mod2 feature extrac-

tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

xi

8/22/2019 Ballot Face 2005

13/98

Nomenclature

Constants: = 3,1415926535897932384626433832795

Abbreviations:

HMM Hidden Markov Model

HMMs Hidden Markov Models

AI Artificial Intelligence

GMM Gaussian Mixture Model

PCA Principal Component Analysis

LDA Linear Discriminant Analysis

EM Expectation Maximisation

PDF Probability density function

DCT Discrete Cosine Transform

IDCT Inverse Discrete Cosine Transform

JPEG Joint Photographic Experts GroupAC Alternating Current

DC Direct Current

General Variables :

x A vector x

N Dimension

xii

8/22/2019 Ballot Face 2005

14/98

Nomenclature xiii

N(,) Gaussian pdf with mean (vector) and covariance (matrix)

Variables referring to HMMs:

= {a, f} An HMM with transition probabilities a and probability

density functions f

fi(x|St, ) Probability density function of state i quantifying the simi-

larity of a feature vector x to the state St = i given the model

XT1 Observation sequence from t = 1 to t = T

8/22/2019 Ballot Face 2005

15/98

Chapter 1

Introduction

1.1 Motivation

In a world where security has become a very high priority and where there is

no tolerance towards human error in this regard, computers and especially

software have developed to such an extent, that they are able to distin-

guish one human from another. Whether this is via fingerprint, voice or

other physicalities, the uniqueness of each and every human is exploited to

build robust computerised recognition systems which should in theory be

more reliable and more cost effective than employing a person to do the

same work. This thesis focuses on face recognition, especially using trained

statistical models to distinguish between a variety of individuals. The pos-

sibilities for applications are endless. Especially in an era of global paranoia

in terms of personal safety, the high technology security field would be the

most exploitable for its application.Recognising one human face from another is a process which happens

RecogniserSensor ResultData

Figure 1.1: Information flow in recognising human faces

1

8/22/2019 Ballot Face 2005

16/98

Chapter 1. Introduction 2

sub-consciously in a human being. The flow of information in a typical

recognition process is shown in figure 1.1. In a human system the process

can be summarised as follows.

Information is passed from the sensors to the recogniser

In the recogniser a database of hundreds of thousands of faces is

scanned in an instant and matched against the data obtained from

the sensors

The result is a recognition success or failure

This system is highly effective in humans. One of the problems in copying

this process for computerised applications is that we do not know how the

human brain (in computer terms wetware) does the recognising. What

features are extracted from the test data? How is the massive internal

database scanned in a fraction of a second? These are all questions which

remain largely unanswered even with current available technology.

To implement such a system, an artificial visual recogniser tries to sim-ulate this process so natural to humans each and every day. In such an

artificial system, referring to figure 1.1, as sensor there is a camera of sorts,

as recogniser some software implemented on some hardware, and finally

specialised software where a decision is made as to whether a subject is

recognised or not. The problem is that in the wetware human system, a

face which is already in the database will almost certainly be recognised,

but in an artificial system this is not the case. An artificial system must

be trained to recognise certain known features and it must also be designedto be robust in terms of eliminating background noise. In this respect the

filtering capacity of the human brain is still an unrivalled technology.

To summarise, the four basic problems in an artificial recogniser system

are:

Choosing robust features to interpret

8/22/2019 Ballot Face 2005

17/98


Choosing a model for the recogniser

Running a classification experiment using the chosen model

Interpreting the results

The construction of a computerised recogniser can be seen as a special case

of creating some form of artificial intelligence (AI). A computer system is

set up to perform a task usually reserved for humans and therefore this

exercise in modelling is also an investigation in the understanding of the

human brain to a certain extent. Hopefully we can furthermore show that

the AI can generate both consistent and satisfactory results.

1.2 Background

To recognise humans, three basic paths and one hybrid path could be fol-

lowed namely:

Chemical

Audio

Visual

Hybrid

It could be argued that the most effective recogniser is the chemical model.

It is however, probably the most impractical since humans tend to be scep-

tical to part with a sample of their DNA! Advances in speech recogni-

tion technology have shown such recognition systems to have substantial

use. But a person about to be recognised must still be able or willing

to speak. Security based applications could furthermore require a certain

catch-phrase/language to be spoken. A visual recogniser is a subtle recog-

niser; it can take a photo, process it and recognise a subject. All of these

steps can be done in an instant and if necessary, undercover. This is one of

8/22/2019 Ballot Face 2005

18/98


the reasons why dependable face recognition technology is a very attractive

proposition for security based applications. A hybrid recogniser combines

one or more of the above techniques to improve recognition rates. Roughly

stated, in choosing a recognition system a trade off between ease of imple-

mentation and practicality exists.

The usual problems that a face recognition system needs to solve are

(Muller (2002)):

Known/Unknown

Classification

Face verification

Full identification

In the first problem the system needs to identify whether a specific face

belongs to some group of known faces. This is typically encountered in

access control or security applications. Secondly classification is when a

decision has to be made about the identity of a given face by assigning its

identity to a group of known faces. This means that if there are a couple

of faces of persons X, Y and Z in the known group, would the given face

most likely be person X, Y or Z? With face verification the given face is

claimed to be of identity X. The system needs to verify whether this is

correct. Typically this is also used for security type applications. This can

be viewed as a special case of the first problem. Full identification is used

to determine whether a face is known and then to classify it. This is a

combination of the first and second problems.

This thesis investigates the classification problem. The first step in de-

signing a face recognition system is choosing the model for the recogniser.

Hidden Markov Models (HMMs) have proved to be quite a flexible statis-

tical modelling tool for this purpose. In this thesis HMMs are investigated

as a solution to the second of the listed four basic problems in artificial

recognition systems. A brief overview on Hidden Markov Model theory is

8/22/2019 Ballot Face 2005

19/98


given later on, as well as why HMMs could form the basis of quite a robust

recognition mechanism. To summarise the scope covered in this thesis, the

following problems are addressed:

Sensible preprocessing of face images in a given database

Construction of a suitable HMM model to recognise the faces in the

database

Classification experiments

Interpreting the results of a classification experiment

Comparing the results to published results

Segmentation of facial images

The relevant concepts of this study are therefore the peculiarities of the

available database, the modelling using HMMs and finally the achieved

results and their interpretation.

1.3 Literature synopsis

Several approaches may be found in literature for face recognition without

HMMs. These approaches are summarised and discussed in depth in Muller

(2002). This thesis focuses on work done on recognising faces using HMMs.

The most notable first efforts were made by Samaria & Young (1994). These

first HMMs used in face recognition had a straightforward topology as canbe seen in figure 2.1. These HMMs typically had five states, each state

modelling a specific area of a face image.

Each state of such an HMM contains a single multivariate Gaussian dis-

tribution as density function and pixel intensity values are used as feature

vectors. A given image matrix of pixel intensity values is scanned in over-

lapping blocks from the top of the image to the bottom to train the HMM.

8/22/2019 Ballot Face 2005

20/98


Satisfactory results were achieved but the flexibility of the HMM model

allowed for further improvements.

The seminal work in the field of HMM based face recognition is surely

Samaria (1994). Here a left-to-right HMM is used to obtain segmentation

information (or meaningful regions) of a given face. This segmentation in-

formation could then be used to identify a face. The HMM has a pseudo two

dimensional lattice of states each describing a distribution of feature vectors

belonging to a certain area of the face as shown in figure 2.2. Each HMM

has an end-of-line state with two possible transitions, either to the first

state of its row or to the next row of states. The relevant database used in

Samaria (1994) is the Olivetti Research Laboratory, ORL database (1994).

This database consists of faces of 40 individuals, with 10 different images

of each individual. The main feature of this database is that a picture of

an individual contains mainly facial information and very little background.

Background (noise) often transforms a seemingly great recognition system

into quite an average one.

Simultaneous efforts by Nefian & Hayes (1999) and Eickeler et al. (1999a)

introduced an embedded HMM model which consisted of embedded states

inside super states as shown in figure 2.3. This allowed for better transitions

between states since the embedded HMMs proved to be tighter probabil-

ity density functions than normal Gaussian distributions. Both furthermore

showed that pixel intensity values do not form the most robust of features

and that using two dimensional DCT coefficients as features, delivered bet-

ter results. More recent developments on extending HMMs to be even more

robust as recognising tool are discussed in Chapter 2.

This study will aim to reconstruct most of these HMM based face recog-

nition experiments, to verify their results and hopefully add some improve-

ments.

8/22/2019 Ballot Face 2005

21/98


1.4 Objectives

In any study of recognition the main goal or objective is achieving some or

other high rate of recognition, in other words classifying accurately. The

other lesser objectives all relate to this main one in being the stepping stones

in finding the ultimate result perfect classification. The main goals of

this study in face recognition can be summarised as follows:

Investigating the use of HMMs as a face recognition tool

Implementing a number of HMM topologies that could be used as face

classifiers

Evaluating the chosen HMM topologies as face classifiers against avail-

able face databases

Comparing the results of the HMM classifier against published sys-

tems

It can be seen that all the objectives revolve around Hidden Markov Mod-els and applying them to the rather uncustomary field of face recognition.

Modelling with HMMs tends to be quite a flexible process and therefore

a number of models can be constructed and tested as tools in order to

accomplish the aforementioned main objective.

1.5 Contributions

The available literature on HMMs used as a face recognition tool, covers themain issues regarding this solution to the face recognition problem. There is

an aspect though which receives little attention that is related to the actual

detail surrounding it. This aspect deals with the density functions inside

the HMM states. We believe that this thesis deals with these details and

in fact describes the process of selecting useful density function parameters

based on the available databases and therefore generating very good results.

8/22/2019 Ballot Face 2005

22/98


Another contribution deals with the question of what features to use, in

other words, what preprocessing of images is necessary to obtain the best

possible results. Furthermore, by using the segmentation of data provided

by HMMs, we can extract faces from the background and locate facial

features something very useful in computer vision based applications.

Again summarising these contributions:

Choosing density function parameters for HMM (embedded or not

embedded) states and their peculiarities

Training the HMMs with suitable features, i.e. feature extraction and

noise elimination

Designing the HMM topologies in accordance with the physicalities

of the available database

Segmentation of a face into meaningful regions

1.6 Overview

The focus of this thesis is the modelling of a face classification system using

Hidden Markov Models. We start off with an overview of the available

literature on face recognition using HMMs in Chapter 2 on page 11. This

chapter emphasises the fact that there is not much available in published

literature on Hidden Markov Models used in face recognition applications.

Two basic HMM topologies namely an embedded HMM and a single top-

to-bottom HMM are mentioned in the literature. We implement both thesemodels as to test their value as face image classifiers.

The focus of attention then moves on to the available databases used in

the classification experiments in Chapter 3 on page 17. Both the databases

we consider in this thesis have some interesting characteristics. Looking at

typical image histograms (figures 3.3 and 3.5) it may be seen that back-

ground noise and other factors should clearly be taken into consideration

8/22/2019 Ballot Face 2005

23/98


for at least the University of Surrey, XM2VTS database (2002). We cut out

the bulk of the background in all of the images of the XM2VTS database

to stop it from confusing the classifier. The other database we use, namely

the Olivetti Research Laboratory, ORL database (1994)) is used as it is

since the images in this database are already in a friendly format with

very little variation between images and background noise to confuse the

classifier.

This leads in to Chapter 4 on page 24. As suggested by results obtained

by previous systems in the available literature, we use features other than

pixel intensity values. This is done mainly to improve classification accu-

racy. Three feature extraction methods are implemented, focusing on the

Discrete Cosine Transform (DCT) and why DCT coefficients form more ro-

bust features for face recognition than pixel intensity values. Furthermore,

the feature extraction technique known as DCT-mod2 is also discussed and

how it could improve the robustness of the classifier. Our classification

experiments using the DCT-mod2 coefficients give excellent results.

With all the theory of the preprocessing in place, Chapter 5 on page 36

then covers the theoretical modelling of the HMMs used in the face classifi-

cation experiments. We decided to implement two HMM configurations, a

normal top-to-bottom HMM modelling down the rows of an image and an

embedded HMM with a vertical HMM containing horizontal HMMs as the

probability density functions within its states. These two topologies were

chosen as they are the most widely used in the available literature. It also

provides a good comparison of what the extra complexity of an embedded

HMM buys in terms of classification accuracy.

With all the necessary modelling, motivation and theory in place, Chap-

ter 6 on page 43 explains all the practical aspects concerning the implemen-

tation of the face classification system. Here we show the detail on how

the HMMs we use as classifiers, are constructed. Furthermore we show the

specifics of training and scoring our classifier on pixel intensity values, DCT

coefficients and DCT-mod2 coefficients.

8/22/2019 Ballot Face 2005

24/98


Finally the experiments conducted and results obtained are discussed

in Chapter 7 on page 53, the list of classification results on both databases

is noted starting on page 54. Excellent results are achieved on both the

databases we used in the experiments. The embedded HMM using DCT-

mod2 features obtains the best classification results. It scores perfect clas-

sification (100%) on the ORL database and on the complex XM2VTS

database a classification score as high as 93.31% is recorded. These re-

sults are furthermore shown to compare well against published systems.

We furthermore show results of segmentations done on face images, as pro-

vided by the Viterbi algorithm. These segmentations show on what areas

of the face the embedded HMM models on.

The final section is Chapter 8 on page 67. There the conclusions of

this thesis are encapsulated and recommendations are made for further

improvements in possible future work. By using techniques such as LDA

(Linear Discriminant Analysis) or KDA (Kernel Discriminant Analysis) we

believe that the models we discuss in this thesis can be improved to be very

robust recognisers.

8/22/2019 Ballot Face 2005

25/98

Chapter 2

Literature study

2.1 First efforts in face recognition

It can be argued that the pioneering work in the field of face recognition

was done by Kirby & Sirovich (1990). The technique they proposed

commonly known as eigenfaces is based on Principal Component Anal-

ysis (PCA) and has been extended and optimised by various institutions

and people to make it one of the most widely used current face recogni-

tion techniques. This technique and other early methods (like elastic graph

matching and linear discriminant analysis (LDA)) are discussed in Muller

(2002) and Sanderson (2003). These first methods all used facial geometry

and symmetry to classify faces.

2.2 Hidden Markov Models enter the face

recognition race

The first efforts to use HMMs as a face recognition tool were made by

Samaria & Young (1994). They introduced the HMM as quite a robust

mechanism to deal with face recognition. The HMM used was a single left-

to-right HMM as seen in figure 2.1 with each state modelling a specific facial

region. Each state of this HMM contains a single multivariate Gaussian dis-

11

8/22/2019 Ballot Face 2005

26/98

Chapter 2. Literature study 12

1 2 3 4 5a23a12 a34 a 45

a11 a22 a33 a44 a55

Forehead Eyes Nose Mouth Chin

Figure 2.1: A one dimensional HMM for face recognition

tribution as probability density function (pdf). This HMM is trained on a

database of pictures, all of them read from top to bottom with each row of

pixel intensity values used as feature vectors. This approach achieved bet-

ter classification rates than a PCA based approach on the tested database.

Another bonus of introducing HMMs is that it segments the face into mean-

ingful regions which can also be used for other applications like facial gesture

recognition. Follow-up work by the same author, Samaria (1994), extended

the classic one dimensional left-to-right HMM to a pseudo two dimensional

(pseudo-2D) one. This HMM had a pseudo two dimensional lattice of states

each describing a distribution of feature vectors belonging to a certain area

of the face as shown in figure 2.2. Each HMM had an end-of-line state with

two possible transitions, either to the beginning state of its row or to the

next row of states. In each state a multivariate Gaussian distribution was

used to model the distribution of feature vectors relevant to that state. This

approach was tested on the Olivetti Research Laboratory, ORL database

(1994) and again it outperformed previous face recognition techniques at

that time.

2.3 Extending the extensibleSimultaneous efforts by Nefian & Hayes (1999) and Eickeler et al. (1999a)

introduced an embedded HMM which consisted of embedded states inside

super states as shown in figure 2.3. Again, each of the top-to-bottom states

models a specific facial region. This extended HMM model allows for better

transitions between states since the embedded HMMs prove to be tighter

probability density functions than normal Gaussian distributions. Both

8/22/2019 Ballot Face 2005

27/98


Figure 2.2: A one dimensional HMM with end-of-line states

authors furthermore showed that pixel intensity values do not form the most

robust of features and that using selected two dimensional discrete cosine

transform (DCT) coefficients as features delivered better results. Perfect

classification (100%) was obtained on the Olivetti Research Laboratory,

ORL database (1994) using this technique and overall recognition speed

increased because using only selected DCT features significantly compresses

the data. The main problem in all the above mentioned techniques was that

they were tested on a database which consisted of pictures with very little

background (see figure 3.1). The modelling can therefore be done very

accurately and the HMMs can be fine tuned to deliver remarkable results.

In a practical system this step would only be possible if faces could be

8/22/2019 Ballot Face 2005

28/98


Forehead

Nose

Eyes

Mouth

Chin

Figure 2.3: An embedded HMM for face recognition

identified from pictures and then preprocessed to form a background free

image for the HMMs to classify.

2.4 The latest HMM flavours used in facerecognition

Hidden Markov Models have traditionally been used to model time depen-

dent data. For this use they have been fine tuned and thorough research has

already been done on the subject, especially concerning what features to

use (for example cepstra features in automatic speech recognition systems).

8/22/2019 Ballot Face 2005

29/98


In image processing, HMMs are quite a new addition to the fold of well es-

tablished techniques and therefore extracting robust features is still one of

the major areas for future development. Some novel new feature extraction

techniques are discussed in Sanderson (2003). One of these techniques is

the DCT-mod2 approach, which we included as a feature extraction method

in this thesis. The DCT-mod2 feature extraction method could be seen as

a form of delta-coefficient extraction. This method shows lots of potential

especially in keeping the recogniser robust when illumination changes occur.

As far as we could establish, DCT-mod2 features have not previously been

used in HMM based classifiers. Consensus, it seems, has been reached that

DCT based feature extraction methods are probably the most effective.

Other advanced efforts were made by Muller et al. (2002) where they

proposed a triple embedded HMM based model to recognise facial ex-

pressions. It is also worthwhile mentioning the HMM recogniser used by

M.Bicego et al. (2003) where the author proposes wavelet coding as a feature

extraction method. Using the wavelets as features shows the same perfect

classification score on the ORL database. In Othman & Aboulnasr (2003)

the authors proposed an HMM with an extended two dimensional structure

to use as a recogniser. This means that all states allow both vertical and

horizontal transitions. Again DCT coefficients were used as features which

underlined the trend to move away from pixel intensities when choosing

feature vectors. They also achieved remarkable results but again it was on

the ORL database.

Another improved HMM-based recogniser was proposed by Eickeler et al.

(1999b) using JPEG format features. What makes this technique useful is

that it can recognise faces directly from the JPEG format compressed data

and it is therefore an improvement speed-wise on previous efforts. This

method also underlines the fact that DCT-based features are used to sup-

press the sensitivity to changes in light intensity.

8/22/2019 Ballot Face 2005

30/98


2.5 Summary

The literature provides a summary of previous HMM-based classifiers. The

HMM topology most widely used seem to be the basic top-to-bottom HMM

modelling down the rows of an image as first proposed by Samaria & Young

(1994). The extension of this model to an embedded HMM (by Nefian

& Hayes (1999) and Eickeler et al. (1999a)) shows a lot of promise as a

possibly robust classifier. Furthermore, spectral domain feature extraction

techniques are widely used in published systems to improve the robustness

of a classifier. In this thesis we reconstruct and improve the top-to-bottomand embedded HMMs. In training these models we also use specific spectral

domain features (DCT coefficients, as proposed in the literature) to improve

the classification accuracy of the HMMs.

8/22/2019 Ballot Face 2005

31/98

Chapter 3

Face databases and their

peculiarities

3.1 Introduction

The results of any classification experiment should always be seen in the

context of the database of face images that the classifier involved has beentrained and tested on. Such a database could be characterised by the fol-

lowing properties:

Format of the pictures (i.e. file type, size, grey scale/colour)

Number of persons in the database

Number of images per person

Variations in lighting conditions between images

Variations in individuals features between images

Amount of background in a picture

These properties all play some part in either confusing or helping the clas-

sifier to classify the faces in the database. In our experiments we use the

University of Surrey, XM2VTS database (2002) and the Olivetti Research

17

8/22/2019 Ballot Face 2005

32/98

Chapter 3. Face databases and their peculiarities 18

Laboratory, ORL database (1994). These databases differ in all the prop-

erties mentioned above, so we list the differences in table 3.1. In order to

ORL database XM2VTS database

Format Grey scale .pgm RGB .tiff

Image size 112x92 576x720

Persons in database 40 295

Images per person 10 8

Total images 400 2360

Light variation Slight SlightPercentage background 40% of image

Background uniformity Uniform black Non-uniform blue

Table 3.1: Database comparison

fully understand table 3.1, see figure 3.1 for samples of the ORL database

and figure 3.2 for samples of the XM2VTS database. For the purposes of

this thesis the XM2VTS database images were resized to be 288x360, which

corresponds to a scaling of a 12

on the rows and columns. These images were

also converted from RGB1 to grey scale. Finally a window of 236x144 pixels

was cut out, trying to capture as much of the face as possible. The ORL

database pictures were already in a friendly format since the pictures were

all cropped around the faces they represented reducing confusion caused by

background.

The differences between these two databases provided a good test to

show the robustness of our methods.

1Colour pictures are represented by three pictures, each corresponding to the red,

green or blue (three primary colours) values of the pixels.

8/22/2019 Ballot Face 2005

33/98


Figure 3.1: Examples of pictures from the ORL database

3.2 Possible issues

3.2.1 The XM2VTS database

In order for our classifier to perform well on both databases, we need toinvestigate any possible issues that could be encountered when testing our

classifier on these databases. The XM2VTS database is an extensive frontal

face database containing images of 295 individuals (8 images each). This

database was mainly constructed with face verification in mind and es-

tablished a testing protocol to ensure that different institutions compare

equivalent results. This protocol is known as the Lausanne protocol. For a

8/22/2019 Ballot Face 2005

34/98


Figure 3.2: Examples of pictures from the University of Surrey XM2VTS

database

comprehensive discussion on the particulars of the XM2VTS database see

Messer et al. (1999). The main issue that arises when using this database

is to classify faces against the large amount of background that exists in

the images. This database was acquired over a period of five months, with

acquisition sessions spaced over one month intervals. The fact that the ses-

sions were spaced a month apart means that background detail also differs

in different images.

We focus on figure 3.2, and specifically on the sample face at the bottom

left of this image. It can be seen that the background takes up a high

percentage of the pixels of the picture. When referring to the histogram

of the sample face image (see figure 3.3), this problem becomes even more

evident. Most of the pixel values lie in and around the value of 50. Because

HMMs are powerful modelling tools, they tend to model on the non-uniform

background rather than the facial data purely because the background takes

up so much of the data.

Three possible solutions exist to overcome this problem. The first so-

8/22/2019 Ballot Face 2005

35/98


0 50 100 150 200 250

0

1000

2000

3000

4000

5000

6000

7000

8000

Figure 3.3: Histogram of pixel intensities of bottom left image of figure 3.2

lution is to adapt our model by carefully choosing the probability density

functions (pdfs) and the features to extract. This probably represents the

most scientifically correct solution. A second possible solution is to cropall the pictures so that they consist mainly of the facial data. Automatic

procedures to do this exist but we manually extracted the faces for our final

experiments. The third possible solution is to normalise or transform the

images in some or other way and then use the feature extraction methods

as described in chapter 4.

Another feature of the XM2VTS database is the way in which lighting

and background as well as personal features (glasses, hair etc.) vary between

images belonging to the same class. One of the more extreme cases ispresented in figure 3.4.2

2The colour images have been presented as they better highlight the subtle differences

between images.

8/22/2019 Ballot Face 2005

36/98


Figure 3.4: Example of differences between images of the same class in the

XM2VTS database

3.2.2 The ORL database

This database consists of images of 40 people, with 10 images per person.

An image of the complete database (400 faces) is given in appendix A. The

persons captured in this database are aged between 18 and 81. There are 4

female and 36 male subjects, with each image containing a different facial

expression. For most of the images light conditions differ but all of the

images are set against a uniform black background. All of the images are

cropped to consist of mostly facial data with very little background. The

varying conditions of light and expressions but limited background, makes

this database ideal for controlled face classification experiments. Take for

instance the top left sample face in figure 3.1 and that images histogram as

shown in figure 3.5. When comparing this histogram with the one presented

in the previous section, it may be seen that it should be easier to model

on this database because the pixel values are more evenly spread without

extremities at specific pixel values.

8/22/2019 Ballot Face 2005

37/98


0 50 100 150 200 250

0

50

100

150

Pixel intensity on the gray scale

Numberofpixels

Figure 3.5: Histogram of pixel intensities of top left image of figure 3.1

3.3 Summary

The characteristics of both databases have been established. A controlledeffort can therefore be made to extract robust features for the classification

experiments. The next chapter deals with feature extraction and how it

is necessary to develop a way to overcome the difficulties, especially those

presented by the complex images in the XM2VTS database.

8/22/2019 Ballot Face 2005

38/98

Chapter 4

Feature extraction methods

4.1 To feature or not to feature, that is the

question

Our model operates on features extracted from the images. With the issues

surrounding databases as described in the previous chapter, these features

should be chosen in such a way as to ensure a separation between individ-

uals. The extraction of features concerns the passing on of object data in

a specific format and size to some model, mainly for the purpose of recog-

nising the object. Referring to figure 1.1, feature extraction is the step

in between the sensor and the recogniser. Humans are restricted to fea-

tures based on the five senses. Therefore, using the eyes, the frequencies

(colour) and the intensity of light are the only features from which objects

can be identified. In training an artificial recogniser, the features seen by

the recognition models can be manipulated. Specifically in this thesis, the

main question concerning feature extraction that arises is: what numerical

values are necessary to effectively train the HMM based classifier from? The

identification of these values is the basis of the feature extraction problem.

The following features were investigated and specifically used to train

the HMMs used in the face classification experiments:

24

8/22/2019 Ballot Face 2005

39/98

Chapter 4. Feature extraction methods 25

Pixel intensity values

Discrete Cosine Transform (DCT) coefficients

DCT-mod2 coefficients

Pixel intensity values are the raw data representing an image. In a grey

format they typically vary in value from 0 to 255. DCT coefficients are

obtained by applying the two dimensional DCT to blocks of a given image.

The DCT-mod2 coefficients are extended DCT based features as proposed

by Sanderson & Paliwal (2002). As far as could be ascertained DCT-mod2

based feature extraction in HMM based face classification has not been in-

vestigated before. The following sections deal with the in-depth explanation

of the method behind each of these feature extraction techniques and their

advantages or disadvantages when used in the classification of face images.

4.2 Pixel intensities

Pixel intensity values are numerical values of light intensity on a specificscale and are used to store pictures digitally. For instance, say that a

grey-scale digital photo is taken of the face of a human at a resolution of

720x576. This makes it possible to store a matrix (with 720 columns and

576 rows) of light intensity values on a computer. The grey scale implies

that the intensity values are integers representing shades of grey ranging

from 0 (black) to 255 (white). The following example illustrates grey scale

pixel values: Assume we have a matrix of pixel intensity values (matrix A)

representing an image (figure 4.1):

A =

2 255 2 255

10 200 200 100

50 100 50 2

2 50 150 200

Storing pictures in this raw format wastes space, one of the many available

compression routines is used instead. These pixel values do however repre-

8/22/2019 Ballot Face 2005

40/98


Figure 4.1: Enlarged grey scale picture of matrix A

sent features that can be used to train the HMM topologies discussed in this

thesis and satisfactory face classification results are obtained. The problem

however is that many features have to be kept, therefore the training and

scoring of models becomes computationally expensive. If we wanted to

classify the image represented by matrix A for some or other reason using

an HMM based classifier, the image could be scanned from top to bottom

with each row forming a single feature vector. The complete observation

sequence is therefore the four rows of this matrix. A histogram (figure 4.2)

of the pixel intensities can be drawn. As was shown in the previous chapter,

typical histograms of facial images in the available databases (see figures

3.3 and 3.5) show how face data and background, which can be regarded as

noise, are embedded in the features (pixel intensity values). This is one of

the reasons why pixel intensity values are not the best features to use. For

robust face classification we want features to be decorrelated in some way

so we can model a face image as distinctly as possible.

To summarise the advantages of pixel intensity values as features: they

are easy to obtain and they have the same dimensions as the image data.

8/22/2019 Ballot Face 2005

41/98


0 50 100 150 200 250

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Gray scale value

Numberofpixels

Figure 4.2: Histogram of matrix A containing grey scale values

The disadvantages of pixel values are: they tend to be sensitive to image

noise as well as image rotations or shifts, and changes in illumination. They

furthermore induce large dimensions on observation vectors. This causes

any complex algorithm to take an unacceptably long time to complete.

4.3 An introduction to the Discrete Cosine

Transform

Compressing data is essential in both biological and signal processing ap-

plications. Even in human vision the light signals received by the approx-

imately 130 million photo-receptors (see Steven W. Smith (1999) for more

details) on retinal level in the eye, are sent to the brain for compression

and processing. By the time these signals arrive at the higher centres of the

brain, they convey magnitude (contrast), phase and frequency, which are all

principle attributes of Fourier analysis. Especially in the image processing

community the two dimensional DCT has been used as a data compres-

8/22/2019 Ballot Face 2005

42/98


sion tool. The two dimensional DCT forms the basis of the JPEG (Joint

Photographic Expert Group) image compression standard. It is important

to note that we will henceforth be referring to the two dimensional version

of the DCT only as the DCT. The original DCT is mainly used in one

dimensional applications (i.e. not image processing).

4.4 The Discrete Cosine Transform

4.4.1 Motivation and the case of the missing sinecoefficients

In general, to obtain the frequency representation of a two dimensional sig-

nal the Fourier Transform is used and specifically the FFT (Fast Fourier

Transform) algorithm. The Fourier theorem specifies that any signal can

be represented as a weighted sum of even and odd sinusoidal terms. The

DCT is a transform very much like the Fourier Transform, but with the

DCT a signal is represented only by the even sinusoidal terms (hence nam-

ing it a cosine transform). Representing image information in terms of the

DCT rather than with the FFT has the important advantage that DCT

coefficients are always real valued. The DCT also delivers better energy

compression and the coefficients are nearly uncorrelated (Eickeler et al.

(1999a)). Having nearly uncorrelated coefficients makes the DCT very at-

tractive in terms of image processing. It means that for instance in the

application of face recognition that DCT features will be less sensitive to

changes in image illumination. In general, the two dimensional DCT of a

MxN matrix F is defined as follows:

C(u, v) = (u)(v)M1i=0

N1j=0

F(i, j)cos(2i + 1)u

2Mcos

(2j + 1)v

2N(4.4.1)

where

(0) =

1

M, (u) =

2

M

8/22/2019 Ballot Face 2005

43/98


(0) =1

N, (v) =2

N

and

0 u M 1 , 0 v N 1

From equation 4.4.1 a DCT coefficient matrix can be constructed. These

coefficients represent the energy contribution by different frequencies. The

first coefficient (C(0, 0)) represents the DC component or the average

value of the MxN block. The rest of the coefficients represent the different

AC components, as contributed by each of the frequencies present.

For the subsequent discussion refer to the sample image of a person

(figure 4.3, scaled to two thirds the size) taken from the University of Sur-

rey, XM2VTS database (2002). The main advantage of the DCT is that it

Figure 4.3: Example face from the University of Surrey, XM2VTS database

(2002)

compresses data. This compression property of the DCT allows a block of

8/22/2019 Ballot Face 2005

44/98


K=11 K= 13

K=9 K=10 K=15

K=12K=74K=K=2

K=1 5K= K=6

K=3 K=8

K=14

K=0

0 1 2 3

0

1

2

3

v

u

Figure 4.4: Ordering of DCT coefficients for N=M=4

pixels to be represented by just a few DCT coefficients and it is therefore

possible to work with less features, and still obtain more information than

would be present when using the larger number of pixel values. In order

to extract the coefficients which contain the most data about the block

of data transformed, the DCT coefficient matrix needs to be scanned in

a zig-zag pattern as shown in figure 4.4. This is because the contributing

frequencies are arranged from low to high as indicated by the zig-zag pat-

tern represented by increasing K. To show these compression properties

the first 10x10 (compression of4000 times), 50x50 (compression of160

times), 100x100 (compression of40 times) and 200x200 (compression of

10 times) coefficients were extracted from figure 4.3 and run through the

inverse transform (IDCT) to obtain approximated images. See figure 4.5for the approximations of the face image.1 It can be seen that the DCT

provides suitable data compression and for this reason alone it should be

considered when constructing features used in face recognition.

1This example shows the compression capabilities of the DCT and should not be

confused with the JPEG compression standard, in which the DCT is used, but not in

this manner.

8/22/2019 Ballot Face 2005

45/98


Figure 4.5: Reconstructions of figure 4.3 using DCT coefficients

4.4.2 Feature extraction using the DCT

In this thesis the selection of suitable DCT coefficients from pictures in

the available databases (see figures 3.1 and 3.2) was evaluated as a featureextraction method. For this method of feature extraction a sliding window

of 8x8 pixels was scanned over a picture with the standard overlap of 50% in

both the horizontal and vertical directions. For each window of 8x8 pixels,

a DCT coefficient matrix of the same size was obtained. This means that

for an image of Y rows and X columns there are

ND = (2Y

N 1) (2

X

N 1) (4.4.2)

number of 8x8 DCT coefficient blocks (with N = 8 being the size of thewindow). These DCT coefficient blocks are then reduced by keeping their

first 15 coefficients (as suggested by experiments of Sanderson (2003)) by

following the zig-zag pattern described earlier. Thus every 64 values are

reduced to L = 15 values and a single observation used to represent the

data of block (b, a) is now the vector:

x = [c(b,a)0 c

(b,a)1 c

(b,a)2 c

(b,a)L1 ]

T (4.4.3)

8/22/2019 Ballot Face 2005

46/98


A complete observation sequence is obtained consisting of ND of these vec-

tors. Specifically, for the two databases used, the images were of size 112x92

and 236x144.2 This means we have observation sequences of sizes ND = 594

blocks and ND = 2030 blocks respectively.

4.5 Giving DCT features an extra boost of

robustness

In Sanderson & Paliwal (2002) a novel way of adding more robustness tothe DCT is introduced. This method of feature extraction is based on poly-

nomial coefficients, also known as deltas. In speech recognition applications

an analogue to this method of feature extraction has proved very successful

in eliminating background noise and channel mismatch. Images however

consist inherently of two dimensional signals and therefore we have to re-

define these coefficients. As proposed in Sanderson & Paliwal (2002) we

will name this new method of feature extraction DCT-mod2. For images

we now define the n-th horizontal delta coefficient for a block located at

(b, a) as a modified first order orthogonal polynomial coefficient (Sanderson

& Paliwal (2002)):

hc(b,a)n

=

Kk=Kkhkc

(b,a+k)nK

k=Khkk2

(4.5.1)

Similarly, the n-th vertical delta coefficient is defined as:

v

c(b,a)n =

K

k=Kkhkc(b+k,a)n

Kk=Khkk

2 (4.5.2)

where h is a 2K+1 dimensional symmetric window vector and cn is the n-th

DCT coefficient of a block located at (b, a). For our purposes we let K = 1

and h = [1 1 1]T be a rectangular window. To illustrate the advantage of

2Important to note that when speaking of the size of an image the customary format

is (number of rows) x (number of columns) but the resolution of an image is written the

other way around.

8/22/2019 Ballot Face 2005

47/98


using these modified delta features, assume we have three consecutive blocks

X, Y and Z, as explained in Sanderson & Paliwal (2002). Let us assume

that each block contains an information component and a noise component,

say X = XI + XN, Y = YI + YN and Z = ZI + ZN. Let us assume that

each block is corrupted by the same noise, therefore XN = YN = ZN. This

is a reasonable assumption to make if the blocks are small and close to each

other or if these blocks are neighbours as the result of overlapping used in

the sampling process. The deltas for block Y can now be computed using

equation 4.5.1 and 4.5.2:

hY =1

2(X + Z)

=1

2(XI XN + ZI + ZN)

=1

2(ZI XI) (4.5.3)

and

vY =1

2

(X + Z)

=1

2(XI XN + ZI + ZN)

=1

2(ZI XI) (4.5.4)

and the noise component is removed. We now modify our DCT feature

vector by replacing the first three coefficients by their horizontal and vertical

deltas and form a feature vector representing a given block at (b, a) as a

new vector:

x = [hc0 vc0

hc1 vc1

hc2 vc2

c3 c4 cL1]T (4.5.5)

where the (b, a) indication was left out to maintain clarity and L = 15. The

first three coefficients represent the most information held in the block and

therefore to limit the size of the features they are replaced with their delta

coefficients. A block of coefficients taken on the edges of the picture will not

have a neighbouring block on the one side, so when using the DCT-mod2

8/22/2019 Ballot Face 2005

48/98


approach we end up with

ND2 = (2Y

N 3) (2

X

N 3) (4.5.6)

blocks. This gives observation sequences of sizes ND2 = 500 blocks and

ND2 = 1848 blocks respectively.

4.6 Comparison of methods and summary

To summarise, in general any method of feature extraction has certaincharacteristics which need to be taken into account when constructing an

artificial recogniser. The three feature extraction methods discussed are

characterised in table 4.1. When training HMMs to recognise faces, it is

Pixel intensities DCT DCT-mod2

Preprocessing None ND 2-D DCTs ND 2-D DCTs and

ND2 linear operations

Dimensionality Large Small Small

Robustness None Very Most

Table 4.1: Comparisons of feature extraction methods

desirable to speed up the process without sacrificing accuracy. By using

the two DCT-based feature extraction methods, we improved the speed of

our system (because of the fewer dimensions of the observation sequences).

Furthermore, our system becomes robust to changes in light illumination

something that is inherent in any picture. To briefly illustrate the valueof the above feature extraction methods, a small classification experiment

was run on the first 8 individuals (using 4 images of each) in the University

of Surrey, XM2VTS database (2002) using each of the feature extraction

methods. The leave-one-out method of training/scoring was used, with

HMM configuration II (see chapter 6 for details of this configuration it

is a configured embedded HMM). The results we obtained from this mini

8/22/2019 Ballot Face 2005

49/98


Recognition accuracy Wrong classifications

Pixel values 84.38% 5 faces

DCT 90.63% 3 faces

DCT-mod2 100.0% 0 faces

Table 4.2: Classification accuracy on small scale

experiment are summarised in table 4.2. The full results achieved on both

evaluated databases and using all three feature extraction methods are listed

and discussed in chapter 7. We see from this discussion on feature extraction

techniques that we need to give our HMM classifier as much information as

possible about an image wasting as little space as possible. The next chap-

ter deals with the foundation of this thesis on face recognition, namely the

construction of the specialised HMMs used in the classification experiments.

8/22/2019 Ballot Face 2005

50/98

Chapter 5

Constructing the Hidden

Markov Models

5.1 A brief introduction to HMMs

Hidden Markov Model theory forms the background of the industry stan-

dard in speech recognition based applications. HMMs tend to be robustrecognisers with extreme flexibility in terms of parameters. These char-

acteristics caused us to believe that HMMs might be suitable to image

recognition and as this thesis shows, this is in fact the case. An in depth

discussion on HMMs is deferred to the many excellent references on the

topic, one being by Rabiner & Juang (1986). The purpose of this chapter

is to introduce our application specific HMMs and show how an expan-

sion on conventional one-dimensional HMM theory will suit our inherently

two-dimensional application.

5.2 HMM background

We now introduce the notation and mathematical descriptions (regarding

HMMs) necessary to illustrate subsequent discussions on our face recogni-

tion model.

36

8/22/2019 Ballot Face 2005

51/98

Chapter 5. Constructing the Hidden Markov Models 37

5.2.1 Topology and notation

Over the years of research in pattern recognition quite a number of HMM

topologies and configurations have seen the light, as mentioned in du Preez

(1997). The standard topology we are concerned with is the non-ergodic,

left-to-right Hidden Markov Model as in figure 5.1. The reason this specific

model was chosen is because the human face can naturally be divided into

segments common to every human (eyes, nose, mouth, chin etc.) and these

features are in the same order. A Hidden Markov Model is defined as a set

1 2 3 4 5a23a12 a34 a 45

a11 a22 a33 a44 a55

Figure 5.1: Standard left-to-right, non-ergodic HMM

ofN emitting states as well as an initial and an end-of-line state (these states

are so-called null-states), so we end up with N + 2 states. The expression

St = i will indicate the occurrence of state i at time t. The time indices

run from t = 1 to t = T, where T is the length of the observation sequence

X = [x1 x2 x3 xT] to be matched to the HMM. The states are

coupled by transitions with aij denoting the state transition probability with

the subscripts indicating the two states involved and aii refers to the self-

loop probability. The first null-state has a transition probability of 1 and

no self-loop probability. The last null-state has no emitting probabilities, it

is the termination state. Each emitting state has an associated probability

density function (pdf) described as fi(x|St, ). This pdf quantifies the

similarity of a feature vector xt from the observation sequence to the state

St = i. It is important to note that no time step is needed to enter the first

null-state, the process will already occupy that state. Using the common

shorthand notation, a single left-to-right HMM can now be described as

= {a, f}. Introducing the null-states effectively cancels the need for

8/22/2019 Ballot Face 2005

52/98


defining an initial value often denoted by in most of the literature on

HMM theory.

In order to train an HMM we need to quantify a few probabilities. The

match between an observation sequence XT1 and the model can be ex-

pressed in terms of the likelihood f(XT1 |). The calculation of this like-

lihood is often known as the evaluation problem. A possible solution to

this problem is to enumerate all possible sequences of states ST+10 and de-

termine the value of f(XT1 , ST+10 |) and then determine the marginal pdf

by summing over all of them. A more efficient approach is the forward-

backward procedure, described in appendix B. We approximate this by the

well known Viterbi algorithm since it is faster. The sequence which delivers

the highest score will be the solution to what is known as the decoding

problem yielding the most likely state sequence.

In training the HMM we need to optimise the parameters of the model

based on the observation sequence. This can be quantified as finding the

highest value of f(|XT1 , ST+10 ).

1 We used what is known as Viterbi re-

estimation to solve what is often known as the learning problem. This

method uses the state sequence (segmentation) obtained by the Viterbi al-

gorithm to re-estimate the parameters of the HMM. This can easily be

accomplished by simply updating all the parameters (pdfs and transition

probabilities) within the segment specified by the Viterbi algorithms seg-

mentation. This algorithm is an example of an Expectation-Maximisation

algorithm as we change our pdfs parameters to obtain the maximum prob-

ability score (expectation).

The described procedures involving HMMs involve matching an obser-

vation sequence to the model. This is quantified as a probability f(XT1 |),

showing that any HMM can be seen as a special kind of pdf.

1A reader familiar with basic statistics will note that this is the reverse of the eval-

uation problem, and therefore simple Bayesian identities can solve this problem.

8/22/2019 Ballot Face 2005

53/98


5.3 Model Configurations

5.3.1 First configuration 1D HMM

For the face classification task we used two basic configurations of Hidden

Markov Models. In the first case the face was modelled with a vertical

HMM running along the rows of the image as seen in figure 5.2. With each

state of the HMM representing a distinct facial region (i.e. the eyes, mouth,

chin etc.) the characteristic features of any person can be modelled. Inside

Figure 5.2: Vertical top-to-bottom HMM modelling a face

each state S we use a Gaussian mixture model (GMM) as the probability

density function fi(x|St, ) within the state. A Gaussian mixture model

8/22/2019 Ballot Face 2005

54/98


can be expressed as a weighted sum of K Gaussian distributions:

L(x) =Kk=1

p(k)Nk(x) (5.3.1)

where Nk(x) is a D dimensional Gaussian distribution with mean and

covariance matrix :

Nk(x|, ) =1

(2)D2 ||

12

exp[1

2(x )T1(x )] (5.3.2)

and p(k) is a mixture weight constrained by:

0 p(k) 1 andKk=1

p(k) = 1

The mixture weights can be seen as probabilities since they represent the

importance of each separate Gaussian pdf in the GMM. The dimension D

of the Gaussians depends on the feature extraction method we use.2

Now that we have the density functions, we can finalise our HMM by

initialising it. This is done by uniformly segmenting the face under consid-

eration along its rows and obtaining the mean vector and covariance matrix

of each of these segments. This will be the initial values of the parameters

of our GMMs. Furthermore we set all the transitional probabilities of the

HMM equal to aij = 0.5, keeping in mind that for each state of the HMM

these probabilities sum to 1. Now we have a complete model of our face

represented by: = {a, f}. In order to train this model we use the proce-

dure described in the previous section, matching an observation sequence

to the model and then optimising the models parameters.

5.3.2 Second configuration Embedded HMM

We illustrated that calculating the match between an observation sequence

and a model, can be characterised as a probability f(XT1 |). This means

that an HMM itself could be seen as a specialised pdf. Embedding HMMs

2See chapter 4 for feature extraction methods.

8/22/2019 Ballot Face 2005

55/98


to serve as the pdfs of the states of our vertical HMM could indeed enhance

the modelling capabilities of our system. For an embedded HMM the con-

ventional top-to-bottom HMM has a horizontal HMM as the pdf of each of

its states (instead of a GMM) as shown in figure 5.3. This means that for

each vertical state we have fi(x|St, ) i{aei , fei } where the superscript

indicates that we are referring specifically to the horizontal HMM i with

i indicating the vertical state under consideration. Each of the horizontal

1 2 3 4 5

1 2 3 4 5

1 2 3 4 5

1 2 3 4 5

1 2 3 4 51

2

3

4

5

Figure 5.3: Embedded HMM modelling a face

HMMs also needs probability density functions fei (x|i) for its states. These

pdfs were chosen to be Gaussian mixture models (as described by equation

5.3.1), due to their flexibility. To initialise the whole HMM structure, animage under consideration is segmented uniformly. Each of these segments

is then again uniformly segmented across its columns so we end up with

uniform blocks of data.3 The mean vector and covariance matrices of each

of these blocks are now found so that every GMM obtains values for its

3Thus for a 5x5 embedded HMM we will have 25 uniform blocks

8/22/2019 Ballot Face 2005

56/98


parameters. Just to clarify, the estimated mean of column m of a MxN

block (matrix) xd of data is:

Axn = E[xd(m)] =1

N

Ni=1

xd(i, m) (5.3.3)

Thus the total mean d of the block can be expressed as a vector of M

means (m = 1 M). The covariance matrix is estimated by:

d =1

N

N

i=1

(xd(i, m) Axn)(xd(i, m) Axn)T (5.3.4)

To summarise the whole embedded HMM concept, we have a vertical HMM

containing a number of horizontal HMMs as probability density functions

within its states. The initial values of this vertical HMM are the combined

initial values of the horizontal HMMs along with uniform transition proba-

bilities. Each horizontal HMM has a GMM as probability density function

within each of its states, uniform transitional probabilities and the initial

values are obtained from uniform blocks of data to initialise the GMMs.

The horizontal HMMs are trained as described in a previous section and

this means the calculation of a likelihood. Using the likelihoods calculated

of all of the horizontal HMMs, we can finally obtain the same type of like-

lihood value for our vertical HMM. This results in a trained final model.

8/22/2019 Ballot Face 2005

57/98

Chapter 6

Implementation

6.1 Practical aspects

Now that we have built all the necessary foundations that explains the gist

of our system, we can proceed to discuss the implementation issues and the

practicalities of building a robust classifier.

6.1.1 Classifying faces database partitioning

In order to classify faces we need to train our HMM based classifier on

sample images (training data) of each person in the database. Then other

unseen images or test data is scored against our trained model and, as in

most real life scenarios, the highest score wins!

The immediate question which arises, involves the partitioning of the

databases in terms of training and testing data. In order to compare our

classification results with published results, the same partitioning must be

used. The partitioning of a database is often done only once (by the first

publisher) and then, in order to compare results, such a partitioning seems

to propagate through all further publications in the field. To explain the

partitioning problem we summarise common partitions of our two databases

in Table 6.1. Just to clarify what is meant by the percentage values

it reflects the ratio amount of training dataamount of test data

100%. Although the XM2VTS

43

8/22/2019 Ballot Face 2005

58/98

Chapter 6. Implementation 44

Partitioning Reference

XM2VTS 75% Zhang et al. (2004)

ORL 50% Samaria (1994)

Table 6.1: Comparable partitioning of databases

database has 8 images per person, we only used 4 faces (each time using 3

faces to train on and one to test on) in experiments. We chose these faces in

accordance with what seems to be the four faces used in Zhang et al. (2004).

This could prove to be quite limiting as efficient modelling using HMMs is

known to be very training data dependent. In the XM2VTS database one

is also dealing with 295 individuals, so face classification does become more

difficult. The only drawback is that as far as could be established Zhang

et al. (2004) is the only known publication with classification results. All

other publications concerning this database tackled the verification problem

(because of the well defined protocol as described in Messer et al. (1999))

and a large number of verification results are obtainable. Following the

above discussion we set up the XM2VTS database experiments with 3 facesto train on and one to test on to compare classification results.1

The experiments on the ORL database give a good indication on how

our system compares to other systems, since a large number of published

results are available. This database however has a collection of only 40

individuals (figure A.1) which means results could be seen only as a rough

approximation to how a commercial system would perform. We used the

ORL database as it is and did classification experiments using the historical

50% partitioning to compare our system with previous results as well as afull leave-one-out experiment. To clarify the 50%, it means that the first

five faces were used to train on and the last five used to test on. Perfect

classification rates (100%) is obtainable on the ORL database, as we show

in the next chapter.

1This partitioning does have its merits, the images were shot one month apart so

differences in appearance (different hair \ glasses etc.) are present.

8/22/2019 Ballot Face 2005

59/98


6.1.2 Classifying faces image background

One of the most frustrating problems encountered in constructing our classi-

fier was the large amount of background present in the XM2VTS database.

As shown previously, the background can represent more than 40% of an im-

age. A classification experiment was conducted using the embedded HMM

approach with DCT-mod2 coefficients, using the first face of four as test

data and the other three faces as training data on all 295 individuals in

the database without reducing the amount of background. This careless-

ness reflected on the results as we achieved a correct classification rate ofonly 58%. Since we have 8 faces available, but use only four, the effect of

the background can easily be verified by running the same experiment and

again using 4 faces. However, this time 2 faces are replaced with 2 of those

that have been left out.2 In doing this the classification rate increases to

80%! This means that the error rate is effectively halved. It is necessary to

take out the background it is confusing our classifier! This problem at

least illustrates that because of the modelling power and dynamic aspects

of HMMs, they are so flexible that they tend to model on the backgroundif it represents the bulk of the available data.

As a solution to the problem posed by too much background we cut out

all the relevant faces of the XM2VTS database (first face from each of the 4

capturing sessions) giving a total of 295 4 faces to classify. The faces were

cut out manually from downsized images, so as to fit as much of the face

as possible into a 236x144 sized window. This corresponds to 56x33 blocks

of DCT-mod2 coefficients and 58x35 blocks of DCT coefficients. Again the

dimension of the DCT features is 15 and that of DCT-mod2 features is18. The DCT coefficients are obtained with a sampling overlap of 50%.

These were the combination of dimensions used in our final experiments.

In the ORL database no cropping of faces was needed as the data is already

presented in a friendly format as shown in chapter 3.

2The 8 faces per person in the XM2VTS database were shot across 4 sessions, here

we use the 2 faces from the first 2 sessions.

8/22/2019 Ballot Face 2005

60/98


6.1.3 Classifying faces training and scoring

HMMs

The process of classification on a database of faces can be summarised as

follows:

First a database is partitioned into a training and a testing part

with the training data used to train an HMM for each person in the

database. This means we have for each person in the database an

HMM trained on the training data.

Each test face is then scored against all these models and it is clas-

sified to the model with the highest similarity measure. This scoring

procedure was done with the reversed Viterbi algorithm, each score

representing the similarity between the test face image and a trained

model.

6.2 The HMM configurationsFor our experiments we use two basic configurations of HMMs using all

three the feature extraction techniques described in chapter 4. We also

evaluate these six possible setups on both the available databases. For all

the following discussions on the dimensions of the observations and hence

the Gaussian Mixture Models, see Figure 6.1.

6.2.1 HMM configuration I

For configuration I we use a simple one dimensional left-to-right HMM

modelling down the rows of each image. It has Gaussian mixture models

within its states as probability density functions each one modelling hori-

zontal data. We specify seven states, each state modelling a specific facial

region or background, namely: top background and hair, forehead, eyes,

8/22/2019 Ballot Face 2005

61/98


HMM configuration Dimension:N

Feature Domain

"Time" (t)t = 3

t = 2

t = 1

Figure 6.1: Passing of features from the feature domain to an HMM configura-

tion

nose, mouth, chin and finally neck or possibly clothing as shown concep-

tually in figure 6.2. The Gaussian mixtures consisted of three diagonal

covariance Gaussians each, initialised with uniform weights in the mixture.The dimensions of the Gaussians were chosen to be N = 15 for the DCT

based experiments and N = 18 for the DCT-mod2 experiments, as these are

the dimensions of the features extracted as proposed by Sanderson (2003).

For the pixel value experiments a dimension of N = 4 was used with a

single observation being represented by a column vector of 4 pixel values.

The whole observation sequence is formed by scanning the image from top to

bottom with a window of 4 pixels high and a 75% overlap. The dimension of

4 and the 75% overlap are chosen based on experiments by Samaria (1994).The same method is used in the DCT and DCT-mod2 experiments,

except for the overlapping this is set at 50% and done in the step where

we transform the pixel values with the DCT (see chapter 4 for more details).

In this configuration it is important to initialise the Gaussian parameters

to sensible values since the Gaussians do most of the modelling the HMM

merely facilitates the option of which one of the GMMs is used. For the pixel

8/22/2019 Ballot Face 2005

62/98


Figure 6.2: HMM configuration I topology

value experiments we intuitively choose the initial values as follows. Since

the GMMs we use consist of three Gaussians each, we decided to divide the

grey scale domain (0-255) into 4 roughly equal parts and therefore obtain

three distinct borders used as the initial means for the Gaussians. These

means are: 60, 120 and 180. This ensures that no prejudice is assigned

to certain pixel values as to bias the classifier. The diagonal covariance

matrices are all initialised with values of 100 on the diagonal. For the DCT

values and the DCT-mod2 values a more careful approach was needed sincethese features do provide more stability, but at the cost of needing good

initialisation. To the best of our knowledge the effect of initialisation has

not been covered before in literature and we believe it to have an effect on

the outcome of the classification experiment.

For the DCT coefficients obtained from both databases we take the av-

erage of the means down the columns for all 400 of the faces in the case

8/22/2019 Ballot Face 2005

63/98


of the ORL database and all 1180 faces of the XM2VTS database. These

means are shown in figures 6.3 and 6.4. From this we decided to initiali

Date post:	08-Aug-2018
Category:	Documents
Upload:	fla4m
View:	219 times
Download:	0 times

Ballot Face 2005

Documents