+ All Categories
Home > Documents > Ballot Face 2005

Ballot Face 2005

Date post: 08-Aug-2018
Category:
Upload: fla4m
View: 219 times
Download: 0 times
Share this document with a friend

of 98

Transcript
  • 8/22/2019 Ballot Face 2005

    1/98

    Face recognition using Hidden Markov Models

    by

    Johan Stephen Simeon Ballot

    Thesis presented at the University of Stellenbosch

    in partial fulfilment of the requirements for the

    degree of

    Master of Science in Electronic Engineering with Computer

    Science

    Department of Electrical & Electronic Engineering

    University of Stellenbosch

    Private Bag X1, 7602 Matieland, South Africa

    Study leaders:

    Prof. J.A. du PreezProf. B.M. Herbst

    April 2005

  • 8/22/2019 Ballot Face 2005

    2/98

    Copyright

    2005 University of Stellenbosch

    All rights reserved.

  • 8/22/2019 Ballot Face 2005

    3/98

    Declaration

    I, the undersigned, hereby declare that the work contained in this thesis is

    my own original work and that I have not previously in its entirety or in

    part submitted it at any university for a degree.

    S i g n a t u r e : . . . . . . . . . . . . . . . . . . . . . . . . . .

    J.S.S. Ballot

    Date: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    ii

  • 8/22/2019 Ballot Face 2005

    4/98

    Abstract

    Face recognition using Hidden Markov Models

    J.S.S. Ballot

    Department of Electrical & Electronic Engineering

    University of Stellenbosch

    Private Bag X1, 7602 Matieland, South Africa

    Thesis: MScEng (E&E + CS)

    April 2005

    This thesis relates to the design, implementation and evaluation of statis-tical face recognition techniques. In particular, the use of Hidden Markov

    Models in various forms is investigated as a recognition tool and critically

    evaluated. Current face recognition techniques are very dependent on issues

    like background noise, lighting and position of key features (ie. the eyes,

    lips etc.). Using an approach which specifically uses an embedded Hidden

    Markov Model along with spectral domain feature extraction techniques,

    shows that these dependencies may be lessened while high recognition rates

    are maintained.

    iii

  • 8/22/2019 Ballot Face 2005

    5/98

    Uittreksel

    Gesigsherkenning met behulp van Verskuilde Markov

    Modelle

    J.S.S. Ballot

    Departement Elektriese & Elektroniese Ingenieurswese

    Universiteit van Stellenbosch

    Privaatsak X1, 7602 Matieland, Suid Afrika

    Tesis: MScIng (E&E + RW)

    April 2005

    Hierdie tesis handel oor die ontwerp, implementering en bespreking van

    statistiese gesigsherkenningstegnieke. Spesifiek die gebruik van Verskuilde

    Markov Modelle in verskeie vorme, is as herkenningstegniek ondersoek en

    krities geevalueer. Huidige gesigsherkenningstegnieke word meestal beperk

    deur faktore soos agtergrond, beligting en posisie van sleutel-kenmerke (soos

    byvoorbeeld oe, lippe ens.). Deur spesifiek n gentegreerde Verskuilde

    Markov Model te gebruik in samewerking met frekwensiegebiedkenmerk-

    data, word getoon dat genoemde beperkings verminder word terwyl hoe

    herkenningsvermoe behou word.

    iv

  • 8/22/2019 Ballot Face 2005

    6/98

    Acknowledgements

    I would like to express my sincere gratitude to the following people and

    organisations who have contributed to making this work possible:

    Professors du Preez and Herbst for being enthusiastic study leaders

    and staying excited about this thesis even when at times I was not.

    The National Research Foundation who funded most of this work

    through the grant holder linked program.

    My mother and father who for 6 years provided the best bursary any

    student could hope for. Not to mention the emotional support and

    unconditional love!

    My friend, Pieter Rautenbach for being a wall of ideas and for always

    giving me an honest opinion or two about my project. Also for helping

    on the segmentation code which provided the much needed artistic

    flavour in the sea of analytical despair.

    The lab coffee machine, for obvious reasons.

    v

  • 8/22/2019 Ballot Face 2005

    7/98

    Contents

    Declaration ii

    Abstract iii

    Uittreksel iv

    Acknowledgements v

    Contents vi

    List of Figures ix

    List of Tables xi

    Nomenclature xii

    1 Introduction 1

    1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.3 Literature synopsis . . . . . . . . . . . . . . . . . . . . . . . 51.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    1.6 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2 Literature study 11

    2.1 First efforts in face recognition . . . . . . . . . . . . . . . . . 11

    vi

  • 8/22/2019 Ballot Face 2005

    8/98

    Contents vii

    2.2 Hidden Markov Models enter the face recognition race . . . . 11

    2.3 Extending the extensible . . . . . . . . . . . . . . . . . . . . 12

    2.4 The latest HMM flavours used in face recognition . . . . . . 14

    2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    3 Face databases and their peculiarities 17

    3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    3.2 Possible issues . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    4 Feature extraction methods 24

    4.1 To feature or not to feature, that is the question . . . . . . . 24

    4.2 Pixel intensities . . . . . . . . . . . . . . . . . . . . . . . . 25

    4.3 An introduction to the Discrete Cosine Transform . . . . . . 27

    4.4 The Discrete Cosine Transform . . . . . . . . . . . . . . . . 28

    4.5 Giving DCT features an extra boost of robustness . . . . . . 32

    4.6 Comparison of methods and summary . . . . . . . . . . . . 34

    5 Constructing the Hidden Markov Models 36

    5.1 A brief introduction to HMMs . . . . . . . . . . . . . . . . . 36

    5.2 HMM background . . . . . . . . . . . . . . . . . . . . . . . . 36

    5.3 Model Configurations . . . . . . . . . . . . . . . . . . . . . . 39

    6 Implementation 43

    6.1 Practical aspects . . . . . . . . . . . . . . . . . . . . . . . . 43

    6.2 The HMM configurations . . . . . . . . . . . . . . . . . . . . 46

    7 Experimental investigation 53

    7.1 Experiments on the ORL database . . . . . . . . . . . . . . 53

    7.2 Experiments on the XM2VTS database . . . . . . . . . . . . 57

    7.3 Summary of classification results . . . . . . . . . . . . . . . 59

    7.4 Face segmentation . . . . . . . . . . . . . . . . . . . . . . . 62

    7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

  • 8/22/2019 Ballot Face 2005

    9/98

    Contents viii

    8 Conclusions and recommendations 67

    8.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

    8.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

    8.3 Possible improvements and Recommendations . . . . . . . . 70

    A The ORL database 74

    A.1 The complete ORL database . . . . . . . . . . . . . . . . . . 74

    B Solution to the evaluation problem 75

    B.1 The forward-backward procedure . . . . . . . . . . . . . . . 75B.2 The Viterbi algorithm . . . . . . . . . . . . . . . . . . . . . 76

    C Face image segmentations 77

    C.1 Examples of segmentations of face images in the XM2VTS

    database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

    Bibliography 82

  • 8/22/2019 Ballot Face 2005

    10/98

    List of Figures

    1.1 Information flow in recognising human faces . . . . . . . . . . 1

    2.1 A one dimensional HMM for face recognition . . . . . . . . . . 12

    2.2 A one dimensional HMM with end-of-line states . . . . . . . . 13

    2.3 An embedded HMM for face recognition . . . . . . . . . . . . . 14

    3.1 Examples of pictures from the ORL database . . . . . . . . . . 19

    3.2 Examples of pictures from the University of Surrey XM2VTS

    database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    3.3 Histogram of pixel intensities of bottom left image of figure 3.2 213.4 Example of differences between images of the same class in the

    XM2VTS database . . . . . . . . . . . . . . . . . . . . . . . . . 22

    3.5 Histogram of pixel intensities of top left image of figure 3.1 . . 23

    4.1 Enlarged grey scale picture of matrix A . . . . . . . . . . . . . 26

    4.2 Histogram of matrix A containing grey scale values . . . . . . . 27

    4.3 Example face from the University of Surrey, XM2VTS database

    (2002) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    4.4 Ordering of DCT coefficients for N=M=4 . . . . . . . . . . . . 30

    4.5 Reconstructions of figure 4.3 using DCT coefficients . . . . . . 31

    5.1 Standard left-to-right, non-ergodic HMM . . . . . . . . . . . . 37

    5.2 Vertical top-to-bottom HMM modelling a face . . . . . . . . . . 39

    5.3 Embedded HMM modelling a face . . . . . . . . . . . . . . . . . 41

    ix

  • 8/22/2019 Ballot Face 2005

    11/98

    List of Figures x

    6.1 Passing of features from the feature domain to an HMM config-

    uration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    6.2 HMM configuration I topology . . . . . . . . . . . . . . . . . . 48

    6.3 Average of the DCT means of the ORL database . . . . . . . . 49

    6.4 Average of the DCT means of the XM2VTS database . . . . . . 50

    6.5 Average of the DCT-mod2 means of the ORL database . . . . . 50

    6.6 Average of the DCT-mod2 means of the XM2VTS database . . 51

    6.7 HMM configuration II topology . . . . . . . . . . . . . . . . . . 52

    7.1 Wrong classifications on the ORL database using pixel values asfeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    7.2 Examples of wrongly classified face images from the XM2VTS

    database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    7.3 Segmentation of an ORL face using DCT-mod2 features . . . . 63

    7.4 Mapping segmentation from the DCT-mod2 domain to the pixel

    domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    7.5 Segmentation of a XM2VTS face using DCT-mod2 features . . 65

    8.1 Ultimate face classification system . . . . . . . . . . . . . . . . 73

    A.1 The Olivetti Research Laboratory, ORL database (1994) . . . . 74

    C.1 Segmentation of a XM2VTS face image using DCT-mod2 fea-

    tures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

    C.2 Segmentation of a XM2VTS face image using DCT-mod2 fea-

    tures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

    C.3 Segmentation of a XM2VTS face image using DCT-mod2 fea-tures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

    C.4 Segmentation of a mystery face image using DCT-mod2 features 81

  • 8/22/2019 Ballot Face 2005

    12/98

    List of Tables

    3.1 Database comparison . . . . . . . . . . . . . . . . . . . . . . . 18

    4.1 Comparisons of feature extraction methods . . . . . . . . . . . 34

    4.2 Classification accuracy on small scale . . . . . . . . . . . . . . 35

    6.1 Comparable partitioning of databases . . . . . . . . . . . . . . . 44

    7.1 Summary of classification results configuration I . . . . . . 54

    7.2 Summary of classification results configuration II . . . . . . 54

    7.3 Best classification results from literature . . . . . . . . . . . . . 56

    7.4 Our best classification results on the ORL database . . . . . . . 57

    7.5 Summary of classification results configuration I . . . . . . 58

    7.6 Summary of classification results configuration II . . . . . . 58

    7.7 Best classification results of Zhang et al. (2004) . . . . . . . . 59

    7.8 Our results using configuration II and DCT-mod2 feature extrac-

    tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    xi

  • 8/22/2019 Ballot Face 2005

    13/98

    Nomenclature

    Constants: = 3,1415926535897932384626433832795

    Abbreviations:

    HMM Hidden Markov Model

    HMMs Hidden Markov Models

    AI Artificial Intelligence

    GMM Gaussian Mixture Model

    PCA Principal Component Analysis

    LDA Linear Discriminant Analysis

    EM Expectation Maximisation

    PDF Probability density function

    DCT Discrete Cosine Transform

    IDCT Inverse Discrete Cosine Transform

    JPEG Joint Photographic Experts GroupAC Alternating Current

    DC Direct Current

    General Variables :

    x A vector x

    N Dimension

    xii

  • 8/22/2019 Ballot Face 2005

    14/98

    Nomenclature xiii

    N(,) Gaussian pdf with mean (vector) and covariance (matrix)

    Variables referring to HMMs:

    = {a, f} An HMM with transition probabilities a and probability

    density functions f

    fi(x|St, ) Probability density function of state i quantifying the simi-

    larity of a feature vector x to the state St = i given the model

    XT1 Observation sequence from t = 1 to t = T

  • 8/22/2019 Ballot Face 2005

    15/98

    Chapter 1

    Introduction

    1.1 Motivation

    In a world where security has become a very high priority and where there is

    no tolerance towards human error in this regard, computers and especially

    software have developed to such an extent, that they are able to distin-

    guish one human from another. Whether this is via fingerprint, voice or

    other physicalities, the uniqueness of each and every human is exploited to

    build robust computerised recognition systems which should in theory be

    more reliable and more cost effective than employing a person to do the

    same work. This thesis focuses on face recognition, especially using trained

    statistical models to distinguish between a variety of individuals. The pos-

    sibilities for applications are endless. Especially in an era of global paranoia

    in terms of personal safety, the high technology security field would be the

    most exploitable for its application.Recognising one human face from another is a process which happens

    RecogniserSensor ResultData

    Figure 1.1: Information flow in recognising human faces

    1

  • 8/22/2019 Ballot Face 2005

    16/98

    Chapter 1. Introduction 2

    sub-consciously in a human being. The flow of information in a typical

    recognition process is shown in figure 1.1. In a human system the process

    can be summarised as follows.

    Information is passed from the sensors to the recogniser

    In the recogniser a database of hundreds of thousands of faces is

    scanned in an instant and matched against the data obtained from

    the sensors

    The result is a recognition success or failure

    This system is highly effective in humans. One of the problems in copying

    this process for computerised applications is that we do not know how the

    human brain (in computer terms wetware) does the recognising. What

    features are extracted from the test data? How is the massive internal

    database scanned in a fraction of a second? These are all questions which

    remain largely unanswered even with current available technology.

    To implement such a system, an artificial visual recogniser tries to sim-ulate this process so natural to humans each and every day. In such an

    artificial system, referring to figure 1.1, as sensor there is a camera of sorts,

    as recogniser some software implemented on some hardware, and finally

    specialised software where a decision is made as to whether a subject is

    recognised or not. The problem is that in the wetware human system, a

    face which is already in the database will almost certainly be recognised,

    but in an artificial system this is not the case. An artificial system must

    be trained to recognise certain known features and it must also be designedto be robust in terms of eliminating background noise. In this respect the

    filtering capacity of the human brain is still an unrivalled technology.

    To summarise, the four basic problems in an artificial recogniser system

    are:

    Choosing robust features to interpret

  • 8/22/2019 Ballot Face 2005

    17/98

    Chapter 1. Introduction 3

    Choosing a model for the recogniser

    Running a classification experiment using the chosen model

    Interpreting the results

    The construction of a computerised recogniser can be seen as a special case

    of creating some form of artificial intelligence (AI). A computer system is

    set up to perform a task usually reserved for humans and therefore this

    exercise in modelling is also an investigation in the understanding of the

    human brain to a certain extent. Hopefully we can furthermore show that

    the AI can generate both consistent and satisfactory results.

    1.2 Background

    To recognise humans, three basic paths and one hybrid path could be fol-

    lowed namely:

    Chemical

    Audio

    Visual

    Hybrid

    It could be argued that the most effective recogniser is the chemical model.

    It is however, probably the most impractical since humans tend to be scep-

    tical to part with a sample of their DNA! Advances in speech recogni-

    tion technology have shown such recognition systems to have substantial

    use. But a person about to be recognised must still be able or willing

    to speak. Security based applications could furthermore require a certain

    catch-phrase/language to be spoken. A visual recogniser is a subtle recog-

    niser; it can take a photo, process it and recognise a subject. All of these

    steps can be done in an instant and if necessary, undercover. This is one of

  • 8/22/2019 Ballot Face 2005

    18/98

    Chapter 1. Introduction 4

    the reasons why dependable face recognition technology is a very attractive

    proposition for security based applications. A hybrid recogniser combines

    one or more of the above techniques to improve recognition rates. Roughly

    stated, in choosing a recognition system a trade off between ease of imple-

    mentation and practicality exists.

    The usual problems that a face recognition system needs to solve are

    (Muller (2002)):

    Known/Unknown

    Classification

    Face verification

    Full identification

    In the first problem the system needs to identify whether a specific face

    belongs to some group of known faces. This is typically encountered in

    access control or security applications. Secondly classification is when a

    decision has to be made about the identity of a given face by assigning its

    identity to a group of known faces. This means that if there are a couple

    of faces of persons X, Y and Z in the known group, would the given face

    most likely be person X, Y or Z? With face verification the given face is

    claimed to be of identity X. The system needs to verify whether this is

    correct. Typically this is also used for security type applications. This can

    be viewed as a special case of the first problem. Full identification is used

    to determine whether a face is known and then to classify it. This is a

    combination of the first and second problems.

    This thesis investigates the classification problem. The first step in de-

    signing a face recognition system is choosing the model for the recogniser.

    Hidden Markov Models (HMMs) have proved to be quite a flexible statis-

    tical modelling tool for this purpose. In this thesis HMMs are investigated

    as a solution to the second of the listed four basic problems in artificial

    recognition systems. A brief overview on Hidden Markov Model theory is

  • 8/22/2019 Ballot Face 2005

    19/98

    Chapter 1. Introduction 5

    given later on, as well as why HMMs could form the basis of quite a robust

    recognition mechanism. To summarise the scope covered in this thesis, the

    following problems are addressed:

    Sensible preprocessing of face images in a given database

    Construction of a suitable HMM model to recognise the faces in the

    database

    Classification experiments

    Interpreting the results of a classification experiment

    Comparing the results to published results

    Segmentation of facial images

    The relevant concepts of this study are therefore the peculiarities of the

    available database, the modelling using HMMs and finally the achieved

    results and their interpretation.

    1.3 Literature synopsis

    Several approaches may be found in literature for face recognition without

    HMMs. These approaches are summarised and discussed in depth in Muller

    (2002). This thesis focuses on work done on recognising faces using HMMs.

    The most notable first efforts were made by Samaria & Young (1994). These

    first HMMs used in face recognition had a straightforward topology as canbe seen in figure 2.1. These HMMs typically had five states, each state

    modelling a specific area of a face image.

    Each state of such an HMM contains a single multivariate Gaussian dis-

    tribution as density function and pixel intensity values are used as feature

    vectors. A given image matrix of pixel intensity values is scanned in over-

    lapping blocks from the top of the image to the bottom to train the HMM.

  • 8/22/2019 Ballot Face 2005

    20/98

    Chapter 1. Introduction 6

    Satisfactory results were achieved but the flexibility of the HMM model

    allowed for further improvements.

    The seminal work in the field of HMM based face recognition is surely

    Samaria (1994). Here a left-to-right HMM is used to obtain segmentation

    information (or meaningful regions) of a given face. This segmentation in-

    formation could then be used to identify a face. The HMM has a pseudo two

    dimensional lattice of states each describing a distribution of feature vectors

    belonging to a certain area of the face as shown in figure 2.2. Each HMM

    has an end-of-line state with two possible transitions, either to the first

    state of its row or to the next row of states. The relevant database used in

    Samaria (1994) is the Olivetti Research Laboratory, ORL database (1994).

    This database consists of faces of 40 individuals, with 10 different images

    of each individual. The main feature of this database is that a picture of

    an individual contains mainly facial information and very little background.

    Background (noise) often transforms a seemingly great recognition system

    into quite an average one.

    Simultaneous efforts by Nefian & Hayes (1999) and Eickeler et al. (1999a)

    introduced an embedded HMM model which consisted of embedded states

    inside super states as shown in figure 2.3. This allowed for better transitions

    between states since the embedded HMMs proved to be tighter probabil-

    ity density functions than normal Gaussian distributions. Both furthermore

    showed that pixel intensity values do not form the most robust of features

    and that using two dimensional DCT coefficients as features, delivered bet-

    ter results. More recent developments on extending HMMs to be even more

    robust as recognising tool are discussed in Chapter 2.

    This study will aim to reconstruct most of these HMM based face recog-

    nition experiments, to verify their results and hopefully add some improve-

    ments.

  • 8/22/2019 Ballot Face 2005

    21/98

    Chapter 1. Introduction 7

    1.4 Objectives

    In any study of recognition the main goal or objective is achieving some or

    other high rate of recognition, in other words classifying accurately. The

    other lesser objectives all relate to this main one in being the stepping stones

    in finding the ultimate result perfect classification. The main goals of

    this study in face recognition can be summarised as follows:

    Investigating the use of HMMs as a face recognition tool

    Implementing a number of HMM topologies that could be used as face

    classifiers

    Evaluating the chosen HMM topologies as face classifiers against avail-

    able face databases

    Comparing the results of the HMM classifier against published sys-

    tems

    It can be seen that all the objectives revolve around Hidden Markov Mod-els and applying them to the rather uncustomary field of face recognition.

    Modelling with HMMs tends to be quite a flexible process and therefore

    a number of models can be constructed and tested as tools in order to

    accomplish the aforementioned main objective.

    1.5 Contributions

    The available literature on HMMs used as a face recognition tool, covers themain issues regarding this solution to the face recognition problem. There is

    an aspect though which receives little attention that is related to the actual

    detail surrounding it. This aspect deals with the density functions inside

    the HMM states. We believe that this thesis deals with these details and

    in fact describes the process of selecting useful density function parameters

    based on the available databases and therefore generating very good results.

  • 8/22/2019 Ballot Face 2005

    22/98

    Chapter 1. Introduction 8

    Another contribution deals with the question of what features to use, in

    other words, what preprocessing of images is necessary to obtain the best

    possible results. Furthermore, by using the segmentation of data provided

    by HMMs, we can extract faces from the background and locate facial

    features something very useful in computer vision based applications.

    Again summarising these contributions:

    Choosing density function parameters for HMM (embedded or not

    embedded) states and their peculiarities

    Training the HMMs with suitable features, i.e. feature extraction and

    noise elimination

    Designing the HMM topologies in accordance with the physicalities

    of the available database

    Segmentation of a face into meaningful regions

    1.6 Overview

    The focus of this thesis is the modelling of a face classification system using

    Hidden Markov Models. We start off with an overview of the available

    literature on face recognition using HMMs in Chapter 2 on page 11. This

    chapter emphasises the fact that there is not much available in published

    literature on Hidden Markov Models used in face recognition applications.

    Two basic HMM topologies namely an embedded HMM and a single top-

    to-bottom HMM are mentioned in the literature. We implement both thesemodels as to test their value as face image classifiers.

    The focus of attention then moves on to the available databases used in

    the classification experiments in Chapter 3 on page 17. Both the databases

    we consider in this thesis have some interesting characteristics. Looking at

    typical image histograms (figures 3.3 and 3.5) it may be seen that back-

    ground noise and other factors should clearly be taken into consideration

  • 8/22/2019 Ballot Face 2005

    23/98

    Chapter 1. Introduction 9

    for at least the University of Surrey, XM2VTS database (2002). We cut out

    the bulk of the background in all of the images of the XM2VTS database

    to stop it from confusing the classifier. The other database we use, namely

    the Olivetti Research Laboratory, ORL database (1994)) is used as it is

    since the images in this database are already in a friendly format with

    very little variation between images and background noise to confuse the

    classifier.

    This leads in to Chapter 4 on page 24. As suggested by results obtained

    by previous systems in the available literature, we use features other than

    pixel intensity values. This is done mainly to improve classification accu-

    racy. Three feature extraction methods are implemented, focusing on the

    Discrete Cosine Transform (DCT) and why DCT coefficients form more ro-

    bust features for face recognition than pixel intensity values. Furthermore,

    the feature extraction technique known as DCT-mod2 is also discussed and

    how it could improve the robustness of the classifier. Our classification

    experiments using the DCT-mod2 coefficients give excellent results.

    With all the theory of the preprocessing in place, Chapter 5 on page 36

    then covers the theoretical modelling of the HMMs used in the face classifi-

    cation experiments. We decided to implement two HMM configurations, a

    normal top-to-bottom HMM modelling down the rows of an image and an

    embedded HMM with a vertical HMM containing horizontal HMMs as the

    probability density functions within its states. These two topologies were

    chosen as they are the most widely used in the available literature. It also

    provides a good comparison of what the extra complexity of an embedded

    HMM buys in terms of classification accuracy.

    With all the necessary modelling, motivation and theory in place, Chap-

    ter 6 on page 43 explains all the practical aspects concerning the implemen-

    tation of the face classification system. Here we show the detail on how

    the HMMs we use as classifiers, are constructed. Furthermore we show the

    specifics of training and scoring our classifier on pixel intensity values, DCT

    coefficients and DCT-mod2 coefficients.

  • 8/22/2019 Ballot Face 2005

    24/98

    Chapter 1. Introduction 10

    Finally the experiments conducted and results obtained are discussed

    in Chapter 7 on page 53, the list of classification results on both databases

    is noted starting on page 54. Excellent results are achieved on both the

    databases we used in the experiments. The embedded HMM using DCT-

    mod2 features obtains the best classification results. It scores perfect clas-

    sification (100%) on the ORL database and on the complex XM2VTS

    database a classification score as high as 93.31% is recorded. These re-

    sults are furthermore shown to compare well against published systems.

    We furthermore show results of segmentations done on face images, as pro-

    vided by the Viterbi algorithm. These segmentations show on what areas

    of the face the embedded HMM models on.

    The final section is Chapter 8 on page 67. There the conclusions of

    this thesis are encapsulated and recommendations are made for further

    improvements in possible future work. By using techniques such as LDA

    (Linear Discriminant Analysis) or KDA (Kernel Discriminant Analysis) we

    believe that the models we discuss in this thesis can be improved to be very

    robust recognisers.

  • 8/22/2019 Ballot Face 2005

    25/98

    Chapter 2

    Literature study

    2.1 First efforts in face recognition

    It can be argued that the pioneering work in the field of face recognition

    was done by Kirby & Sirovich (1990). The technique they proposed

    commonly known as eigenfaces is based on Principal Component Anal-

    ysis (PCA) and has been extended and optimised by various institutions

    and people to make it one of the most widely used current face recogni-

    tion techniques. This technique and other early methods (like elastic graph

    matching and linear discriminant analysis (LDA)) are discussed in Muller

    (2002) and Sanderson (2003). These first methods all used facial geometry

    and symmetry to classify faces.

    2.2 Hidden Markov Models enter the face

    recognition race

    The first efforts to use HMMs as a face recognition tool were made by

    Samaria & Young (1994). They introduced the HMM as quite a robust

    mechanism to deal with face recognition. The HMM used was a single left-

    to-right HMM as seen in figure 2.1 with each state modelling a specific facial

    region. Each state of this HMM contains a single multivariate Gaussian dis-

    11

  • 8/22/2019 Ballot Face 2005

    26/98

    Chapter 2. Literature study 12

    1 2 3 4 5a23a12 a34 a 45

    a11 a22 a33 a44 a55

    Forehead Eyes Nose Mouth Chin

    Figure 2.1: A one dimensional HMM for face recognition

    tribution as probability density function (pdf). This HMM is trained on a

    database of pictures, all of them read from top to bottom with each row of

    pixel intensity values used as feature vectors. This approach achieved bet-

    ter classification rates than a PCA based approach on the tested database.

    Another bonus of introducing HMMs is that it segments the face into mean-

    ingful regions which can also be used for other applications like facial gesture

    recognition. Follow-up work by the same author, Samaria (1994), extended

    the classic one dimensional left-to-right HMM to a pseudo two dimensional

    (pseudo-2D) one. This HMM had a pseudo two dimensional lattice of states

    each describing a distribution of feature vectors belonging to a certain area

    of the face as shown in figure 2.2. Each HMM had an end-of-line state with

    two possible transitions, either to the beginning state of its row or to the

    next row of states. In each state a multivariate Gaussian distribution was

    used to model the distribution of feature vectors relevant to that state. This

    approach was tested on the Olivetti Research Laboratory, ORL database

    (1994) and again it outperformed previous face recognition techniques at

    that time.

    2.3 Extending the extensibleSimultaneous efforts by Nefian & Hayes (1999) and Eickeler et al. (1999a)

    introduced an embedded HMM which consisted of embedded states inside

    super states as shown in figure 2.3. Again, each of the top-to-bottom states

    models a specific facial region. This extended HMM model allows for better

    transitions between states since the embedded HMMs prove to be tighter

    probability density functions than normal Gaussian distributions. Both

  • 8/22/2019 Ballot Face 2005

    27/98

    Chapter 2. Literature study 13

    Figure 2.2: A one dimensional HMM with end-of-line states

    authors furthermore showed that pixel intensity values do not form the most

    robust of features and that using selected two dimensional discrete cosine

    transform (DCT) coefficients as features delivered better results. Perfect

    classification (100%) was obtained on the Olivetti Research Laboratory,

    ORL database (1994) using this technique and overall recognition speed

    increased because using only selected DCT features significantly compresses

    the data. The main problem in all the above mentioned techniques was that

    they were tested on a database which consisted of pictures with very little

    background (see figure 3.1). The modelling can therefore be done very

    accurately and the HMMs can be fine tuned to deliver remarkable results.

    In a practical system this step would only be possible if faces could be

  • 8/22/2019 Ballot Face 2005

    28/98

    Chapter 2. Literature study 14

    Forehead

    Nose

    Eyes

    Mouth

    Chin

    Figure 2.3: An embedded HMM for face recognition

    identified from pictures and then preprocessed to form a background free

    image for the HMMs to classify.

    2.4 The latest HMM flavours used in facerecognition

    Hidden Markov Models have traditionally been used to model time depen-

    dent data. For this use they have been fine tuned and thorough research has

    already been done on the subject, especially concerning what features to

    use (for example cepstra features in automatic speech recognition systems).

  • 8/22/2019 Ballot Face 2005

    29/98

    Chapter 2. Literature study 15

    In image processing, HMMs are quite a new addition to the fold of well es-

    tablished techniques and therefore extracting robust features is still one of

    the major areas for future development. Some novel new feature extraction

    techniques are discussed in Sanderson (2003). One of these techniques is

    the DCT-mod2 approach, which we included as a feature extraction method

    in this thesis. The DCT-mod2 feature extraction method could be seen as

    a form of delta-coefficient extraction. This method shows lots of potential

    especially in keeping the recogniser robust when illumination changes occur.

    As far as we could establish, DCT-mod2 features have not previously been

    used in HMM based classifiers. Consensus, it seems, has been reached that

    DCT based feature extraction methods are probably the most effective.

    Other advanced efforts were made by Muller et al. (2002) where they

    proposed a triple embedded HMM based model to recognise facial ex-

    pressions. It is also worthwhile mentioning the HMM recogniser used by

    M.Bicego et al. (2003) where the author proposes wavelet coding as a feature

    extraction method. Using the wavelets as features shows the same perfect

    classification score on the ORL database. In Othman & Aboulnasr (2003)

    the authors proposed an HMM with an extended two dimensional structure

    to use as a recogniser. This means that all states allow both vertical and

    horizontal transitions. Again DCT coefficients were used as features which

    underlined the trend to move away from pixel intensities when choosing

    feature vectors. They also achieved remarkable results but again it was on

    the ORL database.

    Another improved HMM-based recogniser was proposed by Eickeler et al.

    (1999b) using JPEG format features. What makes this technique useful is

    that it can recognise faces directly from the JPEG format compressed data

    and it is therefore an improvement speed-wise on previous efforts. This

    method also underlines the fact that DCT-based features are used to sup-

    press the sensitivity to changes in light intensity.

  • 8/22/2019 Ballot Face 2005

    30/98

    Chapter 2. Literature study 16

    2.5 Summary

    The literature provides a summary of previous HMM-based classifiers. The

    HMM topology most widely used seem to be the basic top-to-bottom HMM

    modelling down the rows of an image as first proposed by Samaria & Young

    (1994). The extension of this model to an embedded HMM (by Nefian

    & Hayes (1999) and Eickeler et al. (1999a)) shows a lot of promise as a

    possibly robust classifier. Furthermore, spectral domain feature extraction

    techniques are widely used in published systems to improve the robustness

    of a classifier. In this thesis we reconstruct and improve the top-to-bottomand embedded HMMs. In training these models we also use specific spectral

    domain features (DCT coefficients, as proposed in the literature) to improve

    the classification accuracy of the HMMs.

  • 8/22/2019 Ballot Face 2005

    31/98

    Chapter 3

    Face databases and their

    peculiarities

    3.1 Introduction

    The results of any classification experiment should always be seen in the

    context of the database of face images that the classifier involved has beentrained and tested on. Such a database could be characterised by the fol-

    lowing properties:

    Format of the pictures (i.e. file type, size, grey scale/colour)

    Number of persons in the database

    Number of images per person

    Variations in lighting conditions between images

    Variations in individuals features between images

    Amount of background in a picture

    These properties all play some part in either confusing or helping the clas-

    sifier to classify the faces in the database. In our experiments we use the

    University of Surrey, XM2VTS database (2002) and the Olivetti Research

    17

  • 8/22/2019 Ballot Face 2005

    32/98

    Chapter 3. Face databases and their peculiarities 18

    Laboratory, ORL database (1994). These databases differ in all the prop-

    erties mentioned above, so we list the differences in table 3.1. In order to

    ORL database XM2VTS database

    Format Grey scale .pgm RGB .tiff

    Image size 112x92 576x720

    Persons in database 40 295

    Images per person 10 8

    Total images 400 2360

    Light variation Slight SlightPercentage background 40% of image

    Background uniformity Uniform black Non-uniform blue

    Table 3.1: Database comparison

    fully understand table 3.1, see figure 3.1 for samples of the ORL database

    and figure 3.2 for samples of the XM2VTS database. For the purposes of

    this thesis the XM2VTS database images were resized to be 288x360, which

    corresponds to a scaling of a 12

    on the rows and columns. These images were

    also converted from RGB1 to grey scale. Finally a window of 236x144 pixels

    was cut out, trying to capture as much of the face as possible. The ORL

    database pictures were already in a friendly format since the pictures were

    all cropped around the faces they represented reducing confusion caused by

    background.

    The differences between these two databases provided a good test to

    show the robustness of our methods.

    1Colour pictures are represented by three pictures, each corresponding to the red,

    green or blue (three primary colours) values of the pixels.

  • 8/22/2019 Ballot Face 2005

    33/98

    Chapter 3. Face databases and their peculiarities 19

    Figure 3.1: Examples of pictures from the ORL database

    3.2 Possible issues

    3.2.1 The XM2VTS database

    In order for our classifier to perform well on both databases, we need toinvestigate any possible issues that could be encountered when testing our

    classifier on these databases. The XM2VTS database is an extensive frontal

    face database containing images of 295 individuals (8 images each). This

    database was mainly constructed with face verification in mind and es-

    tablished a testing protocol to ensure that different institutions compare

    equivalent results. This protocol is known as the Lausanne protocol. For a

  • 8/22/2019 Ballot Face 2005

    34/98

    Chapter 3. Face databases and their peculiarities 20

    Figure 3.2: Examples of pictures from the University of Surrey XM2VTS

    database

    comprehensive discussion on the particulars of the XM2VTS database see

    Messer et al. (1999). The main issue that arises when using this database

    is to classify faces against the large amount of background that exists in

    the images. This database was acquired over a period of five months, with

    acquisition sessions spaced over one month intervals. The fact that the ses-

    sions were spaced a month apart means that background detail also differs

    in different images.

    We focus on figure 3.2, and specifically on the sample face at the bottom

    left of this image. It can be seen that the background takes up a high

    percentage of the pixels of the picture. When referring to the histogram

    of the sample face image (see figure 3.3), this problem becomes even more

    evident. Most of the pixel values lie in and around the value of 50. Because

    HMMs are powerful modelling tools, they tend to model on the non-uniform

    background rather than the facial data purely because the background takes

    up so much of the data.

    Three possible solutions exist to overcome this problem. The first so-

  • 8/22/2019 Ballot Face 2005

    35/98

    Chapter 3. Face databases and their peculiarities 21

    0 50 100 150 200 250

    0

    1000

    2000

    3000

    4000

    5000

    6000

    7000

    8000

    Figure 3.3: Histogram of pixel intensities of bottom left image of figure 3.2

    lution is to adapt our model by carefully choosing the probability density

    functions (pdfs) and the features to extract. This probably represents the

    most scientifically correct solution. A second possible solution is to cropall the pictures so that they consist mainly of the facial data. Automatic

    procedures to do this exist but we manually extracted the faces for our final

    experiments. The third possible solution is to normalise or transform the

    images in some or other way and then use the feature extraction methods

    as described in chapter 4.

    Another feature of the XM2VTS database is the way in which lighting

    and background as well as personal features (glasses, hair etc.) vary between

    images belonging to the same class. One of the more extreme cases ispresented in figure 3.4.2

    2The colour images have been presented as they better highlight the subtle differences

    between images.

  • 8/22/2019 Ballot Face 2005

    36/98

    Chapter 3. Face databases and their peculiarities 22

    Figure 3.4: Example of differences between images of the same class in the

    XM2VTS database

    3.2.2 The ORL database

    This database consists of images of 40 people, with 10 images per person.

    An image of the complete database (400 faces) is given in appendix A. The

    persons captured in this database are aged between 18 and 81. There are 4

    female and 36 male subjects, with each image containing a different facial

    expression. For most of the images light conditions differ but all of the

    images are set against a uniform black background. All of the images are

    cropped to consist of mostly facial data with very little background. The

    varying conditions of light and expressions but limited background, makes

    this database ideal for controlled face classification experiments. Take for

    instance the top left sample face in figure 3.1 and that images histogram as

    shown in figure 3.5. When comparing this histogram with the one presented

    in the previous section, it may be seen that it should be easier to model

    on this database because the pixel values are more evenly spread without

    extremities at specific pixel values.

  • 8/22/2019 Ballot Face 2005

    37/98

    Chapter 3. Face databases and their peculiarities 23

    0 50 100 150 200 250

    0

    50

    100

    150

    Pixel intensity on the gray scale

    Numberofpixels

    Figure 3.5: Histogram of pixel intensities of top left image of figure 3.1

    3.3 Summary

    The characteristics of both databases have been established. A controlledeffort can therefore be made to extract robust features for the classification

    experiments. The next chapter deals with feature extraction and how it

    is necessary to develop a way to overcome the difficulties, especially those

    presented by the complex images in the XM2VTS database.

  • 8/22/2019 Ballot Face 2005

    38/98

    Chapter 4

    Feature extraction methods

    4.1 To feature or not to feature, that is the

    question

    Our model operates on features extracted from the images. With the issues

    surrounding databases as described in the previous chapter, these features

    should be chosen in such a way as to ensure a separation between individ-

    uals. The extraction of features concerns the passing on of object data in

    a specific format and size to some model, mainly for the purpose of recog-

    nising the object. Referring to figure 1.1, feature extraction is the step

    in between the sensor and the recogniser. Humans are restricted to fea-

    tures based on the five senses. Therefore, using the eyes, the frequencies

    (colour) and the intensity of light are the only features from which objects

    can be identified. In training an artificial recogniser, the features seen by

    the recognition models can be manipulated. Specifically in this thesis, the

    main question concerning feature extraction that arises is: what numerical

    values are necessary to effectively train the HMM based classifier from? The

    identification of these values is the basis of the feature extraction problem.

    The following features were investigated and specifically used to train

    the HMMs used in the face classification experiments:

    24

  • 8/22/2019 Ballot Face 2005

    39/98

    Chapter 4. Feature extraction methods 25

    Pixel intensity values

    Discrete Cosine Transform (DCT) coefficients

    DCT-mod2 coefficients

    Pixel intensity values are the raw data representing an image. In a grey

    format they typically vary in value from 0 to 255. DCT coefficients are

    obtained by applying the two dimensional DCT to blocks of a given image.

    The DCT-mod2 coefficients are extended DCT based features as proposed

    by Sanderson & Paliwal (2002). As far as could be ascertained DCT-mod2

    based feature extraction in HMM based face classification has not been in-

    vestigated before. The following sections deal with the in-depth explanation

    of the method behind each of these feature extraction techniques and their

    advantages or disadvantages when used in the classification of face images.

    4.2 Pixel intensities

    Pixel intensity values are numerical values of light intensity on a specificscale and are used to store pictures digitally. For instance, say that a

    grey-scale digital photo is taken of the face of a human at a resolution of

    720x576. This makes it possible to store a matrix (with 720 columns and

    576 rows) of light intensity values on a computer. The grey scale implies

    that the intensity values are integers representing shades of grey ranging

    from 0 (black) to 255 (white). The following example illustrates grey scale

    pixel values: Assume we have a matrix of pixel intensity values (matrix A)

    representing an image (figure 4.1):

    A =

    2 255 2 255

    10 200 200 100

    50 100 50 2

    2 50 150 200

    Storing pictures in this raw format wastes space, one of the many available

    compression routines is used instead. These pixel values do however repre-

  • 8/22/2019 Ballot Face 2005

    40/98

    Chapter 4. Feature extraction methods 26

    Figure 4.1: Enlarged grey scale picture of matrix A

    sent features that can be used to train the HMM topologies discussed in this

    thesis and satisfactory face classification results are obtained. The problem

    however is that many features have to be kept, therefore the training and

    scoring of models becomes computationally expensive. If we wanted to

    classify the image represented by matrix A for some or other reason using

    an HMM based classifier, the image could be scanned from top to bottom

    with each row forming a single feature vector. The complete observation

    sequence is therefore the four rows of this matrix. A histogram (figure 4.2)

    of the pixel intensities can be drawn. As was shown in the previous chapter,

    typical histograms of facial images in the available databases (see figures

    3.3 and 3.5) show how face data and background, which can be regarded as

    noise, are embedded in the features (pixel intensity values). This is one of

    the reasons why pixel intensity values are not the best features to use. For

    robust face classification we want features to be decorrelated in some way

    so we can model a face image as distinctly as possible.

    To summarise the advantages of pixel intensity values as features: they

    are easy to obtain and they have the same dimensions as the image data.

  • 8/22/2019 Ballot Face 2005

    41/98

    Chapter 4. Feature extraction methods 27

    0 50 100 150 200 250

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    4.5

    5

    Gray scale value

    Numberofpixels

    Figure 4.2: Histogram of matrix A containing grey scale values

    The disadvantages of pixel values are: they tend to be sensitive to image

    noise as well as image rotations or shifts, and changes in illumination. They

    furthermore induce large dimensions on observation vectors. This causes

    any complex algorithm to take an unacceptably long time to complete.

    4.3 An introduction to the Discrete Cosine

    Transform

    Compressing data is essential in both biological and signal processing ap-

    plications. Even in human vision the light signals received by the approx-

    imately 130 million photo-receptors (see Steven W. Smith (1999) for more

    details) on retinal level in the eye, are sent to the brain for compression

    and processing. By the time these signals arrive at the higher centres of the

    brain, they convey magnitude (contrast), phase and frequency, which are all

    principle attributes of Fourier analysis. Especially in the image processing

    community the two dimensional DCT has been used as a data compres-

  • 8/22/2019 Ballot Face 2005

    42/98

    Chapter 4. Feature extraction methods 28

    sion tool. The two dimensional DCT forms the basis of the JPEG (Joint

    Photographic Expert Group) image compression standard. It is important

    to note that we will henceforth be referring to the two dimensional version

    of the DCT only as the DCT. The original DCT is mainly used in one

    dimensional applications (i.e. not image processing).

    4.4 The Discrete Cosine Transform

    4.4.1 Motivation and the case of the missing sinecoefficients

    In general, to obtain the frequency representation of a two dimensional sig-

    nal the Fourier Transform is used and specifically the FFT (Fast Fourier

    Transform) algorithm. The Fourier theorem specifies that any signal can

    be represented as a weighted sum of even and odd sinusoidal terms. The

    DCT is a transform very much like the Fourier Transform, but with the

    DCT a signal is represented only by the even sinusoidal terms (hence nam-

    ing it a cosine transform). Representing image information in terms of the

    DCT rather than with the FFT has the important advantage that DCT

    coefficients are always real valued. The DCT also delivers better energy

    compression and the coefficients are nearly uncorrelated (Eickeler et al.

    (1999a)). Having nearly uncorrelated coefficients makes the DCT very at-

    tractive in terms of image processing. It means that for instance in the

    application of face recognition that DCT features will be less sensitive to

    changes in image illumination. In general, the two dimensional DCT of a

    MxN matrix F is defined as follows:

    C(u, v) = (u)(v)M1i=0

    N1j=0

    F(i, j)cos(2i + 1)u

    2Mcos

    (2j + 1)v

    2N(4.4.1)

    where

    (0) =

    1

    M, (u) =

    2

    M

  • 8/22/2019 Ballot Face 2005

    43/98

    Chapter 4. Feature extraction methods 29

    (0) =1

    N, (v) =2

    N

    and

    0 u M 1 , 0 v N 1

    From equation 4.4.1 a DCT coefficient matrix can be constructed. These

    coefficients represent the energy contribution by different frequencies. The

    first coefficient (C(0, 0)) represents the DC component or the average

    value of the MxN block. The rest of the coefficients represent the different

    AC components, as contributed by each of the frequencies present.

    For the subsequent discussion refer to the sample image of a person

    (figure 4.3, scaled to two thirds the size) taken from the University of Sur-

    rey, XM2VTS database (2002). The main advantage of the DCT is that it

    Figure 4.3: Example face from the University of Surrey, XM2VTS database

    (2002)

    compresses data. This compression property of the DCT allows a block of

  • 8/22/2019 Ballot Face 2005

    44/98

    Chapter 4. Feature extraction methods 30

    K=11 K= 13

    K=9 K=10 K=15

    K=12K=74K=K=2

    K=1 5K= K=6

    K=3 K=8

    K=14

    K=0

    0 1 2 3

    0

    1

    2

    3

    v

    u

    Figure 4.4: Ordering of DCT coefficients for N=M=4

    pixels to be represented by just a few DCT coefficients and it is therefore

    possible to work with less features, and still obtain more information than

    would be present when using the larger number of pixel values. In order

    to extract the coefficients which contain the most data about the block

    of data transformed, the DCT coefficient matrix needs to be scanned in

    a zig-zag pattern as shown in figure 4.4. This is because the contributing

    frequencies are arranged from low to high as indicated by the zig-zag pat-

    tern represented by increasing K. To show these compression properties

    the first 10x10 (compression of4000 times), 50x50 (compression of160

    times), 100x100 (compression of40 times) and 200x200 (compression of

    10 times) coefficients were extracted from figure 4.3 and run through the

    inverse transform (IDCT) to obtain approximated images. See figure 4.5for the approximations of the face image.1 It can be seen that the DCT

    provides suitable data compression and for this reason alone it should be

    considered when constructing features used in face recognition.

    1This example shows the compression capabilities of the DCT and should not be

    confused with the JPEG compression standard, in which the DCT is used, but not in

    this manner.

  • 8/22/2019 Ballot Face 2005

    45/98

    Chapter 4. Feature extraction methods 31

    Figure 4.5: Reconstructions of figure 4.3 using DCT coefficients

    4.4.2 Feature extraction using the DCT

    In this thesis the selection of suitable DCT coefficients from pictures in

    the available databases (see figures 3.1 and 3.2) was evaluated as a featureextraction method. For this method of feature extraction a sliding window

    of 8x8 pixels was scanned over a picture with the standard overlap of 50% in

    both the horizontal and vertical directions. For each window of 8x8 pixels,

    a DCT coefficient matrix of the same size was obtained. This means that

    for an image of Y rows and X columns there are

    ND = (2Y

    N 1) (2

    X

    N 1) (4.4.2)

    number of 8x8 DCT coefficient blocks (with N = 8 being the size of thewindow). These DCT coefficient blocks are then reduced by keeping their

    first 15 coefficients (as suggested by experiments of Sanderson (2003)) by

    following the zig-zag pattern described earlier. Thus every 64 values are

    reduced to L = 15 values and a single observation used to represent the

    data of block (b, a) is now the vector:

    x = [c(b,a)0 c

    (b,a)1 c

    (b,a)2 c

    (b,a)L1 ]

    T (4.4.3)

  • 8/22/2019 Ballot Face 2005

    46/98

    Chapter 4. Feature extraction methods 32

    A complete observation sequence is obtained consisting of ND of these vec-

    tors. Specifically, for the two databases used, the images were of size 112x92

    and 236x144.2 This means we have observation sequences of sizes ND = 594

    blocks and ND = 2030 blocks respectively.

    4.5 Giving DCT features an extra boost of

    robustness

    In Sanderson & Paliwal (2002) a novel way of adding more robustness tothe DCT is introduced. This method of feature extraction is based on poly-

    nomial coefficients, also known as deltas. In speech recognition applications

    an analogue to this method of feature extraction has proved very successful

    in eliminating background noise and channel mismatch. Images however

    consist inherently of two dimensional signals and therefore we have to re-

    define these coefficients. As proposed in Sanderson & Paliwal (2002) we

    will name this new method of feature extraction DCT-mod2. For images

    we now define the n-th horizontal delta coefficient for a block located at

    (b, a) as a modified first order orthogonal polynomial coefficient (Sanderson

    & Paliwal (2002)):

    hc(b,a)n

    =

    Kk=Kkhkc

    (b,a+k)nK

    k=Khkk2

    (4.5.1)

    Similarly, the n-th vertical delta coefficient is defined as:

    v

    c(b,a)n =

    K

    k=Kkhkc(b+k,a)n

    Kk=Khkk

    2 (4.5.2)

    where h is a 2K+1 dimensional symmetric window vector and cn is the n-th

    DCT coefficient of a block located at (b, a). For our purposes we let K = 1

    and h = [1 1 1]T be a rectangular window. To illustrate the advantage of

    2Important to note that when speaking of the size of an image the customary format

    is (number of rows) x (number of columns) but the resolution of an image is written the

    other way around.

  • 8/22/2019 Ballot Face 2005

    47/98

    Chapter 4. Feature extraction methods 33

    using these modified delta features, assume we have three consecutive blocks

    X, Y and Z, as explained in Sanderson & Paliwal (2002). Let us assume

    that each block contains an information component and a noise component,

    say X = XI + XN, Y = YI + YN and Z = ZI + ZN. Let us assume that

    each block is corrupted by the same noise, therefore XN = YN = ZN. This

    is a reasonable assumption to make if the blocks are small and close to each

    other or if these blocks are neighbours as the result of overlapping used in

    the sampling process. The deltas for block Y can now be computed using

    equation 4.5.1 and 4.5.2:

    hY =1

    2(X + Z)

    =1

    2(XI XN + ZI + ZN)

    =1

    2(ZI XI) (4.5.3)

    and

    vY =1

    2

    (X + Z)

    =1

    2(XI XN + ZI + ZN)

    =1

    2(ZI XI) (4.5.4)

    and the noise component is removed. We now modify our DCT feature

    vector by replacing the first three coefficients by their horizontal and vertical

    deltas and form a feature vector representing a given block at (b, a) as a

    new vector:

    x = [hc0 vc0

    hc1 vc1

    hc2 vc2

    c3 c4 cL1]T (4.5.5)

    where the (b, a) indication was left out to maintain clarity and L = 15. The

    first three coefficients represent the most information held in the block and

    therefore to limit the size of the features they are replaced with their delta

    coefficients. A block of coefficients taken on the edges of the picture will not

    have a neighbouring block on the one side, so when using the DCT-mod2

  • 8/22/2019 Ballot Face 2005

    48/98

    Chapter 4. Feature extraction methods 34

    approach we end up with

    ND2 = (2Y

    N 3) (2

    X

    N 3) (4.5.6)

    blocks. This gives observation sequences of sizes ND2 = 500 blocks and

    ND2 = 1848 blocks respectively.

    4.6 Comparison of methods and summary

    To summarise, in general any method of feature extraction has certaincharacteristics which need to be taken into account when constructing an

    artificial recogniser. The three feature extraction methods discussed are

    characterised in table 4.1. When training HMMs to recognise faces, it is

    Pixel intensities DCT DCT-mod2

    Preprocessing None ND 2-D DCTs ND 2-D DCTs and

    ND2 linear operations

    Dimensionality Large Small Small

    Robustness None Very Most

    Table 4.1: Comparisons of feature extraction methods

    desirable to speed up the process without sacrificing accuracy. By using

    the two DCT-based feature extraction methods, we improved the speed of

    our system (because of the fewer dimensions of the observation sequences).

    Furthermore, our system becomes robust to changes in light illumination

    something that is inherent in any picture. To briefly illustrate the valueof the above feature extraction methods, a small classification experiment

    was run on the first 8 individuals (using 4 images of each) in the University

    of Surrey, XM2VTS database (2002) using each of the feature extraction

    methods. The leave-one-out method of training/scoring was used, with

    HMM configuration II (see chapter 6 for details of this configuration it

    is a configured embedded HMM). The results we obtained from this mini

  • 8/22/2019 Ballot Face 2005

    49/98

    Chapter 4. Feature extraction methods 35

    Recognition accuracy Wrong classifications

    Pixel values 84.38% 5 faces

    DCT 90.63% 3 faces

    DCT-mod2 100.0% 0 faces

    Table 4.2: Classification accuracy on small scale

    experiment are summarised in table 4.2. The full results achieved on both

    evaluated databases and using all three feature extraction methods are listed

    and discussed in chapter 7. We see from this discussion on feature extraction

    techniques that we need to give our HMM classifier as much information as

    possible about an image wasting as little space as possible. The next chap-

    ter deals with the foundation of this thesis on face recognition, namely the

    construction of the specialised HMMs used in the classification experiments.

  • 8/22/2019 Ballot Face 2005

    50/98

    Chapter 5

    Constructing the Hidden

    Markov Models

    5.1 A brief introduction to HMMs

    Hidden Markov Model theory forms the background of the industry stan-

    dard in speech recognition based applications. HMMs tend to be robustrecognisers with extreme flexibility in terms of parameters. These char-

    acteristics caused us to believe that HMMs might be suitable to image

    recognition and as this thesis shows, this is in fact the case. An in depth

    discussion on HMMs is deferred to the many excellent references on the

    topic, one being by Rabiner & Juang (1986). The purpose of this chapter

    is to introduce our application specific HMMs and show how an expan-

    sion on conventional one-dimensional HMM theory will suit our inherently

    two-dimensional application.

    5.2 HMM background

    We now introduce the notation and mathematical descriptions (regarding

    HMMs) necessary to illustrate subsequent discussions on our face recogni-

    tion model.

    36

  • 8/22/2019 Ballot Face 2005

    51/98

    Chapter 5. Constructing the Hidden Markov Models 37

    5.2.1 Topology and notation

    Over the years of research in pattern recognition quite a number of HMM

    topologies and configurations have seen the light, as mentioned in du Preez

    (1997). The standard topology we are concerned with is the non-ergodic,

    left-to-right Hidden Markov Model as in figure 5.1. The reason this specific

    model was chosen is because the human face can naturally be divided into

    segments common to every human (eyes, nose, mouth, chin etc.) and these

    features are in the same order. A Hidden Markov Model is defined as a set

    1 2 3 4 5a23a12 a34 a 45

    a11 a22 a33 a44 a55

    Figure 5.1: Standard left-to-right, non-ergodic HMM

    ofN emitting states as well as an initial and an end-of-line state (these states

    are so-called null-states), so we end up with N + 2 states. The expression

    St = i will indicate the occurrence of state i at time t. The time indices

    run from t = 1 to t = T, where T is the length of the observation sequence

    X = [x1 x2 x3 xT] to be matched to the HMM. The states are

    coupled by transitions with aij denoting the state transition probability with

    the subscripts indicating the two states involved and aii refers to the self-

    loop probability. The first null-state has a transition probability of 1 and

    no self-loop probability. The last null-state has no emitting probabilities, it

    is the termination state. Each emitting state has an associated probability

    density function (pdf) described as fi(x|St, ). This pdf quantifies the

    similarity of a feature vector xt from the observation sequence to the state

    St = i. It is important to note that no time step is needed to enter the first

    null-state, the process will already occupy that state. Using the common

    shorthand notation, a single left-to-right HMM can now be described as

    = {a, f}. Introducing the null-states effectively cancels the need for

  • 8/22/2019 Ballot Face 2005

    52/98

    Chapter 5. Constructing the Hidden Markov Models 38

    defining an initial value often denoted by in most of the literature on

    HMM theory.

    In order to train an HMM we need to quantify a few probabilities. The

    match between an observation sequence XT1 and the model can be ex-

    pressed in terms of the likelihood f(XT1 |). The calculation of this like-

    lihood is often known as the evaluation problem. A possible solution to

    this problem is to enumerate all possible sequences of states ST+10 and de-

    termine the value of f(XT1 , ST+10 |) and then determine the marginal pdf

    by summing over all of them. A more efficient approach is the forward-

    backward procedure, described in appendix B. We approximate this by the

    well known Viterbi algorithm since it is faster. The sequence which delivers

    the highest score will be the solution to what is known as the decoding

    problem yielding the most likely state sequence.

    In training the HMM we need to optimise the parameters of the model

    based on the observation sequence. This can be quantified as finding the

    highest value of f(|XT1 , ST+10 ).

    1 We used what is known as Viterbi re-

    estimation to solve what is often known as the learning problem. This

    method uses the state sequence (segmentation) obtained by the Viterbi al-

    gorithm to re-estimate the parameters of the HMM. This can easily be

    accomplished by simply updating all the parameters (pdfs and transition

    probabilities) within the segment specified by the Viterbi algorithms seg-

    mentation. This algorithm is an example of an Expectation-Maximisation

    algorithm as we change our pdfs parameters to obtain the maximum prob-

    ability score (expectation).

    The described procedures involving HMMs involve matching an obser-

    vation sequence to the model. This is quantified as a probability f(XT1 |),

    showing that any HMM can be seen as a special kind of pdf.

    1A reader familiar with basic statistics will note that this is the reverse of the eval-

    uation problem, and therefore simple Bayesian identities can solve this problem.

  • 8/22/2019 Ballot Face 2005

    53/98

    Chapter 5. Constructing the Hidden Markov Models 39

    5.3 Model Configurations

    5.3.1 First configuration 1D HMM

    For the face classification task we used two basic configurations of Hidden

    Markov Models. In the first case the face was modelled with a vertical

    HMM running along the rows of the image as seen in figure 5.2. With each

    state of the HMM representing a distinct facial region (i.e. the eyes, mouth,

    chin etc.) the characteristic features of any person can be modelled. Inside

    Figure 5.2: Vertical top-to-bottom HMM modelling a face

    each state S we use a Gaussian mixture model (GMM) as the probability

    density function fi(x|St, ) within the state. A Gaussian mixture model

  • 8/22/2019 Ballot Face 2005

    54/98

    Chapter 5. Constructing the Hidden Markov Models 40

    can be expressed as a weighted sum of K Gaussian distributions:

    L(x) =Kk=1

    p(k)Nk(x) (5.3.1)

    where Nk(x) is a D dimensional Gaussian distribution with mean and

    covariance matrix :

    Nk(x|, ) =1

    (2)D2 ||

    12

    exp[1

    2(x )T1(x )] (5.3.2)

    and p(k) is a mixture weight constrained by:

    0 p(k) 1 andKk=1

    p(k) = 1

    The mixture weights can be seen as probabilities since they represent the

    importance of each separate Gaussian pdf in the GMM. The dimension D

    of the Gaussians depends on the feature extraction method we use.2

    Now that we have the density functions, we can finalise our HMM by

    initialising it. This is done by uniformly segmenting the face under consid-

    eration along its rows and obtaining the mean vector and covariance matrix

    of each of these segments. This will be the initial values of the parameters

    of our GMMs. Furthermore we set all the transitional probabilities of the

    HMM equal to aij = 0.5, keeping in mind that for each state of the HMM

    these probabilities sum to 1. Now we have a complete model of our face

    represented by: = {a, f}. In order to train this model we use the proce-

    dure described in the previous section, matching an observation sequence

    to the model and then optimising the models parameters.

    5.3.2 Second configuration Embedded HMM

    We illustrated that calculating the match between an observation sequence

    and a model, can be characterised as a probability f(XT1 |). This means

    that an HMM itself could be seen as a specialised pdf. Embedding HMMs

    2See chapter 4 for feature extraction methods.

  • 8/22/2019 Ballot Face 2005

    55/98

    Chapter 5. Constructing the Hidden Markov Models 41

    to serve as the pdfs of the states of our vertical HMM could indeed enhance

    the modelling capabilities of our system. For an embedded HMM the con-

    ventional top-to-bottom HMM has a horizontal HMM as the pdf of each of

    its states (instead of a GMM) as shown in figure 5.3. This means that for

    each vertical state we have fi(x|St, ) i{aei , fei } where the superscript

    indicates that we are referring specifically to the horizontal HMM i with

    i indicating the vertical state under consideration. Each of the horizontal

    1 2 3 4 5

    1 2 3 4 5

    1 2 3 4 5

    1 2 3 4 5

    1 2 3 4 51

    2

    3

    4

    5

    Figure 5.3: Embedded HMM modelling a face

    HMMs also needs probability density functions fei (x|i) for its states. These

    pdfs were chosen to be Gaussian mixture models (as described by equation

    5.3.1), due to their flexibility. To initialise the whole HMM structure, animage under consideration is segmented uniformly. Each of these segments

    is then again uniformly segmented across its columns so we end up with

    uniform blocks of data.3 The mean vector and covariance matrices of each

    of these blocks are now found so that every GMM obtains values for its

    3Thus for a 5x5 embedded HMM we will have 25 uniform blocks

  • 8/22/2019 Ballot Face 2005

    56/98

    Chapter 5. Constructing the Hidden Markov Models 42

    parameters. Just to clarify, the estimated mean of column m of a MxN

    block (matrix) xd of data is:

    Axn = E[xd(m)] =1

    N

    Ni=1

    xd(i, m) (5.3.3)

    Thus the total mean d of the block can be expressed as a vector of M

    means (m = 1 M). The covariance matrix is estimated by:

    d =1

    N

    N

    i=1

    (xd(i, m) Axn)(xd(i, m) Axn)T (5.3.4)

    To summarise the whole embedded HMM concept, we have a vertical HMM

    containing a number of horizontal HMMs as probability density functions

    within its states. The initial values of this vertical HMM are the combined

    initial values of the horizontal HMMs along with uniform transition proba-

    bilities. Each horizontal HMM has a GMM as probability density function

    within each of its states, uniform transitional probabilities and the initial

    values are obtained from uniform blocks of data to initialise the GMMs.

    The horizontal HMMs are trained as described in a previous section and

    this means the calculation of a likelihood. Using the likelihoods calculated

    of all of the horizontal HMMs, we can finally obtain the same type of like-

    lihood value for our vertical HMM. This results in a trained final model.

  • 8/22/2019 Ballot Face 2005

    57/98

    Chapter 6

    Implementation

    6.1 Practical aspects

    Now that we have built all the necessary foundations that explains the gist

    of our system, we can proceed to discuss the implementation issues and the

    practicalities of building a robust classifier.

    6.1.1 Classifying faces database partitioning

    In order to classify faces we need to train our HMM based classifier on

    sample images (training data) of each person in the database. Then other

    unseen images or test data is scored against our trained model and, as in

    most real life scenarios, the highest score wins!

    The immediate question which arises, involves the partitioning of the

    databases in terms of training and testing data. In order to compare our

    classification results with published results, the same partitioning must be

    used. The partitioning of a database is often done only once (by the first

    publisher) and then, in order to compare results, such a partitioning seems

    to propagate through all further publications in the field. To explain the

    partitioning problem we summarise common partitions of our two databases

    in Table 6.1. Just to clarify what is meant by the percentage values

    it reflects the ratio amount of training dataamount of test data

    100%. Although the XM2VTS

    43

  • 8/22/2019 Ballot Face 2005

    58/98

    Chapter 6. Implementation 44

    Partitioning Reference

    XM2VTS 75% Zhang et al. (2004)

    ORL 50% Samaria (1994)

    Table 6.1: Comparable partitioning of databases

    database has 8 images per person, we only used 4 faces (each time using 3

    faces to train on and one to test on) in experiments. We chose these faces in

    accordance with what seems to be the four faces used in Zhang et al. (2004).

    This could prove to be quite limiting as efficient modelling using HMMs is

    known to be very training data dependent. In the XM2VTS database one

    is also dealing with 295 individuals, so face classification does become more

    difficult. The only drawback is that as far as could be established Zhang

    et al. (2004) is the only known publication with classification results. All

    other publications concerning this database tackled the verification problem

    (because of the well defined protocol as described in Messer et al. (1999))

    and a large number of verification results are obtainable. Following the

    above discussion we set up the XM2VTS database experiments with 3 facesto train on and one to test on to compare classification results.1

    The experiments on the ORL database give a good indication on how

    our system compares to other systems, since a large number of published

    results are available. This database however has a collection of only 40

    individuals (figure A.1) which means results could be seen only as a rough

    approximation to how a commercial system would perform. We used the

    ORL database as it is and did classification experiments using the historical

    50% partitioning to compare our system with previous results as well as afull leave-one-out experiment. To clarify the 50%, it means that the first

    five faces were used to train on and the last five used to test on. Perfect

    classification rates (100%) is obtainable on the ORL database, as we show

    in the next chapter.

    1This partitioning does have its merits, the images were shot one month apart so

    differences in appearance (different hair \ glasses etc.) are present.

  • 8/22/2019 Ballot Face 2005

    59/98

    Chapter 6. Implementation 45

    6.1.2 Classifying faces image background

    One of the most frustrating problems encountered in constructing our classi-

    fier was the large amount of background present in the XM2VTS database.

    As shown previously, the background can represent more than 40% of an im-

    age. A classification experiment was conducted using the embedded HMM

    approach with DCT-mod2 coefficients, using the first face of four as test

    data and the other three faces as training data on all 295 individuals in

    the database without reducing the amount of background. This careless-

    ness reflected on the results as we achieved a correct classification rate ofonly 58%. Since we have 8 faces available, but use only four, the effect of

    the background can easily be verified by running the same experiment and

    again using 4 faces. However, this time 2 faces are replaced with 2 of those

    that have been left out.2 In doing this the classification rate increases to

    80%! This means that the error rate is effectively halved. It is necessary to

    take out the background it is confusing our classifier! This problem at

    least illustrates that because of the modelling power and dynamic aspects

    of HMMs, they are so flexible that they tend to model on the backgroundif it represents the bulk of the available data.

    As a solution to the problem posed by too much background we cut out

    all the relevant faces of the XM2VTS database (first face from each of the 4

    capturing sessions) giving a total of 295 4 faces to classify. The faces were

    cut out manually from downsized images, so as to fit as much of the face

    as possible into a 236x144 sized window. This corresponds to 56x33 blocks

    of DCT-mod2 coefficients and 58x35 blocks of DCT coefficients. Again the

    dimension of the DCT features is 15 and that of DCT-mod2 features is18. The DCT coefficients are obtained with a sampling overlap of 50%.

    These were the combination of dimensions used in our final experiments.

    In the ORL database no cropping of faces was needed as the data is already

    presented in a friendly format as shown in chapter 3.

    2The 8 faces per person in the XM2VTS database were shot across 4 sessions, here

    we use the 2 faces from the first 2 sessions.

  • 8/22/2019 Ballot Face 2005

    60/98

    Chapter 6. Implementation 46

    6.1.3 Classifying faces training and scoring

    HMMs

    The process of classification on a database of faces can be summarised as

    follows:

    First a database is partitioned into a training and a testing part

    with the training data used to train an HMM for each person in the

    database. This means we have for each person in the database an

    HMM trained on the training data.

    Each test face is then scored against all these models and it is clas-

    sified to the model with the highest similarity measure. This scoring

    procedure was done with the reversed Viterbi algorithm, each score

    representing the similarity between the test face image and a trained

    model.

    6.2 The HMM configurationsFor our experiments we use two basic configurations of HMMs using all

    three the feature extraction techniques described in chapter 4. We also

    evaluate these six possible setups on both the available databases. For all

    the following discussions on the dimensions of the observations and hence

    the Gaussian Mixture Models, see Figure 6.1.

    6.2.1 HMM configuration I

    For configuration I we use a simple one dimensional left-to-right HMM

    modelling down the rows of each image. It has Gaussian mixture models

    within its states as probability density functions each one modelling hori-

    zontal data. We specify seven states, each state modelling a specific facial

    region or background, namely: top background and hair, forehead, eyes,

  • 8/22/2019 Ballot Face 2005

    61/98

    Chapter 6. Implementation 47

    HMM configuration Dimension:N

    Feature Domain

    "Time" (t)t = 3

    t = 2

    t = 1

    Figure 6.1: Passing of features from the feature domain to an HMM configura-

    tion

    nose, mouth, chin and finally neck or possibly clothing as shown concep-

    tually in figure 6.2. The Gaussian mixtures consisted of three diagonal

    covariance Gaussians each, initialised with uniform weights in the mixture.The dimensions of the Gaussians were chosen to be N = 15 for the DCT

    based experiments and N = 18 for the DCT-mod2 experiments, as these are

    the dimensions of the features extracted as proposed by Sanderson (2003).

    For the pixel value experiments a dimension of N = 4 was used with a

    single observation being represented by a column vector of 4 pixel values.

    The whole observation sequence is formed by scanning the image from top to

    bottom with a window of 4 pixels high and a 75% overlap. The dimension of

    4 and the 75% overlap are chosen based on experiments by Samaria (1994).The same method is used in the DCT and DCT-mod2 experiments,

    except for the overlapping this is set at 50% and done in the step where

    we transform the pixel values with the DCT (see chapter 4 for more details).

    In this configuration it is important to initialise the Gaussian parameters

    to sensible values since the Gaussians do most of the modelling the HMM

    merely facilitates the option of which one of the GMMs is used. For the pixel

  • 8/22/2019 Ballot Face 2005

    62/98

    Chapter 6. Implementation 48

    Figure 6.2: HMM configuration I topology

    value experiments we intuitively choose the initial values as follows. Since

    the GMMs we use consist of three Gaussians each, we decided to divide the

    grey scale domain (0-255) into 4 roughly equal parts and therefore obtain

    three distinct borders used as the initial means for the Gaussians. These

    means are: 60, 120 and 180. This ensures that no prejudice is assigned

    to certain pixel values as to bias the classifier. The diagonal covariance

    matrices are all initialised with values of 100 on the diagonal. For the DCT

    values and the DCT-mod2 values a more careful approach was needed sincethese features do provide more stability, but at the cost of needing good

    initialisation. To the best of our knowledge the effect of initialisation has

    not been covered before in literature and we believe it to have an effect on

    the outcome of the classification experiment.

    For the DCT coefficients obtained from both databases we take the av-

    erage of the means down the columns for all 400 of the faces in the case

  • 8/22/2019 Ballot Face 2005

    63/98

    Chapter 6. Implementation 49

    of the ORL database and all 1180 faces of the XM2VTS database. These

    means are shown in figures 6.3 and 6.4. From this we decided to initiali


Recommended