c 2009 Ritwik Kumar - University of...

RELIGHTING, POSE CHANGE AND RECOGNITION OF FACES

By

RITWIK KUMAR

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2009

1

c© 2009 Ritwik Kumar

2

To my mother, father and sister

3

ACKNOWLEDGMENTS

I am extremely grateful to Dr. Baba C. Vemuri and Dr. Arunava Banerjee for their

guidance and support during my graduate studies. They have been a constant source

of inspiration and encouragement for me, the most important ingredients necessary for

research. I am also thankful to Dr. Jeffery Ho, Dr. Anand Rangarajan and Dr. Trevor

Park for being on my supervisory committee and providing extremely useful insights into

the work presented in this dissertation.

I would like to thank the Department of Computer and Information Science and

Engineering (CISE) and the University of Florida (UF) for giving me the opportunity to

pursue my graduate studies in a very constructive environment. I am especially thankful

to my UF Alumni Fellowship, the Dept. of CISE and the NIH grants NS46812 (to Dr.

Baba C. Vemuri), EB007082 (to Dr. Baba C. Vemuri) and EB004752 (to Dr. Paul R.

Carney and Dr. Thomas H. Mareci) for funding my doctoral studies and travels to various

conferences. During my graduate studies, I enjoyed my job as a teaching assistant and for

that I am grateful to Dr. Manuel Bermudez for being a terrific boss.

I am grateful to Dr. Tanveer Syeda-Mahmood for giving me the opportunity to work

with her wonderful research group at the IBM Almaden Research Center. I appreciate

the opportunity given to me by Dr. Michael Jones and Dr. Tim Marks to work at the

Mitsubishi Electric Research Laboratories (MERL). I am especially grateful to MERL for

providing data and software to generate some of the results included in this dissertation.

At the Center for Vision Graphics and Medical Imaging (CVGMI), I was very

fortunate to get to spend time in company of many wonderful people. I deeply appreciate

Angelos Barmpoutis for his friendship and guidance during the course of my graduate

studies. Fei Wang has been a wonderful friend throughout my graduate studies and I am

grateful for all the mentoring and support he has provided. I must thank Ajit Rajwade,

Bing Jian and Santhosh Kodipaka for being extremely helpful and patient, as I endlessly

bothered them with my questions. I also appreciate the camaraderie of my lab-mates

4

Nicholas Lord, Adrian Peter, O’Neil Smith, Ozlem Subakan, Guang Cheng and Ting

Chen.

I am thankful to my long time friends Siddharth Chouksey and Vaibhav Garg for

being there when I needed them. I must also thank Kris, someone who I knew very briefly,

but would always remember as a good friend. Thanks to Karthik Gurumoorthy, Venkat

Ramaswamy, Seniha Esen Yuksel, Jason Yu-Tseh Chi and Amit Dhurandar, my wonderful

friends who are capable of making any occasion fun. I am especially grateful to Mujde

Erten, for being a patient, understanding and helpful friend.

I am thankful to Dr. Arvind Kudchadkar, Dr. R. N. Biswas, Dr. Prabhat Ranjan,

Dr. Naresh Jotwani, Dr. Ashish Jadhav and Dr. Suman K. Mitra, for their guidance

and support during my undergraduate studies at the Dhirubhai Ambani Institute of

Information and Communication Technology, Gandhinagar. They introduced me to the

wonderful world of research and helped me attain my goal of pursuing a doctoral degree.

Lastly and most importantly, I am thankful to my family, for their unconditional and

unflinching love and support. I will be eternally grateful to my mother, Shashi, my father,

Kailash and my sister, Richa, for all that they have done for me.

5

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

CHAPTER

1 PROBLEM DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.2 Facial Relighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.3 Pose Change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.4 Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.5 The Interconnections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.6 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 FACIAL RELIGHTING AND POSE CHANGE WITH MULTIPLE IMAGES . 17

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.4 Tensor Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4.1 Spherical Functions Modeled as Tensors . . . . . . . . . . . . . . . . 292.4.2 Tensor Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.4.3 Facial ABRDF Approximation Using Tensor Splines . . . . . . . . . 31

2.5 Mixture of Single-lobed Functions . . . . . . . . . . . . . . . . . . . . . . . 362.6 Recovering Shape from the ABRDF Field . . . . . . . . . . . . . . . . . . 37

2.6.1 Rotation Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 382.6.2 Surface Normal Computation . . . . . . . . . . . . . . . . . . . . . . 402.6.3 Shape Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.6.4 Novel Pose Relighting . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442.7.1 Relighting Faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452.7.2 Estimating Shape . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482.7.3 Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

2.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3 EIGENBUBBLES: THE ENHANCED ABRDF REPRESENTATION . . . . . . 55

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.2 Eigenbubbles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.3 Experiments & Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6

3.3.1 Relighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.3.2 Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.3.3 ABRDF Field Compression . . . . . . . . . . . . . . . . . . . . . . . 64

3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4 FACE RELIGHTING AND POSE CHANGE WITH SINGLE IMAGE . . . . . 66

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.4 The Reference ABRDF Field Model . . . . . . . . . . . . . . . . . . . . . . 694.5 Model Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.6.1 Relighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744.6.2 Pose Change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5 FACE RECOGNITION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.2 Volterra Kernel Approximations . . . . . . . . . . . . . . . . . . . . . . . . 835.3 Kernel Computation as Generalized Eigenvalue Problem . . . . . . . . . . 85

5.3.1 First Order Approximation . . . . . . . . . . . . . . . . . . . . . . . 875.3.2 Second Order Approximation . . . . . . . . . . . . . . . . . . . . . . 88

5.4 Training and Testing Algorithms . . . . . . . . . . . . . . . . . . . . . . . 895.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 905.6 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6 ENHANCING FACE RECOGNITION WITH RELIGHTING . . . . . . . . . . 95

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 956.2 Gallery Set Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 956.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.3.1 Nearest Neighbor Classifier . . . . . . . . . . . . . . . . . . . . . . . 966.3.2 MERL Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 966.3.3 Local Binary Pattern Classifier . . . . . . . . . . . . . . . . . . . . . 976.3.4 Naive Relighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 976.3.5 Practical Relighting . . . . . . . . . . . . . . . . . . . . . . . . . . . 976.3.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

7 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

7

LIST OF TABLES

Table page

2-1 Requirements, assumptions and capabilities of existing methods . . . . . . . . . 20

2-2 Face recognition error rates with Tensor Splines . . . . . . . . . . . . . . . . . . 52

3-1 Face recognition error rates with Eigenbubbles . . . . . . . . . . . . . . . . . . . 64

3-2 ABRDF field compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5-1 State-of-the-art methods for face recognition . . . . . . . . . . . . . . . . . . . . 90

5-2 Face recognition results on the Yale A dataset . . . . . . . . . . . . . . . . . . . 91

5-3 Face recognition results on the CMU PIE dataset . . . . . . . . . . . . . . . . . 91

5-4 Face recognition results on the Extended Yale B dataset . . . . . . . . . . . . . 92

6-1 Effect of the gallery augmentation size on the recognition rates . . . . . . . . . . 98

6-2 Face recognition rates for the CMU PIE dataset . . . . . . . . . . . . . . . . . . 99

6-3 Face recognition rates for the MERL Dome dataset . . . . . . . . . . . . . . . . 99

8

LIST OF FIGURES

Figure page

2-1 Lambertian model vs. Cartesian Tensors model . . . . . . . . . . . . . . . . . . 19

2-2 Symmetric and antisymmetric ABRDF approximations . . . . . . . . . . . . . . 24

2-3 ABRDF alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2-4 ABRDFs on a face . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2-5 Novel synthesized images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2-6 Novel images in complex lighting . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2-7 Shadows and specularities comparison . . . . . . . . . . . . . . . . . . . . . . . 39

2-8 Impact of the input images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2-9 Per pixel intensity error comparison . . . . . . . . . . . . . . . . . . . . . . . . . 42

2-10 Shapes recovered using 9 images . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2-11 Pose variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2-12 Shape comparison with the Robust Photometric Stereo . . . . . . . . . . . . . . 49

2-13 Simultaneous pose and illumination variation . . . . . . . . . . . . . . . . . . . 51

3-1 Global Eigenbubbles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3-2 Local Eigenbubbles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3-3 ABRDFs on a face: Eigenbubbles . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3-4 Global vs. Local Eigenbubbles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3-5 Images relighted with the Eigenbubbles . . . . . . . . . . . . . . . . . . . . . . . 61

3-6 Quantitative comparison with Eigenbubbles . . . . . . . . . . . . . . . . . . . . 62

3-7 Shadows and specularities in relighted images . . . . . . . . . . . . . . . . . . . 62

3-8 Extrapolated lighting conditions with Eigenbubbles . . . . . . . . . . . . . . . . 63

4-1 Overview of the ABRDF field fitting . . . . . . . . . . . . . . . . . . . . . . . . 71

4-2 Comparison with the ground truth images . . . . . . . . . . . . . . . . . . . . . 74

4-3 Relighted images from the CMU PIE dataset . . . . . . . . . . . . . . . . . . . 76

4-4 Relighted images from the CMU PIE dataset . . . . . . . . . . . . . . . . . . . 77

9

4-5 Relighted images from the MERL Dome dataset . . . . . . . . . . . . . . . . . . 78

4-6 Relighted images from the MERL Dome dataset . . . . . . . . . . . . . . . . . . 79

4-7 Pose changed images using single input image . . . . . . . . . . . . . . . . . . . 80

5-1 Structure of A1i and A2

i matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5-2 Training algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5-3 Testing algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6-1 ROC Curve for the CMU PIE dataset with the MERL classifier. . . . . . . . . . 100

6-2 ROC Curve for the MERL Dome dataset with the MERL classifier. . . . . . . . 101

6-3 ROC Curve for the CMU PIE dataset with the Nearest Neighbor classifier. . . . 101

6-4 ROC Curve for the MERL Dome dataset with the Nearest Neighbor classifier. . 102

6-5 ROC Curve for the CMU PIE dataset with the LBP classifier. . . . . . . . . . . 102

6-6 ROC Curve for the MERL Dome dataset with the LBP classifier. . . . . . . . . 103

10

Abstract of dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

RELIGHTING, POSE CHANGE AND RECOGNITION OF FACES

By

Ritwik Kumar

December 2009

Chair: Baba C. VemuriCochair: Arunava BanerjeeMajor: Computer Engineering

Relighting, pose change and recognition of faces from images are intimately connected

fundamental problems in the field of Computer Vision and Graphics. These problems are

particularly interesting and difficult when examined in presence of constraints like limited

number of input images, cast shadows and specularities. Though numerous solutions have

been proposed in the past, none effectively addresses these problems when considered in

the aforementioned constrained setting. In this dissertation, we present a set of techniques,

which accomplish relighting, pose change and recognition of facial images in presence of

specularities and shadows, using as few as one input image.

We start by presenting a method for relighting and pose change for facial images

using nine or more input images. We accomplish this by representing the Apparent

Bidirectional Reflectance Distribution Function (ABRDF) fields of human faces using

Tensor Splines. We then present a method for improving the quality of relighted images

by enhancing the ABRDFs with face specific subspace representations. Next, we present

a novel technique for estimating facial ABRDF fields for the difficult case of single input

image. Finally, we focus on the face recognition problem and present a novel face image

classification scheme as well as a framework for enhancing face recognition using relighting

methods outlined above.

All of the above mentioned techniques are supported by extensive experiments on the

Yale A, the CMU PIE, the Extended Yale B and the MERL Dome face image databases.

11

We show that our relighting, pose change and recognition systems outperform various

state-of-the-art methods in terms of image quality and recognition rates.

12

CHAPTER 1PROBLEM DESCRIPTION

1.1 Introduction

Faces are the primary features that are used by humans to associate identities with

people. Due to this critical role of human faces, understanding and manipulating their

appearance has become immensely important in various modern applications which

deal with the human-form representations. For instance, in a given video clip, change

of the scene illumination requires that the appearances of present faces also be modified

realistically – a problem commonly found in many video games with human characters.

Similar scenarios are also frequently encountered in movies-making industry where

post-production manipulation of scenes with human faces, like pose or illumination

change, is required. In addition to the manipulation of facial images, understanding

what traits of a face are critical for its identity is also important, since it opens roads

for automatic face recognition. Automatic face recognition has numerous important

applications like access control, automatic image tagging and forensic database search.

The work presented in this dissertation addresses the three fundamental problems

in facial image analysis - facial relighting, pose change and recognition. These have been

selected because these interconnected problems cover a substantial subset of the facial

image manipulation and understanding based applications described above. The exact

nature of each of these problems depends on the chosen assumptions. In a broad sense, we

will present solutions to the problem of relighting and pose change for both the multiple

image input and the single image input cases. We will also examine the problem of the

multiple image face recognition by presenting a novel face image classification method

and explore the single image face recognition problem in general as it relates to face

image relighting. Note that the single and multiple images here refer to the gallery or the

training set.

13

We begin with defining the problems of facial relighting, pose change and recognition

and then look at the interdependence among them. The specific flavors of these problems

that we work with are addressed in more details in the respective chapters.

1.2 Facial Relighting

In its most general form, the problem of facial relighting involves generating facial

images of a subject under novel illumination given a few example facial images of that

subject. The example images may or may not come with any information about the

lighting used in the images. Further, the number of such example images may vary

from one to many. At the output end, the novel images may be required to have a point

source or arbitrarily complex lighting. The complexity of the problem also increases

according to the sought quality of the generated novel images. For instance, if shadows

and specularities are to be faithfully produced in the novel images, the problem becomes

more difficult than when these photo-effects are ignored.

1.3 Pose Change

In its most general form, the problem of pose change for a facial image involves

generating images of a subject’s face under novel poses, when one or more example images

of the same face are given in some other pose. The example images may or may not

come with information about the lighting conditions. The most common approach to this

problem involves recovering the 3D shape and texture of the face from the given image

or images. Once the shape and the texture of a face have been captured, images in novel

poses can be generated by projecting the shape along various directions.

1.4 Face Recognition

The above two problems are concerned with face images of a single subject, while

the problem of face recognition involves images across different subjects. This problem is

defined using three classes of images. The probe image is the image whose identity is to be

determined. The gallery is the set of images against which the identity of the probe image

is checked and the training set is the set of images which are used to learn the system

14

parameters. It should be noted that often the gallery and the training set refer to the

same set of images.

The complexity of the recognition problem is governed by both the probe and

the gallery images. The problem of recognizing the face is simpler if the probe is

obtained under predetermined pose and lighting conditions and it becomes harder as the

control over the probe image is relaxed and photo-effects like shadows and specularities

are allowed. The problem is again simpler if the gallery images are obtained under

predetermined conditions. The number of images in the gallery also impacts the difficulty

of the problem.

1.5 The Interconnections

The interdependence among these problems is best understood from the point of view

of the face recognition problem. It is possible, that the probe image is taken in a different

pose and/or illumination condition than the image(s) present in the gallery set. In such

cases, if either using the probe image or the image(s) from the gallery we can generate

novel images in new illumination conditions and poses, it can aid the recognition process

which involves matching the probe image against the gallery images.

The problems of relighting and pose change are also intimately connected. The

change in the observed image intensity values as the environment lighting changes is

known to be in-part governed by the local shape of the object ([1], [2]). The global shape

can also affect photo-effects like shadows in an image.

1.6 Organization

Due to the tightly interdependent nature of the relighting and the pose change

problems, we discuss them together in Chapter 2, where we present a detailed description

of these problems, literature survey, our proposed solutions and the experimental results.

In Chapter 3 we discuss a method for enhancing the quality of relighted images using

Eigenbubbles. The more difficult case of relighting and pose change problem, which works

with single input image is discussed in Chapter 4. Then we shift focus on the multiple

15

image face recognition problem and present a new classifier called Volterrafaces in Chapter

5. In Chapter 6 we detail our novel framework for single image face recognition that draws

upon the relighting scheme described in Chapter 4 to enhance recognition rates. Finally,

in Chapter 7 we conclude with a summary of the contributions made in this dissertation.

Part of the work presented in this dissertation has been published in [3], [4] and [5].

16

CHAPTER 2FACIAL RELIGHTING AND POSE CHANGE WITH MULTIPLE IMAGES

2.1 Introduction

Precisely capturing appearance and shape of objects has engaged human imagination

ever since the conception of drawing and sculpting. With the invention of computers,

a part of this interest was translated into the search for automated ways of accurately

modeling and realistically rendering of appearances and shapes. Among all the objects

explored via this medium, human faces have stood out for their obvious importance. In

recent times, the immense interest in facial image analysis has been fueled by applications

like face recognition (on account of recent world events), pose synthesis and face relighting

(driven in part by the entertainment industry), among others. This in turn has led to an

epitome of literature on this subject, encompassing various techniques for modeling and

rendering appearances and shapes of faces.

Our understanding of the process of image formation and the interaction of light and

the facial surface has come a long way since we started [6], with many impressive strides

along the way (e.g. [7], [8], [9]), but we are still some distance from an ideal solution. In

our view, an ideal solution to the problem of modeling and rendering appearances and

shapes of human faces should be able to generate extremely photo-realistic renderings of

a person’s face, given just one 2D image of the face, in any desired illumination condition

and pose, at a click of a button (real time). Furthermore, such a system should not require

any manual intervention and should not be fazed by the presence of common photo-effects

like shadows and specularities in the input. Lastly, such an ideal system should not require

expensive data collection tools and processes, e.g. 3D scanners, and should not assume

availability of meta-information about the imaging environment (e.g. lighting directions,

lighting wavelength etc.).

These general requirements have been singled out because the state-of-the-art is

largely comprised of systems which relax one or more of these conditions while satisfying

17

others. Common simplifying assumptions include applicability of the Lambertian

reflectance model (e.g. [7]), availability of 3D face model (e.g. [10]), manual initialization

(e.g. [11]), absence of cast shadows in input images(e.g. [12]), availability of large amount

of data obtained from custom built rigs (e.g. [8]) etc. These assumptions are noted as

“simplifying” because – human faces are known to be neither exactly Lambertian nor

convex (and thus can have cast shadows), fitting a 3D model requires time consuming

large-scale optimization with manual selection of features for initialization, specialized

data acquisition can be costly and in most real applications only a few images of a face are

available.

The method we propose in this chapter moves the state-of-the-art closer to the

ideal solution by satisfying more of the above mentioned attributes simultaneously.

Our technique can produce photo-realistic renderings of human faces across arbitrary

illumination and pose using as few as 9 images (fixed pose, known illumination direction)

with a spatially varying non-Lambertian reflectance model. Unlike most techniques, our

method does not require input images to be free of cast shadows or specularities and can

reproduce these in the novel renderings. It does not require any manual initialization and

is a purely image based technique (no expensive 3D scans are needed). Furthermore, it is

capable of working with images obtained from standard benchmark datasets and does not

require specialized data acquisition.

Our technique is based on the Tensor Splines framework which can be used to

approximate any n-dimensional field of spherical functions (originally proposed in [13]).

In the case of faces, we, for the first time, use Tensor Splines to approximate the field of

Apparent Bidirectional Reflectance Distribution function (ABRDF) for a fixed viewing

direction. Unlike the BRDF, the ABRDF (also known as the reflectance function) at each

pixel captures the variation in intensity as a function of illumination and viewing direction

and is thus sensitive to the context of the pixel. Once the ABRDF field has been captured,

images of the face under the same pose but with arbitrary illumination can be generated

18

Figure 2-1. Lambertian model vs. Cartesian Tensors model. From the syntheticexample of the first row and the real data below it, it can be noted that theCartesian tensor can capture variations of intensity distributions moreaccurately than the Lambertian model.

by simply taking weighted combinations of the ABRDF field samples. Next, we estimate

the surface normal at each pixel by robustly combining the shape information from its

neighboring pixels. Towards this, we put forward an iterative algorithm which works by

registering neighborhood ABRDFs using an extremely efficient linear technique. With as

few as 1 or 2 iteration, we can recover the surface normal fields of most faces which are

then numerically integrated to obtain the face surfaces. Novel pose with novel illumination

conditions can thus be rendered while seamlessly accounting for attached as well as cast

shadows.

2.2 Related Work

The sheer size of the facial shape-reflectance modeling literature allows its taxonomy

to be carried along various possible lines. Here we have classified methods based on

various assumptions made by them while modeling the facial reflectance and the shape.

We have also summarized few of the key methods along with associated assumption in

Table 2-1.

19

Tab

le2-

1.R

equir

emen

ts,as

sum

pti

ons

and

capab

ilit

ies

ofex

isti

ng

met

hods

Meth

ods

Ass

um

ed

Sur-

face

BR

DF

Model

No.

of

Images

as

Input

Relighte

dIm

ages

Pre

-se

nte

d

Shape

or

Pose

Resu

lts

Pre

-se

nte

d

Cast

Shadow

sin

Input

Pure

lyIm

age

base

d(N

o3D

Sca

ns)

Oth

er

Ass

um

ptions,

Requir

em

ents

and

Lim

itations

1999

MV

IEW

[14]

Lam

ber

tian

≥3

44

84

Nea

rfr

onta

lillu

min

atio

nex

pec

ted,R

aytr

acin

gfo

rca

stsh

adow

s.1999

SIG

GR

AP

H[1

5]N

on-L

amb.

14

48

8N

oat

tach

edsh

adow

s,M

anual

init

ializa

tion

tofit

3Dm

odel

.2000

SIG

GR

AP

H[8

]N

on-L

amb.

≥20

004

44

4C

ust

omri

gfo

rdat

aco

llec

tion

,Str

uct

ure

dligh

ting

for

shap

e.2001

SIG

GR

AP

H[9

]Lam

ber

tian

≥3

48

88

Dis

tant

and

isot

ropic

ligh

ting,

3DSca

ns

nee

ded

asin

put.

2001

SIG

GR

AP

H[1

6]N

on-L

amb.

≥50

44

88

Cust

omri

gfo

rdat

aac

quis

itio

n,N

osp

ecula

rity

allo

wed

inin

put.

2001

PA

MI

[12]

Lam

ber

tian

≥7

44

84

Alm

ost

no

atta

ched

shad

ow,Sym

met

ric

face

s,R

aytr

acin

g.2001

PA

MI

[17]

Lam

ber

tian

14

88

4B

oot

stra

pse

tof

imag

esre

quir

ed,Id

ealcl

ass

assu

mpti

on.

2001

IJC

V[1

8]Lam

ber

tian

14

48

4N

oat

tach

edsh

adow

s,Sym

met

ric

face

s,pie

cew

ise

const

ant

albed

o.2001

ICC

V[1

9]N

on-L

amb.

≥30

08

44

4K

now

nligh

ting

dir

ecti

ons,

Lig

hti

ng

shou

lddou

bly

cove

rth

edir

ecti

ons

2003

EG

SR

[20]

Non

-Lam

b.

≥12

44

84

3so

urc

es/p

ixel

,ad

-hoc

shad

owdet

ecti

on,Spat

ially

const

ant

BR

DF.

2003

CV

PR

[21]

Lam

ber

tian

14

48

8Sym

met

ric

ligh

ting,

3DM

odel

Fit

ting,

Man

ual

init

ializa

tion

.2003

PA

MI

[7]

Lam

ber

tian

14

88

8D

ista

nt

&is

otro

pic

ligh

ting,

3Dsc

ans

requir

ed,M

anual

init

ializa

tion

.2004

PA

MI

[22]

Lam

ber

tian

≥1

84

84

Man

ual

del

inea

tion

offe

ature

poi

nts

for

bet

ter

reco

gnit

ion.

2005

ICC

V[2

3]N

on-L

amb.

124

48

4K

now

nligh

ting,

HD

Rim

ages

expec

ted,M

anual

thre

shol

dse

lect

ion

2005

PA

MI

[24]

Non

-Lam

b.

≥88

48

4N

osh

adow

sex

pec

ted,R

efer

ence

obje

ctex

pec

ted,Sym

met

ryof

face

s.2005

IAM

F[2

5]Lam

ber

tian

14

84

8Shad

owed

pix

elge

tsdef

ault

albed

o,U

niv

ersa

l3D

face

model

requir

ed.

2006

PA

MI

[11]

Lam

ber

tian

14

48

83D

Model

Fit

ting

wit

hm

anual

init

ializa

tion

.2006

PA

MI

[26]

Non

-Lam

b.

≥14

88

8Poi

nt

ligh

ting

sourc

esw

ith

know

ndir

ecti

ons,

Obje

ctsh

ape

requir

ed.

2007

CV

PR

[27]

Lam

ber

tian

≥4

84

44

3so

urc

es/p

ixel

,K

now

nligh

ting,

Nor

mal

sca

n’t

be

onbis

ecti

onpla

nes

.2007

ICC

V[2

8]N

on-L

amb.

≥32

84

44

Poi

nt

ligh

tso

urc

es,K

now

ndir

ecti

ons,

BR

DF

isot

ropic

abou

tnor

mal

.2007

ICC

V[2

9]Lam

ber

tian

14

48

8Poi

nt

sourc

esw

ith

know

ndir

ecti

ons,

Reg

iste

red

avg.

3Dm

odel

requir

ed.

2007

PA

MI

[30]

Lam

ber

tian

14

48

4N

oat

tach

edsh

adow

sex

pec

ted,Sym

met

ryof

face

s,B

oot

stra

pse

tre

quir

ed.

2007

IJC

V[3

1]Lam

ber

tian

158

48

4D

ista

nt

and

isot

ropic

ligh

ting,

Wor

ks

for

only

conve

xob

ject

s.2008

CV

PR

[32]

Non

-Lam

b.

≥10

24

48

4Poi

nt

sourc

esw

ith

know

ndir

ecti

ons,

BR

DF

isot

ropic

abou

tnor

mal

.O

ur

Meth

od

Non

-Lam

b.

≥9

44

44

Poi

nt

sourc

esw

ith

know

ndir

ecti

ons.

20

A large fraction of the existing techniques for facial image analysis work with the

Lambertian assumption for the reflectance functions. This translates to assuming that

the reflectance function at each point on the object’s surface has the same shape, that of

a half cosine function, which has been scaled by a constant – the albedo, and is oriented

along the surface normal at that location. One of the major reasons for the prevalence of

this model is its simplicity. Analysis has shown that under this assumption, if cast and

attached shadows are ignored, image of a convex object, in a fixed pose, lit by arbitrary

illumination lies in a 3-dimensional subspace [33]. When an ambient lighting component

is included, this subspace expands to become 4-dimensional [34] and when attached

shadows are taken into account, the subspace grows to become an infinite dimensional –

illumination cone [35].

Spherical harmonic analysis of the Lambertian kernel has shown that even though

the illumination cone is infinite dimensional, it can be approximated quite well by a lower

dimensional subspaces ([36], [9], [7]). In particular, these methods can produce impressive

results with 9 basis images, though they require the 3D shapes and the albedo fields as

input. These basis images can also be directly acquired using the “universal virtual”

lighting conditions [37]. More recently, this idea has been extended to 3D surfaces in [11]

building on the prior seminal work presented in [15] called Morphable Models. Morphable

Models can recover 3D shape of a face by fitting an average 3D facial model to a given 2D

image, accounting for necessary shape and texture adjustments. Morphable Models are

known to produce excellent results for across pose face recognition but cannot handle cast

shadows or specularities robustly. More importantly, they require manual delineation of

facial features to initialize a complicated non-linear optimization which can take a long

time to converge and can suffer from local minima. Using the idea of a low dimensional

subspace explored above, [22] represented the entire light-field using a low dimensional

eigen light-field.

21

It has been suggested that even though the time and cost of acquiring the 3D data is

decreasing, majority of the face databases still remain 2D and hence it is more pragmatic

to work with the 2D images alone [38]. Methods that are purely image based and work

with the Lambertian assumption generally apply photometric stereo or shape from shading

to recover facial shape from the given images. For instance, results for simultaneous shape

recovery using photometric stereo and reflectance modeling were presented in [14] and [12].

Both of these methods work with multiple images and expect no cast shadows and very

little attached shadows in the images. Here the cast shadows in the relighted images are

rendered using ray tracing, which can be computationally expensive. Examples of methods

that recover shape from shading working under the Lambertian assumption can be found

in [18] and [39]. As these methods work with just one image, besides requiring the absence

of cast shadows in the input, they make additional assumptions like facial symmetry

(as in [18]) etc. An important point to note here is that the uncalibrated photometric

stereo or the shape from shading methods, that work with the Lambertian assumption

and orthographically projected images, also suffer from the Bas-Relief Ambiguity ([40]).

Resolving this requires additional assumptions like the symmetry of face, the nose and the

forehead being at the same height, known lighting directions, etc., and manual assistance.

Recently, shape recovery using the generalized photometric stereo was presented in

[31] which relaxes some of the assumptions made by the traditional photometric stereo

techniques. This method can recover shape from images taken under general unknown

lighting. On account of the Lambertian assumption, cast shadows are not entertained

in the input images and the shape of the object is assumed to be convex. Note that the

accurate recovery of shape using this method requires 15 to 60 images as input. Another

method for Lambertian shape recovery with multiple illuminants, but without ignoring

shadows, was presented in [27] where the graph cuts method was used to identify light

source visibility and information from shadow maps were used to recover the shape.

22

At a contrast to most of the methods mentioned above are the techniques that seek

illumination invariant representations of the faces which can then be used to render

relighted images. Seminal work in this category was presented in [17], where the so called

“Quotient Images”, generated using ratio of albedo values, were used to generate images

under novel illumination conditions. More recently, use of invariants was invoked in

[21], where the radiance environment map was deduced using the ratio image technique

([41], [17]). Note that the shape recovery in [21], like the Morphable Models, requires

manual initialization. Forgoing the ratio technique, direct use of albedo as an illumination

invariant signature of face images was explored in [25], where using a universal 3D face

model, illumination normalized images of faces were generated. This method worked with

low resolution images and did not render high quality relighted images. More recently

an improvement was presented in [29] where the albedo estimation was made more

robust using the error statistics of surface normals and the known illumination direction.

This method requires a registered average 3D model of the face and does not allow cast

shadows in the input but unlike [25], also provides a facial shape estimate. Improving

upon the idea of the ideal class assumption ([17]), another generalized photometric stereo

technique was presented in [30]. Using a bootstrap set of facial images and exploiting the

subspace spanned by a set of basis objects with Lambertian surfaces, images with novel

pose and illumination were generated. Here the faces were assumed to be symmetric and

the input was assumed to be free of shadows.

Next, we look at techniques that do not make the Lambertian assumption. Seminal

work in this class of techniques was presented in [8], where using a custom built rig,

dense sampling of the illumination space for faces was obtained. In this work, the

facial shape was obtained using structured lighting and no assumption about the

surface BRDF was made. This completely data driven technique was able to produce

extremely photo-realistic images of the face in novel illuminations and poses. The specular

component was captured using polarized lighting and modified appropriately for pose

23

A A semicircularfunction

B Approximation of(A) using symmetricfunctions

C Max of function in(B) and zero

D Approximationof (A) using anti-symmetric functions

E Max of function in(D) and zero

F ABRDF approximatedwith symmetric functionsleads to unnatural lighting

G ABRDF approximatedwith anti-symmetric func-tion leads to more naturallighting

Figure 2-2. Symmetric and antisymmetric ABRDF approximations

variation. This method demonstrated that if a large number of images (> 2000) for each

subject can be obtained under various lighting configurations, the relighting and the pose

generation problems can be solved, but the cost of such a system can be extremely high.

Use of biquadratic polynomials to model texture was explored in [16]. This method

required a custom built rig and more than 50 specularity free images to recover the model

parameters. The shape of the object was not recovered in this method. Use of a large

number (≥ 300) of images to recover the shape without making any assumption about

the nature of the BRDF was revisited in [19]. This method required the input images to

24

Figure 2-3. ABRDF alignment. Neighboring ABRDFs A and B can be better alignedwith each other than A and C. This would weigh the normal suggested by Bhigher than the normal suggested by C.

doubly cover the illumination directions which called for specialized data acquisition. No

attempt to capture the reflectance properties of the object was made in this work.

One of the first techniques that worked with standard face databases and did not

require custom data was presented in [20] where the more general Torrance-Sparrow

([2]) model for the BRDF was used. This method presented the relighting and the pose

variation results with 12 images as input but did not allow cast shadows. Further, this

method required each pixel to be lit by at least 3 light sources in order to work properly.

Important contribution in the field of example based shape recovery was made by [42]

where objects of interest were imaged along with a reference object of known geometry

(e.g. sphere). Multiple (≥ 8) cast shadows free images were used as input. This work

was expanded to allow spatially varying BRDF by using multiple references in [24]. An

interesting extension of this work was presented in [23] where the shape of an object was

25

recovered using the assumption that the BRDF of any object is essentially composed of

the BRDFs of a few fundamental materials. This method used 12 High Dynamic Range

images with known illumination direction and required manual selection of a system

parameter threshold. Using a large amount of data obtained from a custom built rig, [38]

presented a technique for pose and illumination variation using single image, that used

the Morphable Models to recover the 3D shape and the higher-order SVD to recover the

illumination subspace. This method does not explicitly make the Lambertian assumption

but it requires manual initialization for the 3D model fitting.

When the 3D shapes of the objects are assumed available, [26] presented a technique

which, at times using just one image, can recover their spatially varying non-parametric

BRDF fields. For the case of the human face, this work presented results with 4 images

where the specular component was separately captured using polarized lighting. The

images were acquired from known illumination directions and no cast shadows were

allowed.

Recently, [28] presented a new method for photometric reconstruction of shape

assuming spatially varying but isotropic BRDFs. Given 32 or more images with known

illumination, this method recovers isocontours of the surface depth map from which the

shape can be recovered by imposing additional constraints. An extension of this work

was presented in [32] where the need for additional constraints to recover the shape from

the depth map isocontour was alleviated by assuming the surface to be composed of a

few fundamental materials and that the BRDF at each point can be approximated by a

bivariate function. Results presented in this work required 102 or more images. Another

interesting framework for photometric stereo using the Markov Random Field approach

was presented in [43].

Lastly, we note that in the cases when extremely high quality renderings are required

and cost-time constraints are relaxed, custom hardware is employed. For instance, highly

accurate measurements of material BRDFs were carried out using a gonioreflectometer

26

in [44] and various customized hardware components and software were used to render

face images in the movie “The Matrix Reloaded” [45]. In order to measure accurate skin

reflectance while accounting for sub-surface scattering, custom built devices were again

employed in [46] to render high quality facial images.

It can noticed that most of the image based techniques that do not make the

simplifying Lambertian assumption end up using a large amount of custom acquired data

or assuming some other parametric form for the BRDF (besides the other assumptions).

In this chapter we explore the possibility of acquiring the non-Lambertian reflectance and

shape with just nine images in a purely data driven fashion.

2.3 Overview

The technique that we propose in this chapter simultaneously captures both the shape

and the reflectance properties of a face. Unlike the majority of existing techniques that

work with the BRDFs, in order to seamlessly account for the specularities, the attached

shadows, the cast shadows and other photo-effects, we have chosen to work with the

ABRDFs, which are spherical functions of non-trivial shape. We estimate them using the

Cartesian tensors, which in practice, have enough flexibility to account for the variations

in the ABRDFs found across human faces. Further, in order to robustly estimate the

ABRDF field from only a few and often noisy samples, we draw upon the apparent smooth

variation of reflectance properties across the face and combine the Cartesian tensors with

B-Splines. This combination of the Cartesian tensors with B-Splines is called Tensor

Splines in this chapter.

Embedded in the ABRDFs at each pixel also lies the surface normal at this point.

To extract the normal from the ABRDF field riddled with cast shadows and specularities,

we invoke the homogeneity of the ABRDFs in local neighborhoods, and infer surface

normal at a pixel using the information from its immediate neighbors. More concretely, at

each pixel we align the ABRDF with its neighbors’ ABRDF using a linearized algorithm

for rotation recovery and take a weighted geodesic mean of the normals suggested by

27

the neighbors to obtain the surface normal. Our framework automatically discounts

possibly erroneous surface normal suggestions by weighting the suggestion from a neighbor

of substantially different shape lower than others. This process can be iterated and in

practice we find good solutions within 1 or 2 iterations.

Equipped with this mechanism to capture both reflectance properties and shapes

of the human faces, we can generate images of any face in novel poses and illumination

conditions.

Assumptions: Like all other techniques, our method also works with certain

assumptions. It requires at least 9 images of the face under point illuminations from

known directions in a fixed pose. Note that these assumptions have been used in the

past by various methods, for example, [23] worked with 12 images obtained from known

lighting directions in fixed pose. As the number of input images increases, the performance

of our method improves. We do not require the input images to be free of the attached or

the cast shadows. We also do not restrict the BRDF to be Lambertian ( [12]) or isotropic

( [28], [26]). Though global photo-effects like subsurface scattering and interreflection are

not explicitly modeled, Tensor Splines can capture them to some extent.

2.4 Tensor Splines

We seek a mathematical framework that can represent a field of spherical functions

accurately. If a dense enough sampling of the spherical function field is provided, this can

be accomplished to arbitrary accuracy, but the central problem we face is precisely the

scarcity of the data. To solve this problem for the case of human facial ABRDF fields, we

exploit clues from the specific nature of ABRDFs on human faces e.g. smooth variation

of ABRDF for the most part, presence of multiple lobes in the ABRDF etc. Note that the

pose is assumed to be fixed and hence the term “ABRDF” is used to refer to a spherical

function of illumination direction.

28

Figure 2-4. Recovered ABRDFs for a human face. Complex shapes of the ABRDFs invarious regions of the face can be readily noted.

2.4.1 Spherical Functions Modeled as Tensors

A spherical function in R3 can be thought of as a function of directions or unit

vectors, v = (v1 v2 v3)T . Such a function, T , when approximated using an nth order 1

Cartesian tensor [47] (a tensor in R3), is expressed as

T (v) =∑

k+l+m=n

Tklm(v1)k(v2)

l(v3)m (2–1)

where Tklm are the real-valued tensor coefficients and k, l & m are non-negative integers.

This is a Cartesian tensor with all the n arguments set to be v. The expressive power of

such Cartesian tensors increases with their order. Geometrically this translates to presence

of more “lobes” on a higher order Cartesian tensor.

Note that the Lambertian model is intricately connected to a special case of this

Cartesian tensor formulation. If v = (v1 v2 v3)T is the light source direction, n = (n1 n2

1 In the notation used in this paper, this order is not same as the number of indicesused to represent the tensor.

29

n3)T is the surface normal and ρ is the surface albedo, the Lambertian kernel is given by

max(ρ · n · v, 0) = ρ ·max(n1v1 + n2v2 + n3v3, 0)

= max(∑

k+l+m=1

Tklmvk1v

l2v

m3 , 0) (2–2)

with T100 = ρ · n1, T010 = ρ · n2 and T001 = ρ · n3. A comparison with Eq. 2–1 reveals that

the Lambertian kernel is exactly the positive half of the 1st order Cartesian tensor.

The 1st, 2nd, 3rd and 5th order Cartesian tensors have 3, 6, 10 and 21 unique

coefficients respectively. For even orders, the Cartesian tensors are symmetric, T (v) =

T (−v), while for odd orders they are anti-symmetric, T (v) = −T (−v). We must point

out that these definitions of symmetry and anti-symmetry are different from the standard

definitions based on switching of the arguments’ order. In this chapter, we would use the

definitions we provided above.

Finally, though the higher order tensors can be more expressive, they can be

perceived to be more sensitive to noise due to their ability to model high frequency

details. In contrast, the lower order tensors are incapable of modeling high frequency

information but arguably are more robust to noise. Since it is impossible to discriminate

between high frequency detail and noise in the data, it is reasonable to say that the higher

order tensors possess higher noise sensitivity. Thus, like in any other approximation task,

we must strike a balance between the high frequency data fidelity and the noise sensitivity.

2.4.2 Tensor Splines

When the task requires estimation of a p-dimensional field of multi-lobed spherical

functions from sparse and noisy data, given the high noise sensitivity of higher order

tensors, it is reasonable to enforce smoothness across the field of spherical functions. We

accomplish this by combining the Cartesian tensor basis at each pixel with the B-Spline

basis ([48]) across the lattice of the spherical functions.

30

We define a Tensor Spline as a B-spline of multilinear functions of any order. In a

Tensor Spline, the multilinear functions are weighted by the B-spline basis Ni,k+1, where

Ni,1 =

1 if ti ≤ t < ti+1

0 otherwise(2–3)

and

Ni,k(t) = Ni,k−1(t)t− ti

ti+k−1 − ti+ Ni+1,k−1(t)

ti+k − t

ti+k − ti+1

, (2–4)

The Ni,k+1(t) are polynomials of degree k, associated with n+k+2 monotonically

increasing numbers called “knots” (t−k, t−k+1, ..., tn+1).

The Tensor Spline for a p-dimensional lattice of spherical functions, with kth degree

spline and nth order Cartesian tensor is defined as

S(t,v) =∑

(i1...ip)∈D

(∏ia

Nia,k+1(tia))Ti1...ip(v) (2–5)

where t = (t1 . . . tp) is the index into the spherical function lattice, v = (v1 v2 v3)T is a

unit vector, D is the p-dimensional spline control point lattice and Ti1...ip(v) is given by

Eq. 2–1. In Tensor Splines the usual B-Spline control points have been replaced by the

control tensors Ti1...ip(v). The formulation presented in Eq. 2–5 is quite general as it can

be used to estimate a spherical function field defined over an arbitrary dimensional lattice,

with any desired degree of B-Spline smoothing.

2.4.3 Facial ABRDF Approximation Using Tensor Splines

Human faces are known to be neither exactly Lambertian nor convex, which leads

to photo-effects like specularities (oily forehead and nose tip) and cast shadows (around

protruding features like nose and lips) in facial images. These effects cause such a complex

variation in the intensity values at various pixels as the lighting direction changes that it

cannot be accurately captured by a single lobed function (like the Lambertian kernel).

This motivated us to explore the use of the higher order Tensor Splines to model

the ABRDFs. Note that here the lattice is 2-dimensional and the assumption of local

31

Figure 2-5. Images synthesized using Tensor Splines under novel illumination directions.The illumination direction is mentioned on each image as (azimuth,elevation).The nine images used as input were illuminated from (-20,60), (0,45), (20,60),(-50,0), (0,0), (50,0), (-50,-40), (0,-35) and (50,40) directions.

homogeneity also holds to a reasonable degree. In order to ensure that the smoothness

is manifested only in a localized fashion, we have chosen to use bi-cubic B-Splines in the

ABRDF-specialized version of the Tensor Splines.

The ability of the Cartesian tensors to better model data with complex distributions

can be noted in Fig. 2-1 ([5]), where in the first row, we show that for the case of

synthetic circular data (shown by the green arrows), the Cartesian tensors can more

accurately approximate the data than the Lambertian cosine bumps. In the second row we

show real facial ABRDFs approximated by the Tensor Splines and the Lambertian model

from a shadow prone region of the face. It can be readily noted that the Tensor Splines

capture the variability in intensity values, as a function of illumination direction, more

accurately than the Lambertian reflectance model.

We must point out that as the order of the Cartesian tensors increases, so does the

amount of data samples required to estimate the unknown coefficients. When there are

only a few images available, in order to satisfy our desire to use the higher order tensors,

32

Figure 2-6. Images relighted with complex lighting. The first image of each subject is litby a point source while the next two are lit by Eucalyptus Grove and St.Peter’s Basilica light probes respectively. Light probes are provided below thefacial images.

we must choose between its odd (anti-symmetric) or even (symmetric) components. Note

that since most of the time we are interested in the ABRDFs’ behavior on the frontal

hemisphere, both symmetric and anti-symmetric versions provide the same representation

power. Their behavior only becomes pertinent when the illumination direction is exactly

perpendicular to the pose direction, and this is where the use of anti-symmetric versions is

advantageous.

This has been explained via a 2D example in Fig. 2-2 ([5]). Fig. 2-2A shows a

semicircular function where the blue circle in the figure is considered to be the zero value.

Fig. 2-2B and 2-2D show the same function approximated by an antipodally symmetric

function and an antipodally anti-symmetric function respectively. In can be noted that

for both the cases the approximation is quite accurate except near the angles 0◦ and 180◦.

When the original function (Fig. 2-2A) is such that it has positive value at one of these

antipodal points and near zero value at the other, a symmetric function forces the value at

both of these crucial angles to be positive while the anti-symmetric function forces one to

be positive and the other to be negative. Now, if we assume that only the positive values

of the function are preserved we get the results as presented in Fig. 2-2C and 2-2E.

33

The behavior of most facial ABRDFs is similar to the function in Fig. 2-2A. This

is because if a pixel has high intensity value when lit from 0◦, most of the time it would

have a low intensity value when lit from 180◦ (due to attached and cast shadows), and

vice versa. Thus, if a symmetric function is used for approximating such an ABRDF,

it would cause non-negative values at both 0◦ and 180◦ and would lead to visually

significant artifacts (unnatural lighting) in the images (Fig. 2-2F). On the other hand,

in practice, use of an anti-symmetric function does not cause visually significant artifacts

(Fig. 2-2G). To summarize, even though both, anti-symmetric and symmetric functions,

introduce artifacts near 0◦ and 180◦ directions, the artifacts created by an anti-symmetric

approximation are visual insignificant and hence we have chosen to work with the

anti-symmetric components of the Cartesian tensors.

Two dimensional Tensor Splines with bi-cubic B-Splines and odd order tensors can be

written as

S(t,v) =∑

(i,j)∈D

Ni,4(tx)Nj,4(ty)Ti,j(v) (2–6)

where vectors i, j, D, t and v have the same meaning as before and the tensor has an odd

order.

The problem at hand is that given a set of Q face images (Iq, q = 1 . . . Q) of a subject

in a fixed pose along with the associated lighting directions vq = (vq1 vq2 vq3), we want

to estimate the ABRDF field of the face using a bi-cubic Tensor Spline. We propose to

accomplish this by minimizing the following energy function which minimizes the L2

distance between the model and the given data,

E(Tijklm) =Q∑

q=1

∑tx,ty

(∑

(i,j)∈D

Ni,4(tx)Nj,4(ty)Ti,j(vq)− Iq(tx, ty))2 =

Q∑q=1

∑tx,ty

(∑

(i,j)∈D

Ni,4(tx)Nj,4(ty)∑

k+l+m=n

Tijklmvkq1v

lq2v

mq3

−Iq(tx, ty))2 (2–7)

34

where tx, ty run through the lattice of the given images, i, j are the indices into the spline

control point lattice D(D×D), and the tensor order n is an odd integer. The minimization

of Eq. 2–7 is done with respect to the unknown tensor coefficients Ti,j,k,l,m that correspond

to the control tensors Ti,j(vn).

If the image size is M × M , there are M2 unknown ABRDF tensors which are

interpolated from the control tensors (Eq. 2–6). We use a uniform grid, D ×D, of control

tensors, which translates to 3D2, 10D2 and 21D2 unknown control tensor coefficients for

1st, 3rd and 5th order tensors respectively. A value for D is chosen according to the desired

smoothness. For the cases when the number of unknowns per control tensor is one more

than the number of data constraints, we use an additional constraint which discourages

solutions with large norms. This is enforced by adding the term λ∑

ij

∑klm T 2

ijklm to the

error function in Eq. 2–7, where λ is the regularization constant.

We recover the unknowns in Eq. 2–7 using the gradient descent method with the

control tensor coefficient field initialized using all-ones unit vectors. This technique can be

efficiently implemented because we have obtained the closed form for the derivative of the

objective function with respect to the unknown coefficients as

∂E(Tijklm)/∂Tijklm =

2 · (Q∑

q=1

∑tx,ty

(∑

(i,j)∈D

Ni,4(tx)Nj,4(ty)Ti,j(vq)− Iq(tx, ty)))

×(

Q∑q=1

∑tx,ty

(Ni,4(tx)Nj,4(ty)vkq1v

lq2v

mq3)). (2–8)

Once the coefficients have been recovered, images under a novel illumination direction,

v, can be synthesized by evaluating the ABRDF field in the direction v, where each

ABRDF is given by Eq. 2–5. Possible negative values obtained in Eq. 2–5 are set to zero

(as in the Lambertian model). Furthermore, it should be noted that the generated images

can be readily up-sampled by evaluating Eq. 2–5 on a more dense sampling lattice since

the Tensor Spline is a continuous function.

35

2.5 Mixture of Single-lobed Functions

In order to quantitatively validate whether the Tensor Splines provide a good enough

approximation of the ABRDF fields, we present a more expressive model here. This

validation model is more general in the sense that it can accommodate arbitrarily large

number of lobes to approximate any spherical function. We define it using a mixture of

single-lobed spherical functions. Such a mixture is characterized by a kernel function,

k(µi,v), and a set of mixing weights, wi, associated with a set of unit vectors µi as follows

B(v) =∑

i

wik(µi,v), (2–9)

where v is the lighting direction and the vectors µi are uniformly distributed on the unit

sphere.

Of the various choices for singled lobed spherical functions that can be used as the

kernel function k(µ,v), we picked k(µ,v) = e−µ·v − 1 due to two reasons – it has a single

peak and k(µ,v) = 0 for all v such that v · µ = 0 (since if the viewing and the illumination

directions are perpendicular we expect zero intensity). Note that these two properties are

also satisfied by the Lambertian kernel.

The task of estimating ABRDFs using this mixture model requires us to recover the

unknown weights such that the weighted combination leads to a spherical function which

closely approximates the ABRDFs. Given a set of N facial images with the same fixed

pose and the associated lighting directions vn, we can setup a N × M matrix An,m by

evaluating e−µ·v − 1 for every vn and µi. M is the number of µi picked in the model.

The unknown weights (Eq. 2–9) for each pixel can then be estimated by solving the

over-determined system AW = B, where B is a N -dimensional vector of the intensities at

a fixed pixel in the N given images, and W is the vector of the unknown weights. Since

the ABRDF is a nonnegative function, we solve this system with the positivity constraint

using the non-negative least square minimization algorithm developed in [49].

36

Note that this model would generally have a very large number of unknowns

(depending on the chosen resolution while picking µi), and thus would require a large

number of the ABRDF field samples (images) for accurate recovery of the ABRDFs. But

since this would only be used as a tool to evaluate the Tensor Splines, it is not considered

a drawback.

2.6 Recovering Shape from the ABRDF Field

Facial ABRDF is in part characterized by the local surface normal and hence it

should be possible to recover the shape information from it. But unlike the various

popular parametric reflectance models like Lambertian, Torrance-Sparrow ([2]), Phong([1])

etc., which explicitly assume a role for the surface normal in their formulae, Tensor Splines

make no such assumption. This allows spatially varying and accurate approximation of the

ABRDFs, but also makes the recovery of the surface normals non-trivial.

To recover the surface normals from the Tensor Splines model we invoke the local

homogeneity of the ABRDF field. This assumption is physically sound because the

reflectance properties of a human face does not change drastically in small neighborhoods

(3 × 3 pixels) and mathematically robust as Tensor Splines model ensures that the

coefficients vary smoothly across the ABRDF lattice. We assume that the ABRDFs at two

neighboring pixels have the same shape and differ only by a rotation, R and thus, if the

surface normal at one of these pixels is known, the surface normal at the other pixel can

be derived by rotating it by R.

For a given internal pixel (x, y) in the image, there are eight immediate neighbors.

If the surface normal at (x, y) is inferred as described above, it would receive eight

suggestions for possible surface normals (assuming that the surface normals for the

neighbors are known). Instead of picking one of the suggestions as its surface normal,

we take a weighted geodesic average of the suggested normals. The weights are set to be

inversely proportional to the registration error obtained during the rotation-alignment of

the ABRDF pairs. There are two main advantages of computing the surface normal is this

37

manner. Firstly, being an aggregate statistic, the geodesic mean is more robust to noise

than the individual suggestions. Secondly and more importantly, the weighted nature of

the mean ensures that suggestions, which originate from neighbors whose ABRDFs are

very different in shape than the ABRDF at (x, y), are automatically weighted less. This

property of the weighted mean is especially useful at locations in the image where the

homogeneity assumption breaks down, e.g. shadow edges.

This process is summarized in Fig. 2-3 ([5]), where the central ABRDF, (A), is shown

to be aligned with a higher accuracy to its left neighbor, ARBDF (B), than to its right

neighbor, ABRDF (C). For both case, before and after alignment configurations are shown

from 2 different points of views. As mentioned before, the misalignment error is used

to weigh the normal suggestion from a neighbor and hence the suggestion from the left

ABRDF, (B), would eventually be weighted more than the suggestion from the ABRDF

on the right, (C).

Once the rotation matrices for all the pixels in the image have been computed, we

initialize all the normals with the directions in which the ABRDFs have their maxima.

This initialization is followed by weighted geodesic mean computations, which provide us

with a robust estimate of the surface normals. The process of mean computation is carried

out iteratively but empirically it was noticed that good results can be obtained in all cases

with 1 or 2 iterations. Note that using the maxima directly as a normal estimate provides

inaccurate results. We attribute this to the fact that unlike some reflectance models (e.g.

Lambertian), Tensor Splines model does not enforce that the maximal response of the

ABRDF lies along the surface normal direction.

2.6.1 Rotation Estimation

Recovering the surface normal field using the steps described above requires

computation of the rotation matrices for each pair of neighboring ABRDFs in the image.

A simple but computationally intensive approach would be to search for the rotation

matrix using a gradient based constrained optimization technique. More concretely,

38

Figure 2-7. The 1st image is generated using the Tensor Splines model, the 2nd image isthe ground truth and the 3rd is generated using the Lambertian model. Thecast shadows and the specularities are much more realistically render using theTensor Splines model than the Lambertian model.

according to this scheme, two ABRDFs represented by their Cartesian tensor coefficients

w1 and w2, can be aligned by minimizing the following objective function

E(R) =∑

v∈S2

(wT1 B(v)− wT

2 B(R · v))2, (2–10)

such that

RT R = I, (2–11)

where the unit vectors v are obtained by some uniform sampling of the sphere, B is the

vector of Cartesian tensor basis defined in Eq. 2–1 and R is the sought rotation matrix.

This method for the rotation matrix recovery would require nonlinear optimization to

be run ∼ 8L2 times for an image of size L × L pixels. Even for an average sized image

this process can be quite intractable and hence, we propose the following more efficient

algorithm for the rotation matrix recovery.

Let T1(v) and T2(v) be the two ABRDFs (Eq. 2–1) that need to be aligned via a

rotation. This implies that we seek a δv such that

T1(v) = T2(v + δv). (2–12)

Since the ABRDFs are from neighboring pixels, we assume that the required δv would be

small and thus using the first order Taylor’s expansion, we get

T1(v) = T2(v) +∇T2(v)T δv. (2–13)

39

A B

C D

Figure 2-8. Distribution of errors as the configuration of input changes. X-axis representsthe azimuth and Y-axis represents the elevation angles. Hotter colors showlarger errors. The white dots represent the exact directions of illumination inimages used as input.

As we expect

L · v = v + δv, (2–14)

where L is the linear transformation containing the rotation matrix, we get

T1(v)− T2(v) +∇T2(v)Tv = ∇T 2(v)T Lv, (2–15)

which leads to the linear system

Ax = B, (2–16)

where the ith row of A contains vectorized entries of ∇T 2(vi)viT , x contains the vectorized

entries of L, the ith entry of B is T1(vi) − T2(vi) +∇T2(vi)Tvi and vi are the unit vectors

obtained from uniform sampling of a sphere. The embedded rotation matrix R can be

recovered using the QR decomposition from L.

2.6.2 Surface Normal Computation

As described earlier, the surface normal, n, at a pixel (x, y) with ABRDF T can be

computed by taking a weighted geodesic mean of the normals suggested by its neighboring

pixels. Let all of its P immediate neighbors be indexed 1 . . . P with corresponding

40

ABRDFs as Tp, normals as np and the rotation matrices computed using the process

described above as R1 . . . Rp. The normal at (x, y) is then given by

n = argminµ

P∑p=1

1

||RpTp − T ||2d(np, µ)2, (2–17)

where d() is the geodesic distance defined on the space of unit normals, the arc length.

We seek a geodesic mean because the domain of unit normals is the unit sphere and not

the Euclidean space. This mean is also known as the weighted Karcher mean and can be

computed using the following iterative scheme –

µ → expµ(εν) (2–18)

ν = (1/n)

p∑i=1

1

||RpTp − T ||2 exp−1µ np (2–19)

where exp, the exponential map, is given as

expµ(εν) = cos(|εν|)µ + sin(|εν|)(ν/|ν|) (2–20)

and exp−1µ (np), the log map, is defined as

exp−1µ (np) = u cos−1(〈µ, np〉)/

√(〈u, u〉) (2–21)

where

u = np − 〈np, µ〉µ, (2–22)

and ε is the iteration step size. For more details on computing means on manifolds see [50]

and references therein.

2.6.3 Shape Recovery

Once the normal field has been computed, we use one of the standard techniques

([51]) to recover the surface. If z(x, y) defines the surface, the normal at a location (x, y)

is given by (zx zy −1)T where zx and zy denote the partial derivatives of the surface with

respect to x and y. If (nx ny nz)T denotes the surface normal at location (x, y), we have

41

Figure 2-9. Per pixel intensity error comparison. The blue color shows the errors on the1st and 2nd subsets combined, which contain the lighting directions, φ, smallerthan 25

◦, the green color shows the errors on the 3rd subset with

25◦ < φ < 50◦ and the red color shows the errors on the 4th subset with50◦ < φ < 70◦.

the following relations

zx = −nx/nz (2–23)

zy = −ny/nz. (2–24)

Using the forward difference approximation of the partial derivatives, we obtain the

following two equations

nzz(x + 1, y)− nzz(x, y) = nx (2–25)

nzz(x, y + 1)− nzz(x, y) = ny, (2–26)

which provide a linear relation between the surface values at the grid points and the

known surface normals. The surface can thus be recovered by solving an over-determined

system of linear equations. At the boundary points, the above formulation is not valid

and the surface is recovered by solving the following equation, obtained by eliminating nz

above

nxz(x, y)− nxz(x, y + 1) = nyz(x + 1, y)− nyz(x, y). (2–27)

2.6.4 Novel Pose Relighting

With the facial shape in hand, novel poses can be rendered by simply changing the

viewpoints. But generating novel illumination conditions in the novel pose is not trivial as

42

A9

inpu

tim

ages

BQ

uive

rpl

otof

the

norm

alfie

ldC

Col

oren

code

dno

rmal

s(R

-x,G

-z,B

-y)

DR

ecov

ered

shap

ein

nove

lpo

se

E9

inpu

tim

ages

FQ

uive

rpl

otof

the

norm

alfie

ldG

Col

oren

code

dno

rmal

s(R

-x,G

-z,B

-y)

HR

ecov

ered

shap

ein

nove

lpo

se

Fig

ure

2-10

.Shap

esre

cove

red

usi

ng

9im

ages

43

the ABRDFs estimated from a different pose cannot be directly used. If the ABRDF field

was estimated in pose P1 and if we wish to generate an image with a novel illumination in

a new pose P2, we have to rotate the ABRDFs by the same rotation which is required to

change P1 to P2. Once the orientations of the ABRDFs have been rectified, images of the

face in the new pose with novel illumination can be generated by evaluating the rectified

ABRDF field in the desired directions.

We would like to point out that the specularities are view dependent and accurately

speaking, cannot be directly transferred from one pose to another. Most of the existing

Lambertian methods ignore this effect but the few which deal with this problem, handle it

by either explicitly obtaining the specular component by using polarized lighting (e.g. [26],

[8]), which required specialized data acquisition, or by assuming a parametric form for the

specular component of lighting (e.g. [20]).

Our Cartesian tensor representation for the ABRDFs does not discriminate against

specularities and estimates the ABRDF as best as possible from the available intensity

values. Thus, it should be possible to recover and manipulate the specular component

separately, but at this stage, we have made the assumption that specularities do not

change drastically across facial poses. The validity of this assumption is supported by the

results presented in the next section.

2.7 Experimental Results

In order to evaluate the proposed methods for relighting and shape recovery, we

conducted several detailed experiments which are presented here. Since it has been shown

for the popular Lambertian model that the space of images with illumination variation

can be approximated quiet accurately using a 9 dimensional subspace ([7], [9]), we have

taken on the challenge of also working with just 9 image. Note that with 9 samples of

the ABRDF field and the solution norm minimization constraint, at most 10 unknown

coefficients can be recovered per pixel and hence our central results use bi-cubic 3rd order

Tensor Splines.

44

The experiments were carried out on the Extended Yale B [37] (38 subjects, in 9

poses and 64 illumination conditions) and the CMU PIE [52] (68 subjects in 13 poses and

43 illumination conditions) benchmark databases. Note that the CMU PIE has 21 usable

point source illuminated images while in the Extended Yale B all 64 illuminations are

point source.

2.7.1 Relighting Faces

We begin by noting that the Tensor Splines model can capture the non-trivial shape

of the facial ABRDFs. In Fig. 2-4 ([5]) we show the ABRDF field of a subject from the

Extended Yale B database estimated using 9 images. Three different regions of the face

have been shown in detail where complicated shapes of the ABRDF can be noticed.

The regions A and B have more complicated shapes because these ABRDFs have to

accommodate shadows. The spherical functions in the image have been color coded based

on their maximal value directions. The mapping of the directions to colors is provided in

the lower right corner.

Next we present results for relighting of faces in novel illumination directions. In

Fig. 2-5 ([5]) four different subjects lit in various novel point source illuminations are

depicted. For the first two rows, the illumination direction varies across the azimuth angle

while in the next two rows, the variation is in the elevation angle. It can be noticed that

our method can accurately interpolate as well as extrapolate from the images provided

as input. Further, difficult effects like the cast shadows and the specularities have been

photo-realistically rendered without using any additional ray tracing.

Starting with 9 images, our technique estimates the entire ABRDF field and thus

images lit in fairly complex lighting conditions can also be rendered. In Fig. 2-6 ([5]) we

present such results for two subjects from the CMU PIE database. Below each image is its

lighting condition. The first image of each subject is one of the nine input images used to

estimate their ABRDF fields. The next two images for each of the subjects are lit by the

light probes ([53]) named Eucalyptus Grove and St. Peter’s Basilica respectively. For these

45

color images, we estimated the ABRDF field for each color channel separately. The images

were relighted by taking a weighted combination of the point source lit images where the

weights were determined using the light probes. We used 2500 samples of the light probe

to render these images.

In Fig. 2-7 ([5]) we provide a qualitative comparison between our method and the

Lambertian model. The first face image in Fig. 2-7 ([5]) is rendered using the Tensor

Splines model, the second is the ground truth and the third image is rendered using the

Lambertian model. It can be readily noted that the image obtained using our method is

closer to the ground truth than the one rendered using the Lambertian model. The arrows

show the locations of important differences – the cast shadows and the specularities.

The next two experiments try to quantitatively capture the performance of our

method. First, we explore the impact of the input data on the estimation when the tensor

order is fixed (3rd order in this case). For this we use the Extended Yale B dataset as it

provides the ground truth for 64 directions. To set a baseline, we estimated the ABRDF

field for 10 subjects using all the 64 images as input, rendered images in the same 64

direction and computed the total error with respect to the ground truth images. Next,

the errors were computed similarly for 3 other cases where only 9 images were used as

input to our method, but in different configurations. Two of the cases had images with

illumination directions uniformly distributed in front of the face while one had images with

the directions biased towards one side.

To visualize the distributions of obtained errors, we color coded them with hotter

colors denoting larger errors and plotted them as a continuous images in Fig. 2-8 ([5]).

The X axes of these images show the azimuth angles varying from −130◦ to 130◦ (from

the leftmost white dot to the rightmost) and the Y axes show the elevation angles varying

form −40◦ to 90◦ (from the topmost white dot to the bottom most). The white dots in

these images show the exact direction of illumination in the images used as input. It can

be readily noted that when all the 64 images are used as input, Fig. 2-8A, the error is the

46

Figure 2-11. Detailed pose variation with texture-less in upper right and depth-map inlower right. Accurate renderings even in extreme poses can be noticed.

least. For the 9 image cases, Fig. 2-8B and Fig. 2-8C, where the illumination directions

of the input images are somewhat uniformly distributed, the error is more than that in

Fig. 2-8A, but notedly less than the case when the distribution is skewed in one direction,

Fig. 2-8D. Hence, as expected, our method performs better when the input images have

the lighting directions that are uniformly sampled from the sphere. Moreover, the errors

in all the cases are concentrated towards the extreme illumination angles and for the near

frontal illumination conditions, the performance is not particularly affected by the input

image distribution.

Next we present a quantitative comparison of our method with the Lambertian

model and the validation model presented in Section 2.5. A natural question that arises

is why should an order 3 Cartesian tensor be suitable for estimating the facial ABRDFs?

To answer this question, we computed the average intensity error per pixel over all 38

subjects in the 64 illumination directions of the Extended Yale B dataset using the

Lambertian model, 3rd order Tensor Splines, 5th order Tensor Splines and the mixture of

single lobed functions (Eq. 2–9). All 64 illumination directions were used for the mixture

47

model (on account of the large number of unknowns), while for the other three, only 9

images (configuration shown in Fig. 2-8B) were used. We set the µi values required for

the mixture model using a dense sampling (642 directions) of the unit sphere obtained by

the 4th-order tessellation of the icosahedron. We have presented results in Fig. 2-9 ([5])

shattered along the standard subsets (subset 4 in red, 3 in green and 1 + 2 in blue) of the

Extended Yale B database. As expected, the error for the subset with extreme lighting

(subset 4) is more than the other sets, for all methods. More importantly, even with a

considerably large amount of input data and a very flexible estimation model, the errors

obtained from the mixture model are quite similar to those obtained from the 3rd order

Tensor Splines model. This indicates that though a 3rd order Tensor Splines model can

only accommodate three lobes, for most facial ABRDFs this suffices. The 3rd order Tensor

Splines model outperforms the Lambertian model and even the 5th order Tensor Splines

model, which suggests possible over-fitting in the 5th order model.

2.7.2 Estimating Shape

All the results presented till now assumed a fixed pose, but using the technique

presented in Section 2.6 we can simultaneously vary the illumination and the pose of a

face. Fig. 2-10 ([5]) summarizes the results produced by our shape recovery algorithm

for one subject each from the Extended Yale B and the CMU PIE databases. The

first column shows the 9 input images, the second column shows the quiver plot of the

estimated normal field (zoom in to see details), the third column presents the surface

normal information in a color coded form (x components of the normal field are mapped

to the red channel, y components to the blue channel and z components to the green

channel) and the fourth column shows the recovered shape in a novel pose. For the case

of color images, shape estimation was carried out using only the luminance component.

In both the cases, occlusion of appropriate regions of the face due to pose change can be

noted from the images in the fourth column.

48

Figure 2-12. Shape comparison with the Robust Photometric Stereo on the Extended YaleB dataset

In Fig. 2-11 ([5]) we present more detailed results for pose variation with fixed

illumination. The 3 rows of images show a subject from the Extended Yale B in different

poses ranging from the right profile to the left profile, as we go from left to right, and

viewpoint varying from below the face to above the face, as we go from top to bottom.

Note that the ABRDF field for this subject was recovered using just 9 images under the

illumination configuration shown in Fig. 2-8B. The recovered shape for the same subject,

rendered with constant albedo and specularities, is also presented at the right end of the

figure. This allows finer details of the shape to be shown without any texture to bias the

observer. Finally, to the lower right of the figure is the height map for the same subject. It

can be noted that our shape recovery algorithm can produce good results without making

the simplifying Lambertian assumption.

We present face shapes estimated by our method and the Robust Photometric

Stereo [54] (9 input images for both methods) in Fig. 2-12 ([5]). Results for four

49

different subjects, both with and without texture, are presented. Based on the results

following conclusions can be drawn: First, since the Tensor Splines method imposes local

smoothness, the recovered shape lacks some minute details like the mole on the chin in the

case (a) as compared to the Robust Photometric Stereo. Second, since the Tensor Splines

method more seamlessly handles the cast shadows and the specularities as compared to

the Robust Photometric Stereo, regions affected by the cast shadows and the specularities,

especially the nose, are better recovered by the Tensor Splines method. This can be

readily noted in the cases (a), (c) and (d). Finally, the Tensor Splines method seems to

have better global shape estimation. For instance, in case (a), the shape recovered by

the Robust Photometric Stereo is titling backwards towards the top. In case (c), the

region around the mouth seems unnaturally warped in the Robust Photometric Stereo

while in case (d), relative positioning of the nose and the eyes seems more realistic in the

Tensor Splines results. In summary, these results demonstrate that the Tensor Splines

method may lack minute details but models facial features in a more photo-realistic

fashion than the Robust Photometric Stereo, when the input images have cast shadows

and specularities.

Finally, we present results when both the pose and the illumination conditions are

simultaneously varied. In Fig. 2-13 ([5]) one subject each from the CMU PIE and the

Extended Yale B databases are shown in various poses and illumination conditions. The

ABRDF fields for both the cases were recovered using 9 images, and the shape for the

color images were recovered using the luminance channel. With the change of pose, we

have retained the ABRDF field learnt using the frontal pose, but it can noted that the

results are photo-realistic even when the specularities are not explicitly modified and

transferred.

2.7.3 Face Recognition

These results are presented here as an illustration that meaningful relighting can

enhance the face recognition results even when a simple classifier like the Nearest Neighbor

50

Figure 2-13. Simultaneous pose and illumination variation

classifier is used. We will discuss face recognition in more details in Chapter 5 and

Chapter 6.

Face recognition is one of the most popular applications of facial image analysis. It

is generally defined as – given a database of facial images of various people, called the

gallery, identify the person in a novel test image, called the probe, as one the people

present in the gallery. The degree of difficulty of this problem increases as the differences

between the probe and the gallery images increase. This difference could be due to the

illumination conditions, occlusion, expression, pose or any combination of these.

In recent times, illumination invariant face recognition has attracted particular

interest due to advances in our understanding of the reflectance modeling. Here we present

a comparative study of illumination invariant face recognition. When using the Tensor

Splines method, we assume that for each subject 9 gallery images with known illumination

directions are available. From these images we compute the ABRDF fields and generate

images with novel illumination for a dense sampling of directions. This step expands our

collection of 9 gallery images to any desired size. The probe image is then matched to all

51

Table 2-2. Face recognition errors rates. N is the gallery set size.

Method N Subset 1&2 Subset 3 Subset 4 TotalCorrelation [55] 4 0.0 23.3 73.6 29.1Eigenfaces [56] 6 0.0 25.8 75.7 30.4

Linear subspace [57] 7 0.0 0.0 15.0 4.7Cones-attached [12] 7 0.0 0.0 8.6 2.7

Cones-cast [12] 7 0.0 0.0 0.0 0.09PL [37] 9 0.0 0.0 2.8 0.8

3D SH [11] 1 0.0 0.0 2.8 0.8Harmonic (SFS) [58] 1 0.0 0.0 12.8 4.0

Tensor Splines 9 0.0 0.0 1.6 0.5

the images in the database and the subject with the closest matching (in L2 sense) gallery

image is assumed to be the correct identity of the probe image.

We have used the Extended Yale B data for this experiment primarily because most

of the existing methods have presented face recognition results on the same database. As

mentioned before, this database is divided into 4 subsets with the lighting getting more

and more extreme as we go from subset 1 to 4 and thus, the difficulty in classifying the

images from these subsets also increases. The obtained recognition error rates are reported

in Table 3-1. We have also presented results reported by existing methods and their

respective references are listed next to the name of the method. Results for the first seven

techniques are taken from [11] and the rest are taken from the respective references. Along

with the error rates, we have also listed the number of images required by each method

in the gallery set. For our method we used the nine images in the configuration shown in

Fig. 2-8B ([5]). It can be noted that even with the naive nearest neighbor classification

strategy our method produces near perfect results.

2.8 Conclusions

In this chapter we have presented a novel comprehensive system for capturing

the reflectance and the shape of human faces using Tensor Splines. Since our method

requires at least 9 input images with known illumination directions, we fall short of the

ideal solution described in the introduction, but show an improvement over the popular

Lambertian model. Accurate recovery of the ABRDF field from single image with the

52

cast shadows and the specularities with no lighting information remains a challenge. The

central problem in the single image case stems from the dearth of information to constrain

the space of all possible ABRDF fields. Use of strong prior information presents itself as a

potentially effective way to constrain the search space, but attempts so far (e.g. [15]) suffer

from the need of manual intervention and cumbersome computational requirements.

We would explore the use of Tensor Splines ABRDF fields as prior information to

meaningfully predict the ABRDF fields using single input images in Chapter 4. Use of

a shape prior can also potentially aid in shape recovery.

While relighting images in novel poses, we make the assumption that the ABRDF

field maintains the same specular information across poses. Though practically useful,

this is not fully valid. We have dealt with the specularities in a data driven fashion but

possible attempt can be made to explicitly model the specularities, which we would like to

explore in future. It should be noted that though the problem of detecting specularities is

relatively well studied, the problem of realistically predicting specularities in novel poses

without using specialized imaging tricks (like special filters) remains challenging. Possible

improvement can also be made in our model by incorporating non-uniform smoothness as

opposed to the current setup.

Besides the relighting and the pose change applications described in the chapter, our

technique can also be used for image up-sampling and compression. The former is possible

because the Tensor Splines representation creates a continuous field of the ABRDF

coefficients across the image, which can be sampled at a sub-pixel resolution. The later

exploits the capability of the ABRDFs to represent images of a face under infinitely many

lighting directions using just a few coefficients per pixel.

In conclusion, the Tensor Splines framework for the analysis and modeling of the

illumination and the pose variation of facial images provides a useful alternative to the

Lambertian assumption. It also seems that the collective analysis of the shape and the

53

reflectance through the ABRDFs is promising as an alternative to separate facial BRDF

and shape analysis.

54

CHAPTER 3EIGENBUBBLES: THE ENHANCED ABRDF REPRESENTATION

3.1 Introduction

Thus far we have presented a method for the ABRDF field estimation using nine

or more samples provided as input. But since we are interested in a very specific class

of spherical functions, the human facial ABRDFs, we should be able to improve their

representation if we can include the knowledge derived from observing and analyzing

the ABRDFs from various locations on various faces. To this end, we seek a few salient

spherical functions, which define a natural subspace for representing the ABRDFs.

3.2 Eigenbubbles

An intrinsic subspace for the ABRDF representation can be the one spanned by

the cluster centers obtained from the k-mean clustering of the ABRDFs. We can avoid

the initialization issues associated with clustering by using the fact that such a subspace

can also be obtained by the spectral expansion of the data covariance matrix [59]. The

basis elements of such a subspace are themselves spherical functions and we call them

Eigenbubbles.

The only change that we make to the ABRDF representation from Chapter 2 is

to replace the Cartesian tensors in the Tensor Splines representation of the spherical

functions with the Spherical Harmonics basis. We denote the real spherical harmonic basis

functions as Ψml (order: l, degree: m), with l = 0, 1, 2, ... and −l ≤ m ≤ l:

Ψml (θ, φ) =

√2l + 1

4π

(l −m)!

(l + m)!Pl|m|(θ, φ)Φm(θ, φ). (3–1)

where Pl|m| are the associated Legendre functions and Φm(θ, φ) is defined as

Φm(θ, φ) =

√2 cos mφ m > 0,

1 m = 0,√

2 sin |m|φ m < 0.

(3–2)

55

We chose the Spherical Harmonics basis for this enhanced representation since

empirically we found that it leads to more accurate relighting after the ABRDF

enhancement than the Cartesian tensor basis. Note that the objective function for

ABRDF estimation, the spline based smoothness constraints and the numerical optimization

scheme remain exactly the same as before.

Now, given a bag of ABRDFs {αi} (where αi are the Spherical Harmonic coefficients

of the ABRDFs), we define the mean ABRDF as

α =∑

i

αi/N. (3–3)

The data covariance matrix can thus be defined as

C =∑

i

(αi − α)(αi − α)T /N. (3–4)

Note that even though the number of ABRDFs can be large, the covariance matrix

has dimensions 10 × 10 for the third order ABRDF estimation. Next we decompose the

square matrix C into its eigenvectors(V) and eigenvalues(U) as

C = VUVT . (3–5)

Arranged in the increasing order of corresponding eigenvalues, the eigenvectors - in our

case Eigenbubbles - define a low variance subspace for the ABRDF representation.

At this point we define two kinds of Eigenbubbles - Local and Global. Global

Eigenbubbles are defined to be the ones obtained by putting the ABRDFs from all

different locations and different individuals into the initial bag of ABRDFs. On the other

hand, Local Eigenbubbles are those obtained separately for each pixel, by considering

only the ABRDFs lying at the same (corresponding) pixel locations, from all the given

faces. For this we only require a rough alignment of the faces (e.g. alignment of eyes)

since we assume that the ABRDFs in the same neighborhoods are similar. These two

56

Figure 3-1. Global Eigenbubbles learnt from the Extended Yale B face database.Eigenvalues increase from left to right in row one and then in row two.

definitions would allow us to analyze the impact of the use of Eigenbubble for the ABRDF

representation in the next section.

3.3 Experiments & Discussion

3.3.1 Relighting

In Fig. 3-1 the first 10 Global Eigenbubbles learnt from the Extended Yale B

face database are shown. It can be noted that the Eigenbubbles corresponding to

lower eigenvalues consist of a smaller number of blunt lobes while the Eigenbubbles

corresponding to higher eigenvalues show a larger number of shaper peaks. This indicates

that the Eigenbubbles corresponding to higher eigenvalues encode higher frequency details

of the facial ABRDF fields. As mentioned before, the significance of these Eigenbubbles

lies in the fact that an ABRDF function at any location on any face can be represented

with high accuracy as a linear combination of these spherical functions. Next, we present

the first 10 Local Eigenbubbles in Fig. 3-2, at two different locations on a human face. It

can be noted that shapes of these two sets of spherical functions are quiet different form

each other as they are learnt using the ABRDF functions from two different regions of the

face. Forehead shows much more uniform variation in intensity values as compared to the

shadow infested nasal region.

Next we look at the actual ABRDF functions which are represented using the

Eigenbubble framework. In Fig. 3-3 the estimated ABRDF field has been shown

superimposed on a face. Larger images of individual ABRDFs represented using

57

Figure 3-2. Local Eigenbubbles learnt from two different locations on a face. Eigenvaluesincrease from left to right in row one and then in row two.

Figure 3-3. Estimated ABRDF functions at various locations on a human face.

Eigenbubbles show that high fidelity representation of intensity variation at pixels which

accounts for the cast shadows and the specularities require more complicated spherical

functions than a half-cosine bump used in the Lambertian model.

In order to analyze and understand the representational power of the Global and the

Local Eigenbubbles, in the next set of experiments we qualitatively and quantitatively

examine the quality of the synthesized relighted images using the proposed techniques.

58

Figure 3-4. On the left (5 columns) synthesized images using the Global Eigenbubbles areshown. From top to bottom, each row uses 1, 2, 3, 5 and 10 Eigenbubblesrespectively for its ABRDF field representation. The same setup is repeatedon the right using the Local Eigenbubbles. The azimuth and the elevationangles of the illumination source are given at the bottom right corner of eachimage. All ABRDF fields were estimated using 9 input images illuminatedfrom (-20,60),(0,45),(20,60),(-50,0),(0,0),(50,0),(-50,-40),(0,-35) and (50,40)directions.

First we present synthesized images of a subject from the Extended Yale B dataset

while using varying number of subspace dimensions for its ABRDF field representation

in Fig. 3-4. The first 5 columns on the left show images synthesized using the Global

Eigenbubbles while the last five columns show images synthesized using the Local

Eigenbubbles. From top to bottom each row shows images generated from the ABRDF

field represented using 1, 2, 3, 5 and 10 eigenbubbles respectively. All the ABRDF fields

were estimated using 9 input images. These images have been synthesized using novel

illumination directions whose azimuthal and elevation angles are mentioned in the bottom

right corner of each image. From these images, foremost, it can be noted that as the

subspace dimension is increased, so does the visual quality of the images. Secondly, quality

of the images produced using the Local Eigenbubbles, especially for low dimensional

subspaces, is better than when the Global Eigenbubbles are used. Thirdly, for both types

59

of Eigenbubbles, visually high quality images are rendered for 5 and higher dimensional

representations. Accurate depiction of the shadows, both attached and cast, and the

specularities can be readily noted from the images.

To demonstrate the versatility of the Global Eigenbubbles for accurately representing

the facial ABRDF across different face types and databases, we present novel images

of various faces rendered using our technique in Fig. 3-5. When used for each channel

independently, the ABRDF framework proposed here can be easily extended to color

images, and this is demonstrated in the first four rows where subjects from the CMU PIE

dataset are render under novel illumination conditions. Global Eigenbubbles were learnt

for each channel separately. The bottom two rows show images of two subjects from the

Extended Yale B database. The Global eigenbubbles used here were the same as used in

Fig. 3-4. Using faces belonging to different races, we have demonstrated the capability of

the Global Eigenbubbles to represent the ABRDFs of surfaces which can have somewhat

different surface properties. Note that the shadows have been crisply generated and the

specularities have been meaningfully rendered.

Next we quantitatively examine the images generated using our method. For this

experiment we have chosen the Extended Yale B dataset as it provides images of each

subject taken in 64 point illumination conditions which can be used as the ground-truth

for evaluating our method. Besides the proposed enhanced ABRDF representation method

we have also included results from the Tensor Splines method and the Lambertian model,

examined in [60], in order to compare our results with these alternate methods. Using 9

images per subject, we synthesized images in all 64 illumination directions using just the

Spline Modulated Spherical Harmonics, the Local and Global Eigenbubbles, the Tensor

Splines model and the Lambertian model and then computed the pixel-wise intensity

errors for each using the ground-truth images. We have presented these results in Fig. 3-6.

It can be noted that the Lambertian model, with its limited representative power in the

presence of the cast shadows and the specularities performs the worst. As expected, the

60

Figure 3-5. The first four rows show color images of subjects from the CMU PIE database,while the last two rows show images of subjects from the Extended Yale Bdataset. The (azimuth,elevation) angles of the illumination source arementioned in the bottom right corner of each image. Input images for theCMU PIE subjects were lit from (32,2), (0,-9),(32,-9),(0,3),(-32,-8),(-32,3),(38,6),(0,10),(-32,5) directions while for theExtended Yale B subjects we used same directions as in Fig. 3-4.

61

Figure 3-6. Per pixel intensity errors obtained using the proposed and the state-of-the-artmethods. The Y-axis represents the error while the X-axis represents thedimension of the subspace used in the Eigenbubble representation. Note thatthe error rates do not vary along the X- axis for the Spline ModulatedSpherical Harmonics, the Tensor Splines model and the Lambertian model.

Figure 3-7. Qualitative comparison between the proposed technique, the Tensor Splinesmodel and the Lambertian model [60].

Local Eigenbubble outperforms the Global Eigenbubbles, but both of these techniques,

outperform the Tensor Splines model. The improvement in the image quality as the

number of subspace dimensions is increased, as apparent in the images presented in Fig.

3-4, can be clearly noted in Fig. 3-6.

The observed superiority of the proposed technique over the state-of-the-art methods

in terms of relighting with faithful shadows and specularities reproduction can be visually

noted in Fig. 3-7. Here we have shown a representative image rendered using various

62

Figure 3-8. Novel images generated under extreme illumination directions. The centralfigure shows the illumination directions of the input images as blue circles andthe illumination directions of the novel images as red squares.

methods along with the ground truth image. It can be noted that the Lambertian

model largely fails to model the specularities (green arrow) and the cast shadows (red

arrows). The Tensor Splines model does a better job than the Lambertian model but the

shadows and specularities are smudged on account of its assumed across-field-smoothness.

Our method, Eigenbubbles, on the other hand, has produced results which have

accurately depicted the cast shadows and the specularities. Here we have used the

Local Eigenbubbles which do not suffer from the smoothing artifacts present in the Tensor

Splines model.

3.3.2 Face Recognition

Since our technique can produce novel images using only a few input images, it can

aid face recognition by augmenting the gallery set with new images of the given subjects.

We test the performance of our method in aiding face recognition using the Extended Yale

B database since most of the other methods also present results on this dataset. For each

63

Methods N Set 1&2 Set 3 Set 4 Total

PAMI 1993 Correlation [55] 4 0.0 23.3 73.6 29.1CPVR 1994 Eigenfaces [56] 6 0.0 25.8 75.7 30.4CVPR 2001 Linear subspace [57] 7 0.0 0.0 15.0 4.7PAMI 2001 Cones-attached [12] 7 0.0 0.0 8.6 2.7PAMI 2001 Cones-cast [12] 7 0.0 0.0 0.0 0.0PAMI 2005 9PL [37] 9 0.0 0.0 2.8 0.8PAMI 2006 3D SH [11] 1 0.0 0.0 2.8 0.8IJCV 2008 Harmonic (SFS) [58] 1 0.0 0.0 12.8 4.0Tensor Splines 9 0.0 0.0 1.6 0.5Eigenbubbles 9 0.0 0.0 0.9 0.29

Table 3-1. Face recognition errors rates. N is the number of input images.

subject, 9 images are used as the gallery set while the rest are used as probes. Using the

Global Eigenbubbles, we generate a lot more images of each subject by uniformly sampling

the illumination direction sphere. The final classification is performed using the Nearest

Neighbor classifier. The results obtained (shattered along the subsets) are presented in

the Table. 3-1. The illumination conditions get harsher from subset 1 to 4 and so does the

recognition task. It can be noted that our method produces extremely small error rates

comparable to the state-of-the-art. We must emphasize that in general, face recognition

systems use more sophisticated classifiers than the simple Nearest Neighbor classifier

used by us and thus instead of a face recognition system, our method should be seen as

a database augmentation method which can aid any other face classification method by

adding meaningful gallery images to the training set.

3.3.3 ABRDF Field Compression

A novel application of our method is the compression of the ABRDF fields. We

demonstrated earlier (Fig. 3-6) that the Global Eigenbubble based representation of the

ABRDF can generate high quality images even when only 5 dimensional subspace is

used to represent the ABRDFs. In terms of space requirement, this translates to storing

only 5 coefficients per pixel and the 5 Global Eigenbubbles, in order to generate images

under any arbitrary illumination condition. If we consider only the 64 point source

lit images present in the Extended Yale B database, it takes about 2 MB of memory

64

D Raw Eigenbubbles Reduction Ratio Error/Pixel

1 2016 KB 127.6 KB 6.25 % 18.32 2016 KB 253.6 KB 12.50 % 15.03 2016 KB 379.6 KB 18.75 % 13.84 2016 KB 505.6 KB 25.00 % 13.95 2016 KB 631.6 KB 31.24 % 13.86 2016 KB 757.6 KB 37.51 % 14.27 2016 KB 883.6 KB 43.76 % 13.38 2016 KB 1009.6 KB 50.01 % 13.29 2016 KB 1135.6 KB 56.26 % 13.110 2016 KB 1261.6 KB 62.51 % 12.9

Table 3-2. ABRDF field compression. D is the subspace dimension and the errors are theintensity errors per pixel.

to store the raw 192 × 168 images for each subject, while using Eigenbubbles based

representation with 5 coefficients, it would take less than one third of that memory to

summarize all the ABRDFs. This advantage increases if we consider more than 64 images

since the Eigenbubble based ABRDF representation can generate as many images as

desired. It must be noted that our technique cannot be directly compared with the image

compression methods as the ABRDF fields can generate as many images as desired, while

the image compression techniques cannot. Detailed results for various subspace dimensions

are presented in Table 3-2.

3.4 Conclusions

In this chapter we have shown that the face specific knowledge can be used to

enhance the quality of the relighted images. Using the Spherical Harmonics basis and

Principal Component Analysis, we demonstrated that the high quality relighted images

can be used to improve face recognition as well as to effectively summarize large ABRDF

fields using only a fraction of the required memory.

65

CHAPTER 4FACE RELIGHTING AND POSE CHANGE WITH SINGLE IMAGE

4.1 Introduction

In Chapter 2 a solution to the facial relighting and pose change problem with multiple

input images was presented, and in this chapter we will examine a related but relatively

harder problem of relighting and pose change of faces using single image as input. Most

of the applications associated with multiple image relighting and pose change, like the

post-production video scene editing or face recognition, naturally extend to the case with

single input image also.

In terms of the ABRDF field estimation, the fundamental difference between the cases

with multiple and single input image lies in the amount of face-specific background

knowledge required. In the case of multiple input images, since there are multiple

samples of the ABRDFs available, interpolation and extrapolation using the appropriate

mathematical functions can provide good approximations to the underlying ABRDF field.

But in the case of single input image, only one sample of the ABRDF at each image pixel

is provided, and hence it becomes imperative to use background knowledge about the

facial ABRDFs in order to hallucinate the complete ABRDF.

The above mentioned background knowledge can be derived from a number of

sources. One possible option is to use the sample values at the neighboring pixels in order

to better estimate the ABRDF at a given pixel. Another approach can exploit the known

ABRDF field of some reference face to generate the ABRDF field of the given input

image. This idea can be further extended to use multiple such reference ABRDF fields in

order to generate the ABRDF field of the given input. In this work we use a combination

of these possible techniques to estimate the ABRDF field of a given face.

The rest of this chapter is organized as follows: in Section 2 we being with a survey of

a few relighting and pose change methods which are specifically meant to work with single

input image. These may have some overlap with the techniques mentioned in Chapter 2

66

since the two considered problems are closely related. In Section 3 we present a general

overview of our technique and also explicitly bring out the underlying assumptions. In

Sections 4 and 5 we describe our technique in details and in Section 6, we present various

experimental results comparing our results with those from existing methods. Finally, in

Section 7 we conclude the discussion on single image relighting.

4.2 Related Work

Before we review the literature we must mentioned that the single image relighting

and pose change techniques in the literature are primarily driven by the face recognition

application. This is so because the other major application – relighting in movies and

video games require reasonably high quality relighted and pose changed images which are,

most of the time, difficult to generate using single image as input. That said, there are a

few methods that do produce relighting and pose change results using single image input

and so would we.

The subset of techniques that are most closely related to our technique consists of the

methods that involve fitting a canonical reflectance-shape face model to the given input

image. Using the single input image, the fitting process is expected to warp or modify

the canonical face model in such a way that the end result is the reflectance-shape model

for the given face, which can then be used for relighting and pose change. Morphable

Models [10] and Spherical Harmonics Morphable Models [11] are the prime examples of

such techniques. Both of these methods involve building shape and texture models using

a bootstrap set of images and 3D shapes, which are then fit onto the given input image

using some non-linear optimization techniques. The later of these uses the Spherical

Harmonics based lighting model instead of the Phong’s lighting model used in the former.

The upside of using these methods is that they can obtain the reflectance field as well as

the shape in one shot. On the downside, these methods require manual initialization of

the optimization process, which itself can be quite cumbersome and susceptible to local

minima.

67

Computationally more efficient techniques include methods like the Quotient Image

([17]) and the Practical Relighting ([61]). These methods use a bootstrap set of images

and/or shape, but assume them to be in a rough alignment with the input image. This

assumption alleviates the need for computationally expensive non-rigid alignment at

the cost of some loss in the results’ quality. Given a bootstrap set of 3 images of at

least one other face, the Quotient Image method computes the illumination neutral

”Quotient Image” from a given input image. This illumination neutral image can then

be combined with various lighting directions and intensities to produce various relighted

images. Though the Quotient Image method works with the Lambertian assumption,

surface normals are not required to be explicitly computed. On the other hand, the

Practical Relighting technique assumes a mean face shape and a mean face texture to

be applicable to all the faces. Using the bootstrap shape and texture models with the

Lambertian model, it iteratively recovers the albedo and the lighting in the input image.

Another interesting piece of work was presented in [52], which moves beyond the use of the

Lambertian model and uses statistical models in conjunction with the shape-from-shading

method to recover the shape and the reflectance of the face in the given input image.

4.3 Overview

Our method tries to strike a balance between the computation efficiency and the

results’ quality. It requires a 2D fitting of a reference ABRDF field to the given input

image(s) and as the end result provides the ABRDF field for the given input face. Our

method can work with both multiple or single input image, but here we would focus on its

behavior with single input image. Since our fitting procedure does not involve any 3D to

2D fitting, practical drawbacks encountered in the Morphable Models class of techniques

(e.g. [11], [10]), like the manual initialization and the cumbersome optimization procedures

are avoided. Our method is composed of two parts - building a reference reflectance

model and fitting this model to the given input image(s). It is fully automated with the

only requirement that at least one of the input images be somewhat frontally lit. It also

68

assumes that in case multiple input images are provided as input, they are all aligned

amongst themselves. In addition to estimating the ABRDF field of the given face, our

method also recovers an estimate for the lighting condition in the given images.

Our reference reflectance model is different than the reference models used by existing

techniques in that though it imbibes the 3D shape information, it is fully defined by a 2D

field of spherical functions. This allows the fitting of the model to be carried out without

any 3D to 2D projections. In the following sections we first describe our reference model

and then the fitting procedure in detail.

4.4 The Reference ABRDF Field Model

The first part of our technique involves building a reference ABRDF model, which

captures sufficient amount of variation seen in the facial ABRDF fields, so that it can

be readily customized for any given input face. We begin by breaking down the facial

ABRDF field into two parts - one which is illumination dependent (referred here as the

illumination model) and one which is illumination independent. In computer vision and

graphics literature the later is often referred to as the texture of the face. Interestingly,

in a given ABRDF field, the part that constitutes the texture and that which makes the

illumination model has not been well defined in the literature. Most often the definition

of texture and the illumination model is dependent on the assumed reflectance model. For

instance, in the Lambertian model, the albedo - a constant scaling factor at each pixel,

is commonly accepted at the texture, while the half-cosine term is considered to be the

illumination function.

We have chosen to use definitions of the texture and the illumination models which

are independent of any particular reflectance model. In a data driven fashion, we define

the texture at a pixel to be the mean value of the ABRDF, while the quotient function

obtained by dividing the ABRDF with the texture is defined to be the illumination

function (illumination model is a field of such illumination functions). In other words,

texture is taken to be the average of all possible intensity values obtained at a pixel by

69

varying the point source lighting direction. This scalar value is illumination neutral and in

practice generates a scalar field for a given face which looks illumination independent.

With these definitions, the ABRDF field of any given face can be easily factored into

a scalar field called texture and an illumination model. Thus, our reference ABRDF model

is composed of a reference texture model and a reference illumination model. Now in order

to build the reference texture and illumination models, we use the technique developed

in Chapter 2. Given N subjects with m images each, under known point source lighting,

their ABRDF fields can be built using the Tensor Splines model (Eigenbubbles can also be

used). The average of all the m images, which uniformly samples the lighting directions,

is taken as an estimate for the texture, while the quotient function field obtained by

dividing each ABRDF with the texture value is taken as an estimate for the illumination

model. The texture images are then used to non-rigidly align all the N faces to a reference

face. The obtained deformation field is also used to align the illumination models across

different subject to the same reference face.

Using the 3rd order Tensor Splines representation, we obtain 10 coefficients per pixel

for the illumination model and a scalar value per pixel for the texture model. For each

subject, we string all the 10 coefficients at all the pixels into a single vector, Lj, and

similarly, also obtain the texture vector Ti. Further, we assume that both the texture

and the illumination model for any given face come from the multi-variate Gaussian

distributions. Given a bootstrap set of aligned facial textures (Ti) and illumination models

(Lj), the covariance matrices for the texture and the illumination model distributions can

be defined as CT =∑

i (Ti − T )(Ti − T )/N and CL =∑

j (Lj − L)(Lj − L)/N respectively.

T and L are the average texture and the average illumination models, respectively.

We obtain an orthonormal basis for the covariance matrices CT and CL by the

Principal Component Analysis as their eigenvectors ti and lj respectively. Note that the

eigenvectors here are ordered according to decreasing eigenvalues. Hence, for any ABRDF

70

Figure 4-1. An overview of our technique.

field, A, the texture, T can be expressed as

T = (∑

αiti + T ) (4–1)

while the illumination model, L can be written as

L = (∑

βj lj + L). (4–2)

The number of terms in both the models can be chosen in accordance with the computational

and the quality requirements. These two quantities can now be composed to get the

ABRDF fields. We call the set {ti, T , lj, L} as the reference ABRDF field of the reference

reflectance model.

4.5 Model Fitting

Given k images of a face under point source, but unknown lighting conditions,

the problem now is to fit the reference model built in the previous section to the input

image(s). The unknowns include the non-rigid deformation required to align the reference

model and the input face, the lighting directions in the k images, the texture model

coefficients and the illumination model coefficients. We propose to recover these unknown

71

parameters by minimizing the following objective function

E1(Tx, Ty, αi, βj, θk, φk) =∑

k

∑(x,y) ||Ik(Tx(x, y), Ty(x, y))−D((

∑ni=1 αili + L, θk, φk, x, y)

·(∑mj=1 βj tj(x, y) + T ))||2,

(4–3)

where Tx and Ty are the x and y components of the non-rigid deformation applied to

the input images, θk and φk are the illumination directions of the k input images, αi are

the illumination coefficients and βj are the texture coefficients. The function D is an

abstraction of the process which takes the vector of coefficients and computes the spherical

illumination function at each pixel using the Tensor Splines basis. These functions are

then sampled at (x, y) location in (θ, φ) direction and scaled by the estimated texture

value at the location (x, y). The setup for the fitting procedure described above is

graphically depicted in Fig. 4-1.

In addition to the sum-of-differences objective function defined above, we constrain

the search space for the illumination model further by adding the following Tikhonov

regularizer to Eq. 4–3

E2(αi) = λ ·n∑i

α2i , (4–4)

where λ is the regularization parameter. This constraint effectively keeps the estimated

illumination model from going too far off from the mean illumination model and results in

artifact free relighted images. The value of the parameter λ is set by the user based on the

desired relighted image quality.

We break down the process of recovering the unknown into four steps. In the first

step, the input images are aligned with the reference ABRDF model. We use the 2D

Morphable Models [62] method to compute the non-rigid deformation parameters. The

inputs to this step are two images - the reference face image used to align the ABRDF

fields and the input image with somewhat frontal illumination. The output of this step are

the deformation parameters Tx and Ty which can be used to warp the input image(s) to

the reference model.

72

Next we compute the remaining unknowns by minimizing

E(αi, βj, θk, φk) = E1(Tx, Ty, αi, βj, θk, φk) + E2(αi) (4–5)

using a gradient descent based technique. The unknown illumination and texture

parameters are initialized with ones and the illumination directions are initialized with

zeros. In practice we have found that using MATLAB’s fminunc function with 200

iterations provides good estimate of the unknown parameters.

Once the unknowns have been recovered, we have an estimate for the ABRDF field of

the input face but it is still aligned with the reference faces. To handle this, as the third

step in our model fitting process, we backward warp the estimated ABRDF field using the

deformation parameters computed earlier. But since the process described above involves

two registration steps, the resultant ABRDF field provides images that appear grainy.

In order to remove these interpolation artifacts, we have incorporated a final step in

the fitting process called quotient mapping. Towards this, we generate an image from the

computed ABRDF field with the same lighting direction as the near-frontal input image.

Note that the lighting direction for this image was computed as part of the optimization

procedure described above. Next we compute the quotient map by dividing the near

frontal image with its synthesized estimate. This quotient map is then used to scale the

estimated ABRDF field and suppresses the artifacts introduced by interpolation and

extrapolation during the non-rigid alignments of the ABRDF field.


In order to test the proposed model we did experiments using the Extended Yale B,

the CMU PIE and the MERL Dome face datasets. We used the Extended Yale B dataset

to build the reference illumination model since it contains a large number of good quality

point source lit images. The illumination model was obtained by taking the ABRDF fields

and normalizing the coefficient vectors. In absence of a large number of texture images,

as defined above, we used 200 to 400 frontally lit images to build the texture model. We

73

Figure 4-2. Comparison of relighted images with ground truth images from the CMU PIEand MERL DOME databases.

used images from the CMU PIE and the MERL Dome databases as input images of our

relighting method. Though our method can be used with multiple input images, here we

have presented results primarily on single input image case since, as described before, it is

the more difficult and useful special case of the relighting and the pose change problems.

4.6.1 Relighting

Foremost, we compare the quality of the relighted images with the ground truth

images in Fig. 4-2. The top two rows show results for an input from the CMU PIE dataset

while the bottom two rows show results for an input from the MERL Dome dataset. For

various ground truth lighting direction, relighted images are presented. Note that the

closest relighted images were manually selected. We have also included a couple of images

under extreme illuminations, for which there were no ground truth images available. This

shows the importance of a good ABRDF reference model. Since the Extended Yale B

images had more extreme lighting examples than the CMU PIE or the MERL Dome

datasets, our method was able to predict appearances of these faces under more extreme

illumination than those included in the CMU PIE or the MERL Dome datasets.

74

Next, for two subjects each from CMU PIE (Fig. 4-3 and Fig. 4-4) and MERL Dome

(Fig. 4-5 and Fig. 4-6) datasets, we present the whole range of images generated by our

method as the elevation and azimuth angle for the point light source varies from −60◦

to +60◦. The gradual appearance of cast and attached shadows can be noted as the

illumination direction moves away from the frontal direction.

4.6.2 Pose Change

Using the procedure described in this Chapter, one can estimate the ABRDF field of

a face from as few as one input image. Since the estimated ABRDF field is same as those

described in Chapter 2, we can use the shape recovery method described there to estimate

its shape too. In Fig. 4-7 we show the novel poses rendered of three different faces from

the CMU PIE dataset using single image as input.

4.7 Conclusions

In this chapter we have presented a novel scheme for single image relighting where we

use a non-Lambertian reference ABRDF field model and a 2D fitting procedure to obtain

the ABRDF field of a face from as few as one input image. From the relighted images

included in this chapter it can be noted that our method realistically reproduces complex

photo-effects like the cast shadows and the specularities in the synthesized images. Using

the procedure outlined before, we have also presented the pose change results using single

input image. Finally, one of the important application of our single image relighting

framework, face recognition, would be described in detail in Chapter 6.

75

Figure 4-3. Relighted images from CMU PIE dataset. Left to right and top to bottom theillumination angle varies from −60◦ to +60◦

76

Figure 4-4. Relighted images from CMU PIE dataset. Left to right and top to bottom theillumination angle varies from −60◦ to +60◦

77

Figure 4-5. Relighted images from MERL DOME dataset. Left to right and top to bottomthe illumination angle varies from −60◦ to +60◦

78

Figure 4-6. Relighted images from MERL DOME dataset. Left to right and top to bottomthe illumination angle varies from −60◦ to +60◦

79

Figure 4-7. Pose changed images of three subjects from the CMU PIE dataset. The singleimage used as the input is also shown in the first column.

80

CHAPTER 5FACE RECOGNITION

5.1 Introduction

World events, specially in the last decade, have lead to an increased interest in the

field of biometrics based person identification. Face recognition in particular, has attracted

prolific research in the computer vision and pattern recognition community. Even though

impressive strides have been made towards providing an ultimate solution to this problem,

significant and interesting problems still remain.

If we try to organize the epitome of literature present in this field, loosely a

dichotomy of approaches emerges. The first class of these tries to capture the physical

processes of image formation under various scene parameter variations like illumination

(Harmonic Image Exemplar [63], Generic ABRDF [3], Illumination Cone [64], Universal

Lighting [65]), pose (Shape + Spherical Harmonics Basis [66], Morphable Models [67]),

expression(Isometry-invariant Similarity [68], Geometry-Texture [69]) etc. In contrast,

the second class of approaches invokes mathematical and statistical tools to capture the

structure of the oft-invisible relations among the numbers that make up the face images.

These techniques explore the intrinsic data geometry assuming images to be either vectors

(e.g. Eigenfaces [70], Fisherfaces [71], Laplacianfaces [72], orthogonal Laplacianfaces

(OLAP) [73], Neighborhood Preserving Embedding [74], Marginal Fisher Analysis [75],

Laplacian Eigenmaps [76], Locally Linear Embedding [77], Locality Preserving Projections

[78], Kernel Locality Preserving Projections with Side Information (KLPPSI) [79],

MLASSO [80], Kernel Ridge Regression (KRR) [81]), or higher dimensional tensors (e.g.

Tensor Subspace Analysis [82], 2-Dimensional Linear Discriminant Analysis [83], Tensor

Marginal Fisher Analysis [75], Multi-Linear Discriminant Analysis [84], Tensorfaces [85],

Orthogonal Rank One Tensor Projection (ORO) [86], Tensor Average Neighborhood

Margin Maximization (TANMM) [87], Correlation Tensor Analysis (CTA) [88], Spectral

Regression [89], Regularized Discriminant Analysis [89], Smooth LDA [90]).

81

Figure 5-1. Structure of A1i and A2

i for an image of size 5× 5 and kernel of size 3× 3. Inthe first row, 9 neighborhoods of the image Ii are highlighted. For first orderapproximation, each of of these neighborhoods become a row in A1

i . For secondorder case, we take all the second order combinations of pixel values in eachneighborhood and use them as the first 81 (b4) elements of a row in A2

i . Therest 9 (b2) elements are simply the pixel values. Rows are numbered to showwhich neighborhood they correspond to.

A major advantage of the techniques in the first class comes from their being

generative in nature. This property allows these methods to accomplish tasks like face

relighting (e.g. [3],[63], [91]) or novel pose generation or complete 3D image reconstruction

(e.g. [66], [67]) in addition to recognition. At the same time, methods in the first class

tend to demand more side information from the data as compared to the second class of

methods (e.g. [3] requires illumination direction for the training set, [91] requires facial

feature points for initialization etc). The second class of methods are in a sense more

versatile as they can be seamlessly applied to a variety of different image sets without any

significant requirement of side information.

The method that we propose in this chapter loosely falls into the second category

of techniques. We seek a mapping of face image patches such that in the range space,

discrimination among different classes is easier. We choose Volterra kernels to accomplish

this because it allows us to systematically build progressively better approximations to

such a mapping. Furthermore, Volterra kernels can be learnt in a data driven fashion

which relieves us from being predisposed towards any fixed kernel form (e.g. Gaussian,

82

Figure 5-2. Training images from each class are stacked up and divided into equal sizedpatches. Corresponding patches from each class are then used to learn higherorder convolutional Volterra kernel by minimizing intraclass distance overinterclass distance. We end up with one Volerra kernel per group of spatiallycorresponding patches. The size and the order of the kernel is held constantfor a given training process. Note that the color images are only used forillustration, so far our implementation works with grayscale images.

Radial Basis Function etc). The face images in the range space are called Volterrafaces in

this chapter.

5.2 Volterra Kernel Approximations

From signal processing theory we know that a linear translation invariant (LTI)

functional = : H → H, which maps the function x(t) to function y(t), can be completely

described by a function h(t) as

=(x(t)) = y(t) = x(t)⊗ h(t) =

∫ ∞

−∞h(τ)x(t− τ)dτ. (5–1)

Volterra series theory generalizes this concept and states that any non-linear translation

invariant functional ℵ : H → H, which maps the function x(t) to function y(t), can be

described by a sequence of functions hn(t) as

83

=(x(t)) = y(t) =∞∑

n=1

yn(t) (5–2)

where

yn(t) =

∫ ∞

−∞· · ·

∫ ∞

−∞hn(τ1, . . . , τn)x(t− τ1) . . . x(t− τn)dτ1 · · · dτn (5–3)

Here hn(τ1, . . . , τn) are called the Volterra Kernels of the functional. It must be noted that

the above equation can be seamlessly generalized to 2 dimensional functions, I(u, v), which

for instance, can be an image. It can be noted that eq. (5–1) is just a special case of the

more general eq. (5–3) if only the first order terms are taken into account.

Since we are interested in computing using this theory, we would be using the

following discrete form of eq. (5–3).

yn(m) =∞∑

q1=−∞· · ·

∞∑qn=−∞

hn(q1, . . . , qn)x(m− q1) . . . x(m− qn). (5–4)

The infinite series form in eq. (5–4) does not lend itself well for practical implementations.

Further, for a given application, only the first few terms may be able to give the desired

approximation of the functional. Thus we need a truncated form of Volterra series, which

is denoted in this chapter as

=p(x(m)) =

p∑n=1

yn(m) = x(m)⊗p h(m) (5–5)

where p denotes the maximal order of the terms taken in account for the approximation.

Note that in this truncated Volterra series representation, h(m) is a placeholder for all the

different order of kernels.

In general, given a set of input functions I, we are interested in finding a functional

ℵ, such that ℵ(I) has some desired property. This desired property can be captured by

defining a goodness functional on the range space of ℵ. In cases when the explicit equation

relating the input set I to ℵ(I) is known, various techniques like harmonic input method,

direct expansion etc. ([92]) can be used to compute kernels of the unknown functional. In

84

absence of such an explicit relation, we propose that Volterra kernels be learnt from data

using the goodness functional. The translation invariance property of the volterra kernels

ensures that if the images are translated by a fixed amount in the domain, mapped images

are also translated by the same amount and hence the volterra kernel mapping is stable.

In this framework, the problem of pattern classification can be posed as follows.

Given a set of input data I = {gi} where i = 1 . . . N , a set of classes C = {ck} where

k = 1 . . . K and a mapping which associates each gi to a class ck, find a functional such

that in the range space, data ℵ(I) is easily classifiable. Here the goodness functional could

be a measure of separability of classes in the range space. Once the Volterra kernels have

been determined, a new data point can be classified using the learnt functional. ℵ(I)

can be approximated to appropriate accuracy based on computational efficiency and

classification accuracy constraints.

5.3 Kernel Computation as Generalized Eigenvalue Problem

For the specific task of image classification, we define the problem as follows. Given

a set of input images (2D functions) I, train set, where each image belongs to a particular

class ck ∈ C, compute the Volterra kernels for the unknown functional N which map the

images in such a manner that goodness functional O is minimized in the range space of

N . Functional O measures departure from complete separability of data in range space.

In this chapter we seek a functional N that maps all images from the same class in a

manner such that the intraclass L2 distance in minimized while the interclass L2 distance

is maximized. Once N has been determined, a new image can be classified using any

methods like Nearest Centroid Classifier, Nearest Neighbor Classifiers etc. in the mapped

space. With this observation, we define the goodness functional O as

O(I) =

∑ck∈C

∑i,j∈ck

‖N (Ii)−N (Ij)‖2

∑ck∈C

∑m∈ck,n/∈ck

‖N (Im)−N (In)‖2 (5–6)

85

Figure 5-3. Testing involves dividing the test image according to the scheme used whiletraining. Then each patch is mapped to the range space by the correspondingVolterra kernel. After the mapping each patch is classified using a NearestNeighbor classifier in the range space. After all patches have been individuallyclassified, each patch from the test image casts a vote towards the parentimage classification. The class with the maximum votes wins.

where the numerator measures the aggregate intraclass distance for all classes and

the denominator measures the aggregate distance of class ck from all other classes in C.

Equation (5–6) can be further expanded as

Ok(I) =

∑ck∈C

∑i,j∈ck

‖Ii ⊗p K − Ij ⊗p K‖2

∑ck∈C

∑m∈ck,n/∈ck

‖In ⊗p K − Im ⊗p K‖2 (5–7)

where K, like h(t) in eq. (5–5), is a placeholder for all different order convolution kernels.

At this juncture we make the linear nature of convolution explicit by converting the

convolution operation to multiplication. This conversion to explicit linear transformational

form can be done in many ways, but as the convolution kernel is the unknown in our

setup, we wish to keep it as a vector and thus we transform the image Ii into a new

representation Api such that

86

Ii ⊗p K = Api ·K (5–8)

where K is the vectorized form of 2D masks represented by K.

The exact form of Api depends on the order of the convolutions p. In Section 5 we

have presented results for up till the second order approximations and thus the structure

of Api is explained for only up to second order, but it should be noted that the recognition

framework using volterra kernels that we propose is very general and the structure of Api

for any order can be analogously derived.

5.3.1 First Order Approximation

For an image Ii of size m × n pixels and a first order kernel K1 of size b × b, the

transformed matrix Api has dimensions mn × b2. It is built by taking neighborhoods of

b × b dimensions at each pixel in Ii, vectorizing and stacking them one on top of other.

This procedure is illustrated for an image of size 5 × 5 and kernel of size 3 × 3 in Fig. 5-1

([4]). Border pixels can be ignored or taken into account during convolution by padding

the image with zeros without affecting the performance significantly.

Substituting the above defined representation for convolution in eq. (5–7), we obtain

O(I) =

∑ck∈C

∑i,j∈ck

‖Api ·K1 − Ap

j ·K1‖2

∑ck∈C

∑m∈ck,n/∈ck

‖Apn ·K1 − Am

i ·K1‖2 . (5–9)

This can be written as

O(I) =K

T

1 SWK1

KT

1 SBK1

(5–10)

where

SW =∑ck∈C

∑i,j∈ck

(Api − Ap

j)T (Ap

i − Apj) (5–11)

and

SB =∑ck∈C

∑

m∈ck,n/∈ck

(Api − Ap

j)T (Ap

i − Apj). (5–12)

87

Here SW and SB are symmetric matrices of only dimensions b2 × b2. Seeking the

minimum of eq. (5–10) leads to solving the generalized eigenvalue problem and thus the

minimum of O(I) is given by the minimum eigenvalue of SB−1SW and it is attained when

K1 equals the corresponding eigenvector.

5.3.2 Second Order Approximation

The second order approximation of the sought functional contains two terms

y(m) =∞∑

q1=−∞h1(q1)x(m− q1) +

∞∑q1=−∞

∞∑q2=−∞

h2(q1, q2)x(m− q1)x(m− q2)

(5–13)

The first term in eq. (5–13) corresponds to a weighted sum of the first order terms,

x(m − q1), while the second order term corresponds to weighted sum of the second order

terms, x(m − q1)x(m − q2). For an image Ii of size m × n pixels and kernels of size b × b,

the transformed matrix A2i for the second order approximation in eq. (5–8) has dimensions

mn × (b4 + b2) and the kernel vector that multiplies it, K2, has dimensions (b4 + b2) × 1.

A2i is built by taking a neighborhood of size b × b at each pixel in Ii, generating all second

degree combinations from the neighborhood, vectorizing them, concatenating first degree

terms and stacking them one on top of other. K2 is formed by concatenating vectorized

second and first order kernels. The structure of A2i for a 5 × 5 image and 3 × 3 kernels is

illustrated in Fig. 5-1 ([4]). It must noted that the problem is still linear in the variables

being solved for and in fact by use of this formulation we have ensured that regardless

of the order of approximation, the problem is linear in the coefficients of the volterra

convolution kernels.

With this definition of A2i we proceed as for the first order approximation to obtain

equations (5–9) and (5–10) with the difference being that the matrices SB and SW now

have dimensions (b4 + b2) × (b4 + b2). Here we must point out an important modification

88

to the structure of A2i which allows us to reduce the size of the matrices. The second order

convolution kernels in the Volterra series are required to be symmetrical ([92]) and this

symmetry also manifests itself into the structure of A2i . By allowing only unique entries

in A2i we can reduce the dimensions of A2

i to mn × b4+3b2

2and the dimensions of matrices

SB and SW to b4+3b2

2× b4+3b2

2. Now as in the first order approximation, the minimum of

Ok(I) is given by the minimum eigenvalue of SB−1SW and it is attained when K2 equals

the corresponding eigenvector.

5.4 Training and Testing Algorithms

The mapping developed in the previous section can be made more expressive if the

image is divided into smaller patches and each patch is allowed to be independently

mapped to its own discriminative space. This allows us to develop separate kernels for

different face regions. Further, if each constituent patch casts a vote towards the parent

image classification, it brings in robustness to the overall image classification. Thus we

adopt a patch based framework for the face recognition task.

Training (Fig. 5-2 [4]) in the proposed framework involves learning a volterra kernel

from the corresponding patches of the training images. Testing (Fig. 5-3 [4]) in our

framework involves two stages. The first stage classifies each patch using the mapping

framework described in the previous section and a nearest neighbor classifier. In the

second stage, each patch casts a vote towards the parent image classification and the class

with maximum votes wins. In the case of a tie, we classify the image by a run-off among

the tied classes. As the number of cases with ties are small, a simpler strategy of using a

coin toss to break the tie also showed similar results.

Parameter Selection: Like any other learning algorithm, there are parameters

that need to be set before using this method. But unlike most state-of-the-art algorithms

where parameters are chosen from a large discrete or continuously varying domain

(e.g. initialization and λ in [80], parameter α ∈ [0, 1] in [90], dimensionality D in [93],

reduced dimensionality in TSA [82], 2D-LDA [83] , energy threshold in [94] etc.), Volterra

89

Table 5-1. State-of-the-art methods with which we compare our technique along with thetraining set sizes used in their experiments.

Method Yale A CMU PIE Ext. Yale B

2008 CVPR,An (KLPPSI) [79] - 5 10 20 30 5 10 20 302008 CVPR,Pham (MLASSO) [80] - 2 3 4 2 3 42008 CVPR,Shan (UVF) [93] 2 - 8 - -2008 TIP,Fu (CTA) [88] - 5 10 15 20 5 10 20 302008 TIP,Fu (Lap,Eig,Fis) [88] - - 5 10 20 302007 CVPR,An (KRR) [81] - 5 10 20 30 5 10 20 302007 CVPR,Hua (ORO) [86] 5 30 202007 CVPR,Cai (S-LDA) [90] 2 3 4 5 - -2007 CVPR,Wang (TANMM) [87] 2 3 4 5 10 20 -2007 ICCV,Cai (SR, RDA) [89] - 30 40 10 20 30 402006 TIP,Cai (OLAP) [73] 2 3 4 5 5 10 20 30 -2006 TIP,Cai (Lap,Eig,Fis) [73] 2 3 4 5 5 10 20 30 -

discriminant analysis has a smaller discrete set of parameter choices. We use one of the

widely used ([90],[81]) and accepted methods, cross validation, for parameter selection.

Foremost is the selection of the patch size, and for this, starting with the whole face

image we define a quad-tree of sub-images. We progressively go down the tree stopping

at the level beyond which there is no improvement in the recognition rates in cross

validation. Empirically we found that a patch size of 8 × 8 pixels provides the best results

in all cases. Next, we allow patches to be overlapping or non-overlapping. The Volterra

kernel size can be of size 3 × 3 or 5 × 5 pixels (anything bigger than this severely over-fits

a patch of size 8× 8). Lastly, the order of the kernel can be quadratic or linear.

Here we have presented results for both quadratic and linear kernels. Rest of the

parameters were set using a 20-fold leave-one-out cross validation on the training set.

It can be noted from the results presented in the next section that the best parameter

configuration is fairly consistent not just within a particular database but also across

databases.

5.5 Experiments

In order to evaluate our technique and compare it with existing state-of-the-art

methods in learning based face recognition, we have identified 11 recent (CVPR 2008,

90

Table 5-2. Yale A: Recognition error rates for training set size varying from 2 to 9.Linear kernel size was 5× 5 and quadratic kernels size was 3× 3. For both casesoverlapping patches of size 8× 8 were used. Images of size 64× 64 were used.For other methods, best results as reported in the respective papers are used.

Train Set Size 2 3 4 5S-LDA [90] 42.4 27.7 22.2 18.3S-LDA [90] (updated) 37.5 25.5 19.3 14.7UVF [93] 27.11 17.38 11.71 8.16TANMM [87] 44.69 29.57 18.44 -OLAP [73] 44.3 29.9 22.7 17.9Eigenfaces [73] 56.5 51.1 47.8 45.2Fisherfaces [73] 54.3 35.5 27.3 22.5Laplacianfaces [73] 43.5 31.5 25.4 21.7Volterrafaces (Linear) 15.70 12.33 9.47 6.11Volterrafaces (Quad) 22.15 13.36 15.78 10.19

Train Set Size 6 7 8 9S-LDA [90] (updated) 12.3 10.3 8.7 -UVF [93] 6.27 5.07 3.82 -Volterrafaces (Linear) 5.78 3.96 2.61 1.43Volterrafaces (Quad) 10.04 9.66 9.49 8.74

Table 5-3. CMU PIE: Recognition error rates for training set size varying from 2 to 9.Linear kernel size was 5× 5 and quadratic kernels size was 3× 3. For both casesoverlapping patches of size 8× 8 were used. Images of size 32× 32 were used.For other methods, best results as reported in the respective papers are used.

Train Set Size 5 10 20 30KLPPSI [79] 27.88 12.32 5.48 3.62KRR [81] 26.4 13.1 5.97 4.02ORO [86] - - - 6.4TANMM [87] 26.98 17.22 5.68 -SR [89] - - - 6.1OLAP [73] 21.4 11.4 6.51 4.83Eigenfaces [73] 69.9 55.7 38.1 27.9Fisherfaces [73] 31.5 22.4 15.4 7.77Laplacianfaces [73] 30.8 21.1 14.1 7.13Volterrafaces (Linear) 20.26 10.24 4.94 2.85Volterrafaces (Quad) 25.29 11.94 5.45 4.60

Train Set Size 2 3 4 40SR [89] - - - 5.2MLASSO [80] 54.0 43.0 34.0 -Volterrafaces (Linear) 43.0 36.30 23.98 2.37Volterrafaces (Quad) 50.48 39.66 32.67 3.04

91

Table 5-4. Extended Yale B: Recognition error rates for training set size varying from 2to 5 10 20 30 & 40. Both Linear kernel and Quadratic kernel of size 3× 3 wereused and for both cases non-overlapping patches of size 8× 8 were used. Imagesof size 32× 32 were used. For other methods, best results as reported in therespective papers are used.

Train Set Size 5 10 20 30ORO [86] - - - 9.0SR [89] - 12.0 4.7 2.0RDA [89] - 11.6 4.2 1.8KLPPSI [79] 24.74 9.93 3.15 1.39KRR [81] 23.9 11.04 3.67 1.43CTA [88] 16.99 7.60 4.96 2.94Eigenfaces [88] 54.73 36.06 31.22 27.71Fisherfaces [88] 37.56 18.91 16.87 14.94Laplacianfaces [88] 34.08 18.03 30.26 20.20Volterrafaces (Linear) 6.35 2.67 0.90 0.42Volterrafaces (Quad) 13.0 3.98 1.27 0.58

Train Set Size 2 3 4 40MLASSO [80] 58.0 54.0 50.0 -SR [89] - - - 1.0RDA [89] - - - 0.9Volterrafaces (Linear) 26.23 18.23 9.33 0.34Volterrafaces (Quad) 40.81 20.47 14.42 0.43

CVPR 2007, ICCV 2007, TIP 2008, TIP 2006) publication which present the best results

on the benchmark databases using very similar (to that in [73]) experimental setup. All of

these are embedding methods mentioned in the prologue with the exception of [93] which

builds on the concept of Universal Visual Features (UVF). In our study, we present results

on Yale A, CMU PIE and Extended Yale B benchmark face databases partly because they

are few of the most popular databases which makes a comparative study easy. In addition

to the recent methods we also provide comparison with the traditional baseline methods -

Eigenfaces, Fisherface and Laplacianfaces. Table. 5-1 ([4]) lists these methods along with

the number of training images used by them on the above mentioned databases in their

experiments. We have presented our results for the whole range of training set sizes so

that comparison with maximum number of techniques can be made.

92

For the Yale A 1 face database we used 11 images each of the 15 individuals with a

total of 165 images. For CMU PIE database ([95]) we used all 170 images (except a few

corrupted images) from 5 near frontal poses (C05,C07,C09,C27,C29) for each of the 68

subjects. For Extended Yale B database ([65]) we used 64 (except for a few corrupted

images) images each (frontal pose) of 38 individuals present in the database. Note that the

methods in Table. 5-1 ([4]) used the same subset of images. We obtained the data from

the website of the authors of [90] 2 . These images are manually aligned (two eyes were

aligned at the same position) and cropped to extract faces, with 256 gray value levels per

pixel.

The results (average recognition error rates) on the Yale A, CMU PIE and Extended

Yale B databases are presented in Table. 5-2 ([4]), Table. 5-3 ([4]) and Table. 5-4 ([4])

respectively. Rows titled Train Set Size indicate the number of training images used and

the rows below them list the rates reported by various state-of-the-art and our method

(Volterrafaces). Each experiment was repeated 10 times for 10 random choices of training

set. All images other than the training set were used for testing. Specific experimental

setup used for Volterrafaces is mentioned below each table. We have reported results with

both linear and quadratic masks for the sake of completeness. Best results for a particular

training set size are highlighted in bold.

5.6 Discussion and Conclusion

It can be noted the our proposed method (Volterrafaces) consistently outperforms the

state-of-the-art and traditional methods on all the three benchmark datasets. On all the

three databases linear masks outperform the quadratic masks but in most cases quadratic

masks also provided better performance than the existing methods. For the competing

methods we have reported the error rates as mentioned in the original publications (listed

1 http://cvc.yale.edu/projects/yalefaces/yalefaces.html

2 http://www.cs.uiuc.edu/homes/dengcai2/Data/data.html

93

http://cvc.yale.edu/projects/yalefaces/yalefaces.html

http://www.cs.uiuc.edu/homes/dengcai2/Data/data.html

next to method names in tables). Since we want to present only the best results reported

by the competing methods, some of the entries in the tables are left empty because no

results for those training set sizes were reported in the original publications.

We have introduced the use of Volterra kernel approximations for image recognition

functionals in this chapter. The kernel learning is driven by training data, based on

a goodness functional defined in the range space of the recognition functional. It is

shown that for the goodness functional that tries to minimize intraclass distances

while maximizing interclass distances, kernel computation reduces to the generalized

eigenvalue problem which translates to very efficient computation of kernels for any order

approximation of the functional. Effectiveness of this technique for face recognition is

demonstrated by experiments on three benchmark databases and the results are compared

to traditional as well as the state of the art techniques in discriminant analysis for faces.

From the results presented in this chapter it can be concluded that Volterra kernel

approximations show great promise for applications in image recognition tasks.

94

CHAPTER 6ENHANCING FACE RECOGNITION WITH RELIGHTING

6.1 Introduction

In this dissertation we have presented various methods for generating relighted images

of a given face using one or more images. We have also examined the problem of face

recognition and built a novel classifier as a solution to this problem. In this chapter we

will examine the problem of face recognition in conjunction with relighting and explore the

possibility of using relighting as a tool to enhance face recognition. In particular, we would

explore the gallery set augmentation as a possible way to achieve higher recognition rates.

The literature is replete with various proposals (e.g. [17], [61], [52], [96]) on how

to marry relighting techniques with face recognition. We can broadly put the existing

techniques that have examined image relighting in conjunction with recognition into

two classes. The first includes techniques which use relighting as a way to obtain an

illumination invariant representations of the gallery and probe images (e.g. [97], [98]) while

the second class includes the techniques which seek to improve recognition by augmenting

the gallery set with novel relighted images (e.g. [52]).

6.2 Gallery Set Augmentation

The method we propose here falls into the second of the above mentioned classes

of methods. Given single input image of a face (initially the only image in the gallery

set), which is somewhat frontally lit (i.e. no strong cast or attached shadows), we use

the method we developed for single image relighting in Chapter 4 to generate multiple

relighted version of the same image. These new images are added to the gallery set and

any given probe image is now matched against all of the images.

One would expect that a probe image which has very different lighting than the input

image would not be a good match for it, but it is likely that one of the newly added image

might be a closer match for it. Even in the case where the probe image has been lit from

multiple light sources, a linear combination of the new gallery images would be a closer

95

match to it than any single image. Here we will focus on point source lit image based face

recognition since the case with multiple light sources can be simply addressed by using an

appropriate classifier with the same set of point source lit augmented images.


In order to test the proposed recognition framework, we used the CMU PIE and

the MERL Dome datasets. Since gallery augmentation should help any recognition

method, we used three different recognition techniques to illustrate that relighting using

our method improves recognition. In our experiments, we looked at the improvement

achieved by augmentation over using single gallery image. The ABRDF reference model

used for single image relighting was built using the Extended Yale B face database. We

also implemented a few competing methods for image relighting to compare their results

against those achieved by the proposed relighting method.

Here we describe the recognizers we used in our experiments.

6.3.1 Nearest Neighbor Classifier

One of the simplest ways to classify images is to look at the pixel-by-pixel distance

between the probe image and the gallery images. Of the various distance measures

possible, we have chosen to use the simple L1 norm based distance measure since this

method would only be used to set the baseline results.

6.3.2 MERL Classifier

It is a commercial face recognition system that uses Haar-Like features to recognize

face images. Details for this classifier can be found in [99]. This classifier uses a large

set of pairs of images labeled as same or different to learn a set of good recognition

features and classifier parameters using Boosting. Once the classifier parameters and

the discriminating features have been learnt, given any pair of images, it assigns a

similarity score to the pair. This similarity score can then be used in conjunction with

a classification threshold to categorize images as different subjects.

96

6.3.3 Local Binary Pattern Classifier

Local Binary Patterns are a new class of features, proposed in [100], which have

been very successfully applied to face recognition. At contrast to the above mentioned

classifiers, this method uses a histogram of features and thus is not sensitive to the

absolute locations of the features.

To compare our results with other relighting methods, we also implement following

two relighting methods:

6.3.4 Naive Relighting

One of the simplest relighting methods is to darken a part of the image. This

gives the illusion that a part of the image is under shadow due to lighting. For this we

divide the image into rectangular segments and generate new images with pixels in these

rectangles progressively going darker from left to right, top to bottom and vice-versa. For

the darker segments of the image, we set the pixel values to be 30% of the original values.

This relighting method is used to answer the question that whether the more complicated

relighting schemes proposed here (and else where) are better than naively darkening

segments of the images to simulate shadowing or not?

6.3.5 Practical Relighting

This is one of the recently proposed state-of-the-art methods for single image

relighting which boasts a simple implementation with high performance. Starting with

a reference albedo and shape model, this method ([61]) estimates the lighting direction in

the input image. The lighting estimate and the shape model are then used to estimate a

face specific albedo field. With the albedo and the shape available, image of the face under

any illumination can be rendered using the Lambertian Model. Note that this method

assumes that all the face have the same shape and requires only a rough alignment among

them.

97

Table 6-1. Effect of gallery augmentation size : Face Recognition rates for the CMU PIEdataset with the MERL classifier.

Augmentation SizeRelighting Method 256 128 64 32 16 8 4 1Naive Relighting 92.5 92.2 91.8 93.1 92.8 90.4 90.9 90.1

Practical Relighting 92.7 92.2 92.3 91.6 90.7 90.2 91.3 90.1Proposed Technique 98.4 98.4 98.1 97.4 97.9 96.6 94.8 90.1

6.3.6 Results

One of important parameter in a gallery set augmentation scheme is the number of

new images to add to the gallery set. This can affect the performance in following three

ways. First, as the number of augmented images increases, so does the classification time

since the number of required comparisons also increase. Second, if more images are used

for augmentation, it is more likely that a close match would be found for the probe image.

Finally, having more images in the gallery increases the risk of encountering false positives.

Assuming that the classification time is not critical (since modern commercial classifiers

work quite fast), here we would focus on the change in recognition rates and the impact on

false positives.

In the first set of results we show how the recognition rates change as the number of

images used to augment the gallery increases. Here we have used the MERL classifier with

our relighting on the CMU PIE dataset. The gallery and the probe sets are composed of

facial images in the frontal pose. These results are presented in Table. 6-1. It can be noted

that across different relighting schemes, the recognition rates show improvement as the

number of augmented images is increased. For this experiment we synthesized 256 images

of each gallery subject by uniformly sampling the illumination direction space from +60◦

to −60◦ in both the elevation and the azimuth angles. Other augmentations were obtained

by uniformly spacing the images in the previously mentioned range.

For the rest of the experiments we would fix the number of augmented images to 64.

This is because it provides a reasonable trade-off between the memory-time requirements

and the recognition rates. We present recognition rates for the CMU PIE database in

98

Table 6-2. Face recognition rates for the CMU PIE datasetNearest Neighbor (L1) MERL Classifier Local Binary Patterns

Relighting Method Augmented Single Image Augmented Single Image Augmented Single ImageNaive Relighting 70.4 58.9 91.8 90.1 93.9 93.4

Practical Relighting 80.7 58.9 92.3 90.1 96.6 93.4Proposed Technique 93.5 58.9 98.1 90.1 99.2 93.4

Table 6-3. Face recognition rates for the MERL Dome datasetNearest Neighbor (L1) MERL Classifier Local Binary Patterns

Relighting Method Augmented Single Image Augmented Single Image Augmented Single ImageNaive Relighting 54.3 42.7 78.2 71.2 66.3 65.6

Practical Relighting 77.5 42.7 80.8 71.2 85.4 65.6Proposed Technique 79.8 42.7 82.9 71.2 88.2 65.6

Table 6-2 and for the MERL DOME database in Table 6-3. It can be noted that across

the different databases and across the different classification schemes, our relighting

method provides higher recognition rates than the competing methods.

Rank-one recognition rates, as presented in Tables 6-2 and 6-3, though useful, do

not provide insight into the impact of gallery augmentation on the False Acceptance

Rates. We explore this aspect of the galley set augmentation using the receiver operating

characteristic (ROC) curves which plot the False Rejection Rates against the False

Acceptance Rates as the recognition threshold is varied. For each probe image, a distance

is computed from every gallery class. Though this distance is computed differently

for different recognition strategies, it is a scalar value in all the cases. The recognition

threshold used to generate the ROC curves decides when to classify a probe image as

belonging to a class. If the distance between a probe image and a class is less than the

threshold value, the probe image is classified as belonging to that class, otherwise not.

Note that the classes here correspond to the subjects.

In Fig. 6-1 and Fig. 6-2 we present the Receiver Operating Characteristic (ROC)

curves for the MERL classifier, on the CMU PIE and the MERL Dome datasets

respectively. Curves for various relighting methods and the single image case are included

in the plots. There are three important figures of merit for such curves – first is the Equal

Error Rate (EER), which is the point on the curve where the False Acceptance Rate

99

Figure 6-1. ROC Curve for the CMU PIE dataset with the MERL classifier.

equals the False Rejection Rate, the second is the area under the ROC curve and third is

the False Rejection Rate at 0.1% False Acceptance Rate. It can be noted that for all of

these merit criterions, our relighting outperforms the existing methods. Similar trends can

be noted in the curves for the Nearest Neighbor classifier in Fig. 6-3 and Fig. 6-4. ROCs

for the Local Binary Pattern classifier on the CMU PIE and the MERL Dome databases

are presented in Fig. 6-5 and Fig. 6-6 respectively.

6.4 Conclusion

In this chapter we have demonstrated that relighting can be effectively used to

enhance recognition via gallery set augmentation. We have shown that across different

recognition schemes, relighting can benefit face recognition. Further, by comparing various

relighting methods to the proposed method, we have demonstrated that our relighting

method has significant advantages when it comes to enhancing recognition.

100

Figure 6-2. ROC Curve for the MERL Dome dataset with the MERL classifier.

Figure 6-3. ROC Curve for the CMU PIE dataset with the Nearest Neighbor classifier.

101

Figure 6-4. ROC Curve for the MERL Dome dataset with the Nearest Neighbor classifier.

Figure 6-5. ROC Curve for the CMU PIE dataset with the LBP classifier.

102

Figure 6-6. ROC Curve for the MERL Dome dataset with the LBP classifier.

103

CHAPTER 7CONCLUSION

In this dissertation we have examined the three fundamental problems in facial image

analysis - relighting, pose change and recognition. For all the three problems we have

presented novel solutions for both the cases of multiple images and single image as input.

Due to their tightly interdependent nature, we have treated the problems of relighting

and pose change together and presented a Tensor Splines based framework for multiple

image relighting and pose change. We have shown that our framework is superior to the

popular Lambertian model in terms of the shadows and the specularities reproduction in

the relighted images. Furthermore, we have shown that the face shape estimated using

our method has advantages over popular Robust Photometric Stereo technique when the

input images are cast shadow ridden. We have also presented a method for enhancing the

relighted image quality using Eigenbubbles.

For the more difficult problem of single image relighting and pose change, we

draw upon a bootstrap set of ABRDF fields. We define the reference ABRDF field

model in terms of the illumination field model and the texture model. Given a single

input image, using a non-linear optimization technique, we find those coefficients of

the illumination and the texture model, along with the illumination direction and the

non-rigid deformation, which explain the input image the best. Once we have obtained the

ABRDF field, we follow the same scheme as devised for the multiple input images case to

generate images in novel poses.

In order to address the problem of face recognition with multiple input images,

we presented a novel classification scheme using the Volterra kernels which we call

Volterrafaces. We have shown that this method outperforms various state-of-the-art

methods in terms of recognition accuracy on three popular benchmark face databases.

We have addressed the problem of single image recognition in a broader sense and have

proposed a gallery set augmentation framework which can be used with most off-the-shelf

104

recognition methods to enhance recognition. Given single gallery image, our technique

generates multiple relighted version of it and then any recognition technique that uses

multiple images can be used. We have compared our technique with existing single image

relighting methods and have showed that the recognition rate enhancements attained by

our relighting method exceeds those obtained by existing relighting methods.

The techniques presented in this dissertation improve upon the state-of-the-art in

various ways but there are numerous avenues that are still open for exploration. Few of

the important threads for possible future exploration include removing the need for having

somewhat frontally lit input image for single image relighting and pose change, removing

the need for point light source lit images as input for both multiple and single input image

cases and reducing the number of system parameters in the Volterrafaces classification

scheme. The computational efficiency aspects of all the algorithms presented in this paper

can also be explored in future since they play an important role in real-life applications.

105

REFERENCES

[1] B. T. Phong, “Illumination for computer generated pictures,” Communications ofthe ACM, vol. 18, no. 6, pp. 311–317, 1975.

[2] K. E. Torrance and E. M. Sparrow, “Theory for off-specular reflection fromroughened surfaces,” J. Optical Society of America, vol. 57, pp. 1105–1112, 1967.

[3] A. Barmpoutis, R. Kumar, B. C. Vemuri, and A. Banerjee, “Beyond thelambertian assumption: A generative model for apparent brdf fields of faces usinganti-symmetric tensor splines,” Proc. IEEE CS Conf. Computer Vision and PatternRecognition, 2008.

[4] R. Kumar, A. Banerjee, and B. C. Vemuri, “Volterrafaces: Discriminant analysisusing volterra kernels,” Proc. IEEE CS Conf. Computer Vision and Pattern Recogni-tion, 2009.

[5] R. Kumar, A. Barmpoutis, A. Banerjee, and B. C. Vemuri, “Non-lambertianreflectance modeling and shape recovery for faces using anti-symmetric tensorsplines,” Technical Report REP-2009-467, Dept. of CISE, Univ. of Florida, 2009.

[6] H. Chan and W. W. Bledsoe, “A man-machine facial recognition system-somepreliminary results,” Panoramic Research Inc, Tech. Rep., 1965.

[7] R. Basri and D. W. Jacobs, “Lambertian reflectance and linear subspaces,” IEEETrans. Pattern Analysis and Machine Intelligence, vol. 25, no. 2, pp. 218–233, 2003.

[8] P. Debevec, T. Hawkins, C. Tchou, H. Duiker, W. Sarokin, and M. Sagar,“Acquiring the reflectance field of a human face,” ACM Trans. on Graphics (Proc.SIGGRAPH), vol. 1, pp. 145–156, 2000.

[9] R. Ramamoorthi and P. Hanrahan, “A signal-processing framework for inverserendering,” ACM Trans. on Graphics (Proc. SIGGRAPH), vol. 1, pp. 117–128, 2001.

[10] V. Blanz and T. Vetter, “Face recognition based on fitting a 3d morphablemodel,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 9,pp. 1063–1074, Sept. 2003.

[11] L. Zhang and D. Samaras, “Face recognition from a single training image underarbitrary unknown lighting using spherical harmonics,” IEEE Trans. PatternAnalysis and Machine Intelligence, vol. 28, no. 3, pp. 351–363, March 2006.

[12] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, “From few to many:Illumination cone models for face recognition under variable lighting and pose,”IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 6, pp. 643–660,June 2001.

106

[13] A. Barmpoutis, B. C. Vemuri, and J. R. Forder, “Robust tensor splines forapproximation of diffusion tensor mri data,” Proc. IEEE CS Workshop on Mathe-matical Methods in Biomedical Image Analysis, pp. 86–86, 2006.

[14] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, “Illumination-based imagesynthesis: Creating novel images of human faces under differing pose & lighting,”IEEE Workshop on Multi-View Modeling and Analysis of Visual Scenes, 1999.

[15] V. Blanz and T. Vetter, “A morphable model for the synthesis of 3d faces,” ACMTrans. on Graphics (Proc. SIGGRAPH), pp. 187–194, 1999.

[16] T. Malzbender, D. Gelb, and H. Wolters, “Polynomial texture maps,” ACM Trans.on Graphics (Proc. SIGGRAPH), pp. 519–528, 2001.

[17] A. Shashua and T. Riklin-Raviv, “The quotient image: Class-based re-rendering andrecognition with varying illuminations,” IEEE Trans. Pattern Analysis and MachineIntelligence, vol. 23, no. 2, pp. 129–139, Feb. 2001.

[18] W. Y. Zhao and R. Chellappa, “Symmetric shape-from-shading using self-ratioimage,” Int’l J. Computer Vision, vol. 45, no. 1, pp. 55–75, 2001.

[19] S. Magda, D. J. Kriegman, T. Zickler, and P. N. Belhumeur, “Beyond lambert:Reconstructing surfaces with arbitrary brdfs,” Proc. IEEE Int’l Conf. ComputerVision, vol. 2, pp. 391–398, 7–14 July 2001.

[20] A. S. Georghiades, “Recovering 3-d shape and reflectance from a small number ofphotographs,” Proc. Eurographics Workshop on Rendering, pp. 230–240, 2003.

[21] Z. Wen, Z. Liu, and T. S. Huang, “Face relighting with radiance environmentmaps,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, vol. 2, pp.II–158–65, 18–20 June 2003.

[22] R. Gross, I. Matthews, and S. Baker, “Appearance-based face recognition andlight-fields,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, no. 4,pp. 449–465, April 2004.

[23] D. B. Goldman, B. Curless, A. Hertzmann, and S. M. Seitz, “Shape andspatially-varying brdfs from photometric stereo,” Proc. IEEE Int’l Conf. Com-puter Vision, vol. 1, pp. 341–348, 17–21 Oct. 2005.

[24] A. Hertzmann and S. M. Seitz, “Example-based photometric stereo: Shapereconstruction with general, varying brdfs,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 27, no. 8, pp. 1254–1264, Aug. 2005.

[25] K. C. Lee and B. Moghaddam, “A practical face relighting method for directionallighting normalization,” Int’l Workshop on Analysis and Modelling of Faces andGestures, 2005.

107

[26] T. Zickler, R. Ramamoorthi, S. Enrique, and P. N. Belhumeur, “Reflectance sharing:Predicting appearance from a sparse set of images of a known shape,” IEEE Trans.Pattern Analysis and Machine Intelligence, vol. 28, no. 8, pp. 1287–1302, Aug. 2006.

[27] M. Chandraker, S. Agarwal, and D. Kriegman, “Shadowcuts: Photometric stereowith shadows,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition,2007.

[28] N. Alldrin and D. Kriegman, “Shape from varying illumination and viewpoint,”Proc. IEEE Int’l Conf. Computer Vision, 2007.

[29] S. Biswas and G. Aggarwal, “Robust estimation of albedo for illumination-invariantmatching & shape recovery,” Proc. IEEE Int’l Conf. Computer Vision, 2007.

[30] S. K. Zhou, G. Aggarwal, R. Chellappa, and D. W. Jacobs, “Appearancecharacterization of linear lambertian objects, generalized photometric stereo, andillumination-invariant face recognition,” IEEE Trans. Pattern Analysis and MachineIntelligence, vol. 29, no. 2, pp. 230–245, Feb. 2007.

[31] R. Basri, D. Jacobs, I. Kemelmacher, and R. Basri, “Photometric stereo withgeneral, unknown lighting,” Int’l J. Computer Vision, vol. 72, no. 3, pp. 239–257,2007.

[32] N. Alldrin, T. Zickler, and D. Kriegman, “Photometric stereo with non-parametricand spatially-varying reflectance,” Proc. IEEE CS Conf. Computer Vision andPattern Recognition, pp. 1–8, 23–28 June 2008.

[33] A. Shashua, “On photometric issues in 3d visual recognition from a single 2d image,”Int’l J. Computer Vision, vol. 21, pp. 99–122, 1997.

[34] A. L. Yuille, D. Snow, R. Epstein, and P. N. Belhumeur, “Determining generativemodels of objects under varying illumination: Shape and albedo from multipleimages using svdand integrability,” Int’l J. Computer Vision, vol. 35, no. 3, pp.203–222, 1999.

[35] P. Belhumeur and D. Kriegman, “What is the set of images of an object under allpossible illumination conditions?” Int’l J. Computer Vision, vol. 28, pp. 245–260,1998.

[36] R. Ramamoorthi and P. Hanrahan, “The relationship between radiance andirradiance: Determining the illumination from images of a convex lambertianobject,” J. Optical Society of America A, vol. 18, no. 10, pp. 2448–2459, 2001.

[37] K.-C. Lee, J. Ho, and D. J. Kriegman, “Acquiring linear subspaces for facerecognition under variable lighting,” IEEE Trans. Pattern Analysis and MachineIntelligence, vol. 27, no. 5, pp. 684–698, May 2005.

108

[38] J. Lee, B. Moghaddam, H. Pfister, and R. Machiraju, “A bilinear illumination modelfor robust face recognition,” Proc. IEEE Int’l Conf. Computer Vision, vol. 2, pp.1177–1184, 17–21 Oct. 2005.

[39] J. P. O’Shea, M. S. Banks, and M. Agrawala, “The assumed light direction forperceiving shape from shading,” Proc. Symposium on Applied Perception in Graphicsand Visualization, pp. 135–142, 2008.

[40] P. N. Belhumeur, D. J. Kriegman, and A. L. Yuille, “The bas-relief ambiguity,” Int’lJ. Computer Vision, vol. 35, no. 1, pp. 33–44, 1999.

[41] Z. Liu, Y. Shan, and Z. Zhang, “Expressive expression mapping with ratio images,”ACM Trans. on Graphics (Proc. SIGGRAPH), pp. 271–276, 2001.

[42] A. Hertzmann and S. M. Seitz, “Shape and materials by example: A photometricstereo approach,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition,vol. 1, pp. I–533–I–540, 18–20 June 2003.

[43] T.-P. Wu, K.-L. Tang, C.-K. Tang, and T.-T. Wong, “Dense photometric stereo:A markov random field approach,” IEEE Trans. Pattern Analysis and MachineIntelligence, vol. 28, no. 11, pp. 1830–1846, Nov. 2006.

[44] S. C. Foo, “A gonioreflectometer for measuring the bidirectional reflectance ofmaterial for use in illumination computation,” Master’s thesis, Cornell University,1997.

[45] G. Borshukov and J. P. Lewis, “Realistic human face rendering for ”the matrixreloaded”,” ACM SIGGRAPH 2003 Sketches & Applications, pp. 1–1, 2003.

[46] T. Weyrich, W. Matusik, H. Pfister, B. Bickel, C. Donner, C. Tu, J. McAndless,J. Lee, A. Ngan, H. W. Jensen, and M. Gross, “Analysis of human faces usinga measurement-based skin reflectance model,” ACM Trans. on Graphics (Proc.SIGGRAPH), pp. 1013–1024, 2006.

[47] H. Jeffreys, Cartesian Tensors. Cambridge: The University Press, 1931.

[48] C. de Boor, “On calculating with b-splines,” J. Approximation Theory, vol. 6, p.5062, 1972.

[49] C. Lawson and R. J. Hanson, Solving Least Squares Problems. Prentice-Hall,Englewood Cliffs, 1974.

[50] A. Barmpoutis, B. C. Vemuri, T. M. Shepherd, and J. R. Forder, “Tensor splinesfor interpolation and approximation of dt-mri with applications to segmentationof isolated rat hippocampi,” IEEE Trans. Medical Imaging, vol. 26, no. 11, pp.1537–1546, Nov. 2007.

[51] B. K. P. Horn, Robot Vision. McGraw Hill: New York, 1986.

109

[52] T. Sim and T. Kanade, “Combining models and exemplars for face recognition:An illuminating example,” Proc. IEEE CS Conf. Computer Vision and PatternRecognition Workshop on Models versus Exemplars in Computer Vision, 2001.

[53] P. Debevec, “Rendering synthetic objects into real scenes: Bridging traditionaland image-based graphics with global illumination and high dynamic rangephotography,” ACM Trans. on Graphics (Proc. SIGGRAPH), pp. 189–198, 1998.

[54] R. Gross and V. Brajovic, “An image preprocessing algorithm for illuminationinvariant face recognition,” Int’l Conf. Audio- and Video-Based Biometric PersonAuthentication, pp. 10–18, 2003.

[55] R. Brunelli and T. Poggio, “Face recognition: Features versus templates,” IEEETrans. Pattern Analysis and Machine Intelligence, vol. 15, no. 10, pp. 1042–1052,Oct. 1993.

[56] P. W. Hallinan, “A low-dimensional representation of human faces for arbitrarylighting conditions,” Proc. IEEE CS Conf. Computer Vision and Pattern Recogni-tion, pp. 995–999, 21–23 June 1994.

[57] A. U. Batur and M. H. Hayes III, “Linear subspaces for illumination robust facerecognition,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition,vol. 2, pp. 296–301, 2001.

[58] W. A. Smith and E. R. Hancock, “Facial shape-from-shading and recognition usingprincipal geodesic analysis and robust statistics,” Int’l J. Computer Vision, vol. 76,no. 1, pp. 71–91, 2008.

[59] C. Ding and X. He, “K-means clustering via principal component analysis,” Int’lConf. Machine Learning, 2004.

[60] R. Basri and D. Jacobs, “Lambertian reflectance and linear subspaces,” IEEE Trans.Pattern Analysis and Machine Intelligence, vol. 25, no. 2, pp. 218–233, 2003.

[61] K. chih Lee and B. Moghaddam, “A practical face relighting method for directionallighting normalization,” IEEE Int’l Workshop on Analysis and Modeling of Facesand Gestures, 2005.

[62] M. Jones and T. Poggio, “Multidimensional morphable models: A framework forrepresenting and matching object classes,” Proc. IEEE Int’l Conf. Computer Vision,pp. 683–688, 1998.

[63] L. Zhang and D. Samaras, “Face recognition under variable lighting using harmonicimage exemplars,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition,pp. I–19–I–25, 2003.

[64] A. S. Georghiades, P. Belhumeur, and D. J. Kriegman, “From few to many:Illumination cone models for face recognition under variable lighting and pose,”

110

IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 6, pp. 643–660,2001.

[65] K. Lee, J. Ho, and D. J. Kriegman, “Acquiring linear subspaces for face recognitionunder variable lighting,” IEEE Trans. Pattern Analysis and Machine Intelligence,vol. 27, no. 5, pp. 684–698, 2005.

[66] S. Wang, L. Zhang, and D. Samaras, “Face reconstruction across different poses andarbitrary illumination conditions,” Biometric Authentication Workshop, pp. 91–101,2005.

[67] V. Blanz and T. Vetter, “Face recognition based on fitting a 3d morphablemodel,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 9,pp. 1063–1074, 2003.

[68] A. M. Bronstein, M. M. Bronstein, and R. Kimmel, “Robust expression-invariantface recognition from partially missing data,” European Conf. Computer Vision, pp.396–408, 2006.

[69] X. Li, G. Mori, and H. Zhang, “Expression-invariant face recognition with expressionclassification,” Canadian Conf. Computer and Robot Vision, pp. 77–83, 2006.

[70] A. Pentland, B. Moghaddam, and T. Starner, “View-based and modular eigenfacesfor face recognition,” Proc. IEEE CS Conf. Computer Vision and Pattern Recogni-tion, 1994.

[71] P. N. Belhumeur, J. Hespanha, and D. J. Kriegman, “Eigenfaces vs. fisherfaces:Recognition using class specific linear projection,” IEEE Trans. Pattern Analysisand Machine Intelligence, vol. 19, no. 7, pp. 711–720, 1997.

[72] X. He, S. Yan, Y. Hu, P. Niyogi, and H. Zhang, “Face recognition usinglaplacianfaces,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27,no. 3, 2005.

[73] D. Cai, X. He, J. Han, and H. J.Zhang, “Orthogonal laplacianfaces for facerecognition,” IEEE Trans. Image Processing, vol. 15, no. 11, 2006.

[74] X. He, D. Cai, S. Yan, and H.-J. Zhang, “Neighborhood preserving embedding,”Proc. IEEE Int’l Conf. Computer Vision, pp. 1208–1213, 2005.

[75] S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, and S. Lin, “Graph embedding andextension: A general framework for dimensionality reduction,” IEEE Trans. PatternAnalysis and Machine Intelligence, vol. 29, no. 1, pp. 830–837, 2007.

[76] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques forembedding and clustering,” Advances in Neural Information Processing Systems,pp. 585–591, 2002.

111

[77] S. Roweis and L. Saul, “Nonlinear dimensionality reduction by locally linearembedding,” Science, pp. 2323–2326, 2002.

[78] X. He, D. Cai, and P. Niyogi, “Locality preserving projections,” Advances in NeuralInformation Processing Systems, 2003.

[79] S. An, W. Liu, and S. Venkatesh, “Exploiting side information in locality preservingprojection,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2008.

[80] D.-S. Pham and S. Venkatesh, “Robust learning of discriminative projection formulticategory classification on the stiefel manifold,” Proc. IEEE CS Conf. ComputerVision and Pattern Recognition, 2008.

[81] S. An, W. Liu, and S. Venkatesh, “Face recognition using kernel ridge regression,”Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2007.

[82] X. He, D. Cai, and P. Niyogi, “Tensor subspace analysis,” Advances in NeuralInformation Processing Systems, 2005.

[83] J. Ye, R. Janardan, and Q. Li, “Two-dimensional linear discriminant analysis,”Advances in Neural Information Processing Systems, 2004.

[84] S. Yan, D. Xu, Q. Yang, L. Zhang, X. Tang, and H. J. Zhang, “Multilineardiscriminant analysis for face recognition,” IEEE Trans. Image Processing, vol. 16,no. 1, 2007.

[85] M. Vasilescu and D. Terzopoulos, “Multilinear subspace analysis of imageensembles,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition,pp. 212–220, 2003.

[86] G. Hua, P. Viola, and S. Drucker, “Face recognition using discriminatively trainedorthogonal rank one tensor projections,” Proc. IEEE CS Conf. Computer Vision andPattern Recognition, 2007.

[87] F. Wang and C. Zhang, “Feature extraction by maximizing the averageneighborhood margin,” Proc. IEEE CS Conf. Computer Vision and Pattern Recogni-tion, 2007.

[88] Y. Fu and T. S. Huang, “Image classification using correlation tensor analysis,”IEEE Trans. Image Processing, vol. 17, no. 2, 2008.

[89] D. Cai, X. He, and J. Han, “Spectral regression for efficient regularized subspacelearning,” Proc. IEEE Int’l Conf. Computer Vision, 2007.

[90] D. Cai, X. He, Y. Hu, J. Han, and T. Huang, “Learning a spatially smooth subspacefor face recognition,” Proc. IEEE CS Conf. Computer Vision and Pattern Recogni-tion, 2007.

112

[91] L. Zhang, S. Wang, and D. Samaras, “Face synthesis and recognition from a singleimage under arbitrary unknown lighting using a spherical harmonic basis morphablemodel,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, pp.209–216, 2005.

[92] J. A. Cherry, “Introduction to volterra methods,” Distortion Analysis of WeaklyNonlinear Filters Using Volterra Series, 1994.

[93] H. Shan and G. W. Cottrell, “Looking around the backyard helps to recognize facesand digits,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2008.

[94] S. Rana, W. Liu, M. Lazarescu, and S. Venkatesh, “Recognising faces in unseenmodes: A tensor based approach,” Proc. IEEE CS Conf. Computer Vision andPattern Recognition, 2008.

[95] T. Sim, S. Baker, and M. Bsat, “The cmu pose, illumination, and expression (pie)database,” Proc. IEEE Int’l Conf. Automatic Face and Gesture Recognition, 2002.

[96] C.-P. Chen and C.-S. Chen, “Lighting normalization with generic intrinsicillumination subspace for face recognition,” Proc. IEEE Int’l Conf. ComputerVision, 2005.

[97] D.-H. Liu, K.-M. Lama, and L.-S. Shenb, “Illumination invariant face recognition,”Pattern Recognition, vol. 38, no. 10, pp. 1705–1716, 2005.

[98] J. E. N. Coleman and R. Jain, “Obtaining 3-dimensional shape of textured andspecular surfaces using four-source photometry,” Computer Graphics and ImageProcessing, 2003.

[99] M. Jones and P. Viola, “Face recognition using boosted local features,” TechnicalReport TR2003-25, MERL, 2003.

[100] T. Ahonen, A. Hadid, and M. Pietikainen, “Face recognition with local binarypatterns,” Proc. European Conf. Computer Vision, 2004.

113

BIOGRAPHICAL SKETCH

Ritwik Kumar received his Bachelor of Technology degree in Information and

Communication Technology from Dhirubhai Ambani Institute of Information and

Communication Technology (DAIICT), Gandhinagar, India in 2005. Since 2005 he

has been a Ph.D. student at the Center for Vision, Graphics and Medical Imaging at the

Department of Computer and Information Science and Engineering at the University

of Florida, Gainesville, FL, USA. His research interests include machine learning, color

video analysis, face recognition and medical image analysis. He is a recipient of DAIICT

President’s Gold Medal (2005) and the University of Florida Alumni Fellowship (2005 -

2009). He received the best student paper award at the Twelfth International Conference

on Advanced Computing and Communications (ADCOM) 2004.

114

Date post:	16-Oct-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

c 2009 Ritwik Kumar - University of...

Documents