RELIGHTING, POSE CHANGE AND RECOGNITION OF FACES
By
RITWIK KUMAR
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2009
1
c© 2009 Ritwik Kumar
2
To my mother, father and sister
3
ACKNOWLEDGMENTS
I am extremely grateful to Dr. Baba C. Vemuri and Dr. Arunava Banerjee for their
guidance and support during my graduate studies. They have been a constant source
of inspiration and encouragement for me, the most important ingredients necessary for
research. I am also thankful to Dr. Jeffery Ho, Dr. Anand Rangarajan and Dr. Trevor
Park for being on my supervisory committee and providing extremely useful insights into
the work presented in this dissertation.
I would like to thank the Department of Computer and Information Science and
Engineering (CISE) and the University of Florida (UF) for giving me the opportunity to
pursue my graduate studies in a very constructive environment. I am especially thankful
to my UF Alumni Fellowship, the Dept. of CISE and the NIH grants NS46812 (to Dr.
Baba C. Vemuri), EB007082 (to Dr. Baba C. Vemuri) and EB004752 (to Dr. Paul R.
Carney and Dr. Thomas H. Mareci) for funding my doctoral studies and travels to various
conferences. During my graduate studies, I enjoyed my job as a teaching assistant and for
that I am grateful to Dr. Manuel Bermudez for being a terrific boss.
I am grateful to Dr. Tanveer Syeda-Mahmood for giving me the opportunity to work
with her wonderful research group at the IBM Almaden Research Center. I appreciate
the opportunity given to me by Dr. Michael Jones and Dr. Tim Marks to work at the
Mitsubishi Electric Research Laboratories (MERL). I am especially grateful to MERL for
providing data and software to generate some of the results included in this dissertation.
At the Center for Vision Graphics and Medical Imaging (CVGMI), I was very
fortunate to get to spend time in company of many wonderful people. I deeply appreciate
Angelos Barmpoutis for his friendship and guidance during the course of my graduate
studies. Fei Wang has been a wonderful friend throughout my graduate studies and I am
grateful for all the mentoring and support he has provided. I must thank Ajit Rajwade,
Bing Jian and Santhosh Kodipaka for being extremely helpful and patient, as I endlessly
bothered them with my questions. I also appreciate the camaraderie of my lab-mates
4
Nicholas Lord, Adrian Peter, O’Neil Smith, Ozlem Subakan, Guang Cheng and Ting
Chen.
I am thankful to my long time friends Siddharth Chouksey and Vaibhav Garg for
being there when I needed them. I must also thank Kris, someone who I knew very briefly,
but would always remember as a good friend. Thanks to Karthik Gurumoorthy, Venkat
Ramaswamy, Seniha Esen Yuksel, Jason Yu-Tseh Chi and Amit Dhurandar, my wonderful
friends who are capable of making any occasion fun. I am especially grateful to Mujde
Erten, for being a patient, understanding and helpful friend.
I am thankful to Dr. Arvind Kudchadkar, Dr. R. N. Biswas, Dr. Prabhat Ranjan,
Dr. Naresh Jotwani, Dr. Ashish Jadhav and Dr. Suman K. Mitra, for their guidance
and support during my undergraduate studies at the Dhirubhai Ambani Institute of
Information and Communication Technology, Gandhinagar. They introduced me to the
wonderful world of research and helped me attain my goal of pursuing a doctoral degree.
Lastly and most importantly, I am thankful to my family, for their unconditional and
unflinching love and support. I will be eternally grateful to my mother, Shashi, my father,
Kailash and my sister, Richa, for all that they have done for me.
5
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
CHAPTER
1 PROBLEM DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.2 Facial Relighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.3 Pose Change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.4 Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.5 The Interconnections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.6 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 FACIAL RELIGHTING AND POSE CHANGE WITH MULTIPLE IMAGES . 17
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.4 Tensor Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.1 Spherical Functions Modeled as Tensors . . . . . . . . . . . . . . . . 292.4.2 Tensor Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.4.3 Facial ABRDF Approximation Using Tensor Splines . . . . . . . . . 31
2.5 Mixture of Single-lobed Functions . . . . . . . . . . . . . . . . . . . . . . . 362.6 Recovering Shape from the ABRDF Field . . . . . . . . . . . . . . . . . . 37
2.6.1 Rotation Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 382.6.2 Surface Normal Computation . . . . . . . . . . . . . . . . . . . . . . 402.6.3 Shape Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.6.4 Novel Pose Relighting . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442.7.1 Relighting Faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452.7.2 Estimating Shape . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482.7.3 Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3 EIGENBUBBLES: THE ENHANCED ABRDF REPRESENTATION . . . . . . 55
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.2 Eigenbubbles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.3 Experiments & Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6
3.3.1 Relighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.3.2 Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.3.3 ABRDF Field Compression . . . . . . . . . . . . . . . . . . . . . . . 64
3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4 FACE RELIGHTING AND POSE CHANGE WITH SINGLE IMAGE . . . . . 66
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.4 The Reference ABRDF Field Model . . . . . . . . . . . . . . . . . . . . . . 694.5 Model Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.6.1 Relighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744.6.2 Pose Change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5 FACE RECOGNITION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.2 Volterra Kernel Approximations . . . . . . . . . . . . . . . . . . . . . . . . 835.3 Kernel Computation as Generalized Eigenvalue Problem . . . . . . . . . . 85
5.3.1 First Order Approximation . . . . . . . . . . . . . . . . . . . . . . . 875.3.2 Second Order Approximation . . . . . . . . . . . . . . . . . . . . . . 88
5.4 Training and Testing Algorithms . . . . . . . . . . . . . . . . . . . . . . . 895.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 905.6 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6 ENHANCING FACE RECOGNITION WITH RELIGHTING . . . . . . . . . . 95
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 956.2 Gallery Set Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 956.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.3.1 Nearest Neighbor Classifier . . . . . . . . . . . . . . . . . . . . . . . 966.3.2 MERL Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 966.3.3 Local Binary Pattern Classifier . . . . . . . . . . . . . . . . . . . . . 976.3.4 Naive Relighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 976.3.5 Practical Relighting . . . . . . . . . . . . . . . . . . . . . . . . . . . 976.3.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7
LIST OF TABLES
Table page
2-1 Requirements, assumptions and capabilities of existing methods . . . . . . . . . 20
2-2 Face recognition error rates with Tensor Splines . . . . . . . . . . . . . . . . . . 52
3-1 Face recognition error rates with Eigenbubbles . . . . . . . . . . . . . . . . . . . 64
3-2 ABRDF field compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5-1 State-of-the-art methods for face recognition . . . . . . . . . . . . . . . . . . . . 90
5-2 Face recognition results on the Yale A dataset . . . . . . . . . . . . . . . . . . . 91
5-3 Face recognition results on the CMU PIE dataset . . . . . . . . . . . . . . . . . 91
5-4 Face recognition results on the Extended Yale B dataset . . . . . . . . . . . . . 92
6-1 Effect of the gallery augmentation size on the recognition rates . . . . . . . . . . 98
6-2 Face recognition rates for the CMU PIE dataset . . . . . . . . . . . . . . . . . . 99
6-3 Face recognition rates for the MERL Dome dataset . . . . . . . . . . . . . . . . 99
8
LIST OF FIGURES
Figure page
2-1 Lambertian model vs. Cartesian Tensors model . . . . . . . . . . . . . . . . . . 19
2-2 Symmetric and antisymmetric ABRDF approximations . . . . . . . . . . . . . . 24
2-3 ABRDF alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2-4 ABRDFs on a face . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2-5 Novel synthesized images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2-6 Novel images in complex lighting . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2-7 Shadows and specularities comparison . . . . . . . . . . . . . . . . . . . . . . . 39
2-8 Impact of the input images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2-9 Per pixel intensity error comparison . . . . . . . . . . . . . . . . . . . . . . . . . 42
2-10 Shapes recovered using 9 images . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2-11 Pose variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2-12 Shape comparison with the Robust Photometric Stereo . . . . . . . . . . . . . . 49
2-13 Simultaneous pose and illumination variation . . . . . . . . . . . . . . . . . . . 51
3-1 Global Eigenbubbles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3-2 Local Eigenbubbles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3-3 ABRDFs on a face: Eigenbubbles . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3-4 Global vs. Local Eigenbubbles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3-5 Images relighted with the Eigenbubbles . . . . . . . . . . . . . . . . . . . . . . . 61
3-6 Quantitative comparison with Eigenbubbles . . . . . . . . . . . . . . . . . . . . 62
3-7 Shadows and specularities in relighted images . . . . . . . . . . . . . . . . . . . 62
3-8 Extrapolated lighting conditions with Eigenbubbles . . . . . . . . . . . . . . . . 63
4-1 Overview of the ABRDF field fitting . . . . . . . . . . . . . . . . . . . . . . . . 71
4-2 Comparison with the ground truth images . . . . . . . . . . . . . . . . . . . . . 74
4-3 Relighted images from the CMU PIE dataset . . . . . . . . . . . . . . . . . . . 76
4-4 Relighted images from the CMU PIE dataset . . . . . . . . . . . . . . . . . . . 77
9
4-5 Relighted images from the MERL Dome dataset . . . . . . . . . . . . . . . . . . 78
4-6 Relighted images from the MERL Dome dataset . . . . . . . . . . . . . . . . . . 79
4-7 Pose changed images using single input image . . . . . . . . . . . . . . . . . . . 80
5-1 Structure of A1i and A2
i matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5-2 Training algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5-3 Testing algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6-1 ROC Curve for the CMU PIE dataset with the MERL classifier. . . . . . . . . . 100
6-2 ROC Curve for the MERL Dome dataset with the MERL classifier. . . . . . . . 101
6-3 ROC Curve for the CMU PIE dataset with the Nearest Neighbor classifier. . . . 101
6-4 ROC Curve for the MERL Dome dataset with the Nearest Neighbor classifier. . 102
6-5 ROC Curve for the CMU PIE dataset with the LBP classifier. . . . . . . . . . . 102
6-6 ROC Curve for the MERL Dome dataset with the LBP classifier. . . . . . . . . 103
10
Abstract of dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
RELIGHTING, POSE CHANGE AND RECOGNITION OF FACES
By
Ritwik Kumar
December 2009
Chair: Baba C. VemuriCochair: Arunava BanerjeeMajor: Computer Engineering
Relighting, pose change and recognition of faces from images are intimately connected
fundamental problems in the field of Computer Vision and Graphics. These problems are
particularly interesting and difficult when examined in presence of constraints like limited
number of input images, cast shadows and specularities. Though numerous solutions have
been proposed in the past, none effectively addresses these problems when considered in
the aforementioned constrained setting. In this dissertation, we present a set of techniques,
which accomplish relighting, pose change and recognition of facial images in presence of
specularities and shadows, using as few as one input image.
We start by presenting a method for relighting and pose change for facial images
using nine or more input images. We accomplish this by representing the Apparent
Bidirectional Reflectance Distribution Function (ABRDF) fields of human faces using
Tensor Splines. We then present a method for improving the quality of relighted images
by enhancing the ABRDFs with face specific subspace representations. Next, we present
a novel technique for estimating facial ABRDF fields for the difficult case of single input
image. Finally, we focus on the face recognition problem and present a novel face image
classification scheme as well as a framework for enhancing face recognition using relighting
methods outlined above.
All of the above mentioned techniques are supported by extensive experiments on the
Yale A, the CMU PIE, the Extended Yale B and the MERL Dome face image databases.
11
We show that our relighting, pose change and recognition systems outperform various
state-of-the-art methods in terms of image quality and recognition rates.
12
CHAPTER 1PROBLEM DESCRIPTION
1.1 Introduction
Faces are the primary features that are used by humans to associate identities with
people. Due to this critical role of human faces, understanding and manipulating their
appearance has become immensely important in various modern applications which
deal with the human-form representations. For instance, in a given video clip, change
of the scene illumination requires that the appearances of present faces also be modified
realistically – a problem commonly found in many video games with human characters.
Similar scenarios are also frequently encountered in movies-making industry where
post-production manipulation of scenes with human faces, like pose or illumination
change, is required. In addition to the manipulation of facial images, understanding
what traits of a face are critical for its identity is also important, since it opens roads
for automatic face recognition. Automatic face recognition has numerous important
applications like access control, automatic image tagging and forensic database search.
The work presented in this dissertation addresses the three fundamental problems
in facial image analysis - facial relighting, pose change and recognition. These have been
selected because these interconnected problems cover a substantial subset of the facial
image manipulation and understanding based applications described above. The exact
nature of each of these problems depends on the chosen assumptions. In a broad sense, we
will present solutions to the problem of relighting and pose change for both the multiple
image input and the single image input cases. We will also examine the problem of the
multiple image face recognition by presenting a novel face image classification method
and explore the single image face recognition problem in general as it relates to face
image relighting. Note that the single and multiple images here refer to the gallery or the
training set.
13
We begin with defining the problems of facial relighting, pose change and recognition
and then look at the interdependence among them. The specific flavors of these problems
that we work with are addressed in more details in the respective chapters.
1.2 Facial Relighting
In its most general form, the problem of facial relighting involves generating facial
images of a subject under novel illumination given a few example facial images of that
subject. The example images may or may not come with any information about the
lighting used in the images. Further, the number of such example images may vary
from one to many. At the output end, the novel images may be required to have a point
source or arbitrarily complex lighting. The complexity of the problem also increases
according to the sought quality of the generated novel images. For instance, if shadows
and specularities are to be faithfully produced in the novel images, the problem becomes
more difficult than when these photo-effects are ignored.
1.3 Pose Change
In its most general form, the problem of pose change for a facial image involves
generating images of a subject’s face under novel poses, when one or more example images
of the same face are given in some other pose. The example images may or may not
come with information about the lighting conditions. The most common approach to this
problem involves recovering the 3D shape and texture of the face from the given image
or images. Once the shape and the texture of a face have been captured, images in novel
poses can be generated by projecting the shape along various directions.
1.4 Face Recognition
The above two problems are concerned with face images of a single subject, while
the problem of face recognition involves images across different subjects. This problem is
defined using three classes of images. The probe image is the image whose identity is to be
determined. The gallery is the set of images against which the identity of the probe image
is checked and the training set is the set of images which are used to learn the system
14
parameters. It should be noted that often the gallery and the training set refer to the
same set of images.
The complexity of the recognition problem is governed by both the probe and
the gallery images. The problem of recognizing the face is simpler if the probe is
obtained under predetermined pose and lighting conditions and it becomes harder as the
control over the probe image is relaxed and photo-effects like shadows and specularities
are allowed. The problem is again simpler if the gallery images are obtained under
predetermined conditions. The number of images in the gallery also impacts the difficulty
of the problem.
1.5 The Interconnections
The interdependence among these problems is best understood from the point of view
of the face recognition problem. It is possible, that the probe image is taken in a different
pose and/or illumination condition than the image(s) present in the gallery set. In such
cases, if either using the probe image or the image(s) from the gallery we can generate
novel images in new illumination conditions and poses, it can aid the recognition process
which involves matching the probe image against the gallery images.
The problems of relighting and pose change are also intimately connected. The
change in the observed image intensity values as the environment lighting changes is
known to be in-part governed by the local shape of the object ([1], [2]). The global shape
can also affect photo-effects like shadows in an image.
1.6 Organization
Due to the tightly interdependent nature of the relighting and the pose change
problems, we discuss them together in Chapter 2, where we present a detailed description
of these problems, literature survey, our proposed solutions and the experimental results.
In Chapter 3 we discuss a method for enhancing the quality of relighted images using
Eigenbubbles. The more difficult case of relighting and pose change problem, which works
with single input image is discussed in Chapter 4. Then we shift focus on the multiple
15
image face recognition problem and present a new classifier called Volterrafaces in Chapter
5. In Chapter 6 we detail our novel framework for single image face recognition that draws
upon the relighting scheme described in Chapter 4 to enhance recognition rates. Finally,
in Chapter 7 we conclude with a summary of the contributions made in this dissertation.
Part of the work presented in this dissertation has been published in [3], [4] and [5].
16
CHAPTER 2FACIAL RELIGHTING AND POSE CHANGE WITH MULTIPLE IMAGES
2.1 Introduction
Precisely capturing appearance and shape of objects has engaged human imagination
ever since the conception of drawing and sculpting. With the invention of computers,
a part of this interest was translated into the search for automated ways of accurately
modeling and realistically rendering of appearances and shapes. Among all the objects
explored via this medium, human faces have stood out for their obvious importance. In
recent times, the immense interest in facial image analysis has been fueled by applications
like face recognition (on account of recent world events), pose synthesis and face relighting
(driven in part by the entertainment industry), among others. This in turn has led to an
epitome of literature on this subject, encompassing various techniques for modeling and
rendering appearances and shapes of faces.
Our understanding of the process of image formation and the interaction of light and
the facial surface has come a long way since we started [6], with many impressive strides
along the way (e.g. [7], [8], [9]), but we are still some distance from an ideal solution. In
our view, an ideal solution to the problem of modeling and rendering appearances and
shapes of human faces should be able to generate extremely photo-realistic renderings of
a person’s face, given just one 2D image of the face, in any desired illumination condition
and pose, at a click of a button (real time). Furthermore, such a system should not require
any manual intervention and should not be fazed by the presence of common photo-effects
like shadows and specularities in the input. Lastly, such an ideal system should not require
expensive data collection tools and processes, e.g. 3D scanners, and should not assume
availability of meta-information about the imaging environment (e.g. lighting directions,
lighting wavelength etc.).
These general requirements have been singled out because the state-of-the-art is
largely comprised of systems which relax one or more of these conditions while satisfying
17
others. Common simplifying assumptions include applicability of the Lambertian
reflectance model (e.g. [7]), availability of 3D face model (e.g. [10]), manual initialization
(e.g. [11]), absence of cast shadows in input images(e.g. [12]), availability of large amount
of data obtained from custom built rigs (e.g. [8]) etc. These assumptions are noted as
“simplifying” because – human faces are known to be neither exactly Lambertian nor
convex (and thus can have cast shadows), fitting a 3D model requires time consuming
large-scale optimization with manual selection of features for initialization, specialized
data acquisition can be costly and in most real applications only a few images of a face are
available.
The method we propose in this chapter moves the state-of-the-art closer to the
ideal solution by satisfying more of the above mentioned attributes simultaneously.
Our technique can produce photo-realistic renderings of human faces across arbitrary
illumination and pose using as few as 9 images (fixed pose, known illumination direction)
with a spatially varying non-Lambertian reflectance model. Unlike most techniques, our
method does not require input images to be free of cast shadows or specularities and can
reproduce these in the novel renderings. It does not require any manual initialization and
is a purely image based technique (no expensive 3D scans are needed). Furthermore, it is
capable of working with images obtained from standard benchmark datasets and does not
require specialized data acquisition.
Our technique is based on the Tensor Splines framework which can be used to
approximate any n-dimensional field of spherical functions (originally proposed in [13]).
In the case of faces, we, for the first time, use Tensor Splines to approximate the field of
Apparent Bidirectional Reflectance Distribution function (ABRDF) for a fixed viewing
direction. Unlike the BRDF, the ABRDF (also known as the reflectance function) at each
pixel captures the variation in intensity as a function of illumination and viewing direction
and is thus sensitive to the context of the pixel. Once the ABRDF field has been captured,
images of the face under the same pose but with arbitrary illumination can be generated
18
Figure 2-1. Lambertian model vs. Cartesian Tensors model. From the syntheticexample of the first row and the real data below it, it can be noted that theCartesian tensor can capture variations of intensity distributions moreaccurately than the Lambertian model.
by simply taking weighted combinations of the ABRDF field samples. Next, we estimate
the surface normal at each pixel by robustly combining the shape information from its
neighboring pixels. Towards this, we put forward an iterative algorithm which works by
registering neighborhood ABRDFs using an extremely efficient linear technique. With as
few as 1 or 2 iteration, we can recover the surface normal fields of most faces which are
then numerically integrated to obtain the face surfaces. Novel pose with novel illumination
conditions can thus be rendered while seamlessly accounting for attached as well as cast
shadows.
2.2 Related Work
The sheer size of the facial shape-reflectance modeling literature allows its taxonomy
to be carried along various possible lines. Here we have classified methods based on
various assumptions made by them while modeling the facial reflectance and the shape.
We have also summarized few of the key methods along with associated assumption in
Table 2-1.
19
Tab
le2-
1.R
equir
emen
ts,as
sum
pti
ons
and
capab
ilit
ies
ofex
isti
ng
met
hods
Meth
ods
Ass
um
ed
Sur-
face
BR
DF
Model
No.
of
Images
as
Input
Relighte
dIm
ages
Pre
-se
nte
d
Shape
or
Pose
Resu
lts
Pre
-se
nte
d
Cast
Shadow
sin
Input
Pure
lyIm
age
base
d(N
o3D
Sca
ns)
Oth
er
Ass
um
ptions,
Requir
em
ents
and
Lim
itations
1999
MV
IEW
[14]
Lam
ber
tian
≥3
44
84
Nea
rfr
onta
lillu
min
atio
nex
pec
ted,R
aytr
acin
gfo
rca
stsh
adow
s.1999
SIG
GR
AP
H[1
5]N
on-L
amb.
14
48
8N
oat
tach
edsh
adow
s,M
anual
init
ializa
tion
tofit
3Dm
odel
.2000
SIG
GR
AP
H[8
]N
on-L
amb.
≥20
004
44
4C
ust
omri
gfo
rdat
aco
llec
tion
,Str
uct
ure
dligh
ting
for
shap
e.2001
SIG
GR
AP
H[9
]Lam
ber
tian
≥3
48
88
Dis
tant
and
isot
ropic
ligh
ting,
3DSca
ns
nee
ded
asin
put.
2001
SIG
GR
AP
H[1
6]N
on-L
amb.
≥50
44
88
Cust
omri
gfo
rdat
aac
quis
itio
n,N
osp
ecula
rity
allo
wed
inin
put.
2001
PA
MI
[12]
Lam
ber
tian
≥7
44
84
Alm
ost
no
atta
ched
shad
ow,Sym
met
ric
face
s,R
aytr
acin
g.2001
PA
MI
[17]
Lam
ber
tian
14
88
4B
oot
stra
pse
tof
imag
esre
quir
ed,Id
ealcl
ass
assu
mpti
on.
2001
IJC
V[1
8]Lam
ber
tian
14
48
4N
oat
tach
edsh
adow
s,Sym
met
ric
face
s,pie
cew
ise
const
ant
albed
o.2001
ICC
V[1
9]N
on-L
amb.
≥30
08
44
4K
now
nligh
ting
dir
ecti
ons,
Lig
hti
ng
shou
lddou
bly
cove
rth
edir
ecti
ons
2003
EG
SR
[20]
Non
-Lam
b.
≥12
44
84
3so
urc
es/p
ixel
,ad
-hoc
shad
owdet
ecti
on,Spat
ially
const
ant
BR
DF.
2003
CV
PR
[21]
Lam
ber
tian
14
48
8Sym
met
ric
ligh
ting,
3DM
odel
Fit
ting,
Man
ual
init
ializa
tion
.2003
PA
MI
[7]
Lam
ber
tian
14
88
8D
ista
nt
&is
otro
pic
ligh
ting,
3Dsc
ans
requir
ed,M
anual
init
ializa
tion
.2004
PA
MI
[22]
Lam
ber
tian
≥1
84
84
Man
ual
del
inea
tion
offe
ature
poi
nts
for
bet
ter
reco
gnit
ion.
2005
ICC
V[2
3]N
on-L
amb.
124
48
4K
now
nligh
ting,
HD
Rim
ages
expec
ted,M
anual
thre
shol
dse
lect
ion
2005
PA
MI
[24]
Non
-Lam
b.
≥88
48
4N
osh
adow
sex
pec
ted,R
efer
ence
obje
ctex
pec
ted,Sym
met
ryof
face
s.2005
IAM
F[2
5]Lam
ber
tian
14
84
8Shad
owed
pix
elge
tsdef
ault
albed
o,U
niv
ersa
l3D
face
model
requir
ed.
2006
PA
MI
[11]
Lam
ber
tian
14
48
83D
Model
Fit
ting
wit
hm
anual
init
ializa
tion
.2006
PA
MI
[26]
Non
-Lam
b.
≥14
88
8Poi
nt
ligh
ting
sourc
esw
ith
know
ndir
ecti
ons,
Obje
ctsh
ape
requir
ed.
2007
CV
PR
[27]
Lam
ber
tian
≥4
84
44
3so
urc
es/p
ixel
,K
now
nligh
ting,
Nor
mal
sca
n’t
be
onbis
ecti
onpla
nes
.2007
ICC
V[2
8]N
on-L
amb.
≥32
84
44
Poi
nt
ligh
tso
urc
es,K
now
ndir
ecti
ons,
BR
DF
isot
ropic
abou
tnor
mal
.2007
ICC
V[2
9]Lam
ber
tian
14
48
8Poi
nt
sourc
esw
ith
know
ndir
ecti
ons,
Reg
iste
red
avg.
3Dm
odel
requir
ed.
2007
PA
MI
[30]
Lam
ber
tian
14
48
4N
oat
tach
edsh
adow
sex
pec
ted,Sym
met
ryof
face
s,B
oot
stra
pse
tre
quir
ed.
2007
IJC
V[3
1]Lam
ber
tian
158
48
4D
ista
nt
and
isot
ropic
ligh
ting,
Wor
ks
for
only
conve
xob
ject
s.2008
CV
PR
[32]
Non
-Lam
b.
≥10
24
48
4Poi
nt
sourc
esw
ith
know
ndir
ecti
ons,
BR
DF
isot
ropic
abou
tnor
mal
.O
ur
Meth
od
Non
-Lam
b.
≥9
44
44
Poi
nt
sourc
esw
ith
know
ndir
ecti
ons.
20
A large fraction of the existing techniques for facial image analysis work with the
Lambertian assumption for the reflectance functions. This translates to assuming that
the reflectance function at each point on the object’s surface has the same shape, that of
a half cosine function, which has been scaled by a constant – the albedo, and is oriented
along the surface normal at that location. One of the major reasons for the prevalence of
this model is its simplicity. Analysis has shown that under this assumption, if cast and
attached shadows are ignored, image of a convex object, in a fixed pose, lit by arbitrary
illumination lies in a 3-dimensional subspace [33]. When an ambient lighting component
is included, this subspace expands to become 4-dimensional [34] and when attached
shadows are taken into account, the subspace grows to become an infinite dimensional –
illumination cone [35].
Spherical harmonic analysis of the Lambertian kernel has shown that even though
the illumination cone is infinite dimensional, it can be approximated quite well by a lower
dimensional subspaces ([36], [9], [7]). In particular, these methods can produce impressive
results with 9 basis images, though they require the 3D shapes and the albedo fields as
input. These basis images can also be directly acquired using the “universal virtual”
lighting conditions [37]. More recently, this idea has been extended to 3D surfaces in [11]
building on the prior seminal work presented in [15] called Morphable Models. Morphable
Models can recover 3D shape of a face by fitting an average 3D facial model to a given 2D
image, accounting for necessary shape and texture adjustments. Morphable Models are
known to produce excellent results for across pose face recognition but cannot handle cast
shadows or specularities robustly. More importantly, they require manual delineation of
facial features to initialize a complicated non-linear optimization which can take a long
time to converge and can suffer from local minima. Using the idea of a low dimensional
subspace explored above, [22] represented the entire light-field using a low dimensional
eigen light-field.
21
It has been suggested that even though the time and cost of acquiring the 3D data is
decreasing, majority of the face databases still remain 2D and hence it is more pragmatic
to work with the 2D images alone [38]. Methods that are purely image based and work
with the Lambertian assumption generally apply photometric stereo or shape from shading
to recover facial shape from the given images. For instance, results for simultaneous shape
recovery using photometric stereo and reflectance modeling were presented in [14] and [12].
Both of these methods work with multiple images and expect no cast shadows and very
little attached shadows in the images. Here the cast shadows in the relighted images are
rendered using ray tracing, which can be computationally expensive. Examples of methods
that recover shape from shading working under the Lambertian assumption can be found
in [18] and [39]. As these methods work with just one image, besides requiring the absence
of cast shadows in the input, they make additional assumptions like facial symmetry
(as in [18]) etc. An important point to note here is that the uncalibrated photometric
stereo or the shape from shading methods, that work with the Lambertian assumption
and orthographically projected images, also suffer from the Bas-Relief Ambiguity ([40]).
Resolving this requires additional assumptions like the symmetry of face, the nose and the
forehead being at the same height, known lighting directions, etc., and manual assistance.
Recently, shape recovery using the generalized photometric stereo was presented in
[31] which relaxes some of the assumptions made by the traditional photometric stereo
techniques. This method can recover shape from images taken under general unknown
lighting. On account of the Lambertian assumption, cast shadows are not entertained
in the input images and the shape of the object is assumed to be convex. Note that the
accurate recovery of shape using this method requires 15 to 60 images as input. Another
method for Lambertian shape recovery with multiple illuminants, but without ignoring
shadows, was presented in [27] where the graph cuts method was used to identify light
source visibility and information from shadow maps were used to recover the shape.
22
At a contrast to most of the methods mentioned above are the techniques that seek
illumination invariant representations of the faces which can then be used to render
relighted images. Seminal work in this category was presented in [17], where the so called
“Quotient Images”, generated using ratio of albedo values, were used to generate images
under novel illumination conditions. More recently, use of invariants was invoked in
[21], where the radiance environment map was deduced using the ratio image technique
([41], [17]). Note that the shape recovery in [21], like the Morphable Models, requires
manual initialization. Forgoing the ratio technique, direct use of albedo as an illumination
invariant signature of face images was explored in [25], where using a universal 3D face
model, illumination normalized images of faces were generated. This method worked with
low resolution images and did not render high quality relighted images. More recently
an improvement was presented in [29] where the albedo estimation was made more
robust using the error statistics of surface normals and the known illumination direction.
This method requires a registered average 3D model of the face and does not allow cast
shadows in the input but unlike [25], also provides a facial shape estimate. Improving
upon the idea of the ideal class assumption ([17]), another generalized photometric stereo
technique was presented in [30]. Using a bootstrap set of facial images and exploiting the
subspace spanned by a set of basis objects with Lambertian surfaces, images with novel
pose and illumination were generated. Here the faces were assumed to be symmetric and
the input was assumed to be free of shadows.
Next, we look at techniques that do not make the Lambertian assumption. Seminal
work in this class of techniques was presented in [8], where using a custom built rig,
dense sampling of the illumination space for faces was obtained. In this work, the
facial shape was obtained using structured lighting and no assumption about the
surface BRDF was made. This completely data driven technique was able to produce
extremely photo-realistic images of the face in novel illuminations and poses. The specular
component was captured using polarized lighting and modified appropriately for pose
23
A A semicircularfunction
B Approximation of(A) using symmetricfunctions
C Max of function in(B) and zero
D Approximationof (A) using anti-symmetric functions
E Max of function in(D) and zero
F ABRDF approximatedwith symmetric functionsleads to unnatural lighting
G ABRDF approximatedwith anti-symmetric func-tion leads to more naturallighting
Figure 2-2. Symmetric and antisymmetric ABRDF approximations
variation. This method demonstrated that if a large number of images (> 2000) for each
subject can be obtained under various lighting configurations, the relighting and the pose
generation problems can be solved, but the cost of such a system can be extremely high.
Use of biquadratic polynomials to model texture was explored in [16]. This method
required a custom built rig and more than 50 specularity free images to recover the model
parameters. The shape of the object was not recovered in this method. Use of a large
number (≥ 300) of images to recover the shape without making any assumption about
the nature of the BRDF was revisited in [19]. This method required the input images to
24
Figure 2-3. ABRDF alignment. Neighboring ABRDFs A and B can be better alignedwith each other than A and C. This would weigh the normal suggested by Bhigher than the normal suggested by C.
doubly cover the illumination directions which called for specialized data acquisition. No
attempt to capture the reflectance properties of the object was made in this work.
One of the first techniques that worked with standard face databases and did not
require custom data was presented in [20] where the more general Torrance-Sparrow
([2]) model for the BRDF was used. This method presented the relighting and the pose
variation results with 12 images as input but did not allow cast shadows. Further, this
method required each pixel to be lit by at least 3 light sources in order to work properly.
Important contribution in the field of example based shape recovery was made by [42]
where objects of interest were imaged along with a reference object of known geometry
(e.g. sphere). Multiple (≥ 8) cast shadows free images were used as input. This work
was expanded to allow spatially varying BRDF by using multiple references in [24]. An
interesting extension of this work was presented in [23] where the shape of an object was
25
recovered using the assumption that the BRDF of any object is essentially composed of
the BRDFs of a few fundamental materials. This method used 12 High Dynamic Range
images with known illumination direction and required manual selection of a system
parameter threshold. Using a large amount of data obtained from a custom built rig, [38]
presented a technique for pose and illumination variation using single image, that used
the Morphable Models to recover the 3D shape and the higher-order SVD to recover the
illumination subspace. This method does not explicitly make the Lambertian assumption
but it requires manual initialization for the 3D model fitting.
When the 3D shapes of the objects are assumed available, [26] presented a technique
which, at times using just one image, can recover their spatially varying non-parametric
BRDF fields. For the case of the human face, this work presented results with 4 images
where the specular component was separately captured using polarized lighting. The
images were acquired from known illumination directions and no cast shadows were
allowed.
Recently, [28] presented a new method for photometric reconstruction of shape
assuming spatially varying but isotropic BRDFs. Given 32 or more images with known
illumination, this method recovers isocontours of the surface depth map from which the
shape can be recovered by imposing additional constraints. An extension of this work
was presented in [32] where the need for additional constraints to recover the shape from
the depth map isocontour was alleviated by assuming the surface to be composed of a
few fundamental materials and that the BRDF at each point can be approximated by a
bivariate function. Results presented in this work required 102 or more images. Another
interesting framework for photometric stereo using the Markov Random Field approach
was presented in [43].
Lastly, we note that in the cases when extremely high quality renderings are required
and cost-time constraints are relaxed, custom hardware is employed. For instance, highly
accurate measurements of material BRDFs were carried out using a gonioreflectometer
26
in [44] and various customized hardware components and software were used to render
face images in the movie “The Matrix Reloaded” [45]. In order to measure accurate skin
reflectance while accounting for sub-surface scattering, custom built devices were again
employed in [46] to render high quality facial images.
It can noticed that most of the image based techniques that do not make the
simplifying Lambertian assumption end up using a large amount of custom acquired data
or assuming some other parametric form for the BRDF (besides the other assumptions).
In this chapter we explore the possibility of acquiring the non-Lambertian reflectance and
shape with just nine images in a purely data driven fashion.
2.3 Overview
The technique that we propose in this chapter simultaneously captures both the shape
and the reflectance properties of a face. Unlike the majority of existing techniques that
work with the BRDFs, in order to seamlessly account for the specularities, the attached
shadows, the cast shadows and other photo-effects, we have chosen to work with the
ABRDFs, which are spherical functions of non-trivial shape. We estimate them using the
Cartesian tensors, which in practice, have enough flexibility to account for the variations
in the ABRDFs found across human faces. Further, in order to robustly estimate the
ABRDF field from only a few and often noisy samples, we draw upon the apparent smooth
variation of reflectance properties across the face and combine the Cartesian tensors with
B-Splines. This combination of the Cartesian tensors with B-Splines is called Tensor
Splines in this chapter.
Embedded in the ABRDFs at each pixel also lies the surface normal at this point.
To extract the normal from the ABRDF field riddled with cast shadows and specularities,
we invoke the homogeneity of the ABRDFs in local neighborhoods, and infer surface
normal at a pixel using the information from its immediate neighbors. More concretely, at
each pixel we align the ABRDF with its neighbors’ ABRDF using a linearized algorithm
for rotation recovery and take a weighted geodesic mean of the normals suggested by
27
the neighbors to obtain the surface normal. Our framework automatically discounts
possibly erroneous surface normal suggestions by weighting the suggestion from a neighbor
of substantially different shape lower than others. This process can be iterated and in
practice we find good solutions within 1 or 2 iterations.
Equipped with this mechanism to capture both reflectance properties and shapes
of the human faces, we can generate images of any face in novel poses and illumination
conditions.
Assumptions: Like all other techniques, our method also works with certain
assumptions. It requires at least 9 images of the face under point illuminations from
known directions in a fixed pose. Note that these assumptions have been used in the
past by various methods, for example, [23] worked with 12 images obtained from known
lighting directions in fixed pose. As the number of input images increases, the performance
of our method improves. We do not require the input images to be free of the attached or
the cast shadows. We also do not restrict the BRDF to be Lambertian ( [12]) or isotropic
( [28], [26]). Though global photo-effects like subsurface scattering and interreflection are
not explicitly modeled, Tensor Splines can capture them to some extent.
2.4 Tensor Splines
We seek a mathematical framework that can represent a field of spherical functions
accurately. If a dense enough sampling of the spherical function field is provided, this can
be accomplished to arbitrary accuracy, but the central problem we face is precisely the
scarcity of the data. To solve this problem for the case of human facial ABRDF fields, we
exploit clues from the specific nature of ABRDFs on human faces e.g. smooth variation
of ABRDF for the most part, presence of multiple lobes in the ABRDF etc. Note that the
pose is assumed to be fixed and hence the term “ABRDF” is used to refer to a spherical
function of illumination direction.
28
Figure 2-4. Recovered ABRDFs for a human face. Complex shapes of the ABRDFs invarious regions of the face can be readily noted.
2.4.1 Spherical Functions Modeled as Tensors
A spherical function in R3 can be thought of as a function of directions or unit
vectors, v = (v1 v2 v3)T . Such a function, T , when approximated using an nth order 1
Cartesian tensor [47] (a tensor in R3), is expressed as
T (v) =∑
k+l+m=n
Tklm(v1)k(v2)
l(v3)m (2–1)
where Tklm are the real-valued tensor coefficients and k, l & m are non-negative integers.
This is a Cartesian tensor with all the n arguments set to be v. The expressive power of
such Cartesian tensors increases with their order. Geometrically this translates to presence
of more “lobes” on a higher order Cartesian tensor.
Note that the Lambertian model is intricately connected to a special case of this
Cartesian tensor formulation. If v = (v1 v2 v3)T is the light source direction, n = (n1 n2
1 In the notation used in this paper, this order is not same as the number of indicesused to represent the tensor.
29
n3)T is the surface normal and ρ is the surface albedo, the Lambertian kernel is given by
max(ρ · n · v, 0) = ρ ·max(n1v1 + n2v2 + n3v3, 0)
= max(∑
k+l+m=1
Tklmvk1v
l2v
m3 , 0) (2–2)
with T100 = ρ · n1, T010 = ρ · n2 and T001 = ρ · n3. A comparison with Eq. 2–1 reveals that
the Lambertian kernel is exactly the positive half of the 1st order Cartesian tensor.
The 1st, 2nd, 3rd and 5th order Cartesian tensors have 3, 6, 10 and 21 unique
coefficients respectively. For even orders, the Cartesian tensors are symmetric, T (v) =
T (−v), while for odd orders they are anti-symmetric, T (v) = −T (−v). We must point
out that these definitions of symmetry and anti-symmetry are different from the standard
definitions based on switching of the arguments’ order. In this chapter, we would use the
definitions we provided above.
Finally, though the higher order tensors can be more expressive, they can be
perceived to be more sensitive to noise due to their ability to model high frequency
details. In contrast, the lower order tensors are incapable of modeling high frequency
information but arguably are more robust to noise. Since it is impossible to discriminate
between high frequency detail and noise in the data, it is reasonable to say that the higher
order tensors possess higher noise sensitivity. Thus, like in any other approximation task,
we must strike a balance between the high frequency data fidelity and the noise sensitivity.
2.4.2 Tensor Splines
When the task requires estimation of a p-dimensional field of multi-lobed spherical
functions from sparse and noisy data, given the high noise sensitivity of higher order
tensors, it is reasonable to enforce smoothness across the field of spherical functions. We
accomplish this by combining the Cartesian tensor basis at each pixel with the B-Spline
basis ([48]) across the lattice of the spherical functions.
30
We define a Tensor Spline as a B-spline of multilinear functions of any order. In a
Tensor Spline, the multilinear functions are weighted by the B-spline basis Ni,k+1, where
Ni,1 =
1 if ti ≤ t < ti+1
0 otherwise(2–3)
and
Ni,k(t) = Ni,k−1(t)t− ti
ti+k−1 − ti+ Ni+1,k−1(t)
ti+k − t
ti+k − ti+1
, (2–4)
The Ni,k+1(t) are polynomials of degree k, associated with n+k+2 monotonically
increasing numbers called “knots” (t−k, t−k+1, ..., tn+1).
The Tensor Spline for a p-dimensional lattice of spherical functions, with kth degree
spline and nth order Cartesian tensor is defined as
S(t,v) =∑
(i1...ip)∈D
(∏ia
Nia,k+1(tia))Ti1...ip(v) (2–5)
where t = (t1 . . . tp) is the index into the spherical function lattice, v = (v1 v2 v3)T is a
unit vector, D is the p-dimensional spline control point lattice and Ti1...ip(v) is given by
Eq. 2–1. In Tensor Splines the usual B-Spline control points have been replaced by the
control tensors Ti1...ip(v). The formulation presented in Eq. 2–5 is quite general as it can
be used to estimate a spherical function field defined over an arbitrary dimensional lattice,
with any desired degree of B-Spline smoothing.
2.4.3 Facial ABRDF Approximation Using Tensor Splines
Human faces are known to be neither exactly Lambertian nor convex, which leads
to photo-effects like specularities (oily forehead and nose tip) and cast shadows (around
protruding features like nose and lips) in facial images. These effects cause such a complex
variation in the intensity values at various pixels as the lighting direction changes that it
cannot be accurately captured by a single lobed function (like the Lambertian kernel).
This motivated us to explore the use of the higher order Tensor Splines to model
the ABRDFs. Note that here the lattice is 2-dimensional and the assumption of local
31
Figure 2-5. Images synthesized using Tensor Splines under novel illumination directions.The illumination direction is mentioned on each image as (azimuth,elevation).The nine images used as input were illuminated from (-20,60), (0,45), (20,60),(-50,0), (0,0), (50,0), (-50,-40), (0,-35) and (50,40) directions.
homogeneity also holds to a reasonable degree. In order to ensure that the smoothness
is manifested only in a localized fashion, we have chosen to use bi-cubic B-Splines in the
ABRDF-specialized version of the Tensor Splines.
The ability of the Cartesian tensors to better model data with complex distributions
can be noted in Fig. 2-1 ([5]), where in the first row, we show that for the case of
synthetic circular data (shown by the green arrows), the Cartesian tensors can more
accurately approximate the data than the Lambertian cosine bumps. In the second row we
show real facial ABRDFs approximated by the Tensor Splines and the Lambertian model
from a shadow prone region of the face. It can be readily noted that the Tensor Splines
capture the variability in intensity values, as a function of illumination direction, more
accurately than the Lambertian reflectance model.
We must point out that as the order of the Cartesian tensors increases, so does the
amount of data samples required to estimate the unknown coefficients. When there are
only a few images available, in order to satisfy our desire to use the higher order tensors,
32
Figure 2-6. Images relighted with complex lighting. The first image of each subject is litby a point source while the next two are lit by Eucalyptus Grove and St.Peter’s Basilica light probes respectively. Light probes are provided below thefacial images.
we must choose between its odd (anti-symmetric) or even (symmetric) components. Note
that since most of the time we are interested in the ABRDFs’ behavior on the frontal
hemisphere, both symmetric and anti-symmetric versions provide the same representation
power. Their behavior only becomes pertinent when the illumination direction is exactly
perpendicular to the pose direction, and this is where the use of anti-symmetric versions is
advantageous.
This has been explained via a 2D example in Fig. 2-2 ([5]). Fig. 2-2A shows a
semicircular function where the blue circle in the figure is considered to be the zero value.
Fig. 2-2B and 2-2D show the same function approximated by an antipodally symmetric
function and an antipodally anti-symmetric function respectively. In can be noted that
for both the cases the approximation is quite accurate except near the angles 0◦ and 180◦.
When the original function (Fig. 2-2A) is such that it has positive value at one of these
antipodal points and near zero value at the other, a symmetric function forces the value at
both of these crucial angles to be positive while the anti-symmetric function forces one to
be positive and the other to be negative. Now, if we assume that only the positive values
of the function are preserved we get the results as presented in Fig. 2-2C and 2-2E.
33
The behavior of most facial ABRDFs is similar to the function in Fig. 2-2A. This
is because if a pixel has high intensity value when lit from 0◦, most of the time it would
have a low intensity value when lit from 180◦ (due to attached and cast shadows), and
vice versa. Thus, if a symmetric function is used for approximating such an ABRDF,
it would cause non-negative values at both 0◦ and 180◦ and would lead to visually
significant artifacts (unnatural lighting) in the images (Fig. 2-2F). On the other hand,
in practice, use of an anti-symmetric function does not cause visually significant artifacts
(Fig. 2-2G). To summarize, even though both, anti-symmetric and symmetric functions,
introduce artifacts near 0◦ and 180◦ directions, the artifacts created by an anti-symmetric
approximation are visual insignificant and hence we have chosen to work with the
anti-symmetric components of the Cartesian tensors.
Two dimensional Tensor Splines with bi-cubic B-Splines and odd order tensors can be
written as
S(t,v) =∑
(i,j)∈D
Ni,4(tx)Nj,4(ty)Ti,j(v) (2–6)
where vectors i, j, D, t and v have the same meaning as before and the tensor has an odd
order.
The problem at hand is that given a set of Q face images (Iq, q = 1 . . . Q) of a subject
in a fixed pose along with the associated lighting directions vq = (vq1 vq2 vq3), we want
to estimate the ABRDF field of the face using a bi-cubic Tensor Spline. We propose to
accomplish this by minimizing the following energy function which minimizes the L2
distance between the model and the given data,
E(Tijklm) =Q∑
q=1
∑tx,ty
(∑
(i,j)∈D
Ni,4(tx)Nj,4(ty)Ti,j(vq)− Iq(tx, ty))2 =
Q∑q=1
∑tx,ty
(∑
(i,j)∈D
Ni,4(tx)Nj,4(ty)∑
k+l+m=n
Tijklmvkq1v
lq2v
mq3
−Iq(tx, ty))2 (2–7)
34
where tx, ty run through the lattice of the given images, i, j are the indices into the spline
control point lattice D(D×D), and the tensor order n is an odd integer. The minimization
of Eq. 2–7 is done with respect to the unknown tensor coefficients Ti,j,k,l,m that correspond
to the control tensors Ti,j(vn).
If the image size is M × M , there are M2 unknown ABRDF tensors which are
interpolated from the control tensors (Eq. 2–6). We use a uniform grid, D ×D, of control
tensors, which translates to 3D2, 10D2 and 21D2 unknown control tensor coefficients for
1st, 3rd and 5th order tensors respectively. A value for D is chosen according to the desired
smoothness. For the cases when the number of unknowns per control tensor is one more
than the number of data constraints, we use an additional constraint which discourages
solutions with large norms. This is enforced by adding the term λ∑
ij
∑klm T 2
ijklm to the
error function in Eq. 2–7, where λ is the regularization constant.
We recover the unknowns in Eq. 2–7 using the gradient descent method with the
control tensor coefficient field initialized using all-ones unit vectors. This technique can be
efficiently implemented because we have obtained the closed form for the derivative of the
objective function with respect to the unknown coefficients as
∂E(Tijklm)/∂Tijklm =
2 · (Q∑
q=1
∑tx,ty
(∑
(i,j)∈D
Ni,4(tx)Nj,4(ty)Ti,j(vq)− Iq(tx, ty)))
×(
Q∑q=1
∑tx,ty
(Ni,4(tx)Nj,4(ty)vkq1v
lq2v
mq3)). (2–8)
Once the coefficients have been recovered, images under a novel illumination direction,
v, can be synthesized by evaluating the ABRDF field in the direction v, where each
ABRDF is given by Eq. 2–5. Possible negative values obtained in Eq. 2–5 are set to zero
(as in the Lambertian model). Furthermore, it should be noted that the generated images
can be readily up-sampled by evaluating Eq. 2–5 on a more dense sampling lattice since
the Tensor Spline is a continuous function.
35
2.5 Mixture of Single-lobed Functions
In order to quantitatively validate whether the Tensor Splines provide a good enough
approximation of the ABRDF fields, we present a more expressive model here. This
validation model is more general in the sense that it can accommodate arbitrarily large
number of lobes to approximate any spherical function. We define it using a mixture of
single-lobed spherical functions. Such a mixture is characterized by a kernel function,
k(µi,v), and a set of mixing weights, wi, associated with a set of unit vectors µi as follows
B(v) =∑
i
wik(µi,v), (2–9)
where v is the lighting direction and the vectors µi are uniformly distributed on the unit
sphere.
Of the various choices for singled lobed spherical functions that can be used as the
kernel function k(µ,v), we picked k(µ,v) = e−µ·v − 1 due to two reasons – it has a single
peak and k(µ,v) = 0 for all v such that v · µ = 0 (since if the viewing and the illumination
directions are perpendicular we expect zero intensity). Note that these two properties are
also satisfied by the Lambertian kernel.
The task of estimating ABRDFs using this mixture model requires us to recover the
unknown weights such that the weighted combination leads to a spherical function which
closely approximates the ABRDFs. Given a set of N facial images with the same fixed
pose and the associated lighting directions vn, we can setup a N × M matrix An,m by
evaluating e−µ·v − 1 for every vn and µi. M is the number of µi picked in the model.
The unknown weights (Eq. 2–9) for each pixel can then be estimated by solving the
over-determined system AW = B, where B is a N -dimensional vector of the intensities at
a fixed pixel in the N given images, and W is the vector of the unknown weights. Since
the ABRDF is a nonnegative function, we solve this system with the positivity constraint
using the non-negative least square minimization algorithm developed in [49].
36
Note that this model would generally have a very large number of unknowns
(depending on the chosen resolution while picking µi), and thus would require a large
number of the ABRDF field samples (images) for accurate recovery of the ABRDFs. But
since this would only be used as a tool to evaluate the Tensor Splines, it is not considered
a drawback.
2.6 Recovering Shape from the ABRDF Field
Facial ABRDF is in part characterized by the local surface normal and hence it
should be possible to recover the shape information from it. But unlike the various
popular parametric reflectance models like Lambertian, Torrance-Sparrow ([2]), Phong([1])
etc., which explicitly assume a role for the surface normal in their formulae, Tensor Splines
make no such assumption. This allows spatially varying and accurate approximation of the
ABRDFs, but also makes the recovery of the surface normals non-trivial.
To recover the surface normals from the Tensor Splines model we invoke the local
homogeneity of the ABRDF field. This assumption is physically sound because the
reflectance properties of a human face does not change drastically in small neighborhoods
(3 × 3 pixels) and mathematically robust as Tensor Splines model ensures that the
coefficients vary smoothly across the ABRDF lattice. We assume that the ABRDFs at two
neighboring pixels have the same shape and differ only by a rotation, R and thus, if the
surface normal at one of these pixels is known, the surface normal at the other pixel can
be derived by rotating it by R.
For a given internal pixel (x, y) in the image, there are eight immediate neighbors.
If the surface normal at (x, y) is inferred as described above, it would receive eight
suggestions for possible surface normals (assuming that the surface normals for the
neighbors are known). Instead of picking one of the suggestions as its surface normal,
we take a weighted geodesic average of the suggested normals. The weights are set to be
inversely proportional to the registration error obtained during the rotation-alignment of
the ABRDF pairs. There are two main advantages of computing the surface normal is this
37
manner. Firstly, being an aggregate statistic, the geodesic mean is more robust to noise
than the individual suggestions. Secondly and more importantly, the weighted nature of
the mean ensures that suggestions, which originate from neighbors whose ABRDFs are
very different in shape than the ABRDF at (x, y), are automatically weighted less. This
property of the weighted mean is especially useful at locations in the image where the
homogeneity assumption breaks down, e.g. shadow edges.
This process is summarized in Fig. 2-3 ([5]), where the central ABRDF, (A), is shown
to be aligned with a higher accuracy to its left neighbor, ARBDF (B), than to its right
neighbor, ABRDF (C). For both case, before and after alignment configurations are shown
from 2 different points of views. As mentioned before, the misalignment error is used
to weigh the normal suggestion from a neighbor and hence the suggestion from the left
ABRDF, (B), would eventually be weighted more than the suggestion from the ABRDF
on the right, (C).
Once the rotation matrices for all the pixels in the image have been computed, we
initialize all the normals with the directions in which the ABRDFs have their maxima.
This initialization is followed by weighted geodesic mean computations, which provide us
with a robust estimate of the surface normals. The process of mean computation is carried
out iteratively but empirically it was noticed that good results can be obtained in all cases
with 1 or 2 iterations. Note that using the maxima directly as a normal estimate provides
inaccurate results. We attribute this to the fact that unlike some reflectance models (e.g.
Lambertian), Tensor Splines model does not enforce that the maximal response of the
ABRDF lies along the surface normal direction.
2.6.1 Rotation Estimation
Recovering the surface normal field using the steps described above requires
computation of the rotation matrices for each pair of neighboring ABRDFs in the image.
A simple but computationally intensive approach would be to search for the rotation
matrix using a gradient based constrained optimization technique. More concretely,
38
Figure 2-7. The 1st image is generated using the Tensor Splines model, the 2nd image isthe ground truth and the 3rd is generated using the Lambertian model. Thecast shadows and the specularities are much more realistically render using theTensor Splines model than the Lambertian model.
according to this scheme, two ABRDFs represented by their Cartesian tensor coefficients
w1 and w2, can be aligned by minimizing the following objective function
E(R) =∑
v∈S2
(wT1 B(v)− wT
2 B(R · v))2, (2–10)
such that
RT R = I, (2–11)
where the unit vectors v are obtained by some uniform sampling of the sphere, B is the
vector of Cartesian tensor basis defined in Eq. 2–1 and R is the sought rotation matrix.
This method for the rotation matrix recovery would require nonlinear optimization to
be run ∼ 8L2 times for an image of size L × L pixels. Even for an average sized image
this process can be quite intractable and hence, we propose the following more efficient
algorithm for the rotation matrix recovery.
Let T1(v) and T2(v) be the two ABRDFs (Eq. 2–1) that need to be aligned via a
rotation. This implies that we seek a δv such that
T1(v) = T2(v + δv). (2–12)
Since the ABRDFs are from neighboring pixels, we assume that the required δv would be
small and thus using the first order Taylor’s expansion, we get
T1(v) = T2(v) +∇T2(v)T δv. (2–13)
39
A B
C D
Figure 2-8. Distribution of errors as the configuration of input changes. X-axis representsthe azimuth and Y-axis represents the elevation angles. Hotter colors showlarger errors. The white dots represent the exact directions of illumination inimages used as input.
As we expect
L · v = v + δv, (2–14)
where L is the linear transformation containing the rotation matrix, we get
T1(v)− T2(v) +∇T2(v)Tv = ∇T 2(v)T Lv, (2–15)
which leads to the linear system
Ax = B, (2–16)
where the ith row of A contains vectorized entries of ∇T 2(vi)viT , x contains the vectorized
entries of L, the ith entry of B is T1(vi) − T2(vi) +∇T2(vi)Tvi and vi are the unit vectors
obtained from uniform sampling of a sphere. The embedded rotation matrix R can be
recovered using the QR decomposition from L.
2.6.2 Surface Normal Computation
As described earlier, the surface normal, n, at a pixel (x, y) with ABRDF T can be
computed by taking a weighted geodesic mean of the normals suggested by its neighboring
pixels. Let all of its P immediate neighbors be indexed 1 . . . P with corresponding
40
ABRDFs as Tp, normals as np and the rotation matrices computed using the process
described above as R1 . . . Rp. The normal at (x, y) is then given by
n = argminµ
P∑p=1
1
||RpTp − T ||2d(np, µ)2, (2–17)
where d() is the geodesic distance defined on the space of unit normals, the arc length.
We seek a geodesic mean because the domain of unit normals is the unit sphere and not
the Euclidean space. This mean is also known as the weighted Karcher mean and can be
computed using the following iterative scheme –
µ → expµ(εν) (2–18)
ν = (1/n)
p∑i=1
1
||RpTp − T ||2 exp−1µ np (2–19)
where exp, the exponential map, is given as
expµ(εν) = cos(|εν|)µ + sin(|εν|)(ν/|ν|) (2–20)
and exp−1µ (np), the log map, is defined as
exp−1µ (np) = u cos−1(〈µ, np〉)/
√(〈u, u〉) (2–21)
where
u = np − 〈np, µ〉µ, (2–22)
and ε is the iteration step size. For more details on computing means on manifolds see [50]
and references therein.
2.6.3 Shape Recovery
Once the normal field has been computed, we use one of the standard techniques
([51]) to recover the surface. If z(x, y) defines the surface, the normal at a location (x, y)
is given by (zx zy −1)T where zx and zy denote the partial derivatives of the surface with
respect to x and y. If (nx ny nz)T denotes the surface normal at location (x, y), we have
41
Figure 2-9. Per pixel intensity error comparison. The blue color shows the errors on the1st and 2nd subsets combined, which contain the lighting directions, φ, smallerthan 25
◦, the green color shows the errors on the 3rd subset with
25◦ < φ < 50◦ and the red color shows the errors on the 4th subset with50◦ < φ < 70◦.
the following relations
zx = −nx/nz (2–23)
zy = −ny/nz. (2–24)
Using the forward difference approximation of the partial derivatives, we obtain the
following two equations
nzz(x + 1, y)− nzz(x, y) = nx (2–25)
nzz(x, y + 1)− nzz(x, y) = ny, (2–26)
which provide a linear relation between the surface values at the grid points and the
known surface normals. The surface can thus be recovered by solving an over-determined
system of linear equations. At the boundary points, the above formulation is not valid
and the surface is recovered by solving the following equation, obtained by eliminating nz
above
nxz(x, y)− nxz(x, y + 1) = nyz(x + 1, y)− nyz(x, y). (2–27)
2.6.4 Novel Pose Relighting
With the facial shape in hand, novel poses can be rendered by simply changing the
viewpoints. But generating novel illumination conditions in the novel pose is not trivial as
42
A9
inpu
tim
ages
BQ
uive
rpl
otof
the
norm
alfie
ldC
Col
oren
code
dno
rmal
s(R
-x,G
-z,B
-y)
DR
ecov
ered
shap
ein
nove
lpo
se
E9
inpu
tim
ages
FQ
uive
rpl
otof
the
norm
alfie
ldG
Col
oren
code
dno
rmal
s(R
-x,G
-z,B
-y)
HR
ecov
ered
shap
ein
nove
lpo
se
Fig
ure
2-10
.Shap
esre
cove
red
usi
ng
9im
ages
43
the ABRDFs estimated from a different pose cannot be directly used. If the ABRDF field
was estimated in pose P1 and if we wish to generate an image with a novel illumination in
a new pose P2, we have to rotate the ABRDFs by the same rotation which is required to
change P1 to P2. Once the orientations of the ABRDFs have been rectified, images of the
face in the new pose with novel illumination can be generated by evaluating the rectified
ABRDF field in the desired directions.
We would like to point out that the specularities are view dependent and accurately
speaking, cannot be directly transferred from one pose to another. Most of the existing
Lambertian methods ignore this effect but the few which deal with this problem, handle it
by either explicitly obtaining the specular component by using polarized lighting (e.g. [26],
[8]), which required specialized data acquisition, or by assuming a parametric form for the
specular component of lighting (e.g. [20]).
Our Cartesian tensor representation for the ABRDFs does not discriminate against
specularities and estimates the ABRDF as best as possible from the available intensity
values. Thus, it should be possible to recover and manipulate the specular component
separately, but at this stage, we have made the assumption that specularities do not
change drastically across facial poses. The validity of this assumption is supported by the
results presented in the next section.
2.7 Experimental Results
In order to evaluate the proposed methods for relighting and shape recovery, we
conducted several detailed experiments which are presented here. Since it has been shown
for the popular Lambertian model that the space of images with illumination variation
can be approximated quiet accurately using a 9 dimensional subspace ([7], [9]), we have
taken on the challenge of also working with just 9 image. Note that with 9 samples of
the ABRDF field and the solution norm minimization constraint, at most 10 unknown
coefficients can be recovered per pixel and hence our central results use bi-cubic 3rd order
Tensor Splines.
44
The experiments were carried out on the Extended Yale B [37] (38 subjects, in 9
poses and 64 illumination conditions) and the CMU PIE [52] (68 subjects in 13 poses and
43 illumination conditions) benchmark databases. Note that the CMU PIE has 21 usable
point source illuminated images while in the Extended Yale B all 64 illuminations are
point source.
2.7.1 Relighting Faces
We begin by noting that the Tensor Splines model can capture the non-trivial shape
of the facial ABRDFs. In Fig. 2-4 ([5]) we show the ABRDF field of a subject from the
Extended Yale B database estimated using 9 images. Three different regions of the face
have been shown in detail where complicated shapes of the ABRDF can be noticed.
The regions A and B have more complicated shapes because these ABRDFs have to
accommodate shadows. The spherical functions in the image have been color coded based
on their maximal value directions. The mapping of the directions to colors is provided in
the lower right corner.
Next we present results for relighting of faces in novel illumination directions. In
Fig. 2-5 ([5]) four different subjects lit in various novel point source illuminations are
depicted. For the first two rows, the illumination direction varies across the azimuth angle
while in the next two rows, the variation is in the elevation angle. It can be noticed that
our method can accurately interpolate as well as extrapolate from the images provided
as input. Further, difficult effects like the cast shadows and the specularities have been
photo-realistically rendered without using any additional ray tracing.
Starting with 9 images, our technique estimates the entire ABRDF field and thus
images lit in fairly complex lighting conditions can also be rendered. In Fig. 2-6 ([5]) we
present such results for two subjects from the CMU PIE database. Below each image is its
lighting condition. The first image of each subject is one of the nine input images used to
estimate their ABRDF fields. The next two images for each of the subjects are lit by the
light probes ([53]) named Eucalyptus Grove and St. Peter’s Basilica respectively. For these
45
color images, we estimated the ABRDF field for each color channel separately. The images
were relighted by taking a weighted combination of the point source lit images where the
weights were determined using the light probes. We used 2500 samples of the light probe
to render these images.
In Fig. 2-7 ([5]) we provide a qualitative comparison between our method and the
Lambertian model. The first face image in Fig. 2-7 ([5]) is rendered using the Tensor
Splines model, the second is the ground truth and the third image is rendered using the
Lambertian model. It can be readily noted that the image obtained using our method is
closer to the ground truth than the one rendered using the Lambertian model. The arrows
show the locations of important differences – the cast shadows and the specularities.
The next two experiments try to quantitatively capture the performance of our
method. First, we explore the impact of the input data on the estimation when the tensor
order is fixed (3rd order in this case). For this we use the Extended Yale B dataset as it
provides the ground truth for 64 directions. To set a baseline, we estimated the ABRDF
field for 10 subjects using all the 64 images as input, rendered images in the same 64
direction and computed the total error with respect to the ground truth images. Next,
the errors were computed similarly for 3 other cases where only 9 images were used as
input to our method, but in different configurations. Two of the cases had images with
illumination directions uniformly distributed in front of the face while one had images with
the directions biased towards one side.
To visualize the distributions of obtained errors, we color coded them with hotter
colors denoting larger errors and plotted them as a continuous images in Fig. 2-8 ([5]).
The X axes of these images show the azimuth angles varying from −130◦ to 130◦ (from
the leftmost white dot to the rightmost) and the Y axes show the elevation angles varying
form −40◦ to 90◦ (from the topmost white dot to the bottom most). The white dots in
these images show the exact direction of illumination in the images used as input. It can
be readily noted that when all the 64 images are used as input, Fig. 2-8A, the error is the
46
Figure 2-11. Detailed pose variation with texture-less in upper right and depth-map inlower right. Accurate renderings even in extreme poses can be noticed.
least. For the 9 image cases, Fig. 2-8B and Fig. 2-8C, where the illumination directions
of the input images are somewhat uniformly distributed, the error is more than that in
Fig. 2-8A, but notedly less than the case when the distribution is skewed in one direction,
Fig. 2-8D. Hence, as expected, our method performs better when the input images have
the lighting directions that are uniformly sampled from the sphere. Moreover, the errors
in all the cases are concentrated towards the extreme illumination angles and for the near
frontal illumination conditions, the performance is not particularly affected by the input
image distribution.
Next we present a quantitative comparison of our method with the Lambertian
model and the validation model presented in Section 2.5. A natural question that arises
is why should an order 3 Cartesian tensor be suitable for estimating the facial ABRDFs?
To answer this question, we computed the average intensity error per pixel over all 38
subjects in the 64 illumination directions of the Extended Yale B dataset using the
Lambertian model, 3rd order Tensor Splines, 5th order Tensor Splines and the mixture of
single lobed functions (Eq. 2–9). All 64 illumination directions were used for the mixture
47
model (on account of the large number of unknowns), while for the other three, only 9
images (configuration shown in Fig. 2-8B) were used. We set the µi values required for
the mixture model using a dense sampling (642 directions) of the unit sphere obtained by
the 4th-order tessellation of the icosahedron. We have presented results in Fig. 2-9 ([5])
shattered along the standard subsets (subset 4 in red, 3 in green and 1 + 2 in blue) of the
Extended Yale B database. As expected, the error for the subset with extreme lighting
(subset 4) is more than the other sets, for all methods. More importantly, even with a
considerably large amount of input data and a very flexible estimation model, the errors
obtained from the mixture model are quite similar to those obtained from the 3rd order
Tensor Splines model. This indicates that though a 3rd order Tensor Splines model can
only accommodate three lobes, for most facial ABRDFs this suffices. The 3rd order Tensor
Splines model outperforms the Lambertian model and even the 5th order Tensor Splines
model, which suggests possible over-fitting in the 5th order model.
2.7.2 Estimating Shape
All the results presented till now assumed a fixed pose, but using the technique
presented in Section 2.6 we can simultaneously vary the illumination and the pose of a
face. Fig. 2-10 ([5]) summarizes the results produced by our shape recovery algorithm
for one subject each from the Extended Yale B and the CMU PIE databases. The
first column shows the 9 input images, the second column shows the quiver plot of the
estimated normal field (zoom in to see details), the third column presents the surface
normal information in a color coded form (x components of the normal field are mapped
to the red channel, y components to the blue channel and z components to the green
channel) and the fourth column shows the recovered shape in a novel pose. For the case
of color images, shape estimation was carried out using only the luminance component.
In both the cases, occlusion of appropriate regions of the face due to pose change can be
noted from the images in the fourth column.
48
Figure 2-12. Shape comparison with the Robust Photometric Stereo on the Extended YaleB dataset
In Fig. 2-11 ([5]) we present more detailed results for pose variation with fixed
illumination. The 3 rows of images show a subject from the Extended Yale B in different
poses ranging from the right profile to the left profile, as we go from left to right, and
viewpoint varying from below the face to above the face, as we go from top to bottom.
Note that the ABRDF field for this subject was recovered using just 9 images under the
illumination configuration shown in Fig. 2-8B. The recovered shape for the same subject,
rendered with constant albedo and specularities, is also presented at the right end of the
figure. This allows finer details of the shape to be shown without any texture to bias the
observer. Finally, to the lower right of the figure is the height map for the same subject. It
can be noted that our shape recovery algorithm can produce good results without making
the simplifying Lambertian assumption.
We present face shapes estimated by our method and the Robust Photometric
Stereo [54] (9 input images for both methods) in Fig. 2-12 ([5]). Results for four
49
different subjects, both with and without texture, are presented. Based on the results
following conclusions can be drawn: First, since the Tensor Splines method imposes local
smoothness, the recovered shape lacks some minute details like the mole on the chin in the
case (a) as compared to the Robust Photometric Stereo. Second, since the Tensor Splines
method more seamlessly handles the cast shadows and the specularities as compared to
the Robust Photometric Stereo, regions affected by the cast shadows and the specularities,
especially the nose, are better recovered by the Tensor Splines method. This can be
readily noted in the cases (a), (c) and (d). Finally, the Tensor Splines method seems to
have better global shape estimation. For instance, in case (a), the shape recovered by
the Robust Photometric Stereo is titling backwards towards the top. In case (c), the
region around the mouth seems unnaturally warped in the Robust Photometric Stereo
while in case (d), relative positioning of the nose and the eyes seems more realistic in the
Tensor Splines results. In summary, these results demonstrate that the Tensor Splines
method may lack minute details but models facial features in a more photo-realistic
fashion than the Robust Photometric Stereo, when the input images have cast shadows
and specularities.
Finally, we present results when both the pose and the illumination conditions are
simultaneously varied. In Fig. 2-13 ([5]) one subject each from the CMU PIE and the
Extended Yale B databases are shown in various poses and illumination conditions. The
ABRDF fields for both the cases were recovered using 9 images, and the shape for the
color images were recovered using the luminance channel. With the change of pose, we
have retained the ABRDF field learnt using the frontal pose, but it can noted that the
results are photo-realistic even when the specularities are not explicitly modified and
transferred.
2.7.3 Face Recognition
These results are presented here as an illustration that meaningful relighting can
enhance the face recognition results even when a simple classifier like the Nearest Neighbor
50
Figure 2-13. Simultaneous pose and illumination variation
classifier is used. We will discuss face recognition in more details in Chapter 5 and
Chapter 6.
Face recognition is one of the most popular applications of facial image analysis. It
is generally defined as – given a database of facial images of various people, called the
gallery, identify the person in a novel test image, called the probe, as one the people
present in the gallery. The degree of difficulty of this problem increases as the differences
between the probe and the gallery images increase. This difference could be due to the
illumination conditions, occlusion, expression, pose or any combination of these.
In recent times, illumination invariant face recognition has attracted particular
interest due to advances in our understanding of the reflectance modeling. Here we present
a comparative study of illumination invariant face recognition. When using the Tensor
Splines method, we assume that for each subject 9 gallery images with known illumination
directions are available. From these images we compute the ABRDF fields and generate
images with novel illumination for a dense sampling of directions. This step expands our
collection of 9 gallery images to any desired size. The probe image is then matched to all
51
Table 2-2. Face recognition errors rates. N is the gallery set size.
Method N Subset 1&2 Subset 3 Subset 4 TotalCorrelation [55] 4 0.0 23.3 73.6 29.1Eigenfaces [56] 6 0.0 25.8 75.7 30.4
Linear subspace [57] 7 0.0 0.0 15.0 4.7Cones-attached [12] 7 0.0 0.0 8.6 2.7
Cones-cast [12] 7 0.0 0.0 0.0 0.09PL [37] 9 0.0 0.0 2.8 0.8
3D SH [11] 1 0.0 0.0 2.8 0.8Harmonic (SFS) [58] 1 0.0 0.0 12.8 4.0
Tensor Splines 9 0.0 0.0 1.6 0.5
the images in the database and the subject with the closest matching (in L2 sense) gallery
image is assumed to be the correct identity of the probe image.
We have used the Extended Yale B data for this experiment primarily because most
of the existing methods have presented face recognition results on the same database. As
mentioned before, this database is divided into 4 subsets with the lighting getting more
and more extreme as we go from subset 1 to 4 and thus, the difficulty in classifying the
images from these subsets also increases. The obtained recognition error rates are reported
in Table 3-1. We have also presented results reported by existing methods and their
respective references are listed next to the name of the method. Results for the first seven
techniques are taken from [11] and the rest are taken from the respective references. Along
with the error rates, we have also listed the number of images required by each method
in the gallery set. For our method we used the nine images in the configuration shown in
Fig. 2-8B ([5]). It can be noted that even with the naive nearest neighbor classification
strategy our method produces near perfect results.
2.8 Conclusions
In this chapter we have presented a novel comprehensive system for capturing
the reflectance and the shape of human faces using Tensor Splines. Since our method
requires at least 9 input images with known illumination directions, we fall short of the
ideal solution described in the introduction, but show an improvement over the popular
Lambertian model. Accurate recovery of the ABRDF field from single image with the
52
cast shadows and the specularities with no lighting information remains a challenge. The
central problem in the single image case stems from the dearth of information to constrain
the space of all possible ABRDF fields. Use of strong prior information presents itself as a
potentially effective way to constrain the search space, but attempts so far (e.g. [15]) suffer
from the need of manual intervention and cumbersome computational requirements.
We would explore the use of Tensor Splines ABRDF fields as prior information to
meaningfully predict the ABRDF fields using single input images in Chapter 4. Use of
a shape prior can also potentially aid in shape recovery.
While relighting images in novel poses, we make the assumption that the ABRDF
field maintains the same specular information across poses. Though practically useful,
this is not fully valid. We have dealt with the specularities in a data driven fashion but
possible attempt can be made to explicitly model the specularities, which we would like to
explore in future. It should be noted that though the problem of detecting specularities is
relatively well studied, the problem of realistically predicting specularities in novel poses
without using specialized imaging tricks (like special filters) remains challenging. Possible
improvement can also be made in our model by incorporating non-uniform smoothness as
opposed to the current setup.
Besides the relighting and the pose change applications described in the chapter, our
technique can also be used for image up-sampling and compression. The former is possible
because the Tensor Splines representation creates a continuous field of the ABRDF
coefficients across the image, which can be sampled at a sub-pixel resolution. The later
exploits the capability of the ABRDFs to represent images of a face under infinitely many
lighting directions using just a few coefficients per pixel.
In conclusion, the Tensor Splines framework for the analysis and modeling of the
illumination and the pose variation of facial images provides a useful alternative to the
Lambertian assumption. It also seems that the collective analysis of the shape and the
53
reflectance through the ABRDFs is promising as an alternative to separate facial BRDF
and shape analysis.
54
CHAPTER 3EIGENBUBBLES: THE ENHANCED ABRDF REPRESENTATION
3.1 Introduction
Thus far we have presented a method for the ABRDF field estimation using nine
or more samples provided as input. But since we are interested in a very specific class
of spherical functions, the human facial ABRDFs, we should be able to improve their
representation if we can include the knowledge derived from observing and analyzing
the ABRDFs from various locations on various faces. To this end, we seek a few salient
spherical functions, which define a natural subspace for representing the ABRDFs.
3.2 Eigenbubbles
An intrinsic subspace for the ABRDF representation can be the one spanned by
the cluster centers obtained from the k-mean clustering of the ABRDFs. We can avoid
the initialization issues associated with clustering by using the fact that such a subspace
can also be obtained by the spectral expansion of the data covariance matrix [59]. The
basis elements of such a subspace are themselves spherical functions and we call them
Eigenbubbles.
The only change that we make to the ABRDF representation from Chapter 2 is
to replace the Cartesian tensors in the Tensor Splines representation of the spherical
functions with the Spherical Harmonics basis. We denote the real spherical harmonic basis
functions as Ψml (order: l, degree: m), with l = 0, 1, 2, ... and −l ≤ m ≤ l:
Ψml (θ, φ) =
√2l + 1
4π
(l −m)!
(l + m)!Pl|m|(θ, φ)Φm(θ, φ). (3–1)
where Pl|m| are the associated Legendre functions and Φm(θ, φ) is defined as
Φm(θ, φ) =
√2 cos mφ m > 0,
1 m = 0,√
2 sin |m|φ m < 0.
(3–2)
55
We chose the Spherical Harmonics basis for this enhanced representation since
empirically we found that it leads to more accurate relighting after the ABRDF
enhancement than the Cartesian tensor basis. Note that the objective function for
ABRDF estimation, the spline based smoothness constraints and the numerical optimization
scheme remain exactly the same as before.
Now, given a bag of ABRDFs {αi} (where αi are the Spherical Harmonic coefficients
of the ABRDFs), we define the mean ABRDF as
α =∑
i
αi/N. (3–3)
The data covariance matrix can thus be defined as
C =∑
i
(αi − α)(αi − α)T /N. (3–4)
Note that even though the number of ABRDFs can be large, the covariance matrix
has dimensions 10 × 10 for the third order ABRDF estimation. Next we decompose the
square matrix C into its eigenvectors(V) and eigenvalues(U) as
C = VUVT . (3–5)
Arranged in the increasing order of corresponding eigenvalues, the eigenvectors - in our
case Eigenbubbles - define a low variance subspace for the ABRDF representation.
At this point we define two kinds of Eigenbubbles - Local and Global. Global
Eigenbubbles are defined to be the ones obtained by putting the ABRDFs from all
different locations and different individuals into the initial bag of ABRDFs. On the other
hand, Local Eigenbubbles are those obtained separately for each pixel, by considering
only the ABRDFs lying at the same (corresponding) pixel locations, from all the given
faces. For this we only require a rough alignment of the faces (e.g. alignment of eyes)
since we assume that the ABRDFs in the same neighborhoods are similar. These two
56
Figure 3-1. Global Eigenbubbles learnt from the Extended Yale B face database.Eigenvalues increase from left to right in row one and then in row two.
definitions would allow us to analyze the impact of the use of Eigenbubble for the ABRDF
representation in the next section.
3.3 Experiments & Discussion
3.3.1 Relighting
In Fig. 3-1 the first 10 Global Eigenbubbles learnt from the Extended Yale B
face database are shown. It can be noted that the Eigenbubbles corresponding to
lower eigenvalues consist of a smaller number of blunt lobes while the Eigenbubbles
corresponding to higher eigenvalues show a larger number of shaper peaks. This indicates
that the Eigenbubbles corresponding to higher eigenvalues encode higher frequency details
of the facial ABRDF fields. As mentioned before, the significance of these Eigenbubbles
lies in the fact that an ABRDF function at any location on any face can be represented
with high accuracy as a linear combination of these spherical functions. Next, we present
the first 10 Local Eigenbubbles in Fig. 3-2, at two different locations on a human face. It
can be noted that shapes of these two sets of spherical functions are quiet different form
each other as they are learnt using the ABRDF functions from two different regions of the
face. Forehead shows much more uniform variation in intensity values as compared to the
shadow infested nasal region.
Next we look at the actual ABRDF functions which are represented using the
Eigenbubble framework. In Fig. 3-3 the estimated ABRDF field has been shown
superimposed on a face. Larger images of individual ABRDFs represented using
57
Figure 3-2. Local Eigenbubbles learnt from two different locations on a face. Eigenvaluesincrease from left to right in row one and then in row two.
Figure 3-3. Estimated ABRDF functions at various locations on a human face.
Eigenbubbles show that high fidelity representation of intensity variation at pixels which
accounts for the cast shadows and the specularities require more complicated spherical
functions than a half-cosine bump used in the Lambertian model.
In order to analyze and understand the representational power of the Global and the
Local Eigenbubbles, in the next set of experiments we qualitatively and quantitatively
examine the quality of the synthesized relighted images using the proposed techniques.
58
Figure 3-4. On the left (5 columns) synthesized images using the Global Eigenbubbles areshown. From top to bottom, each row uses 1, 2, 3, 5 and 10 Eigenbubblesrespectively for its ABRDF field representation. The same setup is repeatedon the right using the Local Eigenbubbles. The azimuth and the elevationangles of the illumination source are given at the bottom right corner of eachimage. All ABRDF fields were estimated using 9 input images illuminatedfrom (-20,60),(0,45),(20,60),(-50,0),(0,0),(50,0),(-50,-40),(0,-35) and (50,40)directions.
First we present synthesized images of a subject from the Extended Yale B dataset
while using varying number of subspace dimensions for its ABRDF field representation
in Fig. 3-4. The first 5 columns on the left show images synthesized using the Global
Eigenbubbles while the last five columns show images synthesized using the Local
Eigenbubbles. From top to bottom each row shows images generated from the ABRDF
field represented using 1, 2, 3, 5 and 10 eigenbubbles respectively. All the ABRDF fields
were estimated using 9 input images. These images have been synthesized using novel
illumination directions whose azimuthal and elevation angles are mentioned in the bottom
right corner of each image. From these images, foremost, it can be noted that as the
subspace dimension is increased, so does the visual quality of the images. Secondly, quality
of the images produced using the Local Eigenbubbles, especially for low dimensional
subspaces, is better than when the Global Eigenbubbles are used. Thirdly, for both types
59
of Eigenbubbles, visually high quality images are rendered for 5 and higher dimensional
representations. Accurate depiction of the shadows, both attached and cast, and the
specularities can be readily noted from the images.
To demonstrate the versatility of the Global Eigenbubbles for accurately representing
the facial ABRDF across different face types and databases, we present novel images
of various faces rendered using our technique in Fig. 3-5. When used for each channel
independently, the ABRDF framework proposed here can be easily extended to color
images, and this is demonstrated in the first four rows where subjects from the CMU PIE
dataset are render under novel illumination conditions. Global Eigenbubbles were learnt
for each channel separately. The bottom two rows show images of two subjects from the
Extended Yale B database. The Global eigenbubbles used here were the same as used in
Fig. 3-4. Using faces belonging to different races, we have demonstrated the capability of
the Global Eigenbubbles to represent the ABRDFs of surfaces which can have somewhat
different surface properties. Note that the shadows have been crisply generated and the
specularities have been meaningfully rendered.
Next we quantitatively examine the images generated using our method. For this
experiment we have chosen the Extended Yale B dataset as it provides images of each
subject taken in 64 point illumination conditions which can be used as the ground-truth
for evaluating our method. Besides the proposed enhanced ABRDF representation method
we have also included results from the Tensor Splines method and the Lambertian model,
examined in [60], in order to compare our results with these alternate methods. Using 9
images per subject, we synthesized images in all 64 illumination directions using just the
Spline Modulated Spherical Harmonics, the Local and Global Eigenbubbles, the Tensor
Splines model and the Lambertian model and then computed the pixel-wise intensity
errors for each using the ground-truth images. We have presented these results in Fig. 3-6.
It can be noted that the Lambertian model, with its limited representative power in the
presence of the cast shadows and the specularities performs the worst. As expected, the
60
Figure 3-5. The first four rows show color images of subjects from the CMU PIE database,while the last two rows show images of subjects from the Extended Yale Bdataset. The (azimuth,elevation) angles of the illumination source arementioned in the bottom right corner of each image. Input images for theCMU PIE subjects were lit from (32,2), (0,-9),(32,-9),(0,3),(-32,-8),(-32,3),(38,6),(0,10),(-32,5) directions while for theExtended Yale B subjects we used same directions as in Fig. 3-4.
61
Figure 3-6. Per pixel intensity errors obtained using the proposed and the state-of-the-artmethods. The Y-axis represents the error while the X-axis represents thedimension of the subspace used in the Eigenbubble representation. Note thatthe error rates do not vary along the X- axis for the Spline ModulatedSpherical Harmonics, the Tensor Splines model and the Lambertian model.
Figure 3-7. Qualitative comparison between the proposed technique, the Tensor Splinesmodel and the Lambertian model [60].
Local Eigenbubble outperforms the Global Eigenbubbles, but both of these techniques,
outperform the Tensor Splines model. The improvement in the image quality as the
number of subspace dimensions is increased, as apparent in the images presented in Fig.
3-4, can be clearly noted in Fig. 3-6.
The observed superiority of the proposed technique over the state-of-the-art methods
in terms of relighting with faithful shadows and specularities reproduction can be visually
noted in Fig. 3-7. Here we have shown a representative image rendered using various
62
Figure 3-8. Novel images generated under extreme illumination directions. The centralfigure shows the illumination directions of the input images as blue circles andthe illumination directions of the novel images as red squares.
methods along with the ground truth image. It can be noted that the Lambertian
model largely fails to model the specularities (green arrow) and the cast shadows (red
arrows). The Tensor Splines model does a better job than the Lambertian model but the
shadows and specularities are smudged on account of its assumed across-field-smoothness.
Our method, Eigenbubbles, on the other hand, has produced results which have
accurately depicted the cast shadows and the specularities. Here we have used the
Local Eigenbubbles which do not suffer from the smoothing artifacts present in the Tensor
Splines model.
3.3.2 Face Recognition
Since our technique can produce novel images using only a few input images, it can
aid face recognition by augmenting the gallery set with new images of the given subjects.
We test the performance of our method in aiding face recognition using the Extended Yale
B database since most of the other methods also present results on this dataset. For each
63
Methods N Set 1&2 Set 3 Set 4 Total
PAMI 1993 Correlation [55] 4 0.0 23.3 73.6 29.1CPVR 1994 Eigenfaces [56] 6 0.0 25.8 75.7 30.4CVPR 2001 Linear subspace [57] 7 0.0 0.0 15.0 4.7PAMI 2001 Cones-attached [12] 7 0.0 0.0 8.6 2.7PAMI 2001 Cones-cast [12] 7 0.0 0.0 0.0 0.0PAMI 2005 9PL [37] 9 0.0 0.0 2.8 0.8PAMI 2006 3D SH [11] 1 0.0 0.0 2.8 0.8IJCV 2008 Harmonic (SFS) [58] 1 0.0 0.0 12.8 4.0Tensor Splines 9 0.0 0.0 1.6 0.5Eigenbubbles 9 0.0 0.0 0.9 0.29
Table 3-1. Face recognition errors rates. N is the number of input images.
subject, 9 images are used as the gallery set while the rest are used as probes. Using the
Global Eigenbubbles, we generate a lot more images of each subject by uniformly sampling
the illumination direction sphere. The final classification is performed using the Nearest
Neighbor classifier. The results obtained (shattered along the subsets) are presented in
the Table. 3-1. The illumination conditions get harsher from subset 1 to 4 and so does the
recognition task. It can be noted that our method produces extremely small error rates
comparable to the state-of-the-art. We must emphasize that in general, face recognition
systems use more sophisticated classifiers than the simple Nearest Neighbor classifier
used by us and thus instead of a face recognition system, our method should be seen as
a database augmentation method which can aid any other face classification method by
adding meaningful gallery images to the training set.
3.3.3 ABRDF Field Compression
A novel application of our method is the compression of the ABRDF fields. We
demonstrated earlier (Fig. 3-6) that the Global Eigenbubble based representation of the
ABRDF can generate high quality images even when only 5 dimensional subspace is
used to represent the ABRDFs. In terms of space requirement, this translates to storing
only 5 coefficients per pixel and the 5 Global Eigenbubbles, in order to generate images
under any arbitrary illumination condition. If we consider only the 64 point source
lit images present in the Extended Yale B database, it takes about 2 MB of memory
64
D Raw Eigenbubbles Reduction Ratio Error/Pixel
1 2016 KB 127.6 KB 6.25 % 18.32 2016 KB 253.6 KB 12.50 % 15.03 2016 KB 379.6 KB 18.75 % 13.84 2016 KB 505.6 KB 25.00 % 13.95 2016 KB 631.6 KB 31.24 % 13.86 2016 KB 757.6 KB 37.51 % 14.27 2016 KB 883.6 KB 43.76 % 13.38 2016 KB 1009.6 KB 50.01 % 13.29 2016 KB 1135.6 KB 56.26 % 13.110 2016 KB 1261.6 KB 62.51 % 12.9
Table 3-2. ABRDF field compression. D is the subspace dimension and the errors are theintensity errors per pixel.
to store the raw 192 × 168 images for each subject, while using Eigenbubbles based
representation with 5 coefficients, it would take less than one third of that memory to
summarize all the ABRDFs. This advantage increases if we consider more than 64 images
since the Eigenbubble based ABRDF representation can generate as many images as
desired. It must be noted that our technique cannot be directly compared with the image
compression methods as the ABRDF fields can generate as many images as desired, while
the image compression techniques cannot. Detailed results for various subspace dimensions
are presented in Table 3-2.
3.4 Conclusions
In this chapter we have shown that the face specific knowledge can be used to
enhance the quality of the relighted images. Using the Spherical Harmonics basis and
Principal Component Analysis, we demonstrated that the high quality relighted images
can be used to improve face recognition as well as to effectively summarize large ABRDF
fields using only a fraction of the required memory.
65
CHAPTER 4FACE RELIGHTING AND POSE CHANGE WITH SINGLE IMAGE
4.1 Introduction
In Chapter 2 a solution to the facial relighting and pose change problem with multiple
input images was presented, and in this chapter we will examine a related but relatively
harder problem of relighting and pose change of faces using single image as input. Most
of the applications associated with multiple image relighting and pose change, like the
post-production video scene editing or face recognition, naturally extend to the case with
single input image also.
In terms of the ABRDF field estimation, the fundamental difference between the cases
with multiple and single input image lies in the amount of face-specific background
knowledge required. In the case of multiple input images, since there are multiple
samples of the ABRDFs available, interpolation and extrapolation using the appropriate
mathematical functions can provide good approximations to the underlying ABRDF field.
But in the case of single input image, only one sample of the ABRDF at each image pixel
is provided, and hence it becomes imperative to use background knowledge about the
facial ABRDFs in order to hallucinate the complete ABRDF.
The above mentioned background knowledge can be derived from a number of
sources. One possible option is to use the sample values at the neighboring pixels in order
to better estimate the ABRDF at a given pixel. Another approach can exploit the known
ABRDF field of some reference face to generate the ABRDF field of the given input
image. This idea can be further extended to use multiple such reference ABRDF fields in
order to generate the ABRDF field of the given input. In this work we use a combination
of these possible techniques to estimate the ABRDF field of a given face.
The rest of this chapter is organized as follows: in Section 2 we being with a survey of
a few relighting and pose change methods which are specifically meant to work with single
input image. These may have some overlap with the techniques mentioned in Chapter 2
66
since the two considered problems are closely related. In Section 3 we present a general
overview of our technique and also explicitly bring out the underlying assumptions. In
Sections 4 and 5 we describe our technique in details and in Section 6, we present various
experimental results comparing our results with those from existing methods. Finally, in
Section 7 we conclude the discussion on single image relighting.
4.2 Related Work
Before we review the literature we must mentioned that the single image relighting
and pose change techniques in the literature are primarily driven by the face recognition
application. This is so because the other major application – relighting in movies and
video games require reasonably high quality relighted and pose changed images which are,
most of the time, difficult to generate using single image as input. That said, there are a
few methods that do produce relighting and pose change results using single image input
and so would we.
The subset of techniques that are most closely related to our technique consists of the
methods that involve fitting a canonical reflectance-shape face model to the given input
image. Using the single input image, the fitting process is expected to warp or modify
the canonical face model in such a way that the end result is the reflectance-shape model
for the given face, which can then be used for relighting and pose change. Morphable
Models [10] and Spherical Harmonics Morphable Models [11] are the prime examples of
such techniques. Both of these methods involve building shape and texture models using
a bootstrap set of images and 3D shapes, which are then fit onto the given input image
using some non-linear optimization techniques. The later of these uses the Spherical
Harmonics based lighting model instead of the Phong’s lighting model used in the former.
The upside of using these methods is that they can obtain the reflectance field as well as
the shape in one shot. On the downside, these methods require manual initialization of
the optimization process, which itself can be quite cumbersome and susceptible to local
minima.
67
Computationally more efficient techniques include methods like the Quotient Image
([17]) and the Practical Relighting ([61]). These methods use a bootstrap set of images
and/or shape, but assume them to be in a rough alignment with the input image. This
assumption alleviates the need for computationally expensive non-rigid alignment at
the cost of some loss in the results’ quality. Given a bootstrap set of 3 images of at
least one other face, the Quotient Image method computes the illumination neutral
”Quotient Image” from a given input image. This illumination neutral image can then
be combined with various lighting directions and intensities to produce various relighted
images. Though the Quotient Image method works with the Lambertian assumption,
surface normals are not required to be explicitly computed. On the other hand, the
Practical Relighting technique assumes a mean face shape and a mean face texture to
be applicable to all the faces. Using the bootstrap shape and texture models with the
Lambertian model, it iteratively recovers the albedo and the lighting in the input image.
Another interesting piece of work was presented in [52], which moves beyond the use of the
Lambertian model and uses statistical models in conjunction with the shape-from-shading
method to recover the shape and the reflectance of the face in the given input image.
4.3 Overview
Our method tries to strike a balance between the computation efficiency and the
results’ quality. It requires a 2D fitting of a reference ABRDF field to the given input
image(s) and as the end result provides the ABRDF field for the given input face. Our
method can work with both multiple or single input image, but here we would focus on its
behavior with single input image. Since our fitting procedure does not involve any 3D to
2D fitting, practical drawbacks encountered in the Morphable Models class of techniques
(e.g. [11], [10]), like the manual initialization and the cumbersome optimization procedures
are avoided. Our method is composed of two parts - building a reference reflectance
model and fitting this model to the given input image(s). It is fully automated with the
only requirement that at least one of the input images be somewhat frontally lit. It also
68
assumes that in case multiple input images are provided as input, they are all aligned
amongst themselves. In addition to estimating the ABRDF field of the given face, our
method also recovers an estimate for the lighting condition in the given images.
Our reference reflectance model is different than the reference models used by existing
techniques in that though it imbibes the 3D shape information, it is fully defined by a 2D
field of spherical functions. This allows the fitting of the model to be carried out without
any 3D to 2D projections. In the following sections we first describe our reference model
and then the fitting procedure in detail.
4.4 The Reference ABRDF Field Model
The first part of our technique involves building a reference ABRDF model, which
captures sufficient amount of variation seen in the facial ABRDF fields, so that it can
be readily customized for any given input face. We begin by breaking down the facial
ABRDF field into two parts - one which is illumination dependent (referred here as the
illumination model) and one which is illumination independent. In computer vision and
graphics literature the later is often referred to as the texture of the face. Interestingly,
in a given ABRDF field, the part that constitutes the texture and that which makes the
illumination model has not been well defined in the literature. Most often the definition
of texture and the illumination model is dependent on the assumed reflectance model. For
instance, in the Lambertian model, the albedo - a constant scaling factor at each pixel,
is commonly accepted at the texture, while the half-cosine term is considered to be the
illumination function.
We have chosen to use definitions of the texture and the illumination models which
are independent of any particular reflectance model. In a data driven fashion, we define
the texture at a pixel to be the mean value of the ABRDF, while the quotient function
obtained by dividing the ABRDF with the texture is defined to be the illumination
function (illumination model is a field of such illumination functions). In other words,
texture is taken to be the average of all possible intensity values obtained at a pixel by
69
varying the point source lighting direction. This scalar value is illumination neutral and in
practice generates a scalar field for a given face which looks illumination independent.
With these definitions, the ABRDF field of any given face can be easily factored into
a scalar field called texture and an illumination model. Thus, our reference ABRDF model
is composed of a reference texture model and a reference illumination model. Now in order
to build the reference texture and illumination models, we use the technique developed
in Chapter 2. Given N subjects with m images each, under known point source lighting,
their ABRDF fields can be built using the Tensor Splines model (Eigenbubbles can also be
used). The average of all the m images, which uniformly samples the lighting directions,
is taken as an estimate for the texture, while the quotient function field obtained by
dividing each ABRDF with the texture value is taken as an estimate for the illumination
model. The texture images are then used to non-rigidly align all the N faces to a reference
face. The obtained deformation field is also used to align the illumination models across
different subject to the same reference face.
Using the 3rd order Tensor Splines representation, we obtain 10 coefficients per pixel
for the illumination model and a scalar value per pixel for the texture model. For each
subject, we string all the 10 coefficients at all the pixels into a single vector, Lj, and
similarly, also obtain the texture vector Ti. Further, we assume that both the texture
and the illumination model for any given face come from the multi-variate Gaussian
distributions. Given a bootstrap set of aligned facial textures (Ti) and illumination models
(Lj), the covariance matrices for the texture and the illumination model distributions can
be defined as CT =∑
i (Ti − T )(Ti − T )/N and CL =∑
j (Lj − L)(Lj − L)/N respectively.
T and L are the average texture and the average illumination models, respectively.
We obtain an orthonormal basis for the covariance matrices CT and CL by the
Principal Component Analysis as their eigenvectors ti and lj respectively. Note that the
eigenvectors here are ordered according to decreasing eigenvalues. Hence, for any ABRDF
70
Figure 4-1. An overview of our technique.
field, A, the texture, T can be expressed as
T = (∑
αiti + T ) (4–1)
while the illumination model, L can be written as
L = (∑
βj lj + L). (4–2)
The number of terms in both the models can be chosen in accordance with the computational
and the quality requirements. These two quantities can now be composed to get the
ABRDF fields. We call the set {ti, T , lj, L} as the reference ABRDF field of the reference
reflectance model.
4.5 Model Fitting
Given k images of a face under point source, but unknown lighting conditions,
the problem now is to fit the reference model built in the previous section to the input
image(s). The unknowns include the non-rigid deformation required to align the reference
model and the input face, the lighting directions in the k images, the texture model
coefficients and the illumination model coefficients. We propose to recover these unknown
71
parameters by minimizing the following objective function
E1(Tx, Ty, αi, βj, θk, φk) =∑
k
∑(x,y) ||Ik(Tx(x, y), Ty(x, y))−D((
∑ni=1 αili + L, θk, φk, x, y)
·(∑mj=1 βj tj(x, y) + T ))||2,
(4–3)
where Tx and Ty are the x and y components of the non-rigid deformation applied to
the input images, θk and φk are the illumination directions of the k input images, αi are
the illumination coefficients and βj are the texture coefficients. The function D is an
abstraction of the process which takes the vector of coefficients and computes the spherical
illumination function at each pixel using the Tensor Splines basis. These functions are
then sampled at (x, y) location in (θ, φ) direction and scaled by the estimated texture
value at the location (x, y). The setup for the fitting procedure described above is
graphically depicted in Fig. 4-1.
In addition to the sum-of-differences objective function defined above, we constrain
the search space for the illumination model further by adding the following Tikhonov
regularizer to Eq. 4–3
E2(αi) = λ ·n∑i
α2i , (4–4)
where λ is the regularization parameter. This constraint effectively keeps the estimated
illumination model from going too far off from the mean illumination model and results in
artifact free relighted images. The value of the parameter λ is set by the user based on the
desired relighted image quality.
We break down the process of recovering the unknown into four steps. In the first
step, the input images are aligned with the reference ABRDF model. We use the 2D
Morphable Models [62] method to compute the non-rigid deformation parameters. The
inputs to this step are two images - the reference face image used to align the ABRDF
fields and the input image with somewhat frontal illumination. The output of this step are
the deformation parameters Tx and Ty which can be used to warp the input image(s) to
the reference model.
72
Next we compute the remaining unknowns by minimizing
E(αi, βj, θk, φk) = E1(Tx, Ty, αi, βj, θk, φk) + E2(αi) (4–5)
using a gradient descent based technique. The unknown illumination and texture
parameters are initialized with ones and the illumination directions are initialized with
zeros. In practice we have found that using MATLAB’s fminunc function with 200
iterations provides good estimate of the unknown parameters.
Once the unknowns have been recovered, we have an estimate for the ABRDF field of
the input face but it is still aligned with the reference faces. To handle this, as the third
step in our model fitting process, we backward warp the estimated ABRDF field using the
deformation parameters computed earlier. But since the process described above involves
two registration steps, the resultant ABRDF field provides images that appear grainy.
In order to remove these interpolation artifacts, we have incorporated a final step in
the fitting process called quotient mapping. Towards this, we generate an image from the
computed ABRDF field with the same lighting direction as the near-frontal input image.
Note that the lighting direction for this image was computed as part of the optimization
procedure described above. Next we compute the quotient map by dividing the near
frontal image with its synthesized estimate. This quotient map is then used to scale the
estimated ABRDF field and suppresses the artifacts introduced by interpolation and
extrapolation during the non-rigid alignments of the ABRDF field.
4.6 Experimental Results
In order to test the proposed model we did experiments using the Extended Yale B,
the CMU PIE and the MERL Dome face datasets. We used the Extended Yale B dataset
to build the reference illumination model since it contains a large number of good quality
point source lit images. The illumination model was obtained by taking the ABRDF fields
and normalizing the coefficient vectors. In absence of a large number of texture images,
as defined above, we used 200 to 400 frontally lit images to build the texture model. We
73
Figure 4-2. Comparison of relighted images with ground truth images from the CMU PIEand MERL DOME databases.
used images from the CMU PIE and the MERL Dome databases as input images of our
relighting method. Though our method can be used with multiple input images, here we
have presented results primarily on single input image case since, as described before, it is
the more difficult and useful special case of the relighting and the pose change problems.
4.6.1 Relighting
Foremost, we compare the quality of the relighted images with the ground truth
images in Fig. 4-2. The top two rows show results for an input from the CMU PIE dataset
while the bottom two rows show results for an input from the MERL Dome dataset. For
various ground truth lighting direction, relighted images are presented. Note that the
closest relighted images were manually selected. We have also included a couple of images
under extreme illuminations, for which there were no ground truth images available. This
shows the importance of a good ABRDF reference model. Since the Extended Yale B
images had more extreme lighting examples than the CMU PIE or the MERL Dome
datasets, our method was able to predict appearances of these faces under more extreme
illumination than those included in the CMU PIE or the MERL Dome datasets.
74
Next, for two subjects each from CMU PIE (Fig. 4-3 and Fig. 4-4) and MERL Dome
(Fig. 4-5 and Fig. 4-6) datasets, we present the whole range of images generated by our
method as the elevation and azimuth angle for the point light source varies from −60◦
to +60◦. The gradual appearance of cast and attached shadows can be noted as the
illumination direction moves away from the frontal direction.
4.6.2 Pose Change
Using the procedure described in this Chapter, one can estimate the ABRDF field of
a face from as few as one input image. Since the estimated ABRDF field is same as those
described in Chapter 2, we can use the shape recovery method described there to estimate
its shape too. In Fig. 4-7 we show the novel poses rendered of three different faces from
the CMU PIE dataset using single image as input.
4.7 Conclusions
In this chapter we have presented a novel scheme for single image relighting where we
use a non-Lambertian reference ABRDF field model and a 2D fitting procedure to obtain
the ABRDF field of a face from as few as one input image. From the relighted images
included in this chapter it can be noted that our method realistically reproduces complex
photo-effects like the cast shadows and the specularities in the synthesized images. Using
the procedure outlined before, we have also presented the pose change results using single
input image. Finally, one of the important application of our single image relighting
framework, face recognition, would be described in detail in Chapter 6.
75
Figure 4-3. Relighted images from CMU PIE dataset. Left to right and top to bottom theillumination angle varies from −60◦ to +60◦
76
Figure 4-4. Relighted images from CMU PIE dataset. Left to right and top to bottom theillumination angle varies from −60◦ to +60◦
77
Figure 4-5. Relighted images from MERL DOME dataset. Left to right and top to bottomthe illumination angle varies from −60◦ to +60◦
78
Figure 4-6. Relighted images from MERL DOME dataset. Left to right and top to bottomthe illumination angle varies from −60◦ to +60◦
79
Figure 4-7. Pose changed images of three subjects from the CMU PIE dataset. The singleimage used as the input is also shown in the first column.
80
CHAPTER 5FACE RECOGNITION
5.1 Introduction
World events, specially in the last decade, have lead to an increased interest in the
field of biometrics based person identification. Face recognition in particular, has attracted
prolific research in the computer vision and pattern recognition community. Even though
impressive strides have been made towards providing an ultimate solution to this problem,
significant and interesting problems still remain.
If we try to organize the epitome of literature present in this field, loosely a
dichotomy of approaches emerges. The first class of these tries to capture the physical
processes of image formation under various scene parameter variations like illumination
(Harmonic Image Exemplar [63], Generic ABRDF [3], Illumination Cone [64], Universal
Lighting [65]), pose (Shape + Spherical Harmonics Basis [66], Morphable Models [67]),
expression(Isometry-invariant Similarity [68], Geometry-Texture [69]) etc. In contrast,
the second class of approaches invokes mathematical and statistical tools to capture the
structure of the oft-invisible relations among the numbers that make up the face images.
These techniques explore the intrinsic data geometry assuming images to be either vectors
(e.g. Eigenfaces [70], Fisherfaces [71], Laplacianfaces [72], orthogonal Laplacianfaces
(OLAP) [73], Neighborhood Preserving Embedding [74], Marginal Fisher Analysis [75],
Laplacian Eigenmaps [76], Locally Linear Embedding [77], Locality Preserving Projections
[78], Kernel Locality Preserving Projections with Side Information (KLPPSI) [79],
MLASSO [80], Kernel Ridge Regression (KRR) [81]), or higher dimensional tensors (e.g.
Tensor Subspace Analysis [82], 2-Dimensional Linear Discriminant Analysis [83], Tensor
Marginal Fisher Analysis [75], Multi-Linear Discriminant Analysis [84], Tensorfaces [85],
Orthogonal Rank One Tensor Projection (ORO) [86], Tensor Average Neighborhood
Margin Maximization (TANMM) [87], Correlation Tensor Analysis (CTA) [88], Spectral
Regression [89], Regularized Discriminant Analysis [89], Smooth LDA [90]).
81
Figure 5-1. Structure of A1i and A2
i for an image of size 5× 5 and kernel of size 3× 3. Inthe first row, 9 neighborhoods of the image Ii are highlighted. For first orderapproximation, each of of these neighborhoods become a row in A1
i . For secondorder case, we take all the second order combinations of pixel values in eachneighborhood and use them as the first 81 (b4) elements of a row in A2
i . Therest 9 (b2) elements are simply the pixel values. Rows are numbered to showwhich neighborhood they correspond to.
A major advantage of the techniques in the first class comes from their being
generative in nature. This property allows these methods to accomplish tasks like face
relighting (e.g. [3],[63], [91]) or novel pose generation or complete 3D image reconstruction
(e.g. [66], [67]) in addition to recognition. At the same time, methods in the first class
tend to demand more side information from the data as compared to the second class of
methods (e.g. [3] requires illumination direction for the training set, [91] requires facial
feature points for initialization etc). The second class of methods are in a sense more
versatile as they can be seamlessly applied to a variety of different image sets without any
significant requirement of side information.
The method that we propose in this chapter loosely falls into the second category
of techniques. We seek a mapping of face image patches such that in the range space,
discrimination among different classes is easier. We choose Volterra kernels to accomplish
this because it allows us to systematically build progressively better approximations to
such a mapping. Furthermore, Volterra kernels can be learnt in a data driven fashion
which relieves us from being predisposed towards any fixed kernel form (e.g. Gaussian,
82
Figure 5-2. Training images from each class are stacked up and divided into equal sizedpatches. Corresponding patches from each class are then used to learn higherorder convolutional Volterra kernel by minimizing intraclass distance overinterclass distance. We end up with one Volerra kernel per group of spatiallycorresponding patches. The size and the order of the kernel is held constantfor a given training process. Note that the color images are only used forillustration, so far our implementation works with grayscale images.
Radial Basis Function etc). The face images in the range space are called Volterrafaces in
this chapter.
5.2 Volterra Kernel Approximations
From signal processing theory we know that a linear translation invariant (LTI)
functional = : H → H, which maps the function x(t) to function y(t), can be completely
described by a function h(t) as
=(x(t)) = y(t) = x(t)⊗ h(t) =
∫ ∞
−∞h(τ)x(t− τ)dτ. (5–1)
Volterra series theory generalizes this concept and states that any non-linear translation
invariant functional ℵ : H → H, which maps the function x(t) to function y(t), can be
described by a sequence of functions hn(t) as
83
=(x(t)) = y(t) =∞∑
n=1
yn(t) (5–2)
where
yn(t) =
∫ ∞
−∞· · ·
∫ ∞
−∞hn(τ1, . . . , τn)x(t− τ1) . . . x(t− τn)dτ1 · · · dτn (5–3)
Here hn(τ1, . . . , τn) are called the Volterra Kernels of the functional. It must be noted that
the above equation can be seamlessly generalized to 2 dimensional functions, I(u, v), which
for instance, can be an image. It can be noted that eq. (5–1) is just a special case of the
more general eq. (5–3) if only the first order terms are taken into account.
Since we are interested in computing using this theory, we would be using the
following discrete form of eq. (5–3).
yn(m) =∞∑
q1=−∞· · ·
∞∑qn=−∞
hn(q1, . . . , qn)x(m− q1) . . . x(m− qn). (5–4)
The infinite series form in eq. (5–4) does not lend itself well for practical implementations.
Further, for a given application, only the first few terms may be able to give the desired
approximation of the functional. Thus we need a truncated form of Volterra series, which
is denoted in this chapter as
=p(x(m)) =
p∑n=1
yn(m) = x(m)⊗p h(m) (5–5)
where p denotes the maximal order of the terms taken in account for the approximation.
Note that in this truncated Volterra series representation, h(m) is a placeholder for all the
different order of kernels.
In general, given a set of input functions I, we are interested in finding a functional
ℵ, such that ℵ(I) has some desired property. This desired property can be captured by
defining a goodness functional on the range space of ℵ. In cases when the explicit equation
relating the input set I to ℵ(I) is known, various techniques like harmonic input method,
direct expansion etc. ([92]) can be used to compute kernels of the unknown functional. In
84
absence of such an explicit relation, we propose that Volterra kernels be learnt from data
using the goodness functional. The translation invariance property of the volterra kernels
ensures that if the images are translated by a fixed amount in the domain, mapped images
are also translated by the same amount and hence the volterra kernel mapping is stable.
In this framework, the problem of pattern classification can be posed as follows.
Given a set of input data I = {gi} where i = 1 . . . N , a set of classes C = {ck} where
k = 1 . . . K and a mapping which associates each gi to a class ck, find a functional such
that in the range space, data ℵ(I) is easily classifiable. Here the goodness functional could
be a measure of separability of classes in the range space. Once the Volterra kernels have
been determined, a new data point can be classified using the learnt functional. ℵ(I)
can be approximated to appropriate accuracy based on computational efficiency and
classification accuracy constraints.
5.3 Kernel Computation as Generalized Eigenvalue Problem
For the specific task of image classification, we define the problem as follows. Given
a set of input images (2D functions) I, train set, where each image belongs to a particular
class ck ∈ C, compute the Volterra kernels for the unknown functional N which map the
images in such a manner that goodness functional O is minimized in the range space of
N . Functional O measures departure from complete separability of data in range space.
In this chapter we seek a functional N that maps all images from the same class in a
manner such that the intraclass L2 distance in minimized while the interclass L2 distance
is maximized. Once N has been determined, a new image can be classified using any
methods like Nearest Centroid Classifier, Nearest Neighbor Classifiers etc. in the mapped
space. With this observation, we define the goodness functional O as
O(I) =
∑ck∈C
∑i,j∈ck
‖N (Ii)−N (Ij)‖2
∑ck∈C
∑m∈ck,n/∈ck
‖N (Im)−N (In)‖2 (5–6)
85
Figure 5-3. Testing involves dividing the test image according to the scheme used whiletraining. Then each patch is mapped to the range space by the correspondingVolterra kernel. After the mapping each patch is classified using a NearestNeighbor classifier in the range space. After all patches have been individuallyclassified, each patch from the test image casts a vote towards the parentimage classification. The class with the maximum votes wins.
where the numerator measures the aggregate intraclass distance for all classes and
the denominator measures the aggregate distance of class ck from all other classes in C.
Equation (5–6) can be further expanded as
Ok(I) =
∑ck∈C
∑i,j∈ck
‖Ii ⊗p K − Ij ⊗p K‖2
∑ck∈C
∑m∈ck,n/∈ck
‖In ⊗p K − Im ⊗p K‖2 (5–7)
where K, like h(t) in eq. (5–5), is a placeholder for all different order convolution kernels.
At this juncture we make the linear nature of convolution explicit by converting the
convolution operation to multiplication. This conversion to explicit linear transformational
form can be done in many ways, but as the convolution kernel is the unknown in our
setup, we wish to keep it as a vector and thus we transform the image Ii into a new
representation Api such that
86
Ii ⊗p K = Api ·K (5–8)
where K is the vectorized form of 2D masks represented by K.
The exact form of Api depends on the order of the convolutions p. In Section 5 we
have presented results for up till the second order approximations and thus the structure
of Api is explained for only up to second order, but it should be noted that the recognition
framework using volterra kernels that we propose is very general and the structure of Api
for any order can be analogously derived.
5.3.1 First Order Approximation
For an image Ii of size m × n pixels and a first order kernel K1 of size b × b, the
transformed matrix Api has dimensions mn × b2. It is built by taking neighborhoods of
b × b dimensions at each pixel in Ii, vectorizing and stacking them one on top of other.
This procedure is illustrated for an image of size 5 × 5 and kernel of size 3 × 3 in Fig. 5-1
([4]). Border pixels can be ignored or taken into account during convolution by padding
the image with zeros without affecting the performance significantly.
Substituting the above defined representation for convolution in eq. (5–7), we obtain
O(I) =
∑ck∈C
∑i,j∈ck
‖Api ·K1 − Ap
j ·K1‖2
∑ck∈C
∑m∈ck,n/∈ck
‖Apn ·K1 − Am
i ·K1‖2 . (5–9)
This can be written as
O(I) =K
T
1 SWK1
KT
1 SBK1
(5–10)
where
SW =∑ck∈C
∑i,j∈ck
(Api − Ap
j)T (Ap
i − Apj) (5–11)
and
SB =∑ck∈C
∑
m∈ck,n/∈ck
(Api − Ap
j)T (Ap
i − Apj). (5–12)
87
Here SW and SB are symmetric matrices of only dimensions b2 × b2. Seeking the
minimum of eq. (5–10) leads to solving the generalized eigenvalue problem and thus the
minimum of O(I) is given by the minimum eigenvalue of SB−1SW and it is attained when
K1 equals the corresponding eigenvector.
5.3.2 Second Order Approximation
The second order approximation of the sought functional contains two terms
y(m) =∞∑
q1=−∞h1(q1)x(m− q1) +
∞∑q1=−∞
∞∑q2=−∞
h2(q1, q2)x(m− q1)x(m− q2)
(5–13)
The first term in eq. (5–13) corresponds to a weighted sum of the first order terms,
x(m − q1), while the second order term corresponds to weighted sum of the second order
terms, x(m − q1)x(m − q2). For an image Ii of size m × n pixels and kernels of size b × b,
the transformed matrix A2i for the second order approximation in eq. (5–8) has dimensions
mn × (b4 + b2) and the kernel vector that multiplies it, K2, has dimensions (b4 + b2) × 1.
A2i is built by taking a neighborhood of size b × b at each pixel in Ii, generating all second
degree combinations from the neighborhood, vectorizing them, concatenating first degree
terms and stacking them one on top of other. K2 is formed by concatenating vectorized
second and first order kernels. The structure of A2i for a 5 × 5 image and 3 × 3 kernels is
illustrated in Fig. 5-1 ([4]). It must noted that the problem is still linear in the variables
being solved for and in fact by use of this formulation we have ensured that regardless
of the order of approximation, the problem is linear in the coefficients of the volterra
convolution kernels.
With this definition of A2i we proceed as for the first order approximation to obtain
equations (5–9) and (5–10) with the difference being that the matrices SB and SW now
have dimensions (b4 + b2) × (b4 + b2). Here we must point out an important modification
88
to the structure of A2i which allows us to reduce the size of the matrices. The second order
convolution kernels in the Volterra series are required to be symmetrical ([92]) and this
symmetry also manifests itself into the structure of A2i . By allowing only unique entries
in A2i we can reduce the dimensions of A2
i to mn × b4+3b2
2and the dimensions of matrices
SB and SW to b4+3b2
2× b4+3b2
2. Now as in the first order approximation, the minimum of
Ok(I) is given by the minimum eigenvalue of SB−1SW and it is attained when K2 equals
the corresponding eigenvector.
5.4 Training and Testing Algorithms
The mapping developed in the previous section can be made more expressive if the
image is divided into smaller patches and each patch is allowed to be independently
mapped to its own discriminative space. This allows us to develop separate kernels for
different face regions. Further, if each constituent patch casts a vote towards the parent
image classification, it brings in robustness to the overall image classification. Thus we
adopt a patch based framework for the face recognition task.
Training (Fig. 5-2 [4]) in the proposed framework involves learning a volterra kernel
from the corresponding patches of the training images. Testing (Fig. 5-3 [4]) in our
framework involves two stages. The first stage classifies each patch using the mapping
framework described in the previous section and a nearest neighbor classifier. In the
second stage, each patch casts a vote towards the parent image classification and the class
with maximum votes wins. In the case of a tie, we classify the image by a run-off among
the tied classes. As the number of cases with ties are small, a simpler strategy of using a
coin toss to break the tie also showed similar results.
Parameter Selection: Like any other learning algorithm, there are parameters
that need to be set before using this method. But unlike most state-of-the-art algorithms
where parameters are chosen from a large discrete or continuously varying domain
(e.g. initialization and λ in [80], parameter α ∈ [0, 1] in [90], dimensionality D in [93],
reduced dimensionality in TSA [82], 2D-LDA [83] , energy threshold in [94] etc.), Volterra
89
Table 5-1. State-of-the-art methods with which we compare our technique along with thetraining set sizes used in their experiments.
Method Yale A CMU PIE Ext. Yale B
2008 CVPR,An (KLPPSI) [79] - 5 10 20 30 5 10 20 302008 CVPR,Pham (MLASSO) [80] - 2 3 4 2 3 42008 CVPR,Shan (UVF) [93] 2 - 8 - -2008 TIP,Fu (CTA) [88] - 5 10 15 20 5 10 20 302008 TIP,Fu (Lap,Eig,Fis) [88] - - 5 10 20 302007 CVPR,An (KRR) [81] - 5 10 20 30 5 10 20 302007 CVPR,Hua (ORO) [86] 5 30 202007 CVPR,Cai (S-LDA) [90] 2 3 4 5 - -2007 CVPR,Wang (TANMM) [87] 2 3 4 5 10 20 -2007 ICCV,Cai (SR, RDA) [89] - 30 40 10 20 30 402006 TIP,Cai (OLAP) [73] 2 3 4 5 5 10 20 30 -2006 TIP,Cai (Lap,Eig,Fis) [73] 2 3 4 5 5 10 20 30 -
discriminant analysis has a smaller discrete set of parameter choices. We use one of the
widely used ([90],[81]) and accepted methods, cross validation, for parameter selection.
Foremost is the selection of the patch size, and for this, starting with the whole face
image we define a quad-tree of sub-images. We progressively go down the tree stopping
at the level beyond which there is no improvement in the recognition rates in cross
validation. Empirically we found that a patch size of 8 × 8 pixels provides the best results
in all cases. Next, we allow patches to be overlapping or non-overlapping. The Volterra
kernel size can be of size 3 × 3 or 5 × 5 pixels (anything bigger than this severely over-fits
a patch of size 8× 8). Lastly, the order of the kernel can be quadratic or linear.
Here we have presented results for both quadratic and linear kernels. Rest of the
parameters were set using a 20-fold leave-one-out cross validation on the training set.
It can be noted from the results presented in the next section that the best parameter
configuration is fairly consistent not just within a particular database but also across
databases.
5.5 Experiments
In order to evaluate our technique and compare it with existing state-of-the-art
methods in learning based face recognition, we have identified 11 recent (CVPR 2008,
90
Table 5-2. Yale A: Recognition error rates for training set size varying from 2 to 9.Linear kernel size was 5× 5 and quadratic kernels size was 3× 3. For both casesoverlapping patches of size 8× 8 were used. Images of size 64× 64 were used.For other methods, best results as reported in the respective papers are used.
Train Set Size 2 3 4 5S-LDA [90] 42.4 27.7 22.2 18.3S-LDA [90] (updated) 37.5 25.5 19.3 14.7UVF [93] 27.11 17.38 11.71 8.16TANMM [87] 44.69 29.57 18.44 -OLAP [73] 44.3 29.9 22.7 17.9Eigenfaces [73] 56.5 51.1 47.8 45.2Fisherfaces [73] 54.3 35.5 27.3 22.5Laplacianfaces [73] 43.5 31.5 25.4 21.7Volterrafaces (Linear) 15.70 12.33 9.47 6.11Volterrafaces (Quad) 22.15 13.36 15.78 10.19
Train Set Size 6 7 8 9S-LDA [90] (updated) 12.3 10.3 8.7 -UVF [93] 6.27 5.07 3.82 -Volterrafaces (Linear) 5.78 3.96 2.61 1.43Volterrafaces (Quad) 10.04 9.66 9.49 8.74
Table 5-3. CMU PIE: Recognition error rates for training set size varying from 2 to 9.Linear kernel size was 5× 5 and quadratic kernels size was 3× 3. For both casesoverlapping patches of size 8× 8 were used. Images of size 32× 32 were used.For other methods, best results as reported in the respective papers are used.
Train Set Size 5 10 20 30KLPPSI [79] 27.88 12.32 5.48 3.62KRR [81] 26.4 13.1 5.97 4.02ORO [86] - - - 6.4TANMM [87] 26.98 17.22 5.68 -SR [89] - - - 6.1OLAP [73] 21.4 11.4 6.51 4.83Eigenfaces [73] 69.9 55.7 38.1 27.9Fisherfaces [73] 31.5 22.4 15.4 7.77Laplacianfaces [73] 30.8 21.1 14.1 7.13Volterrafaces (Linear) 20.26 10.24 4.94 2.85Volterrafaces (Quad) 25.29 11.94 5.45 4.60
Train Set Size 2 3 4 40SR [89] - - - 5.2MLASSO [80] 54.0 43.0 34.0 -Volterrafaces (Linear) 43.0 36.30 23.98 2.37Volterrafaces (Quad) 50.48 39.66 32.67 3.04
91
Table 5-4. Extended Yale B: Recognition error rates for training set size varying from 2to 5 10 20 30 & 40. Both Linear kernel and Quadratic kernel of size 3× 3 wereused and for both cases non-overlapping patches of size 8× 8 were used. Imagesof size 32× 32 were used. For other methods, best results as reported in therespective papers are used.
Train Set Size 5 10 20 30ORO [86] - - - 9.0SR [89] - 12.0 4.7 2.0RDA [89] - 11.6 4.2 1.8KLPPSI [79] 24.74 9.93 3.15 1.39KRR [81] 23.9 11.04 3.67 1.43CTA [88] 16.99 7.60 4.96 2.94Eigenfaces [88] 54.73 36.06 31.22 27.71Fisherfaces [88] 37.56 18.91 16.87 14.94Laplacianfaces [88] 34.08 18.03 30.26 20.20Volterrafaces (Linear) 6.35 2.67 0.90 0.42Volterrafaces (Quad) 13.0 3.98 1.27 0.58
Train Set Size 2 3 4 40MLASSO [80] 58.0 54.0 50.0 -SR [89] - - - 1.0RDA [89] - - - 0.9Volterrafaces (Linear) 26.23 18.23 9.33 0.34Volterrafaces (Quad) 40.81 20.47 14.42 0.43
CVPR 2007, ICCV 2007, TIP 2008, TIP 2006) publication which present the best results
on the benchmark databases using very similar (to that in [73]) experimental setup. All of
these are embedding methods mentioned in the prologue with the exception of [93] which
builds on the concept of Universal Visual Features (UVF). In our study, we present results
on Yale A, CMU PIE and Extended Yale B benchmark face databases partly because they
are few of the most popular databases which makes a comparative study easy. In addition
to the recent methods we also provide comparison with the traditional baseline methods -
Eigenfaces, Fisherface and Laplacianfaces. Table. 5-1 ([4]) lists these methods along with
the number of training images used by them on the above mentioned databases in their
experiments. We have presented our results for the whole range of training set sizes so
that comparison with maximum number of techniques can be made.
92
For the Yale A 1 face database we used 11 images each of the 15 individuals with a
total of 165 images. For CMU PIE database ([95]) we used all 170 images (except a few
corrupted images) from 5 near frontal poses (C05,C07,C09,C27,C29) for each of the 68
subjects. For Extended Yale B database ([65]) we used 64 (except for a few corrupted
images) images each (frontal pose) of 38 individuals present in the database. Note that the
methods in Table. 5-1 ([4]) used the same subset of images. We obtained the data from
the website of the authors of [90] 2 . These images are manually aligned (two eyes were
aligned at the same position) and cropped to extract faces, with 256 gray value levels per
pixel.
The results (average recognition error rates) on the Yale A, CMU PIE and Extended
Yale B databases are presented in Table. 5-2 ([4]), Table. 5-3 ([4]) and Table. 5-4 ([4])
respectively. Rows titled Train Set Size indicate the number of training images used and
the rows below them list the rates reported by various state-of-the-art and our method
(Volterrafaces). Each experiment was repeated 10 times for 10 random choices of training
set. All images other than the training set were used for testing. Specific experimental
setup used for Volterrafaces is mentioned below each table. We have reported results with
both linear and quadratic masks for the sake of completeness. Best results for a particular
training set size are highlighted in bold.
5.6 Discussion and Conclusion
It can be noted the our proposed method (Volterrafaces) consistently outperforms the
state-of-the-art and traditional methods on all the three benchmark datasets. On all the
three databases linear masks outperform the quadratic masks but in most cases quadratic
masks also provided better performance than the existing methods. For the competing
methods we have reported the error rates as mentioned in the original publications (listed
1 http://cvc.yale.edu/projects/yalefaces/yalefaces.html
2 http://www.cs.uiuc.edu/homes/dengcai2/Data/data.html
93
next to method names in tables). Since we want to present only the best results reported
by the competing methods, some of the entries in the tables are left empty because no
results for those training set sizes were reported in the original publications.
We have introduced the use of Volterra kernel approximations for image recognition
functionals in this chapter. The kernel learning is driven by training data, based on
a goodness functional defined in the range space of the recognition functional. It is
shown that for the goodness functional that tries to minimize intraclass distances
while maximizing interclass distances, kernel computation reduces to the generalized
eigenvalue problem which translates to very efficient computation of kernels for any order
approximation of the functional. Effectiveness of this technique for face recognition is
demonstrated by experiments on three benchmark databases and the results are compared
to traditional as well as the state of the art techniques in discriminant analysis for faces.
From the results presented in this chapter it can be concluded that Volterra kernel
approximations show great promise for applications in image recognition tasks.
94
CHAPTER 6ENHANCING FACE RECOGNITION WITH RELIGHTING
6.1 Introduction
In this dissertation we have presented various methods for generating relighted images
of a given face using one or more images. We have also examined the problem of face
recognition and built a novel classifier as a solution to this problem. In this chapter we
will examine the problem of face recognition in conjunction with relighting and explore the
possibility of using relighting as a tool to enhance face recognition. In particular, we would
explore the gallery set augmentation as a possible way to achieve higher recognition rates.
The literature is replete with various proposals (e.g. [17], [61], [52], [96]) on how
to marry relighting techniques with face recognition. We can broadly put the existing
techniques that have examined image relighting in conjunction with recognition into
two classes. The first includes techniques which use relighting as a way to obtain an
illumination invariant representations of the gallery and probe images (e.g. [97], [98]) while
the second class includes the techniques which seek to improve recognition by augmenting
the gallery set with novel relighted images (e.g. [52]).
6.2 Gallery Set Augmentation
The method we propose here falls into the second of the above mentioned classes
of methods. Given single input image of a face (initially the only image in the gallery
set), which is somewhat frontally lit (i.e. no strong cast or attached shadows), we use
the method we developed for single image relighting in Chapter 4 to generate multiple
relighted version of the same image. These new images are added to the gallery set and
any given probe image is now matched against all of the images.
One would expect that a probe image which has very different lighting than the input
image would not be a good match for it, but it is likely that one of the newly added image
might be a closer match for it. Even in the case where the probe image has been lit from
multiple light sources, a linear combination of the new gallery images would be a closer
95
match to it than any single image. Here we will focus on point source lit image based face
recognition since the case with multiple light sources can be simply addressed by using an
appropriate classifier with the same set of point source lit augmented images.
6.3 Experimental Results
In order to test the proposed recognition framework, we used the CMU PIE and
the MERL Dome datasets. Since gallery augmentation should help any recognition
method, we used three different recognition techniques to illustrate that relighting using
our method improves recognition. In our experiments, we looked at the improvement
achieved by augmentation over using single gallery image. The ABRDF reference model
used for single image relighting was built using the Extended Yale B face database. We
also implemented a few competing methods for image relighting to compare their results
against those achieved by the proposed relighting method.
Here we describe the recognizers we used in our experiments.
6.3.1 Nearest Neighbor Classifier
One of the simplest ways to classify images is to look at the pixel-by-pixel distance
between the probe image and the gallery images. Of the various distance measures
possible, we have chosen to use the simple L1 norm based distance measure since this
method would only be used to set the baseline results.
6.3.2 MERL Classifier
It is a commercial face recognition system that uses Haar-Like features to recognize
face images. Details for this classifier can be found in [99]. This classifier uses a large
set of pairs of images labeled as same or different to learn a set of good recognition
features and classifier parameters using Boosting. Once the classifier parameters and
the discriminating features have been learnt, given any pair of images, it assigns a
similarity score to the pair. This similarity score can then be used in conjunction with
a classification threshold to categorize images as different subjects.
96
6.3.3 Local Binary Pattern Classifier
Local Binary Patterns are a new class of features, proposed in [100], which have
been very successfully applied to face recognition. At contrast to the above mentioned
classifiers, this method uses a histogram of features and thus is not sensitive to the
absolute locations of the features.
To compare our results with other relighting methods, we also implement following
two relighting methods:
6.3.4 Naive Relighting
One of the simplest relighting methods is to darken a part of the image. This
gives the illusion that a part of the image is under shadow due to lighting. For this we
divide the image into rectangular segments and generate new images with pixels in these
rectangles progressively going darker from left to right, top to bottom and vice-versa. For
the darker segments of the image, we set the pixel values to be 30% of the original values.
This relighting method is used to answer the question that whether the more complicated
relighting schemes proposed here (and else where) are better than naively darkening
segments of the images to simulate shadowing or not?
6.3.5 Practical Relighting
This is one of the recently proposed state-of-the-art methods for single image
relighting which boasts a simple implementation with high performance. Starting with
a reference albedo and shape model, this method ([61]) estimates the lighting direction in
the input image. The lighting estimate and the shape model are then used to estimate a
face specific albedo field. With the albedo and the shape available, image of the face under
any illumination can be rendered using the Lambertian Model. Note that this method
assumes that all the face have the same shape and requires only a rough alignment among
them.
97
Table 6-1. Effect of gallery augmentation size : Face Recognition rates for the CMU PIEdataset with the MERL classifier.
Augmentation SizeRelighting Method 256 128 64 32 16 8 4 1Naive Relighting 92.5 92.2 91.8 93.1 92.8 90.4 90.9 90.1
Practical Relighting 92.7 92.2 92.3 91.6 90.7 90.2 91.3 90.1Proposed Technique 98.4 98.4 98.1 97.4 97.9 96.6 94.8 90.1
6.3.6 Results
One of important parameter in a gallery set augmentation scheme is the number of
new images to add to the gallery set. This can affect the performance in following three
ways. First, as the number of augmented images increases, so does the classification time
since the number of required comparisons also increase. Second, if more images are used
for augmentation, it is more likely that a close match would be found for the probe image.
Finally, having more images in the gallery increases the risk of encountering false positives.
Assuming that the classification time is not critical (since modern commercial classifiers
work quite fast), here we would focus on the change in recognition rates and the impact on
false positives.
In the first set of results we show how the recognition rates change as the number of
images used to augment the gallery increases. Here we have used the MERL classifier with
our relighting on the CMU PIE dataset. The gallery and the probe sets are composed of
facial images in the frontal pose. These results are presented in Table. 6-1. It can be noted
that across different relighting schemes, the recognition rates show improvement as the
number of augmented images is increased. For this experiment we synthesized 256 images
of each gallery subject by uniformly sampling the illumination direction space from +60◦
to −60◦ in both the elevation and the azimuth angles. Other augmentations were obtained
by uniformly spacing the images in the previously mentioned range.
For the rest of the experiments we would fix the number of augmented images to 64.
This is because it provides a reasonable trade-off between the memory-time requirements
and the recognition rates. We present recognition rates for the CMU PIE database in
98
Table 6-2. Face recognition rates for the CMU PIE datasetNearest Neighbor (L1) MERL Classifier Local Binary Patterns
Relighting Method Augmented Single Image Augmented Single Image Augmented Single ImageNaive Relighting 70.4 58.9 91.8 90.1 93.9 93.4
Practical Relighting 80.7 58.9 92.3 90.1 96.6 93.4Proposed Technique 93.5 58.9 98.1 90.1 99.2 93.4
Table 6-3. Face recognition rates for the MERL Dome datasetNearest Neighbor (L1) MERL Classifier Local Binary Patterns
Relighting Method Augmented Single Image Augmented Single Image Augmented Single ImageNaive Relighting 54.3 42.7 78.2 71.2 66.3 65.6
Practical Relighting 77.5 42.7 80.8 71.2 85.4 65.6Proposed Technique 79.8 42.7 82.9 71.2 88.2 65.6
Table 6-2 and for the MERL DOME database in Table 6-3. It can be noted that across
the different databases and across the different classification schemes, our relighting
method provides higher recognition rates than the competing methods.
Rank-one recognition rates, as presented in Tables 6-2 and 6-3, though useful, do
not provide insight into the impact of gallery augmentation on the False Acceptance
Rates. We explore this aspect of the galley set augmentation using the receiver operating
characteristic (ROC) curves which plot the False Rejection Rates against the False
Acceptance Rates as the recognition threshold is varied. For each probe image, a distance
is computed from every gallery class. Though this distance is computed differently
for different recognition strategies, it is a scalar value in all the cases. The recognition
threshold used to generate the ROC curves decides when to classify a probe image as
belonging to a class. If the distance between a probe image and a class is less than the
threshold value, the probe image is classified as belonging to that class, otherwise not.
Note that the classes here correspond to the subjects.
In Fig. 6-1 and Fig. 6-2 we present the Receiver Operating Characteristic (ROC)
curves for the MERL classifier, on the CMU PIE and the MERL Dome datasets
respectively. Curves for various relighting methods and the single image case are included
in the plots. There are three important figures of merit for such curves – first is the Equal
Error Rate (EER), which is the point on the curve where the False Acceptance Rate
99
Figure 6-1. ROC Curve for the CMU PIE dataset with the MERL classifier.
equals the False Rejection Rate, the second is the area under the ROC curve and third is
the False Rejection Rate at 0.1% False Acceptance Rate. It can be noted that for all of
these merit criterions, our relighting outperforms the existing methods. Similar trends can
be noted in the curves for the Nearest Neighbor classifier in Fig. 6-3 and Fig. 6-4. ROCs
for the Local Binary Pattern classifier on the CMU PIE and the MERL Dome databases
are presented in Fig. 6-5 and Fig. 6-6 respectively.
6.4 Conclusion
In this chapter we have demonstrated that relighting can be effectively used to
enhance recognition via gallery set augmentation. We have shown that across different
recognition schemes, relighting can benefit face recognition. Further, by comparing various
relighting methods to the proposed method, we have demonstrated that our relighting
method has significant advantages when it comes to enhancing recognition.
100
Figure 6-2. ROC Curve for the MERL Dome dataset with the MERL classifier.
Figure 6-3. ROC Curve for the CMU PIE dataset with the Nearest Neighbor classifier.
101
Figure 6-4. ROC Curve for the MERL Dome dataset with the Nearest Neighbor classifier.
Figure 6-5. ROC Curve for the CMU PIE dataset with the LBP classifier.
102
Figure 6-6. ROC Curve for the MERL Dome dataset with the LBP classifier.
103
CHAPTER 7CONCLUSION
In this dissertation we have examined the three fundamental problems in facial image
analysis - relighting, pose change and recognition. For all the three problems we have
presented novel solutions for both the cases of multiple images and single image as input.
Due to their tightly interdependent nature, we have treated the problems of relighting
and pose change together and presented a Tensor Splines based framework for multiple
image relighting and pose change. We have shown that our framework is superior to the
popular Lambertian model in terms of the shadows and the specularities reproduction in
the relighted images. Furthermore, we have shown that the face shape estimated using
our method has advantages over popular Robust Photometric Stereo technique when the
input images are cast shadow ridden. We have also presented a method for enhancing the
relighted image quality using Eigenbubbles.
For the more difficult problem of single image relighting and pose change, we
draw upon a bootstrap set of ABRDF fields. We define the reference ABRDF field
model in terms of the illumination field model and the texture model. Given a single
input image, using a non-linear optimization technique, we find those coefficients of
the illumination and the texture model, along with the illumination direction and the
non-rigid deformation, which explain the input image the best. Once we have obtained the
ABRDF field, we follow the same scheme as devised for the multiple input images case to
generate images in novel poses.
In order to address the problem of face recognition with multiple input images,
we presented a novel classification scheme using the Volterra kernels which we call
Volterrafaces. We have shown that this method outperforms various state-of-the-art
methods in terms of recognition accuracy on three popular benchmark face databases.
We have addressed the problem of single image recognition in a broader sense and have
proposed a gallery set augmentation framework which can be used with most off-the-shelf
104
recognition methods to enhance recognition. Given single gallery image, our technique
generates multiple relighted version of it and then any recognition technique that uses
multiple images can be used. We have compared our technique with existing single image
relighting methods and have showed that the recognition rate enhancements attained by
our relighting method exceeds those obtained by existing relighting methods.
The techniques presented in this dissertation improve upon the state-of-the-art in
various ways but there are numerous avenues that are still open for exploration. Few of
the important threads for possible future exploration include removing the need for having
somewhat frontally lit input image for single image relighting and pose change, removing
the need for point light source lit images as input for both multiple and single input image
cases and reducing the number of system parameters in the Volterrafaces classification
scheme. The computational efficiency aspects of all the algorithms presented in this paper
can also be explored in future since they play an important role in real-life applications.
105
REFERENCES
[1] B. T. Phong, “Illumination for computer generated pictures,” Communications ofthe ACM, vol. 18, no. 6, pp. 311–317, 1975.
[2] K. E. Torrance and E. M. Sparrow, “Theory for off-specular reflection fromroughened surfaces,” J. Optical Society of America, vol. 57, pp. 1105–1112, 1967.
[3] A. Barmpoutis, R. Kumar, B. C. Vemuri, and A. Banerjee, “Beyond thelambertian assumption: A generative model for apparent brdf fields of faces usinganti-symmetric tensor splines,” Proc. IEEE CS Conf. Computer Vision and PatternRecognition, 2008.
[4] R. Kumar, A. Banerjee, and B. C. Vemuri, “Volterrafaces: Discriminant analysisusing volterra kernels,” Proc. IEEE CS Conf. Computer Vision and Pattern Recogni-tion, 2009.
[5] R. Kumar, A. Barmpoutis, A. Banerjee, and B. C. Vemuri, “Non-lambertianreflectance modeling and shape recovery for faces using anti-symmetric tensorsplines,” Technical Report REP-2009-467, Dept. of CISE, Univ. of Florida, 2009.
[6] H. Chan and W. W. Bledsoe, “A man-machine facial recognition system-somepreliminary results,” Panoramic Research Inc, Tech. Rep., 1965.
[7] R. Basri and D. W. Jacobs, “Lambertian reflectance and linear subspaces,” IEEETrans. Pattern Analysis and Machine Intelligence, vol. 25, no. 2, pp. 218–233, 2003.
[8] P. Debevec, T. Hawkins, C. Tchou, H. Duiker, W. Sarokin, and M. Sagar,“Acquiring the reflectance field of a human face,” ACM Trans. on Graphics (Proc.SIGGRAPH), vol. 1, pp. 145–156, 2000.
[9] R. Ramamoorthi and P. Hanrahan, “A signal-processing framework for inverserendering,” ACM Trans. on Graphics (Proc. SIGGRAPH), vol. 1, pp. 117–128, 2001.
[10] V. Blanz and T. Vetter, “Face recognition based on fitting a 3d morphablemodel,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 9,pp. 1063–1074, Sept. 2003.
[11] L. Zhang and D. Samaras, “Face recognition from a single training image underarbitrary unknown lighting using spherical harmonics,” IEEE Trans. PatternAnalysis and Machine Intelligence, vol. 28, no. 3, pp. 351–363, March 2006.
[12] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, “From few to many:Illumination cone models for face recognition under variable lighting and pose,”IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 6, pp. 643–660,June 2001.
106
[13] A. Barmpoutis, B. C. Vemuri, and J. R. Forder, “Robust tensor splines forapproximation of diffusion tensor mri data,” Proc. IEEE CS Workshop on Mathe-matical Methods in Biomedical Image Analysis, pp. 86–86, 2006.
[14] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, “Illumination-based imagesynthesis: Creating novel images of human faces under differing pose & lighting,”IEEE Workshop on Multi-View Modeling and Analysis of Visual Scenes, 1999.
[15] V. Blanz and T. Vetter, “A morphable model for the synthesis of 3d faces,” ACMTrans. on Graphics (Proc. SIGGRAPH), pp. 187–194, 1999.
[16] T. Malzbender, D. Gelb, and H. Wolters, “Polynomial texture maps,” ACM Trans.on Graphics (Proc. SIGGRAPH), pp. 519–528, 2001.
[17] A. Shashua and T. Riklin-Raviv, “The quotient image: Class-based re-rendering andrecognition with varying illuminations,” IEEE Trans. Pattern Analysis and MachineIntelligence, vol. 23, no. 2, pp. 129–139, Feb. 2001.
[18] W. Y. Zhao and R. Chellappa, “Symmetric shape-from-shading using self-ratioimage,” Int’l J. Computer Vision, vol. 45, no. 1, pp. 55–75, 2001.
[19] S. Magda, D. J. Kriegman, T. Zickler, and P. N. Belhumeur, “Beyond lambert:Reconstructing surfaces with arbitrary brdfs,” Proc. IEEE Int’l Conf. ComputerVision, vol. 2, pp. 391–398, 7–14 July 2001.
[20] A. S. Georghiades, “Recovering 3-d shape and reflectance from a small number ofphotographs,” Proc. Eurographics Workshop on Rendering, pp. 230–240, 2003.
[21] Z. Wen, Z. Liu, and T. S. Huang, “Face relighting with radiance environmentmaps,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, vol. 2, pp.II–158–65, 18–20 June 2003.
[22] R. Gross, I. Matthews, and S. Baker, “Appearance-based face recognition andlight-fields,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, no. 4,pp. 449–465, April 2004.
[23] D. B. Goldman, B. Curless, A. Hertzmann, and S. M. Seitz, “Shape andspatially-varying brdfs from photometric stereo,” Proc. IEEE Int’l Conf. Com-puter Vision, vol. 1, pp. 341–348, 17–21 Oct. 2005.
[24] A. Hertzmann and S. M. Seitz, “Example-based photometric stereo: Shapereconstruction with general, varying brdfs,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 27, no. 8, pp. 1254–1264, Aug. 2005.
[25] K. C. Lee and B. Moghaddam, “A practical face relighting method for directionallighting normalization,” Int’l Workshop on Analysis and Modelling of Faces andGestures, 2005.
107
[26] T. Zickler, R. Ramamoorthi, S. Enrique, and P. N. Belhumeur, “Reflectance sharing:Predicting appearance from a sparse set of images of a known shape,” IEEE Trans.Pattern Analysis and Machine Intelligence, vol. 28, no. 8, pp. 1287–1302, Aug. 2006.
[27] M. Chandraker, S. Agarwal, and D. Kriegman, “Shadowcuts: Photometric stereowith shadows,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition,2007.
[28] N. Alldrin and D. Kriegman, “Shape from varying illumination and viewpoint,”Proc. IEEE Int’l Conf. Computer Vision, 2007.
[29] S. Biswas and G. Aggarwal, “Robust estimation of albedo for illumination-invariantmatching & shape recovery,” Proc. IEEE Int’l Conf. Computer Vision, 2007.
[30] S. K. Zhou, G. Aggarwal, R. Chellappa, and D. W. Jacobs, “Appearancecharacterization of linear lambertian objects, generalized photometric stereo, andillumination-invariant face recognition,” IEEE Trans. Pattern Analysis and MachineIntelligence, vol. 29, no. 2, pp. 230–245, Feb. 2007.
[31] R. Basri, D. Jacobs, I. Kemelmacher, and R. Basri, “Photometric stereo withgeneral, unknown lighting,” Int’l J. Computer Vision, vol. 72, no. 3, pp. 239–257,2007.
[32] N. Alldrin, T. Zickler, and D. Kriegman, “Photometric stereo with non-parametricand spatially-varying reflectance,” Proc. IEEE CS Conf. Computer Vision andPattern Recognition, pp. 1–8, 23–28 June 2008.
[33] A. Shashua, “On photometric issues in 3d visual recognition from a single 2d image,”Int’l J. Computer Vision, vol. 21, pp. 99–122, 1997.
[34] A. L. Yuille, D. Snow, R. Epstein, and P. N. Belhumeur, “Determining generativemodels of objects under varying illumination: Shape and albedo from multipleimages using svdand integrability,” Int’l J. Computer Vision, vol. 35, no. 3, pp.203–222, 1999.
[35] P. Belhumeur and D. Kriegman, “What is the set of images of an object under allpossible illumination conditions?” Int’l J. Computer Vision, vol. 28, pp. 245–260,1998.
[36] R. Ramamoorthi and P. Hanrahan, “The relationship between radiance andirradiance: Determining the illumination from images of a convex lambertianobject,” J. Optical Society of America A, vol. 18, no. 10, pp. 2448–2459, 2001.
[37] K.-C. Lee, J. Ho, and D. J. Kriegman, “Acquiring linear subspaces for facerecognition under variable lighting,” IEEE Trans. Pattern Analysis and MachineIntelligence, vol. 27, no. 5, pp. 684–698, May 2005.
108
[38] J. Lee, B. Moghaddam, H. Pfister, and R. Machiraju, “A bilinear illumination modelfor robust face recognition,” Proc. IEEE Int’l Conf. Computer Vision, vol. 2, pp.1177–1184, 17–21 Oct. 2005.
[39] J. P. O’Shea, M. S. Banks, and M. Agrawala, “The assumed light direction forperceiving shape from shading,” Proc. Symposium on Applied Perception in Graphicsand Visualization, pp. 135–142, 2008.
[40] P. N. Belhumeur, D. J. Kriegman, and A. L. Yuille, “The bas-relief ambiguity,” Int’lJ. Computer Vision, vol. 35, no. 1, pp. 33–44, 1999.
[41] Z. Liu, Y. Shan, and Z. Zhang, “Expressive expression mapping with ratio images,”ACM Trans. on Graphics (Proc. SIGGRAPH), pp. 271–276, 2001.
[42] A. Hertzmann and S. M. Seitz, “Shape and materials by example: A photometricstereo approach,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition,vol. 1, pp. I–533–I–540, 18–20 June 2003.
[43] T.-P. Wu, K.-L. Tang, C.-K. Tang, and T.-T. Wong, “Dense photometric stereo:A markov random field approach,” IEEE Trans. Pattern Analysis and MachineIntelligence, vol. 28, no. 11, pp. 1830–1846, Nov. 2006.
[44] S. C. Foo, “A gonioreflectometer for measuring the bidirectional reflectance ofmaterial for use in illumination computation,” Master’s thesis, Cornell University,1997.
[45] G. Borshukov and J. P. Lewis, “Realistic human face rendering for ”the matrixreloaded”,” ACM SIGGRAPH 2003 Sketches & Applications, pp. 1–1, 2003.
[46] T. Weyrich, W. Matusik, H. Pfister, B. Bickel, C. Donner, C. Tu, J. McAndless,J. Lee, A. Ngan, H. W. Jensen, and M. Gross, “Analysis of human faces usinga measurement-based skin reflectance model,” ACM Trans. on Graphics (Proc.SIGGRAPH), pp. 1013–1024, 2006.
[47] H. Jeffreys, Cartesian Tensors. Cambridge: The University Press, 1931.
[48] C. de Boor, “On calculating with b-splines,” J. Approximation Theory, vol. 6, p.5062, 1972.
[49] C. Lawson and R. J. Hanson, Solving Least Squares Problems. Prentice-Hall,Englewood Cliffs, 1974.
[50] A. Barmpoutis, B. C. Vemuri, T. M. Shepherd, and J. R. Forder, “Tensor splinesfor interpolation and approximation of dt-mri with applications to segmentationof isolated rat hippocampi,” IEEE Trans. Medical Imaging, vol. 26, no. 11, pp.1537–1546, Nov. 2007.
[51] B. K. P. Horn, Robot Vision. McGraw Hill: New York, 1986.
109
[52] T. Sim and T. Kanade, “Combining models and exemplars for face recognition:An illuminating example,” Proc. IEEE CS Conf. Computer Vision and PatternRecognition Workshop on Models versus Exemplars in Computer Vision, 2001.
[53] P. Debevec, “Rendering synthetic objects into real scenes: Bridging traditionaland image-based graphics with global illumination and high dynamic rangephotography,” ACM Trans. on Graphics (Proc. SIGGRAPH), pp. 189–198, 1998.
[54] R. Gross and V. Brajovic, “An image preprocessing algorithm for illuminationinvariant face recognition,” Int’l Conf. Audio- and Video-Based Biometric PersonAuthentication, pp. 10–18, 2003.
[55] R. Brunelli and T. Poggio, “Face recognition: Features versus templates,” IEEETrans. Pattern Analysis and Machine Intelligence, vol. 15, no. 10, pp. 1042–1052,Oct. 1993.
[56] P. W. Hallinan, “A low-dimensional representation of human faces for arbitrarylighting conditions,” Proc. IEEE CS Conf. Computer Vision and Pattern Recogni-tion, pp. 995–999, 21–23 June 1994.
[57] A. U. Batur and M. H. Hayes III, “Linear subspaces for illumination robust facerecognition,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition,vol. 2, pp. 296–301, 2001.
[58] W. A. Smith and E. R. Hancock, “Facial shape-from-shading and recognition usingprincipal geodesic analysis and robust statistics,” Int’l J. Computer Vision, vol. 76,no. 1, pp. 71–91, 2008.
[59] C. Ding and X. He, “K-means clustering via principal component analysis,” Int’lConf. Machine Learning, 2004.
[60] R. Basri and D. Jacobs, “Lambertian reflectance and linear subspaces,” IEEE Trans.Pattern Analysis and Machine Intelligence, vol. 25, no. 2, pp. 218–233, 2003.
[61] K. chih Lee and B. Moghaddam, “A practical face relighting method for directionallighting normalization,” IEEE Int’l Workshop on Analysis and Modeling of Facesand Gestures, 2005.
[62] M. Jones and T. Poggio, “Multidimensional morphable models: A framework forrepresenting and matching object classes,” Proc. IEEE Int’l Conf. Computer Vision,pp. 683–688, 1998.
[63] L. Zhang and D. Samaras, “Face recognition under variable lighting using harmonicimage exemplars,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition,pp. I–19–I–25, 2003.
[64] A. S. Georghiades, P. Belhumeur, and D. J. Kriegman, “From few to many:Illumination cone models for face recognition under variable lighting and pose,”
110
IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 6, pp. 643–660,2001.
[65] K. Lee, J. Ho, and D. J. Kriegman, “Acquiring linear subspaces for face recognitionunder variable lighting,” IEEE Trans. Pattern Analysis and Machine Intelligence,vol. 27, no. 5, pp. 684–698, 2005.
[66] S. Wang, L. Zhang, and D. Samaras, “Face reconstruction across different poses andarbitrary illumination conditions,” Biometric Authentication Workshop, pp. 91–101,2005.
[67] V. Blanz and T. Vetter, “Face recognition based on fitting a 3d morphablemodel,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 9,pp. 1063–1074, 2003.
[68] A. M. Bronstein, M. M. Bronstein, and R. Kimmel, “Robust expression-invariantface recognition from partially missing data,” European Conf. Computer Vision, pp.396–408, 2006.
[69] X. Li, G. Mori, and H. Zhang, “Expression-invariant face recognition with expressionclassification,” Canadian Conf. Computer and Robot Vision, pp. 77–83, 2006.
[70] A. Pentland, B. Moghaddam, and T. Starner, “View-based and modular eigenfacesfor face recognition,” Proc. IEEE CS Conf. Computer Vision and Pattern Recogni-tion, 1994.
[71] P. N. Belhumeur, J. Hespanha, and D. J. Kriegman, “Eigenfaces vs. fisherfaces:Recognition using class specific linear projection,” IEEE Trans. Pattern Analysisand Machine Intelligence, vol. 19, no. 7, pp. 711–720, 1997.
[72] X. He, S. Yan, Y. Hu, P. Niyogi, and H. Zhang, “Face recognition usinglaplacianfaces,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27,no. 3, 2005.
[73] D. Cai, X. He, J. Han, and H. J.Zhang, “Orthogonal laplacianfaces for facerecognition,” IEEE Trans. Image Processing, vol. 15, no. 11, 2006.
[74] X. He, D. Cai, S. Yan, and H.-J. Zhang, “Neighborhood preserving embedding,”Proc. IEEE Int’l Conf. Computer Vision, pp. 1208–1213, 2005.
[75] S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, and S. Lin, “Graph embedding andextension: A general framework for dimensionality reduction,” IEEE Trans. PatternAnalysis and Machine Intelligence, vol. 29, no. 1, pp. 830–837, 2007.
[76] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques forembedding and clustering,” Advances in Neural Information Processing Systems,pp. 585–591, 2002.
111
[77] S. Roweis and L. Saul, “Nonlinear dimensionality reduction by locally linearembedding,” Science, pp. 2323–2326, 2002.
[78] X. He, D. Cai, and P. Niyogi, “Locality preserving projections,” Advances in NeuralInformation Processing Systems, 2003.
[79] S. An, W. Liu, and S. Venkatesh, “Exploiting side information in locality preservingprojection,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2008.
[80] D.-S. Pham and S. Venkatesh, “Robust learning of discriminative projection formulticategory classification on the stiefel manifold,” Proc. IEEE CS Conf. ComputerVision and Pattern Recognition, 2008.
[81] S. An, W. Liu, and S. Venkatesh, “Face recognition using kernel ridge regression,”Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2007.
[82] X. He, D. Cai, and P. Niyogi, “Tensor subspace analysis,” Advances in NeuralInformation Processing Systems, 2005.
[83] J. Ye, R. Janardan, and Q. Li, “Two-dimensional linear discriminant analysis,”Advances in Neural Information Processing Systems, 2004.
[84] S. Yan, D. Xu, Q. Yang, L. Zhang, X. Tang, and H. J. Zhang, “Multilineardiscriminant analysis for face recognition,” IEEE Trans. Image Processing, vol. 16,no. 1, 2007.
[85] M. Vasilescu and D. Terzopoulos, “Multilinear subspace analysis of imageensembles,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition,pp. 212–220, 2003.
[86] G. Hua, P. Viola, and S. Drucker, “Face recognition using discriminatively trainedorthogonal rank one tensor projections,” Proc. IEEE CS Conf. Computer Vision andPattern Recognition, 2007.
[87] F. Wang and C. Zhang, “Feature extraction by maximizing the averageneighborhood margin,” Proc. IEEE CS Conf. Computer Vision and Pattern Recogni-tion, 2007.
[88] Y. Fu and T. S. Huang, “Image classification using correlation tensor analysis,”IEEE Trans. Image Processing, vol. 17, no. 2, 2008.
[89] D. Cai, X. He, and J. Han, “Spectral regression for efficient regularized subspacelearning,” Proc. IEEE Int’l Conf. Computer Vision, 2007.
[90] D. Cai, X. He, Y. Hu, J. Han, and T. Huang, “Learning a spatially smooth subspacefor face recognition,” Proc. IEEE CS Conf. Computer Vision and Pattern Recogni-tion, 2007.
112
[91] L. Zhang, S. Wang, and D. Samaras, “Face synthesis and recognition from a singleimage under arbitrary unknown lighting using a spherical harmonic basis morphablemodel,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, pp.209–216, 2005.
[92] J. A. Cherry, “Introduction to volterra methods,” Distortion Analysis of WeaklyNonlinear Filters Using Volterra Series, 1994.
[93] H. Shan and G. W. Cottrell, “Looking around the backyard helps to recognize facesand digits,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2008.
[94] S. Rana, W. Liu, M. Lazarescu, and S. Venkatesh, “Recognising faces in unseenmodes: A tensor based approach,” Proc. IEEE CS Conf. Computer Vision andPattern Recognition, 2008.
[95] T. Sim, S. Baker, and M. Bsat, “The cmu pose, illumination, and expression (pie)database,” Proc. IEEE Int’l Conf. Automatic Face and Gesture Recognition, 2002.
[96] C.-P. Chen and C.-S. Chen, “Lighting normalization with generic intrinsicillumination subspace for face recognition,” Proc. IEEE Int’l Conf. ComputerVision, 2005.
[97] D.-H. Liu, K.-M. Lama, and L.-S. Shenb, “Illumination invariant face recognition,”Pattern Recognition, vol. 38, no. 10, pp. 1705–1716, 2005.
[98] J. E. N. Coleman and R. Jain, “Obtaining 3-dimensional shape of textured andspecular surfaces using four-source photometry,” Computer Graphics and ImageProcessing, 2003.
[99] M. Jones and P. Viola, “Face recognition using boosted local features,” TechnicalReport TR2003-25, MERL, 2003.
[100] T. Ahonen, A. Hadid, and M. Pietikainen, “Face recognition with local binarypatterns,” Proc. European Conf. Computer Vision, 2004.
113
BIOGRAPHICAL SKETCH
Ritwik Kumar received his Bachelor of Technology degree in Information and
Communication Technology from Dhirubhai Ambani Institute of Information and
Communication Technology (DAIICT), Gandhinagar, India in 2005. Since 2005 he
has been a Ph.D. student at the Center for Vision, Graphics and Medical Imaging at the
Department of Computer and Information Science and Engineering at the University
of Florida, Gainesville, FL, USA. His research interests include machine learning, color
video analysis, face recognition and medical image analysis. He is a recipient of DAIICT
President’s Gold Medal (2005) and the University of Florida Alumni Fellowship (2005 -
2009). He received the best student paper award at the Twelfth International Conference
on Advanced Computing and Communications (ADCOM) 2004.
114