PROJECT REPORT
On
Improvement of Viola Jones Face Detector using Pre-Processing
Submitted in partial fulfillment for the award of the degree Of
BACHELOR OF TECHNOLOGY
in ELECTRONICS AND COMMUNICATION ENGINEERING
By
SHREYAS SESHADRI (10408611)
Under the guidance of
Mrs. G. REVATHI, ME, (Assistant Professor(O.G), School of Electronics and Communication Engineering)
FACULTY OF ENGINEERING AND TECHNOLOGY SRM UNIVERSITY, RAMAPURAM CAMPUS.
(Under section 3 of UGC Act, 1956) Chennai – 600089.
MAY, 2012
ii
BONAFIDE CERTIFICATE
Certified that this project report titled Improvement of Viola Jones Face detector
using Pre-processing is the bonafide work of SHREYAS SESHADRI (10408611) who
carried out the project under my supervision.
H.O.D Internal Guide
Date:
Internal Examiner External Examiner
iii
Certificate from germany to be added
iv
Attendance certificate from germany to be added
v
ACKNOWLEDGEMENT
I place on record our deep sense of gratitude to our beloved Chancellor Dr.
T.R.PACHAMUTHU, for providing us with the requisite infrastructure throughout the
course.
I take the opportunity to extend my hearty thanks to our Chairman, SRM
University, Ramapuram, Mr. R. SHIVAKUMAR, for his constant encouragement.
I convey my sincere thanks to our Dean Dr. K. ABDUL GHANI and Vice
Principal Dr. L.ANTONY MICHAEL RAJ, for his interest and support.
I take the privilege to extend my hearty thanks to the Head of the Department, Mrs.
T.BEULA CHITRA PRIYA, for her suggestions, support and encouragement towards the
completion of the project with perfection.
I thank my Internal Guide, Mrs. G.REVATHI, for her timely help and guidance
throughout the overall process of the project.
I would like to express my sincere thanks to all of our staff members of the
Department of Electronics and Communication who gave many suggestions from time to
time that made our project work better and well finished.
Finally I am indebted to Prof Dr-Ing BODO ROSENHAHN of Leibniz
University Hannover and his team who made me feel absolutely at home at the university.
They were very helpful with technical inputs whenever required. But for their guidance
this project work would not have been possible.
vi
ABSTRACT
This technical essay describes my final year Bachelors project undertaken at
Leibniz Universität Hannover under the supervision of Prof. Dr-Ing Bodo Rosenhahn
during November, 2011 to March, 2012.
Face detection has become an important feature in many mobile devices these
days. It is basic to a variety of other human computer interface systems. The project
presented proposes a method to improve performance and computation time of the
popularly used Viola-Jones face detection framework. This is done using Skin colour
detection and Canny edge detection algorithms as pre-processing steps.
The project involves creation of an iOS 5 App which detects faces from live
video feed. The same algorithm is implemented as a Mac OS project for testing. The
performance of the proposed method was tested and showed good preliminary results.
vii
TABLE OF CONTENTS
CHAPTER TITLE PAGE ABSTRACT vi
TABLE OF CONTENTS vii
LIST OF FIGURES x
LIST OF TABLES xi
LIST OF CHARTS xii
1 INTRODUCTION 1
1.1 FACE DETECTION 1
1.2 VIOLA-JONES 1
1.3 PRE-PROCESSING 1
1.4 APPLICATIONS AND USES 2
2 THEORY 3
2.1 MACHINE LEARNING 3
2.1.1 ARTIFICIAL NEURAL NETWORK 3
2.1.2 PERCEPTRON 4
2.1.3 ADABOOSTING 5
2.2 VIOLA-JONES 5
2.2.1 INTEGRAL IMAGE 6
2.2.2 ADABOOST LEARNING 8
2.2.3 CASCADE ARCHITECTURE 9
2.3 CANNY EDGE DETECTION 10
2.3.1 NOISE REDUCTION 10
2.3.2 INTENSITY GRADIENTS 11
2.3.3 NON-MAXIMUM SUPRESSION 11
2.3.4 ADDITIONAL STEPS 11
2.3.5 EXAPMLES 12
2.4 SKIN COLOUR DETECTION 13
viii
2.4.1 HIS MODEL 13
2.4.2 SKIN COLOUR 14
2.4.3 EXAMPLES 14
3 DETAILED EXPLANATION OF
PROJECT 16
3.1 STRUCTURE OF PROJECT 16
3.1.1 iOS PROJECT 16
3.1.2 MAC PROJECT 17
3.2 TRAINING OF CLASSIFIERS 18
3.3 IMAGE PROCESSING 19
3.3.1SKIN COLOUR DETECTION 19
3.3.2 RGB TO GREY 21
3.3.3 CANNY PRUNING 21
3.3.4 SLIDING WINDOW 23
3.3.5 PRE-PROCESSING 24
3.3.6 FACE DETECTION 25
4 EXPERIMENTAL ANALYSIS 27
4.1 THEORY 27
4.2 ANALYSIS 28
4.2.1 TEST SET OF IMAGES 28
4.2.2 DIFFERENT CASES 29
4.2.3 FPR AND TPR CALCULATION 30
4.2.4 ANALYSIS OF SPEED 31
4.3 EXPEXPECTED RESULTS 31
5 SPECICIFCATIONS 33
5.1 SOFTWARE 33
5.5.1 XCODE 33
5.5.2 MATLAB 33
5.1.3 LIBRARIES 33
5.2 HARDWARE 34
5.2.1 iMAC/ MACBOOK PRO 34
5.2.2 iPOD TOUCH 4/ iPAD 2 34
5.3 COMPUTERS USED IN TRAINING 34
ix
5.4 GIT REPOSITORY 35
5.5 SIMD 36
6 RESULTS 37
6.1 PERFORMANCE COMPARISON 37
6.2 SPEED COMPARISON 38
6.3 EXAMPLE IMAGES 39
7 APPENDIX 43
7.1 SKIN COLOUR DETETCION 43
7.2 CANNY PRUNING 44
8 CONCLUSION 51
9 FUTURE ENHANCEMENTS 52
REFERENCES 53
WEBSITES 54
x
LIST OF FIGURES
FIGURE NO. FIGURE TITLE PAGE
2.1 ARTIFICIAL NEURAL NETWORK 3
2.2 PERCEPTRON 4
2.3 EXAMPLE OF RECTANGULAR FEATURE 6
2.4 VALUE OF INTEGRAL IMAGE AT POINT (x,y) 7
2.5 SUM OF PIXELS IN RECTANGLE 8
2.6 FIRST 2 FEATURES SELECTED BY ADABOOST 9
2.7 CASCADED ARCHITECTURE 9
2.8 EXAMPLES 12
2.9 HSI MODEL 13
2.10 EXAMPLES 15
3.1 APP RUNNING ON iPOD TOUCH 4 17
3.2 EXAMPLE 20
3.3 EXAMPLE 21
3.4 EXAMPLE 22
3.5 EXAMPLE OF SCALING 23
3.6 VARIANCE NORMALIZATION 25
4.1 CONFUSION MATRIX 27
4.2 TEST IMAGES 29
5.1 STORAGE LEVELS OF GIT REPOSITORY 35
5.2 SIMD 36
6.1 EXAMPLE - 1 39
6.2 EXAMPLE - 2 40
6.3 EXAMPLE - 3 41
6.4 EXAMPLE - 4 42
xi
LIST OF TABLES
TABLE NO. TABLE NAME PAGE
3.1 NUMBER OF RECTANGULAR
FEATURES PER STAGE 18
6.1 PERFORMANCE COMPARISON 37
LIST OF CHARTS
CHART NO. CHART NAME PAGE
6.1 SPEED-UP COMPARISON 38
CHAPTER - I
INTRODUCTION
1.1 FACE DETECTION
Face detection is the computer technology by which the locations and sizes
of human faces in arbitrary digital images are obtained. Face detection can be regarded as a
specific case of object detection. In object detection, the task is to find the locations and
sizes of all objects in an image that belong to a particular class. The examples include cars,
traffic lights, faces etc.
Face detection has become an important feature in most mobile devices
these days. It is basic to a variety of other human computer interface systems. Face
detection is used in biometrics, often as an initial step to a face recognition system. It is
also used in video surveillance, human computer interface and image database
management. Some recent mobile devices use face detection for autofocus.
1.2 VIOLA JONES FACE DETECTION FRAMEWORK
The Viola-Jones face detection framework is one of the most widely used
methodologies for face detection. This method was proposed by Paul Viola and Michael L
Jones in 2001 [10]. This framework is based on the computation of rectangular features,
which refers to difference between pixel sums of two or more adjacent, equally sized
rectangles within a sub-window of an image. The problem with this method is that it is
computationally intensive when run on mobile devices. This is not a big problem for
dedicated devices such as digital cameras which are designed to perform this task. But on a
regular device such as a smart phone this methodology has to be improvised.
1.3 PRE-PROCESSING
The most obvious choice for improvement is the use of pre-processing. The
idea behind pre-processing is that regions in the image which can be easily classified as
non-faces are eliminated before entering the actual Viola Jones face detection framework.
1
This increases the computational efficiency as well as reduces the number of false
positives in the final output.
This project analyzes the performance of the Viola-Jones face detection
framework with Skin colour detection and Canny edge detection as the pre-processing
steps. Skin colour detection, is to quickly reject sub-windows of the image which have too
few skin coloured pixels. Edge detection is used as a pre-processing step so that sub-
windows having too few or too many edges to be identified as a face are easily eliminated.
1.4 APPLICATIONS AND USES
This is the age of smart phones. The smart phones will function as one
integrated tool for all mankind’s personal and business needs. It is becoming an extended
arm of the modern man. So it is evident that any work taken up for improvement in any
application relevant to smart phones will be immensely useful.
2
CHAPTER - IITHEORY
2.1 MACHINE LEARNING Machine learning is a branch of artificial intelligence concerned with the
design and development of algorithms that allow computers to evolve behaviors based on
empirical data i.e. data produced by an observation or experiment[a]. A learner can capture
characteristics of interest from the data set. This data can be seen as examples that illustrate
relations between observed variables. The major focus of machine learning research is that
computers automatically learn to recognize complex patterns and make intelligent
decisions based on data.
2.1.1 ARTIFICIAL NEURAL NETWORK
An Artificial Neural Network (ANN) is an mathematical or computational
model that is inspired by the way biological nervous systems, such as the brain, process
information. A neural network consists of an interconnected group of artificial neurons. An
ANN is configured for a specific application, such as pattern recognition or data
classification, through a learning process.
Conventional computers use an algorithmic approach i.e. they follow a set
of instructions in order to solve a problem. Unless the specific steps that the computer
needs to follow are known the computer cannot solve the problem. That restricts the
problem solving capability of conventional computers to problems that we already
understand and know how to solve. Neural networks on the other hand learn by example.
They cannot be programmed to perform a specific task. They are trained with a set of
specific examples of a particular problem, and learn how to solve that problem.
Fig 2.1- Artificial Neural Network (ANN)[l]
3
Fig. 2.1 shows an example of a simple ANN. Every ANN consists of the
input, output and hidden layers. Each of these layers has a set of units which represent
weights that manipulate the data in the calculations. ANNs can be classified as[p, c]
• Feedforward ANNs where information travels only in one direction i.e. form the input
layer to the output layer. Fig. 2.1 is an example of a simple feedforward network.
• Feedback ANNs where there are signals traveling in both directions.
2.1.2 PERCEPTRON
A perceptron is the simplest form of a feedforward network. It is a binary
classifier which maps its input x (a real-valued vector) to an output value f(x) (a single
binary value) as shown in Eq. (2.1)
Here w is a vector of real-valued weights, w.x is the dot product which
computes a weighted sum, and b is the 'bias', a constant term that does not depend on any
input value. The value of f(x) (0 or 1) is used to classify x as either a positive or a negative
instance. If b is negative, then the weighted combination of inputs must produce a positive
value greater than | b | in order to push the classifier neuron over the 0 threshold.
Fig. 2.2 - Perceptron [h]
Fig. 2.2 shows a simple perceptron with seven inputs each of which are
multiplied by a corresponding weight and passed through the function f(x) to give an
output y.
4
09/02/12 4:53 PMPerceptron - Wikipedia, the free encyclopedia
Page 1 of 7http://en.wikipedia.org/wiki/Perceptron
PerceptronFrom Wikipedia, the free encyclopedia
The perceptron is a type of artificial neural network invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt.[1] It can beseen as the simplest kind of feedforward neural network: a linear classifier.
Contents1 Definition2 Learning algorithm
2.1 Learning algorithm steps2.2 Separability and convergence
3 Variants4 Example5 Multiclass perceptron6 History7 References8 External links
Definition
The perceptron is a binary classifier which maps its input x (a real-valued vector) to an output value f(x) (a single binary value):
where w is a vector of real-valued weights, is the dot product (which here computes a weighted sum), and b is the 'bias', a constant termthat does not depend on any input value.
The value of f(x) (0 or 1) is used to classify x as either a positive or a negative instance, in the case of a binary classification problem. If b isnegative, then the weighted combination of inputs must produce a positive value greater than | b | in order to push the classifier neuron over the0 threshold. Spatially, the bias alters the position (though not the orientation) of the decision boundary. The perceptron learning algorithm doesnot terminate if the learning set is not linearly separable.
The perceptron is considered the simplest kind of feed-forward neural network.
Learning algorithmBelow is an example of a learning algorithm for a single-layer (no hidden layer) perceptron. For multilayer perceptrons, more complicatedalgorithms such as backpropagation must be used. Alternatively, methods such as the delta rule can be used if the function is non-linear anddifferentiable, although the one below will work as well.
The learning algorithm we demonstrate is the same across all the output neurons, therefore everything that follows is applied to a single neuronin isolation. We first define some variables:
denotes the output from the perceptron for an input vector . is the bias term, which in the example below we take to be 0.
is the training set of s samples, where: is the n-dimensional input vector. is the desired output value of the perceptron for that input.
We show the values of the nodes as follows:
is the value of the ith node of the jth training input vector..
To represent the weights:
(2.1)
2.1.3 ADABOOSTING
Boosting, a machine learning algorithm, refers to a general and provably
effective method of producing a very accurate prediction rule by combining rough and
moderately inaccurate rules of thumb[12]. It is the method of combining a set of weak
learners to create a single strong learner. A weak learner is defined to be a classifier which
is only slightly correlated with the true classification i.e. for a given problem the weak
learner may only classify the training data correctly 51% of the time. It is only slightly
better than random guessing. In contrast, a strong learner is a classifier that is arbitrarily
well-correlated with the true classification.[10]
Kearns and Valiant [5,6] were the first to pose the question of whether a
“weak” learning algorithm which performs just slightly better than random guessing can be
“boosted” into an arbitrarily accurate “strong” learning algorithm. Schapire [8] came up
with the first provable polynomial-time boosting algorithm in 1989.
AdaBoost is short for Adaptive Boosting, formulated by Yoav Freund
Robert E. Schapire in 1995 [12]. AdaBoost is adaptive in the sense that subsequent
classifiers built are tweaked in favor of those instances misclassified by previous
classifiers.
AdaBoost is an algorithm for constructing a strong classifier as linear
combination of simple weak classifiers ht(x) as shown in Eq. (2.2)
Here ht(x) is the weak classifier, H(x) is the strong or final classifier and α
is the weight or importance assigned to each weak classifier ht(x).
2.2 VIOLA-JONES FACE DETECTION FRAMEWORK
The Viola-Jones face detection framework is one of the most widely used
methodologies for face detection. This was proposed by Paul Viola and Michael L Jones in
2001(paper). This describes a face detection framework that is based on computation of
5
(Discrete) AdaBoost Algorithm – Singer & Schapire (1997)
Given: (x1, y1), ..., (xm
, ym
); xi
2 X , yi
2 {�1, 1}
Initialize weights D1(i) = 1/m
For t = 1, ..., T :1. (CallWeakLearn), which returns the weak classifier h
t
: X ! {�1, 1} withminimum error w.r.t. distribution D
t
;2. Choose ↵
t
2 R,3. Update
Dt+1(i) =
Dt
(i)exp(�↵t
yi
ht
(xi
))
Zt
where Zt
is a normalization factor chosen so that Dt+1 is a distribution
Output the strong classifier:
H(x) = sign
TX
t=1
↵t
ht
(x)
!
Comments⌅ The computational complexity of selecting h
t
is independent of t.⌅ All information about previously selected “features” is captured in D
t
!
12345678910111213141516
(2.2)
rectangular features. It achieves high detection rates with relatively low computation time.
It has the following major innovations.
• Integral Image - a new image representation for fast computation of rectangular features
• AdaBoost learning algorithm - A set of classifiers built based on the Adaboosting
algorithm
• Cascaded Architecture - An efficient method for combining the classifiers to reduce
computation.
The following sections takes an in depth look to each of these.
2.2.1 INTEGRAL IMAGE
This face detection procedure classifies images based on the value of
simple rectangular features. Three kinds of features are used as shown in Fig. 2.1.
Fig. 2.3 - Example of rectangular features [j]
In Fig. 2.3 the features A and B show a two-rectangle feature that is the
difference between the sum of the pixel values within two adjacent rectangular
regions(shown as grey and white). Feature C shows a three-rectangle feature which
computes the sum of pixel values within the two white(outer) rectangles subtracted from
the sum in the grey rectangle(central). And, feature D shows a four-rectangle feature which
computes the difference, of pixel sum values, between diagonal pairs of rectangles.
These rectangle features can be computed very rapidly using an
intermediate representation for the image which is called the integral image. It is based on
as the summed area table, which is an algorithm for quickly and efficiently generating the
sum of values in a rectangular subset of a grid. It was first introduced to the computer
6
10/02/12 3:12 PMViola–Jones object detection framework - Wikipedia, the free encyclopedia
Page 1 of 3http://en.wikipedia.org/wiki/Viola–Jones_object_detection_framework
Feature types used by Viola and Jones
Viola–Jones object detection frameworkFrom Wikipedia, the free encyclopedia
The Viola–Jones object detection framework is the first object detection framework to providecompetitive object detection rates in real-time proposed in 2001 by Paul Viola and Michael Jones[1][2].Although it can be trained to detect a variety of object classes, it was motivated primarily by the problem offace detection. This algorithm is implemented in OpenCV as cvHaarDetectObjects().
Contents1 Components of the framework
1.1 Feature types and evaluation1.2 Learning algorithm1.3 Cascade architecture
2 References3 External links
Components of the framework
Feature types and evaluation
The features employed by the detection framework universallyinvolve the sums of image pixels within rectangular areas. As such,they bear some resemblance to Haar basis functions, which havebeen used previously in the realm of image-based object detection[3].However, since the features used by Viola and Jones all rely on morethan one rectangular area, they are generally more complex. Thefigure at right illustrates the four different types of features used inthe framework. The value of any given feature is always simply thesum of the pixels within clear rectangles subtracted from the sum ofthe pixels within shaded rectangles. As is to be expected, rectangularfeatures of this sort are rather primitive when compared toalternatives such as steerable filters. Although they are sensitive tovertical and horizontal features, their feedback is considerablycoarser. However, with the use of an image representation called the integral image, rectangular features canbe evaluated in constant time, which gives them a considerable speed advantage over their moresophisticated relatives. Because each rectangular area in a feature is always adjacent to at least one otherrectangle, it follows that any two-rectangle feature can be computed in six array references, any three-rectangle feature in eight, and any four-rectangle feature in just nine.
Learning algorithm
The speed with which features may be evaluated does not adequately compensate for their number, however.
graphics world in 1984, for texture mapping [1]. The value of the integral image at point
(x,y) contains the sum of the pixels above and to the left of (x,y), on the original image. It is
as shown in Eq. (2.3)
Here ii(x, y) is the integral image and i(x, y) is the original image. This can
be efficiently calculated using the pair of recursive equations shown in Eq. (2.4)
Here s(x, y) is the cumulative row sum, where s(x, −1) = 0, and ii(−1, y) =
0. Fig. 2.4 shows the value of the integral image at a point (x,y) as the sum of the pixels in
the shaded region.
Fig. 2.4 - Value of integral image at a point (x,y)[10]
Using the integral image any rectangular sum can be computed very easily
as shown in Fig 2.5. Here the sum of the pixels within rectangle D is to be calculated. The
value of the integral image at location 1 is the sum of the pixels in rectangle A. The value
at location 2 is A+B, at location 3 is A+C, and at location 4 is A+B+C+D. The sum within
D can be computed as 4 + 1 − (2 + 3).
7
Robust Real-Time Face Detection 139
together yield an extremely reliable and efficient facedetector. Section 5 will describe a number of experi-mental results, including a detailed description of ourexperimental methodology. Finally Section 6 containsa discussion of this system and its relationship to re-lated systems.
2. Features
Our face detection procedure classifies images basedon the value of simple features. There are many moti-vations for using features rather than the pixels directly.The most common reason is that features can act to en-code ad-hoc domain knowledge that is difficult to learnusing a finite quantity of training data. For this systemthere is also a second critical motivation for features:the feature-based system operates much faster than apixel-based system.
The simple features used are reminiscent of Haarbasis functions which have been used by Papageorgiouet al. (1998). More specifically, we use three kinds offeatures. The value of a two-rectangle feature is thedifference between the sum of the pixels within tworectangular regions. The regions have the same sizeand shape and are horizontally or vertically adjacent(see Fig. 1). A three-rectangle feature computes thesum within two outside rectangles subtracted from thesum in a center rectangle. Finally a four-rectangle fea-ture computes the difference between diagonal pairs ofrectangles.
Given that the base resolution of the detector is24 ! 24, the exhaustive set of rectangle features is
Figure 1. Example rectangle features shown relative to the enclos-ing detection window. The sum of the pixels which lie within thewhite rectangles are subtracted from the sum of pixels in the greyrectangles. Two-rectangle features are shown in (A) and (B). Figure(C) shows a three-rectangle feature, and (D) a four-rectangle feature.
quite large, 160,000. Note that unlike the Haar basis,the set of rectangle features is overcomplete.3
2.1. Integral Image
Rectangle features can be computed very rapidly usingan intermediate representation for the image which wecall the integral image.4 The integral image at locationx, y contains the sum of the pixels above and to the leftof x, y, inclusive:
i i(x, y) =!
x "#x,y"#y
i(x ", y"),
where i i(x, y) is the integral image and i(x, y) is theoriginal image (see Fig. 2). Using the following pair ofrecurrences:
s(x, y) = s(x, y $ 1) + i(x, y) (1)
i i(x, y) = i i(x $ 1, y) + s(x, y) (2)
(where s(x, y) is the cumulative row sum, s(x, $1) =0, and i i($1, y) = 0) the integral image can be com-puted in one pass over the original image.
Using the integral image any rectangular sum can becomputed in four array references (see Fig. 3). Clearlythe difference between two rectangular sums can becomputed in eight references. Since the two-rectanglefeatures defined above involve adjacent rectangularsums they can be computed in six array references,eight in the case of the three-rectangle features, andnine for four-rectangle features.
One alternative motivation for the integral im-age comes from the “boxlets” work of Simard et al.
Figure 2. The value of the integral image at point (x, y) is the sumof all the pixels above and to the left.
(2.3)
Robust Real-Time Face Detection 139
together yield an extremely reliable and efficient facedetector. Section 5 will describe a number of experi-mental results, including a detailed description of ourexperimental methodology. Finally Section 6 containsa discussion of this system and its relationship to re-lated systems.
2. Features
Our face detection procedure classifies images basedon the value of simple features. There are many moti-vations for using features rather than the pixels directly.The most common reason is that features can act to en-code ad-hoc domain knowledge that is difficult to learnusing a finite quantity of training data. For this systemthere is also a second critical motivation for features:the feature-based system operates much faster than apixel-based system.
The simple features used are reminiscent of Haarbasis functions which have been used by Papageorgiouet al. (1998). More specifically, we use three kinds offeatures. The value of a two-rectangle feature is thedifference between the sum of the pixels within tworectangular regions. The regions have the same sizeand shape and are horizontally or vertically adjacent(see Fig. 1). A three-rectangle feature computes thesum within two outside rectangles subtracted from thesum in a center rectangle. Finally a four-rectangle fea-ture computes the difference between diagonal pairs ofrectangles.
Given that the base resolution of the detector is24 ! 24, the exhaustive set of rectangle features is
Figure 1. Example rectangle features shown relative to the enclos-ing detection window. The sum of the pixels which lie within thewhite rectangles are subtracted from the sum of pixels in the greyrectangles. Two-rectangle features are shown in (A) and (B). Figure(C) shows a three-rectangle feature, and (D) a four-rectangle feature.
quite large, 160,000. Note that unlike the Haar basis,the set of rectangle features is overcomplete.3
2.1. Integral Image
Rectangle features can be computed very rapidly usingan intermediate representation for the image which wecall the integral image.4 The integral image at locationx, y contains the sum of the pixels above and to the leftof x, y, inclusive:
i i(x, y) =!
x "#x,y"#y
i(x ", y"),
where i i(x, y) is the integral image and i(x, y) is theoriginal image (see Fig. 2). Using the following pair ofrecurrences:
s(x, y) = s(x, y $ 1) + i(x, y) (1)
i i(x, y) = i i(x $ 1, y) + s(x, y) (2)
(where s(x, y) is the cumulative row sum, s(x, $1) =0, and i i($1, y) = 0) the integral image can be com-puted in one pass over the original image.
Using the integral image any rectangular sum can becomputed in four array references (see Fig. 3). Clearlythe difference between two rectangular sums can becomputed in eight references. Since the two-rectanglefeatures defined above involve adjacent rectangularsums they can be computed in six array references,eight in the case of the three-rectangle features, andnine for four-rectangle features.
One alternative motivation for the integral im-age comes from the “boxlets” work of Simard et al.
Figure 2. The value of the integral image at point (x, y) is the sumof all the pixels above and to the left.
Robust Real-Time Face Detection 139
together yield an extremely reliable and efficient facedetector. Section 5 will describe a number of experi-mental results, including a detailed description of ourexperimental methodology. Finally Section 6 containsa discussion of this system and its relationship to re-lated systems.
2. Features
Our face detection procedure classifies images basedon the value of simple features. There are many moti-vations for using features rather than the pixels directly.The most common reason is that features can act to en-code ad-hoc domain knowledge that is difficult to learnusing a finite quantity of training data. For this systemthere is also a second critical motivation for features:the feature-based system operates much faster than apixel-based system.
The simple features used are reminiscent of Haarbasis functions which have been used by Papageorgiouet al. (1998). More specifically, we use three kinds offeatures. The value of a two-rectangle feature is thedifference between the sum of the pixels within tworectangular regions. The regions have the same sizeand shape and are horizontally or vertically adjacent(see Fig. 1). A three-rectangle feature computes thesum within two outside rectangles subtracted from thesum in a center rectangle. Finally a four-rectangle fea-ture computes the difference between diagonal pairs ofrectangles.
Given that the base resolution of the detector is24 ! 24, the exhaustive set of rectangle features is
Figure 1. Example rectangle features shown relative to the enclos-ing detection window. The sum of the pixels which lie within thewhite rectangles are subtracted from the sum of pixels in the greyrectangles. Two-rectangle features are shown in (A) and (B). Figure(C) shows a three-rectangle feature, and (D) a four-rectangle feature.
quite large, 160,000. Note that unlike the Haar basis,the set of rectangle features is overcomplete.3
2.1. Integral Image
Rectangle features can be computed very rapidly usingan intermediate representation for the image which wecall the integral image.4 The integral image at locationx, y contains the sum of the pixels above and to the leftof x, y, inclusive:
i i(x, y) =!
x "#x,y"#y
i(x ", y"),
where i i(x, y) is the integral image and i(x, y) is theoriginal image (see Fig. 2). Using the following pair ofrecurrences:
s(x, y) = s(x, y $ 1) + i(x, y) (1)
i i(x, y) = i i(x $ 1, y) + s(x, y) (2)
(where s(x, y) is the cumulative row sum, s(x, $1) =0, and i i($1, y) = 0) the integral image can be com-puted in one pass over the original image.
Using the integral image any rectangular sum can becomputed in four array references (see Fig. 3). Clearlythe difference between two rectangular sums can becomputed in eight references. Since the two-rectanglefeatures defined above involve adjacent rectangularsums they can be computed in six array references,eight in the case of the three-rectangle features, andnine for four-rectangle features.
One alternative motivation for the integral im-age comes from the “boxlets” work of Simard et al.
Figure 2. The value of the integral image at point (x, y) is the sumof all the pixels above and to the left.
(2.4)
2.2.2 ADABOOST LEARNING ALGORITHM
The weak classifier in the Viola-Jones face detection model is shown in Eq.
(2.3) where a weak classifier h(x,f,p,θ) consists of a rectangular feature f, a threshold θ and
a polarity p indicating the direction of the inequality and where x is a 24 × 24 pixel sub-
window of the image.
This framework is based on the AdaBoost learning algorithm and uses it to
both select the features and train the classifier. The learning algorithm is as shown [10]
• Given example images (x1, y1),...,(xn, yn) where yi = {0, 1} for negative and positive examples respectively.
• Initialize weights w1,i = (1/2m), (1/2l) for yi = 0, 1 respectively, where m and l are the number of negatives and positives respectively.
• For t=1,...,T:
a. Normalize the weights,
b. Select the best weak classifier with respect to the weighted error,
c. Define ht(x) = h(x, ft, pt,θt) where ft, pt, and θt are the minimizers of εt .
d. Update the weights as shown where ei = 0 if example xi is classified correctly, ei = 1 otherwise, and βt = (εt /(1-εt)) ,
• The final strong classifier is as shown where αt = log(1/βt)
8140 Viola and Jones
Figure 3. The sum of the pixels within rectangle D can be computedwith four array references. The value of the integral image at location1 is the sum of the pixels in rectangle A. The value at location 2 isA + B, at location 3 is A + C , and at location 4 is A + B + C + D.The sum within D can be computed as 4 + 1 ! (2 + 3).
(1999). The authors point out that in the case of linearoperations (e.g. f · g), any invertible linear operationcan be applied to f or g if its inverse is applied to theresult. For example in the case of convolution, if thederivative operator is applied both to the image and thekernel the result must then be double integrated:
f " g =! !
( f # " g#).
The authors go on to show that convolution can besignificantly accelerated if the derivatives of f and gare sparse (or can be made so). A similar insight is thatan invertible linear operation can be applied to f if itsinverse is applied to g:
( f ##) "" ! !
g#
= f " g.
Viewed in this framework computation of the rect-angle sum can be expressed as a dot product, i ·r , wherei is the image and r is the box car image (with value1 within the rectangle of interest and 0 outside). Thisoperation can be rewritten
i · r =" ! !
i#
· r ##.
The integral image is in fact the double integral of theimage (first along rows and then along columns). Thesecond derivative of the rectangle (first in row and thenin column) yields four delta functions at the corners of
the rectangle. Evaluation of the second dot product isaccomplished with four array accesses.
2.2. Feature Discussion
Rectangle features are somewhat primitive whencompared with alternatives such as steerable filters(Freeman and Adelson, 1991; Greenspan et al., 1994).Steerable filters, and their relatives, are excellent for thedetailed analysis of boundaries, image compression,and texture analysis. While rectangle features are alsosensitive to the presence of edges, bars, and other sim-ple image structure, they are quite coarse. Unlike steer-able filters, the only orientations available are vertical,horizontal and diagonal. Since orthogonality is not cen-tral to this feature set, we choose to generate a verylarge and varied set of rectangle features. Typically therepresentation is about 400 times overcomplete. Thisovercomplete set provides features of arbitrary aspectratio and of finely sampled location. Empirically it ap-pears as though the set of rectangle features providea rich image representation which supports effectivelearning. The extreme computational efficiency of rect-angle features provides ample compensation for theirlimitations.
In order to appreciate the computational advantageof the integral image technique, consider a more con-ventional approach in which a pyramid of images iscomputed. Like most face detection systems, our de-tector scans the input at many scales; starting at thebase scale in which faces are detected at a size of24 $ 24 pixels, a 384 by 288 pixel image is scannedat 12 scales each a factor of 1.25 larger than the last.The conventional approach is to compute a pyramid of12 images, each 1.25 times smaller than the previousimage. A fixed scale detector is then scanned acrosseach of these images. Computation of the pyramid,while straightforward, requires significant time. Imple-mented efficiently on conventional hardware (using bi-linear interpolation to scale each level of the pyramid) ittakes around .05 seconds to compute a 12 level pyramidof this size (on an Intel PIII 700 MHz processor).5
In contrast we have defined a meaningful set of rect-angle features, which have the property that a singlefeature can be evaluated at any scale and location in afew operations. We will show in Section 4 that effec-tive face detectors can be constructed with as few as tworectangle features. Given the computational efficiencyof these features, the face detection process can be com-pleted for an entire image at every scale at 15 frames per
Fig. 2.5 - Sum of pixels in rectangle D is computed as integral image values at locations 4 + 1 − (2 + 3) [10]
Robust Real-Time Face Detection 141
second, about the same time required to evaluate the 12level image pyramid alone. Any procedure which re-quires a pyramid of this type will necessarily run slowerthan our detector.
3. Learning Classification Functions
Given a feature set and a training set of positive andnegative images, any number of machine learning ap-proaches could be used to learn a classification func-tion. Sung and Poggio use a mixture of Gaussian model(Sung and Poggio, 1998). Rowley et al. (1998) use asmall set of simple image features and a neural net-work. Osuna et al. (1997b) used a support vector ma-chine. More recently Roth et al. (2000) have proposeda new and unusual image representation and have usedthe Winnow learning procedure.
Recall that there are 160,000 rectangle features as-sociated with each image sub-window, a number farlarger than the number of pixels. Even though eachfeature can be computed very efficiently, computingthe complete set is prohibitively expensive. Our hy-pothesis, which is borne out by experiment, is that avery small number of these features can be combinedto form an effective classifier. The main challenge is tofind these features.
In our system a variant of AdaBoost is used bothto select the features and to train the classifier (Freundand Schapire, 1995). In its original form, the AdaBoostlearning algorithm is used to boost the classificationperformance of a simple learning algorithm (e.g., itmight be used to boost the performance of a simple per-ceptron). It does this by combining a collection of weakclassification functions to form a stronger classifier. Inthe language of boosting the simple learning algorithmis called a weak learner. So, for example the percep-tron learning algorithm searches over the set of possibleperceptrons and returns the perceptron with the lowestclassification error. The learner is called weak becausewe do not expect even the best classification function toclassify the training data well (i.e. for a given problemthe best perceptron may only classify the training datacorrectly 51% of the time). In order for the weak learnerto be boosted, it is called upon to solve a sequence oflearning problems. After the first round of learning, theexamples are re-weighted in order to emphasize thosewhich were incorrectly classified by the previous weakclassifier. The final strong classifier takes the form of aperceptron, a weighted combination of weak classifiersfollowed by a threshold.6
The formal guarantees provided by the AdaBoostlearning procedure are quite strong. Freund andSchapire proved that the training error of the strongclassifier approaches zero exponentially in the numberof rounds. More importantly a number of resultswere later proved about generalization performance(Schapire et al., 1997). The key insight is that gen-eralization performance is related to the margin of theexamples, and that AdaBoost achieves large marginsrapidly.
The conventional AdaBoost procedure can be eas-ily interpreted as a greedy feature selection process.Consider the general problem of boosting, in which alarge set of classification functions are combined usinga weighted majority vote. The challenge is to associatea large weight with each good classification functionand a smaller weight with poor functions. AdaBoost isan aggressive mechanism for selecting a small set ofgood classification functions which nevertheless havesignificant variety. Drawing an analogy between weakclassifiers and features, AdaBoost is an effective pro-cedure for searching out a small number of good “fea-tures” which nevertheless have significant variety.
One practical method for completing this analogy isto restrict the weak learner to the set of classificationfunctions each of which depend on a single feature.In support of this goal, the weak learning algorithm isdesigned to select the single rectangle feature whichbest separates the positive and negative examples (thisis similar to the approach of Tieu and Viola (2000) inthe domain of image database retrieval). For each fea-ture, the weak learner determines the optimal thresholdclassification function, such that the minimum num-ber of examples are misclassified. A weak classifier(h(x, f, p, ! )) thus consists of a feature ( f ), a thresh-old (! ) and a polarity (p) indicating the direction of theinequality:
h(x, f, p, ! ) =!
1 if p f (x) < p!
0 otherwise
Here x is a 24 ! 24 pixel sub-window of an image.In practice no single feature can perform the classifi-
cation task with low error. Features which are selectedearly in the process yield error rates between 0.1 and0.3. Features selected in later rounds, as the task be-comes more difficult, yield error rates between 0.4 and0.5. Table 1 shows the learning algorithm.
The weak classifiers that we use (thresholded singlefeatures) can be viewed as single node decision trees.
(2.5)142 Viola and Jones
Table 1. The boosting algorithm for learning a query online.T hypotheses are constructed each using a single feature. Thefinal hypothesis is a weighted linear combination of the T hy-potheses where the weights are inversely proportional to thetraining errors.
• Given example images (x1, y1), . . . , (xn, yn) whereyi = 0, 1 for negative and positive examples respectively.
• Initialize weights w1,i = 12m , 1
2l for yi = 0, 1 respectively,where m and l are the number of negatives and positivesrespectively.
• For t = 1, . . . , T :
1. Normalize the weights, wt,i ! wt,i!nj=1 wt, j
2. Select the best weak classifier with respect to theweighted error
!t = min f,p,"
"
i
wi | h(xi , f, p, " ) " yi | .
See Section 3.1 for a discussion of an efficientimplementation.
3. Define ht (x) = h(x, ft , pt , "t ) where ft , pt , and "t
are the minimizers of !t .4. Update the weights:
wt+1,i = wt,i #1"eit
where ei = 0 if example xi is classified correctly, ei = 1otherwise, and #t = !t
1"!t.
• The final strong classifier is:
C(x) =
#1
T"
t=1
$t ht (x) # 12
T"
t=1
$t
0 otherwise
where $t = log 1#t
Such structures have been called decision stumps inthe machine learning literature. The original work ofFreund and Schapire (1995) also experimented withboosting decision stumps.
3.1. Learning Discussion
The algorithm described in Table 1 is used to selectkey weak classifiers from the set of possible weakclassifiers. While the AdaBoost process is quite effi-cient, the set of weak classifier is extraordinarily large.Since there is one weak classifier for each distinct fea-ture/threshold combination, there are effectively KNweak classifiers, where K is the number of featuresand N is the number of examples. In order to appre-ciate the dependency on N , suppose that the examplesare sorted by a given feature value. With respect to thetraining process any two thresholds that lie between thesame pair of sorted examples is equivalent. Therefore
the total number of distinct thresholds is N . Given atask with N = 20000 and K = 160000 there are 3.2billion distinct binary weak classifiers.
The wrapper method can also be used to learn a per-ceptron which utilizes M weak classifiers (John et al.,1994) The wrapper method also proceeds incremen-tally by adding one weak classifier to the perceptron ineach round. The weak classifier added is the one whichwhen added to the current set yields a perceptron withlowest error. Each round takes at least O(NKN) (or 60Trillion operations); the time to enumerate all binaryfeatures and evaluate each example using that feature.This neglects the time to learn the perceptron weights.Even so, the final work to learn a 200 feature classi-fier would be something like O(MNKN) which is 1016
operations.The key advantage of AdaBoost as a feature selec-
tion mechanism, over competitors such as the wrappermethod, is the speed of learning. Using AdaBoost a200 feature classifier can be learned in O(MNK) orabout 1011 operations. One key advantage is that ineach round the entire dependence on previously se-lected features is efficiently and compactly encodedusing the example weights. These weights can then beused to evaluate a given weak classifier in constant time.
The weak classifier selection algorithm proceeds asfollows. For each feature, the examples are sorted basedon feature value. The AdaBoost optimal threshold forthat feature can then be computed in a single pass overthis sorted list. For each element in the sorted list, foursums are maintained and evaluated: the total sum ofpositive example weights T +, the total sum of negativeexample weights T ", the sum of positive weights belowthe current example S+ and the sum of negative weightsbelow the current example S". The error for a thresholdwhich splits the range between the current and previousexample in the sorted list is:
e = min$S+ + (T " " S"), S" + (T + " S+%
,
or the minimum of the error of labeling all examplesbelow the current example negative and labeling the ex-amples above positive versus the error of the converse.These sums are easily updated as the search proceeds.
Many general feature selection procedures have beenproposed (see chapter 8 of Webb (1999) for a review).Our final application demanded a very aggressive pro-cess which would discard the vast majority of features.For a similar recognition problem Papageorgiou et al.(1998) proposed a scheme for feature selection based
142 Viola and Jones
Table 1. The boosting algorithm for learning a query online.T hypotheses are constructed each using a single feature. Thefinal hypothesis is a weighted linear combination of the T hy-potheses where the weights are inversely proportional to thetraining errors.
• Given example images (x1, y1), . . . , (xn, yn) whereyi = 0, 1 for negative and positive examples respectively.
• Initialize weights w1,i = 12m , 1
2l for yi = 0, 1 respectively,where m and l are the number of negatives and positivesrespectively.
• For t = 1, . . . , T :
1. Normalize the weights, wt,i ! wt,i!nj=1 wt, j
2. Select the best weak classifier with respect to theweighted error
!t = min f,p,"
"
i
wi | h(xi , f, p, " ) " yi | .
See Section 3.1 for a discussion of an efficientimplementation.
3. Define ht (x) = h(x, ft , pt , "t ) where ft , pt , and "t
are the minimizers of !t .4. Update the weights:
wt+1,i = wt,i #1"eit
where ei = 0 if example xi is classified correctly, ei = 1otherwise, and #t = !t
1"!t.
• The final strong classifier is:
C(x) =
#1
T"
t=1
$t ht (x) # 12
T"
t=1
$t
0 otherwise
where $t = log 1#t
Such structures have been called decision stumps inthe machine learning literature. The original work ofFreund and Schapire (1995) also experimented withboosting decision stumps.
3.1. Learning Discussion
The algorithm described in Table 1 is used to selectkey weak classifiers from the set of possible weakclassifiers. While the AdaBoost process is quite effi-cient, the set of weak classifier is extraordinarily large.Since there is one weak classifier for each distinct fea-ture/threshold combination, there are effectively KNweak classifiers, where K is the number of featuresand N is the number of examples. In order to appre-ciate the dependency on N , suppose that the examplesare sorted by a given feature value. With respect to thetraining process any two thresholds that lie between thesame pair of sorted examples is equivalent. Therefore
the total number of distinct thresholds is N . Given atask with N = 20000 and K = 160000 there are 3.2billion distinct binary weak classifiers.
The wrapper method can also be used to learn a per-ceptron which utilizes M weak classifiers (John et al.,1994) The wrapper method also proceeds incremen-tally by adding one weak classifier to the perceptron ineach round. The weak classifier added is the one whichwhen added to the current set yields a perceptron withlowest error. Each round takes at least O(NKN) (or 60Trillion operations); the time to enumerate all binaryfeatures and evaluate each example using that feature.This neglects the time to learn the perceptron weights.Even so, the final work to learn a 200 feature classi-fier would be something like O(MNKN) which is 1016
operations.The key advantage of AdaBoost as a feature selec-
tion mechanism, over competitors such as the wrappermethod, is the speed of learning. Using AdaBoost a200 feature classifier can be learned in O(MNK) orabout 1011 operations. One key advantage is that ineach round the entire dependence on previously se-lected features is efficiently and compactly encodedusing the example weights. These weights can then beused to evaluate a given weak classifier in constant time.
The weak classifier selection algorithm proceeds asfollows. For each feature, the examples are sorted basedon feature value. The AdaBoost optimal threshold forthat feature can then be computed in a single pass overthis sorted list. For each element in the sorted list, foursums are maintained and evaluated: the total sum ofpositive example weights T +, the total sum of negativeexample weights T ", the sum of positive weights belowthe current example S+ and the sum of negative weightsbelow the current example S". The error for a thresholdwhich splits the range between the current and previousexample in the sorted list is:
e = min$S+ + (T " " S"), S" + (T + " S+%
,
or the minimum of the error of labeling all examplesbelow the current example negative and labeling the ex-amples above positive versus the error of the converse.These sums are easily updated as the search proceeds.
Many general feature selection procedures have beenproposed (see chapter 8 of Webb (1999) for a review).Our final application demanded a very aggressive pro-cess which would discard the vast majority of features.For a similar recognition problem Papageorgiou et al.(1998) proposed a scheme for feature selection based
142 Viola and Jones
Table 1. The boosting algorithm for learning a query online.T hypotheses are constructed each using a single feature. Thefinal hypothesis is a weighted linear combination of the T hy-potheses where the weights are inversely proportional to thetraining errors.
• Given example images (x1, y1), . . . , (xn, yn) whereyi = 0, 1 for negative and positive examples respectively.
• Initialize weights w1,i = 12m , 1
2l for yi = 0, 1 respectively,where m and l are the number of negatives and positivesrespectively.
• For t = 1, . . . , T :
1. Normalize the weights, wt,i ! wt,i!nj=1 wt, j
2. Select the best weak classifier with respect to theweighted error
!t = min f,p,"
"
i
wi | h(xi , f, p, " ) " yi | .
See Section 3.1 for a discussion of an efficientimplementation.
3. Define ht (x) = h(x, ft , pt , "t ) where ft , pt , and "t
are the minimizers of !t .4. Update the weights:
wt+1,i = wt,i #1"eit
where ei = 0 if example xi is classified correctly, ei = 1otherwise, and #t = !t
1"!t.
• The final strong classifier is:
C(x) =
#1
T"
t=1
$t ht (x) # 12
T"
t=1
$t
0 otherwise
where $t = log 1#t
Such structures have been called decision stumps inthe machine learning literature. The original work ofFreund and Schapire (1995) also experimented withboosting decision stumps.
3.1. Learning Discussion
The algorithm described in Table 1 is used to selectkey weak classifiers from the set of possible weakclassifiers. While the AdaBoost process is quite effi-cient, the set of weak classifier is extraordinarily large.Since there is one weak classifier for each distinct fea-ture/threshold combination, there are effectively KNweak classifiers, where K is the number of featuresand N is the number of examples. In order to appre-ciate the dependency on N , suppose that the examplesare sorted by a given feature value. With respect to thetraining process any two thresholds that lie between thesame pair of sorted examples is equivalent. Therefore
the total number of distinct thresholds is N . Given atask with N = 20000 and K = 160000 there are 3.2billion distinct binary weak classifiers.
The wrapper method can also be used to learn a per-ceptron which utilizes M weak classifiers (John et al.,1994) The wrapper method also proceeds incremen-tally by adding one weak classifier to the perceptron ineach round. The weak classifier added is the one whichwhen added to the current set yields a perceptron withlowest error. Each round takes at least O(NKN) (or 60Trillion operations); the time to enumerate all binaryfeatures and evaluate each example using that feature.This neglects the time to learn the perceptron weights.Even so, the final work to learn a 200 feature classi-fier would be something like O(MNKN) which is 1016
operations.The key advantage of AdaBoost as a feature selec-
tion mechanism, over competitors such as the wrappermethod, is the speed of learning. Using AdaBoost a200 feature classifier can be learned in O(MNK) orabout 1011 operations. One key advantage is that ineach round the entire dependence on previously se-lected features is efficiently and compactly encodedusing the example weights. These weights can then beused to evaluate a given weak classifier in constant time.
The weak classifier selection algorithm proceeds asfollows. For each feature, the examples are sorted basedon feature value. The AdaBoost optimal threshold forthat feature can then be computed in a single pass overthis sorted list. For each element in the sorted list, foursums are maintained and evaluated: the total sum ofpositive example weights T +, the total sum of negativeexample weights T ", the sum of positive weights belowthe current example S+ and the sum of negative weightsbelow the current example S". The error for a thresholdwhich splits the range between the current and previousexample in the sorted list is:
e = min$S+ + (T " " S"), S" + (T + " S+%
,
or the minimum of the error of labeling all examplesbelow the current example negative and labeling the ex-amples above positive versus the error of the converse.These sums are easily updated as the search proceeds.
Many general feature selection procedures have beenproposed (see chapter 8 of Webb (1999) for a review).Our final application demanded a very aggressive pro-cess which would discard the vast majority of features.For a similar recognition problem Papageorgiou et al.(1998) proposed a scheme for feature selection based
142 Viola and Jones
Table 1. The boosting algorithm for learning a query online.T hypotheses are constructed each using a single feature. Thefinal hypothesis is a weighted linear combination of the T hy-potheses where the weights are inversely proportional to thetraining errors.
• Given example images (x1, y1), . . . , (xn, yn) whereyi = 0, 1 for negative and positive examples respectively.
• Initialize weights w1,i = 12m , 1
2l for yi = 0, 1 respectively,where m and l are the number of negatives and positivesrespectively.
• For t = 1, . . . , T :
1. Normalize the weights, wt,i ! wt,i!nj=1 wt, j
2. Select the best weak classifier with respect to theweighted error
!t = min f,p,"
"
i
wi | h(xi , f, p, " ) " yi | .
See Section 3.1 for a discussion of an efficientimplementation.
3. Define ht (x) = h(x, ft , pt , "t ) where ft , pt , and "t
are the minimizers of !t .4. Update the weights:
wt+1,i = wt,i #1"eit
where ei = 0 if example xi is classified correctly, ei = 1otherwise, and #t = !t
1"!t.
• The final strong classifier is:
C(x) =
#1
T"
t=1
$t ht (x) # 12
T"
t=1
$t
0 otherwise
where $t = log 1#t
Such structures have been called decision stumps inthe machine learning literature. The original work ofFreund and Schapire (1995) also experimented withboosting decision stumps.
3.1. Learning Discussion
The algorithm described in Table 1 is used to selectkey weak classifiers from the set of possible weakclassifiers. While the AdaBoost process is quite effi-cient, the set of weak classifier is extraordinarily large.Since there is one weak classifier for each distinct fea-ture/threshold combination, there are effectively KNweak classifiers, where K is the number of featuresand N is the number of examples. In order to appre-ciate the dependency on N , suppose that the examplesare sorted by a given feature value. With respect to thetraining process any two thresholds that lie between thesame pair of sorted examples is equivalent. Therefore
the total number of distinct thresholds is N . Given atask with N = 20000 and K = 160000 there are 3.2billion distinct binary weak classifiers.
The wrapper method can also be used to learn a per-ceptron which utilizes M weak classifiers (John et al.,1994) The wrapper method also proceeds incremen-tally by adding one weak classifier to the perceptron ineach round. The weak classifier added is the one whichwhen added to the current set yields a perceptron withlowest error. Each round takes at least O(NKN) (or 60Trillion operations); the time to enumerate all binaryfeatures and evaluate each example using that feature.This neglects the time to learn the perceptron weights.Even so, the final work to learn a 200 feature classi-fier would be something like O(MNKN) which is 1016
operations.The key advantage of AdaBoost as a feature selec-
tion mechanism, over competitors such as the wrappermethod, is the speed of learning. Using AdaBoost a200 feature classifier can be learned in O(MNK) orabout 1011 operations. One key advantage is that ineach round the entire dependence on previously se-lected features is efficiently and compactly encodedusing the example weights. These weights can then beused to evaluate a given weak classifier in constant time.
The weak classifier selection algorithm proceeds asfollows. For each feature, the examples are sorted basedon feature value. The AdaBoost optimal threshold forthat feature can then be computed in a single pass overthis sorted list. For each element in the sorted list, foursums are maintained and evaluated: the total sum ofpositive example weights T +, the total sum of negativeexample weights T ", the sum of positive weights belowthe current example S+ and the sum of negative weightsbelow the current example S". The error for a thresholdwhich splits the range between the current and previousexample in the sorted list is:
e = min$S+ + (T " " S"), S" + (T + " S+%
,
or the minimum of the error of labeling all examplesbelow the current example negative and labeling the ex-amples above positive versus the error of the converse.These sums are easily updated as the search proceeds.
Many general feature selection procedures have beenproposed (see chapter 8 of Webb (1999) for a review).Our final application demanded a very aggressive pro-cess which would discard the vast majority of features.For a similar recognition problem Papageorgiou et al.(1998) proposed a scheme for feature selection based
The first two features selected by the Adaboost algorithm are as shown in
Fig. 2.6. The first is two-rectangle feature which measures the difference in intensity
between the darker region of the eyes and the lighter region of the cheekbones. The second
is a three-rectangle feature which compares the intensities in the darker eye regions to the
intensity across the lighter bridge of the nose.
Fig. 2.6 - First two features selected by the Adaboosting learning algorithm
2.2.3 CASCADE ARCHITECTURE
The evaluation of the strong classifiers generated by the learning process
can be done quickly. But to further improve performance, the strong classifiers are
arranged in a cascade in order of complexity i.e the number and complexity of the features
increases from one cascade to the next. If at any stage in the cascade a classifier rejects the
sub-window under inspection, no further processing is performed and continue on
searching the next sub-window. Fig. 2.7 shows a 4 stage cascaded architecture where the
sub-window under inspection has to pass through each of the 4 stages to be detected as a
face.
Fig. 2.7 - Cascaded architecture [r]
9
10/02/12 5:01 PMHow face detection works | News | TechRadar
Page 1 of 6http://www.techradar.com/news/software/applications/how-face-detection-works-703173
The first time I looked at the reardisplay of a camera that had face-detection software, it was aninteresting experience. Point it at aperson, and the software wouldsuperimpose a coloured squareover that person's face.
This would enable you to moreeasily frame the photo, ensurecorrect exposure for the facecompared to the rest of the sceneand make sure that the face wasproperly focused. So how did itmanage it?
What's so special about a face that enables the camera to identify that this setof pixels is a face, but that set isn't? And in real-time too?
The camera doesn't have a chip with great processing power, either, so thealgorithm must be extremely efficient. We should also remember that over theyears, camera face-detection software has become pretty advanced. You cannow expect the software in your point-and-shoot camera to work out not justthe location of a face but also whether the person is smiling, and to take thephoto automatically if so.
Back in 2001, Paul Viola and Michael Jones invented a new framework fordetecting arbitrary objects and refined it for face detection. The algorithm isnow known as the Viola-Jones framework. The first thing to realise is that facedetection, whether by Viola-Jones or not, is not an exact science.
Just like we humans can be fooled by images that seem to contain a face whenin reality they do not, so face-detection software can be hoodwinked. Thisphenomenon is known as pareidolia: the apparent recognition of somethingsignificant (usually a face or a human form) in something that doesn't have itnaturally.
There are many examples of this, the most prominent being perhaps the Faceon Mars – a photo taken in the Cydonia region of Mars that appeared tocontain a human face in the rock – or the image of the Virgin Mary that anAmerican lady found in a grilled cheese sandwich.
Face detection software can be fooled too, so we talk about the algorithm'srate of false positives (detecting a face when there is none) or false negatives(not detecting a face that's present).
A good guess
The Viola-Jones method has a very high accuracy rate – the researchersreport a false negative rate of less than one per cent and a false positive rate ofunder 40 per cent, even when used with the simplest filter. (The full frameworkuses up to 32 filters, or 'classifiers'.) But we're getting ahead of ourselves.
The breakthrough for Viola and Jones came when they didn't try to analyse theimage directly: instead, they started to analyse rectangular 'features' in theimage. These features are known as 'Haar-like features', due to the similarity of
Try the new BETA version of TechRadar →
Esize SRM software Procurement cost savings. Best-in-class SRM software www.esize.com
Rapid Data Integration Fast, Online Data Matching Tool Simple to use - Try for Free www.Match2Lists.com
MOVEit Central Automate and Schedule Data Transfer Processes. Request Ipswitch Demo! www.IpswitchFT.com/MOVEitCentral
Premier PartnerSamsungNEWS REVIEWS BLOGS FORUMS TR STORE MAGAZINES TECH DEALS
Where am I? News News by technology Software Applications All feeds Get weekly newsletter Join TechRadar
TweetTweet 18 2
By Julian M Bucknall
How face detection worksIn Depth: You can do it, and your camera cando it - but how?
July 18th 2010 | Tell us what you think [ 1 comments ]
APPLICATIONS NEWS
10 SendLike
The first classifier on a 24 x 24 image of theauthor's face showing the two features in use
EXPLORE NEWS
ApplicationsOperating systems
RELATED NEWS
Facebook face detection tech goesworldwide
Digital cameras and cars to fuelmobile broadband surge
The best digital cameras to buyright now
PecoBOO face detection cutsenergy bills
Disposable digital cameras userecycled phone parts
Get the best deals onsubscriptionsAnd find out more about PC PlusMagazine
The truth about PC gamepiracyThe figures, the excuses andjustifications examined
45 best digital cameras inthe world todayWhat's the best digital camera?
NEWEST MOST READ MOST COMMENTED
TECH NEWS HEADLINES
Nikon: pros don't want articulating screensPhones4U JUMP scheme offers 6 month phone
upgradesWindows 8 on ARM: a confusing messiPad 3 apps being readied by AppleCanon: G1 X is a 'new category'Google and Microsoft combined can't beat AppleRIM reboots PlayBook as Windows 8 gets closerMore
Find a review Search reviews
All news Mobile Phones TVs Tablets Components Cameras AV Computing Laptops More iPad 3 rumours MWC 2012 Nokia Lumia
Updated 1 hour ago Log in | Join TechRadar and get our free newsletter Search news, reviews, blogs
10/02/12 5:01 PMHow face detection works | News | TechRadar
Page 4 of 6http://www.techradar.com/news/software/applications/how-face-detection-works-703173
pass though 100 per cent of the faces with a 40 per cent false positive rate (60per cent of the non-faces would be rejected by this classifier).
Figure 3 shows this simple classifier in action. It uses two features to test theimage: a horizontal feature that measures the difference between the darkereyes and the lighter cheekbones, and the three-rectangle feature that tests forthe darker eyes against the lighter bridge of the nose.
FIGURE 3: The first classifier on a 24 x 24 image of the author's face showingthe two features in use
Although they had been trying to implement a strong classifi er from acombination of 200 or so weak classifi ers, this early success prompted them tobuild a cascade of classifiers instead of a single large one (see Figure 4).
Each subwindow of the original image is tested against the first classifier. If itpasses that classifier, it's tested against the second. If it passes that one, it'sthen tested against the third, and so on. If it fails at any stage of the testing, thesubwindow is rejected as a possible face. If it passes through all the classifiersthen the subwindow is classified as a face.
FIGURE 4: The Viola-Jones cascade of classifiers
The interesting thing is that the second and subsequent classifiers are nottrained with the full training set. Instead they are trained on the images that
2.3 CANNY EDGE DETECTION
In general, the purpose of edge detection is to significantly reduce the
amount of data in an image, while preserving its structural properties [o]. This is done as
an initial step in many image processing algorithms so that it can be used for further
processing. Several edge detection algorithms exists, but the Canny edge detector is one of
the most popular.
! ! The Canny edge detector is an edge detection operator that uses a multi-
stage algorithm to detect a wide range of edges in images [d]. It was developed by John F.
Canny in 1986. His aim was to develop an algorithm that met the following criteria [3, 4]:
• Good Detection: The detection of real edges should be maximized while that of non-
edges should be minimized.
• Good Localization: The edges marked by the Canny edge detector should be as close as
possible to the real edges in the original image.
• Minimum number of responses: An edge should be detected only once, and image noise
should not be detected as edges.
The Canny edge detection algorithm operates in five separate steps as shown
2.3.1 NOISE REDUCTION BY GAUSSIAN BLUR
All images taken from a camera will contain some amount of noise. To
reduce single pixel noise being mistaken for edges, this noise must be reduced. Therefore
the image is first blurred by convolving it with a Gaussian filter. This is called as a
Gaussian Blur. The kernel of a Gaussian filter with a standard deviation of σ = 1.4 is
usually used and is shown in Eq.(2.6)
10
B = 1159
2 4 5 4 24 9 12 9 45 12 15 12 54 9 12 9 42 4 5 4 2
⎡
⎣
⎢⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥⎥
(2.6)
2.3.2 FINDING INTENSITY GRADIENTS
The Canny algorithm finds edges where there is a high variation in the
grayscale intensity of the image. These regions are found by calculating the gradients at
each pixel of the image. This can be done by applying various operators such as Roberts
Cross, Scharr, Prewitt etc. But the most commonly used one is the Sobel operator. This
uses two 3×3 kernels as shown in Eq. (2.7) which are convolved with the original image to
calculate approximate derivatives in the x and y directions respectively.
The gradient magnitudes, G and direction of the edges, θ can be calculated
as shown in Eq. (3.3), where Gx and Gy are the the gradients in the x- and y-directions
respectively, derived from the Sobel operator. The edge direction, θ is then rounded to one
of four angles representing vertical, horizontal and the two diagonals (usually 0, 45, 90 and
135 degrees, respectively)
2.3.3 NON-MAXIMUM SUPPRESSION
Each of the calculated image gradients are then checked, to see if they
assume a local maximum in the gradient direction. For example, if the rounded gradient
angle is zero degrees (i.e. the edge is in the east-west direction) the point will be
considered to be on the edge if its gradient magnitude is greater than the magnitudes in the
north-south direction. Otherwise this gradient value is suppressed/removed. This
mechanism is carried out for all the gradient values to get the final binary image consisting
of what are called thin edges.
2.3.4 ADDITIONAL STEPS
Additional steps such as Double thresholding and Edge tracking by
hysteresis can be carried out to remove insignificant edges. Double thresholding does this
11
KGX =−1 0 1−2 0 2−1 0 1
⎡
⎣
⎢⎢⎢
⎤
⎦
⎥⎥⎥
KGY =1 2 10 0 0−1 −2 −1
⎡
⎣
⎢⎢⎢
⎤
⎦
⎥⎥⎥
θ = arctanGX
GY
⎛
⎝⎜⎞
⎠⎟G = G2
X +GY2 (2.8)
(2.7)
by setting a threshold for the gradients. If the edges have a gradient value higher than the
threshold they are marked as strong edges and are otherwise marked as weak edges. Edge
tracking by hysteresis is then done to retain only those weak edges that are connected to
strong edges.
2.3.5 EXAMPLE
Fig. 2.8 (a), (b) and (c) - Examples
Fig. 2.8 (d) - Examples
Fig 2.8 illustrates the Canny edge detection algorithm through each step.
Fig 2.8 (a) is a grayscale example image. Fig 2.8 (b) shows a Gaussian bur applied on the
12
original image. Fig 2.8 (c) shows the image after finding gradients and turning those with
high enough gradient to white and the rest of the pixels to black. Fig 2.8 (d) shows the final
Canny edge detected image after applying non-maximum suppression.
2.4 SKIN COLOUR DETECTION
Skin colour detection is a popular method used for face detection. It is a
known fact that human skin has a characteristic color, which is easily recognized by
humans. Colour detection, also allows for fast processing and is highly robust to geometric
variations of the face [11]. So it is only logical to create an algorithm that detects faces
based on colour.
! ! There are various models for classifying pixels as skin colour. This report
uses the Hue Saturation and Intensity(HSI) model which is discussed in the following
sections.
2.4.1 HSI MODEL
HSV is very different three-dimensional color space from RGB or CYM.
The three key elements are defined as follows [2, f].
• Hue - It is an attribute associated with the dominant wavelength in a mixture of light
waves. It represents the human sensation according to which an area appears to be
similar to one, or a combination of two, of the perceived colours red, yellow, green and
blue.
• Saturation - It is the colorfulness of an area relative to its own brightness. Gives a
measure of how much a pure colour(hue) is diluted with white light.
• Value - It is the visual sensation due to which an area appears to emit more or less light.
Fig. 2.9 - Double cone representation of HSI Space [f]
13
Fig. 2.9 illustrates a common representation of the HSI space. The
cylindrical shape has one central axis representing value, V, which varies from 0 to 1. It
has a value 0, representing black, at the lower pointed end of the double cone and 1 for the
upper end, representing white. Along this axis are all the gray values. If this double cone is
viewed from the top, it becomes a circle. Different colors, or hues, are arranged around this
circle. Hues are determined by their angular location on this circle, with the hue at red
being 0 ̊ green at 120˚ and blue at 240˚. Saturation is the distance perpendicular to the
value axis. Colors near the central axis have low saturation and have a washed out look.
Colors near the surface of the cone have high saturation. Another point to be noted is that
when saturation is 0 then the hue is undefined.
2.4.2 SKIN COLOUR
Kjeldson and Kender [7] defined a model in HSV color space to separate
skin colour regions from background for a a hand-gesture-based user interface. In order to
reduce the effects of lighting, the model proposed relies heavily on hue and saturation, and
giving minor importance to intensity.
In the model used in this report the value of intensity is ignored and the
threshold values shown in Eq.(2.9) [n] are used to distinguish between skin coloured and
non-skin coloured pixels. Here 0˚ and 50˚ refer to the upper and lower thresholds of hue
respectively. Similarly 0.23 and 0.68 are the upper and lower thresholds for saturation.
2.4.3 EXAMPLES
! ! In Fig 2.10 (a) each pixel is checked to see if it satisfies the threshold
values mentioned in Eq. (2.9). If a particular pixel passes the threshold it is turned to white,
else it is turned to black as shown in Fig 2.10 (b).
14
0 ≤ Hueskin ≤ 50
0.23≤ Saturationskin ≤ 0.68(2.9)
Fig 2.10 (a) and (b) - Example
15
CHAPTER - III
DETAILED EXPLANATION OF PROJECT
The main aim of the project is to quickly reject regions of an image which
can’t possibly be faces.This increases the speed of computation and reduces the number of
false positives in the output. This is done using skin colour detection and Canny edge
detection as pre-processing techniques. An iOS 5 App is created with the above mention
face detection methodology. A detailed description of the project is given below
3.1 STRUCTURE OF THE PROJECT
The project has two parts. One is the operation of the proposed face detector
on the iMac. The other is an App running this algorithm on the iPod Touch (4th
generation). Both of these are discussed in detail below.
3.1.1 iOS PROJECT
This part of the project is basically an App that takes live video information
from the camera of the mobile device, stores it and runs it thorough the face detection
algorithm. This algorithm returns the positions and the dimensions of the faces. The App
then creates a transparent view on top of the current view on which the boxes surrounding
the face are dawn.
The image coming from the camera is in the standard BRGA format. The
first three letters (B, G and R) refer to the primary colour components i.e. blue, green and
red. The A represents the alpha channel. The alpha channel is used in computer graphics to
blend images with their background in order to create the appearance of partial or full
transparency. If the A value is minimum it means that particular pixel is completely
transparent and it has a maximum value it represents an opaque pixel.
The image has a resolution of 640x480 pixels. This is then in the memory
stored as a uint8_t one dimensional array. The uint8_t variable is similar to an unsigned
char. It occupies one byte of memory and has values in the range (0-255). This means that
each pixel value has four uint8_t values assigned to it i.e. one for every colour component.
The App basically has three views. Topmost is a transparent view that
16
displays boxes around the detected faces. The other two views show the live video feed
and the current frame being processed respectively. The user can switch between these two
using Swipe(left/right) gestures. The App also has a has a Toolbar with options to
• Switch cameras - Has the option of using the front and back cameras of the device.
• Toggle face detection - This is done to turn the face detection ON and OFF
• Take a picture - A button to store the current frame in the save photos library.
• Help option - This brings up a scroll view with options to choose between the different
types of pre-processing (See Section 4.2.2)
Fig 3.1- App running on iPod touch 4th generation
3.1.2 MAC PROJECT
The Mac project runs the same face detection algorithm as the iOS project
and is used for testing. This is useful as images can be displayed and verified from any
point in the program, which is not always possible on a mobile device.
The Mac project uses the Cimg, C++ image library for storing and
displaying the images [b]. The test images taken from the mobile device are stored run
through the Mac algorithm. The Cimg image library accepts images in the Windows
Bitmap format. This then produces an output file containing the name of the image with
the coordinates and dimensions of the faces. Its also saves images at various stages in the
processing.
17
3.2 TRAINING OF CLASSIFIERS
Initially 10,920 training images are taken with 8,004 faces and 2,916 non-
faces. All the images with faces have fixed eye positions. Then the training algorithm
mentioned in the Viola-Jones framework is used (See Section 2.2.2). Each stage is trained
until stage goal of 99.5% true positive rate and 60% true negative rate (for definitions see
Section 4.1) is fulfilled. This means that 99.5% of the images with faces have to be
correctly classified as faces and 60% of the images without any faces have to be correctly
rejected as non-faces. or stage limit of 100 classifiers is reached. Each cascade stage has
can have a maximum of 100 rectangular features. If these values(TPR = 99.5% and TNR =
60%) are not obtained by the 100th rectangular feature the stage is stopped there
irrespective of the values of TPR and TNR.
This is followed by what is called as Bootstrapping. Here, after each stage
non-faces that were correctly rejected by the stage are removed from the negative training
set. Then a collection of 19,774 higher resolution scene images (containing no faces) are
used to randomly sample new non-face training images. Only random samples, that are
falsely not rejected by all previous stages, are added to the negative training set. In this
way the following stage is focused on the weaknesses of the previous stages. 10,000 of
these samples are collected and added to the negative training set. Thus the negative set
normally grows during training. After the 7th stage the training set contained the 8,004
faces and 15,129 non-faces. The number of rectangular features for each stage is
represented in Table 2.1
Table 3.1 - Number of rectangular features per stage.
18
Stage Number Number of rectangular features
1 52 33 104 185 326 507 68
Total = 186
3.3 IMAGE PROCESSING
The 640x480 image is passed on to the face detection algorithm. This
basically detects the position and dimension of each detected face in the image. The
algorithm works in the following steps
3.3.1 SKIN COLOUR DETECTION
Research on object detection based on skin color classification has gained
popularity in recent years[9, 11]. However object detection based solely on this method has
a lot of shortcomings as it is difficult to get perfect classification under different lighting
conditions,etc. Moreover when this method is employed for face detection, the detector
fails when there are skin coloured regions like legs, arms, etc or the background region
contain skin colour like pixels. Hence its better to use it as a pre-processing step.
This step is carried out in using the algorithm mentioned in the theory section. This
algorithm uses images in the HS colour model, but the representation of the image is in the
RGB model. Hence a conversion is required from the RGB to HS. The steps followed are
discussed below
• RGB to HS conversion -
The set of equations in Eq. (3.1) [n] show the conversion from the RGB
model to the HS model. Here the values R, G and B represent the red, green and blue
intensities of each pixel receptively. As can be seen hue is an angle denoted in degrees and
saturation is a value varying between 0 and 1.
19
M = max(R,G,B)m = min(R,G,B)
Cb = M − BM −m
Cg = M −GM −m
Cr = M − RM −m
′H =
undefined , if Saturation = 0Cb −Cg , if M = R2 +Cr −Cb , if M = G4 +Cg −Cr , if M = B
⎧
⎨
⎪⎪
⎩
⎪⎪
Saturation = M −mM
Hue = 60 × ′H
(3.1)
• Thresholding - Each pixel in the HS image is then checked if it a skin coloured value
using the threshold values mentioned in Eq. (2.9). If a certain pixel passes the threshold
then it is turned 255 (white) and 0 (black), otherwise.
• Nearest neighbour scale down - This image is then scaled down to half its dimension
using a nearest neighbour scale down algorithm. This is done to reduce the computation
time of the integral image. In this algorithm the value of the pixel at the new point is
approximated to the value of the pixel which is nearest to it in the original image. It does
not consider all the surrounding neighbours resulting in a pice-wise interpolant.
The final image is stored separately in the memory for to be later used for preprocessing.
Fig 3.2 (a), (b) and (c) - Example
Fig 3.2(a) shows an image taken from the test set (See Section 4.2.1). The
image is first set through the skin colour detection algorithm and all the pixels that are
classified as skin coloured are set to white and the rest are set to black. As can be seen
some of the pixels on the reddish building are also misclassified as skin coloured. The
image is then reduced to half its dimension. It has to be noted that the images shown here
have the same dimension as
20
3.3.2 RGB TO GREY CONVERSION
The face detection algorithm requires a greyscale image. Hence the original
image is converted as shown in Eq. (3.2).
Here each pixel in the grayscale image has a value of the sum of 30%
(approx.) of the red value, 59% (approx.) of the green value, and 11% (approx.) of the blue
value. This image too is stored separately in the memory and the original image can now
be released from the memory. Fig 3.3 shows Fig 3.2(a) converted to greyscale using the
above equation.
Fig 3.3 - Example
3.3.3 CANNY PRUNING
Canny pruning is used as a pre-processing step to remove the regions in the
image that have too many or too few edges to be a face.
The steps followed by the Canny pruning algorithm are discussed below.
• Bilinear scale down - The grey scale image is initially scaled down to half its dimension
i.e.320x240 using a bilinear scale down function. This is done to reduce the computation
time of the canny edge detection algorithm and also that of the integral image
calculation. This algorithm finds the value of pixel intensity at a position in the scaled
image by taking the weighted average of the four nearest neighbours in the original
21
GreyVal = (R × 77)+ (G ×151)+ (B × 28)256
(3.2)
image. The weights are assigned to each neighbour inversely proportional on the distance
between the new pixel and the neighbour.
• Histogram normalization - The scaled grey image is then contrast stretched so that it
used the entire grayscale range, to improve the quality of the detected edges. This is
done using histogram equalization. The normalized histogram of the image is first
calculated and is stretched to cover the entire grayscale range. It is described in Eq. (3.3)
[2], [m].
Here pn denotes the normalized histogram of an image f and g denotes the
histogram normalized image. The summation term in the second equation calculated the
cumulative distribution of the normalized histogram, where L is 256 (the range of the
grayscale values). The function floor() is used to round down to the nearest integer, in
order to avoid getting out of range.
• Canny edge detection - This image is now passed through the canny edge detection
algorithm. This algorithm is the same as the one described in the theory section. The
upper and lower thresholds used for the gradient values are 60 and 30 respectively [q].
This means that the edge tacking starts if the gradient value is above the upper threshold
and it continues if the gradient value is above the lower threshold.
Steps such as double thresholding and edge tracking by hysteresis are
skipped so as not to loose edge information.The final output of the algorithm is an
unsigned char image having detected edges with value 0 (white) and rest of the pixels with
value 255(black). This image is stored separately in the memory.
Fig 3.4 (a), (b) and (c) - Examples
22
Histogram Equalization
Histogram equalization is a technique for adjusting image intensities to enhance contrast.
Let f be a given image represented as a mr by mc matrix of integer pixel intensities rangingfrom 0 to L ! 1. L is the number of possible intensity values, often 256. Let p denote thenormalized histogram of f with a bin for each possible intensity. So
pn =number of pixels with intensity n
total number of pixelsn = 0, 1, ..., L! 1.
The histogram equalized image g will be defined by
gi,j = floor((L ! 1)
fi,j!
n=0
pn), (1)
where floor() rounds down to the nearest integer. This is equivalent to transforming thepixel intensities, k, of f by the function
T (k) = floor((L ! 1)k
!
n=0
pn).
The motivation for this transformation comes from thinking of the intensities of f and g ascontinuous random variables X, Y on [0, L ! 1] with Y defined by
Y = T (X) = (L ! 1)
" X
0
pX(x)dx, (2)
where pX is the probability density function of f . T is the cumulative distributive functionof X multiplied by (L ! 1). Assume for simplicity that T is di!erentiable and invertible. Itcan then be shown that Y defined by T (X) is uniformly distributed on [0, L ! 1], namelythat pY (y) = 1
L!1 .
" y
0
pY (z)dz = probability that 0 " Y " y
= probability that 0 " X " T!1(y)
=
" T!1(y)
0
pX(w)dw
d
dy
#" y
0
pY (z)dz
$
= pY (y) = pX(T!1(y))d
dy(T!1(y)).
1
gi, j = floor((L −1)× pnn=0
fi , j
∑ )(3.3)
Fig 3.4(a) shows image in Fig 3.3 shrunk using bilinear scaling. This is
then contrast stretched as shown in Fig 3.4 (b). Finally, Fig 3.4 (c) shows the Canny edge
detected image.
3.3.4 SLIDING WINDOW
The face detection algorithm tries to detect a face in a rectangular sub-
window of the image. As the classifiers used for this project are trained with images having
dimension 110x128, the size of a sub-window used is also 110x128. In order to detect
larger faces the image is scaled down and the sub-window having the same size is used.
The scale down is done by bilinear scaling for the grayscale image and using nearest
neighbour for the skin colour detected and Canny edge detected image.
The scaling factor used is 1.25 and is repeated seven times. This is
illustrated in Fig 3.5 where the scaling of the grayscale image is shown. The window is
shown in (red) the top left position having constant size in each image, but covering a large
area in the small image thereby detecting big faces. Similarly in the large image the
window covers a smaller area detecting small faces
Fig 3.5 - Example of scaling where window size remains constant
23
For each image size the sub-window starts from the top left corner and
moves through the entire image with a step size of two pixels. The step size for the Canny
image and skin colour detected image is, hence, one pixel, as they have half the dimension
as the original image.
Hence, in effect, the smallest sub-window used to detect a face has
dimension 110x128 and the largest corresponding sub-window has dimension 413x480.
3.3.5 PRE-PROCESSING
Pre-processing is done before the actual Viola-Jones face detector so that a
significant number of windows which can be easily classified as non-faces are rejected
without much computation. This is done by counting the number of white pixels in the
canny edge detected image and the skin colour detected image for each sub-window. This
value of white pixel count is passed through a threshold to check if it is within an
acceptable range for being a face.
The threshold value for the skin colour detected image is chosen to be 45%
to 90%. This means that the number of pixels detected as skin coloured pixels(white) has
to be within 45% to 90% of the total number of pixels in the current sub-window. So the
sub-windows having less than 45% skin coloured pixels are not considered as faces and
sub-windows having more than 90% skin images are also regarded as non-faces. This is
because faces are not made up entirely of skin coloured pixels like the region of the eye,
hair etc. Similarly the threshold values for the Canny edge detected image are 16.7% to
22.7%. Here sub-windows having less than 16.7% white pixels are considered to have too
few edges to be a face and those having more than 22.7% white pixels cannot be faces as
they have too many edges.
The white pixel count is done for each scale for the skin colour detected
and Canny edge detected image by calculating their integral images. The value of the
integral image at any point is the sum of all the pixels above and to the left of the point, as
discussed in the theory. Using this value the total sum of pixels within each sub-window
can be calculated using the integral image values at each of its vertices as shown in
Fig(2.5). This sum value is then divided by 255(value of each white pixel) to get the actual
number of white pixels in the sub-window.
24
3.3.6 FACE DETECTION
The Viola-jones face detection framework as discussed in the theory is used
here. This is done using the trained cascaded classifiers, (See Section 3.2). All the sub-
windows that pass the pre-processing step are checked for faces. Initially the integral
images of the grayscale image and its square are calculated.
The square is calculated for variance normalization as suggested in the
original Viola Jones face detection framework. This is done to reduce the effect of different
lighting conditions and is describes as shown in Eq.(3.4). Here v is the variance of the
image sub-window, σ is the standard deviation, n is the number of pixels in the sub-
window, M is the mean and x is the pixel value within the sub-window.
The normalized value, xnormalized, of pixel x calculated as shown in Eq.(3.5)
Fig 3.6 (a) shows 1200 of 1520 images used for training and Fig 3.6 (b) shows the same
images variance normalized. As can be seen, this compensates for the different lighting
conditions found in the training images
Fig 3.6 (a) and (b) - Variance normalization
25
v =σ 2 = 1n
(x2 )∑ −M 2 (3.4)
xnormalized =x −M
v(3.5)
This value can be easily calculated with the help of the integral image, as
shown in Fig.(2.5). Now, the normalized values of the rectangular features within the sub-
window are compared to the threshold obtained from the classifier training. If they pass
this threshold they go on to the next stage of the cascaded architecture and the same
procedure is followed. If any stage rejects a given sub-window, it is classified as a non-face
and the process is carried out for the next sub-window.
All the sub-windows that pass through all the stages are classified as
containing a face. The position and size of these sub-windows are stored for drawing of
rectangles around the faces. In the Mac project these values (size and position of each sub-
window in which a a face is detected) are stored in a separate document to be later used for
testing.
26
CHAPTER - IV
EXPERIMENTAL ANALYSIS
4.1 THEORY
The face detection classification model is a mapping of instances into
classes of faces and non-faces. This is a two-class prediction problem, also called as a
binary classification, in which the outcomes are labeled either as positive (p) i.e. faces or
negative (n) i.e non-faces. There are four possible outcomes from any binary classifier as
shown.
• True Positive (TP) - If the outcome from a prediction is that a given image is a face and it
is actually a face, then the prediction is called as a true positive .
• False positive (FP) - However if the image is not a face, but it is predicted to be a face,
then it is said to be a false positive.
• True negative (TN) - A true negative occurs when the prediction is that the image is a
non-face and its is actually not a face.
• False negative (FN) - Similarly, a prediction is called a false negative when the
prediction is that the image is a non-face while it is actually a face.
This can be formulated as a confusion matrix as shown in Fig 4.1. A confusion matrix is a
specific table layout that allows visualization of the performance of an algorithm. Here the
values on the horizontal axis denote the actual values i.e. p for faces and n for non-faces.
Similarly the values on the vertical axis denote the outcome of the prediction i.e. p´ for
faces and n´ for non-faces.
Fig 4.1 - Confusion Matrix
27
22/02/12 2:48 PMReceiver operating characteristic - Wikipedia, the free encyclopedia
Page 2 of 9http://en.wikipedia.org/wiki/Receiver_operating_characteristic
eqv. with miss, Type II errorsensitivity or true positive rate (TPR)
eqv. with hit rate, recallTPR = TP / P = TP / (TP + FN)
false positive rate (FPR)eqv. with fall-outFPR = FP / N = FP / (FP + TN)
accuracy (ACC)ACC = (TP + TN) / (P + N)
specificity (SPC) or True Negative RateSPC = TN / N = TN / (FP + TN) = 1 − FPR
positive predictive value (PPV)eqv. with precisionPPV = TP / (TP + FP)
negative predictive value (NPV)NPV = TN / (TN + FN)
false discovery rate (FDR)FDR = FP / (FP + TP)
Matthews correlation coefficient (MCC)
F1 scoreF1 = 2TP / (P + P') = 2TP / (2TP + FP + FN)
Source: Fawcett (2006).
The ROC space and plots of the four
Let us consider a two-class prediction problem (binaryclassification), in which the outcomes are labeled eitheras positive (p) or negative (n) class. There are fourpossible outcomes from a binary classifier. If theoutcome from a prediction is p and the actual value isalso p, then it is called a true positive (TP); however ifthe actual value is n then it is said to be a false positive(FP). Conversely, a true negative (TN) has occurredwhen both the prediction outcome and the actual valueare n, and false negative (FN) is when the predictionoutcome is n while the actual value is p.
To get an appropriate example in a real-world problem,consider a diagnostic test that seeks to determinewhether a person has a certain disease. A false positivein this case occurs when the person tests positive, butactually does not have the disease. A false negative, onthe other hand, occurs when the person tests negative,suggesting they are healthy, when they actually do havethe disease.
Let us define an experiment from P positive instancesand N negative instances. The four outcomes can beformulated in a 2×2 contingency table or confusionmatrix, as follows:
actual value p n total
predictionoutcome
p' TruePositive
FalsePositive P'
n' FalseNegative
TrueNegative N'
total P N
ROC spaceThe contingency table can derive several evaluation "metrics" (see infobox). Todraw an ROC curve, only the true positive rate (TPR) and false positive rate(FPR) are needed. TPR determines a classifier or a diagnostic test performanceon classifying positive instances correctly among all positive samples availableduring the test. FPR, on the other hand, defines how many incorrect positiveresults occur among all negative samples available during the test.
A ROC space is defined by FPR and TPR as x and y axes respectively, whichdepicts relative trade-offs between true positive (benefits) and false positive(costs). Since TPR is equivalent with sensitivity and FPR is equal to 1 −specificity, the ROC graph is sometimes called the sensitivity vs (1 − specificity)plot. Each prediction result or one instance of a confusion matrix represents onepoint in the ROC space.
The best possible prediction method would yield a point in the upper left corner
There certain values that can be obtain from the above confusion matrix.
These are useful for the analysis of the operation of the face detector and as follows
• True positive rate (TPR) - It determines the performance of the face detector in
classifying positive instances correctly (true positives) among all positive samples
available (faces) during the test. It is also called sensitivity.
• False positive rate (FPR) - On the other hand, the false positive rate defines how many
false positive results occur among all negative samples (non-faces) available during the
test.
These two values are defined as shown the Eq.(4.1), where TP denotes true positives, P
denotes total number of actual positives, FN denotes false negatives, FP denotes false
positives, TN denotes true negatives and N denotes the total number of negatives. Another
value called sensitivity is defined as shown.
4.2 ANALYSIS
The following sections describe the different cases that are to be analyzed
and the methodologies employed for the analysis
4.2.1 TEST SET OF IMAGES
To analyze the performance of the proposed face detector a test set of 29
images were taken under different lighting conditions and with different backgrounds. This
was done with an iPod Touch 4th generation using both the front and back cameras.
Ground truth values of each the test images were then taken. This refers to the eye
positions of all the faces present each test image and also the names of the images which
don't contain any face. These ground truth values are very important for analysis of
performance to check whether the detected windows actually do contain faces.
28
TPR = TPP
= TPTP + FN
Sensitivity = 1− FPR
FPR = FPN
= FPFP +TN (4.1)
The analysis of the test images is done using the Mac project. Some of the test images
taken are shown below in Fig 4.2 (a) to (d).
Fig 4.2 (a) to (d) - Test images
4.2.2 DIFFERENT CASES TO BE ANALYZED
The analysis of the face detector is done under five different cases as
shown
• Without pre-processing - This is just the normal Viola Jones face detection framework
running the first seven stages of the cascaded architecture.
• With Skin colour detection as pre-processor - Here the skin colour detection algorithm is
used as a pre-processing step to reject a number of sub-windows before the actual Viola
Jones face detection framework is carried out.
• With Canny edge detection as pre-processor - The Canny edge detection algorithm is
used here to reject the sub-windows of the image that have too many or too few edges to
contain a face.
29
• With both Skin colour and Canny edge detection with an AND gate - Here both skin
colour detection and Canny edge detection algorithms are used. The sub-window is
passed to the Viola Jones face detection framework only if it passes through both theses
pre-processing steps.
• With both Skin colour and Canny edge detection with an OR gate - This is similar to the
previous case, but the sub-window is rejected by pre-processing only if both the skin
colour detection and the Canny edge detection algorithms reject it.
4.2.3 CALCULATION OF FPR AND TPR USING MATLAB
As mentioned earlier, all the test images are run through the Mac project,
under each of the above discussed cases. This returns a text document containing the
position and size of the sub-windows in which a face was detected. The calculation of FPR
and TPR of all the images under each case is done using a Matlab script. This algorithm
requires the output text document from the Mac project and the ground truth values as
inputs.
The Matlab script works by calculating the ideal image sub-window around
each face in all the images using the ground truth data. This can be done as all the images
that were used in the training of the classifiers, have a fixed eye position. Hence with the
eye positions from the ground truth data the calculated the ideal size and position of the
sub-window. The algorithm allows for a 1.5 times variation in size and a 33% variation in
position from the calculated ideal sub-window. Hence all the possible sub-window
positions and sizes are known for each face. This data is compared with the output of the
Mac project to calculate the TPR and FPR values.
It is assumed that each face has to be detected only once i.e. at least by one
sub-window to have a TPR of 1.0. Also the for the calculation of FPR the total number of
windows is used in the denominator rather than the total number of possible negatives.
This is because the total number of possible positives is negligibly small as compared to
the total number of windows and wont have a significant effect on the FPR. This is as
shown in Eq. (4.2). Here Tot os the total number of windows which is the sum of the
number of positives(sub-windows containing faces) and number of negatives(sub-windows
containing no faces). As the the number of positives is a negligible value the total number
of negatives can be approximated to the number of windows.
30
4.2.4 ANALYSIS OF SPEED
Each of the test images is loaded onto the iPod touch 4th generation and the
proposed face detection algorithm is run under each of the above described cases. The
time for the face detection algorithm to calculate the output values, i.e. position and size of
each sub-window detected to contain a face. This is done using functions under the C++
header file <sys/time.h>. Each of these time values are noted and compared.
4.3 EXPECTED RESULTS
Each case of proposed face detection algorithm is expected to show the
following results.
• Without pre-processing - This is expected to have a number of false positives and hence
the FPR should have a high value. This is because only the first seven stages of the Viola
Jones face detection framework are used. However, this is expect to detect all the faces in
the image thus have a TPR value close to 1.0. This algorithm is also expected to take the
longest time amongst the five cases when run on the iPod touch 4th generation.
• With Skin colour detection as pre-processor - Here the number of false positives is
expected to be reduced. Hence FPR will be lower than the first case. But this algorithm
wont work under dark shadowy images where it is difficult to detect skin colour in the
faces. Hence some of the faces will have been missed and the TPR value will be slightly
less. This is also expected to take much lesser time than the first method as many of the
sub-windows will be rejected by pre-processing
• With Canny edge detection as pre-processor - Similar to the previous case the FPR will
be lower than the first case but the TPR will be slightly lesser than 1.0. The speed too
will be lower due to pre-processing as compared to the first case.
• With both Skin colour and Canny edge detection with an AND gate - Here the FPR will
be even lower than the previous two cases as only very few sub-windows will pass the
pre-processing sage. The TPR value will also be significantly lower as will time taken for
processing on the iPod Touch 4th generation.
31
FPR = FP
NFPTot Tot = P + N N (4.2)
• With both Skin colour and Canny edge detection with an OR gate - The number of false
positives will be lower than the case without pre-processing but higher than the ones
using skin colour detection and Canny edge detection separately. The TPR will also be
closer to 1.0 than these two cases. But the speed of processing will be reduced as the
number of sub-windows allowed to pass thorough the pre-processing stage will have
increased.
32
CHAPTER - V
SPECIFICATIONS
5.1 SOFTWARE USED FOR PROGRAMMING AND TESTING
5.1.1 XCODE
Xcode is a suite of tools developed by Apple Inc. It is used for developing
software for Mac OS X and iOS. It was first released in 2003. The Xcode suite also
includes most of Apple's developer documentation, and built-in Interface Builder which is
an application used to construct graphical user interfaces.It supports C, C++, Objective-C,
Objective-C++, Java, AppleScript,Python and Ruby source code with a variety of
programming models. Xcode version 4.3 was used for the project. This was the main
application used for creating and running both the iOS 5 App and the Mac project used for
testing.
5.1.2 MATLAB
MATLAB (Matrix laboratory) is a numerical computing environment used
for various mathematical and engineering analysis. Developed by MathWorks, MATLAB
allows matrix manipulations, plotting of functions and data, implementation of algorithms,
etc. MATLAB version R2010a was used. This was used for getting testing and statistical
results for true positive rate and false positive rate to compare all the five cases.
5.1.3 LIBRARIES
The languages used for the project were C, C++, Objective-C/C++. The
following additional libraries were used other than those found in the Apple's developer
documentation of Xcode.
• CImg Library - The CImg Library [b] is a small, open source, C++ toolkit for image
processing. It mainly consists in a single header file CImg.h providing a set of C++
classes and functions that can be used to load/save, manage/process and display generic
images. It was used for the Mac project for storing and displaying images at different
parts of the program.
33
• pugixml - It is a fast, light-weight C++ XML processing library [k]. Its is used for both
the Mac and iOS projects for parsing the XML files which store the information of the
classifiers.
5.2 HARDWARE
5.2.1 iMAC/ MACBOOK PRO
Macbook Pro - 2.3 GHz Intel Core i5,
4 GB RAM (1333 MHz DDR3),
Mac OS X Lion 10.7.3
iMAC - 3.06 GHz Intel Core i3,
4 GB RAM (1333 MHz DDR3),
Mac OS X Lion 10.7.2
5.2.2 iPOD TOUCH 4th GENERATION/ iPAD 2
iPod Touch 4th generation - 1GHz ARM Cortex - A8 processor,
256 MB DRAM,
iOS 5
iPad 2 - 1GHz Dual core Apple A5 processor
512 MB DDR2 RAM
iOS 5
5.3 SPECIFICATION OF COMPUTERS USED IN TRAINING
The RRZN (Regionales Rechenzentrum für Niedersachsen) computer
cluster system offers massively parallel computing systems and computers with large main
memory. These computers are available for all employees and students of the Leibniz
University of Hannover. The "Tane"-Cluster of the RRZN is used for training. The
specifications of the RRZN cluster system are as shown
Number of nodes - 96
Processors per node - 12
Processor - Intel Xeon CPU X5670, 2.93GHz
34
Main memory per node - 48 GB
Local disk space - 80 GB per node
File systems - Lustre
Operating system - Scientific Linux
The training uses 10 nodes i.e. 120 processors in total. A training round
selecting one classifier takes approximately 20 minutes for a training set of approximately
24000 images. In later stages the bootstrapping process can take up to 1 hour. So the
training classifiers takes in total approximately100 hours.
5.4 GIT REPOSITORY
Git is a distributed version control system (DVCS) written in C [3]. It
creates a history for a collection of files and includes the functionality to revert them to
another state. The collection of files is usually called source code. In a d DVCS all users
have a complete copy of the source code, including its complete history, called a local
repository. Each user can perform version control operations against this local copy, for
example revert to a previous version of the source code, merge the current version with
another created by a different user, etc.
If a user makes such changes to the source code, he/she can mark them as
relevant for the version control by add them to the index(cache) and then add them to the
local repository (commit). Git maintains all versions. Therefore a user can revert to any
point in the source code history using Git.
Git synchronizes these local repositories with other (remote) repositories.
Owners of local repositories can synchronize changes via push (transferring changes to the
remote repository) or pull (getting changes from the remote repository). This is illustrated
in Fig. 5.1
Fig. 5.1 - Storage levels in a Git repository [e]
35
A Git repository was used to maintain a complete history of the program.
This was useful in order to revert to previous versions, include changes made by others
working on the same code etc.
5.5 SIMD SIMD stands for single instruction multiple data. It is a class of parallel
computing. It is very useful for fast processing of similar instructions of a bulk of data.
This is as shown in Fig. 5.2
Fig 5.2 - SIMD[i]
For example, changing the brightness of an image. Here, the R, G and B
values of each pixel are read from memory, a value is added/subtracted to/from them, and
the resulting values are written back out to memory. A SIMD processor can improve this
process. Instead of a series of instructions saying "get this pixel, now get the next pixel", it
will have a single instruction that effectively says "get lots of pixels" . This takes much less
time than "getting" each pixel individually, as with traditional CPU design. Also all
operations performed on this block of pixels are performed in parallel.
ARM's NEON technology, used in many mobile devices (including iPods,
iPhones, etc) is a 64/128-bit hybrid SIMD architecture designed to accelerate the
performance of multimedia and signal processing applications, including video encoding
and decoding, audio encoding and decoding, 3D graphics, speech and image processing
[a]. This technique is used in the iOS project to accelerate a few simple functions such as
finding the square of the image and BRGA to greyscale conversion. This operation can be
extended to other parts of the algorithm which can drastically improve performance.
36
CHAPTER - VI
RESULTS
6.1 PERFORMANCE COMPARISON
The true positive rates and false positive rates of all the five cases are as
shown in Table 1.1. The following conclusions can be drawn
• Without pre-processing - All the faces are detected in thins case(TPR is 1.0), but there is
a high value of false positives.
• With Skin colour detection as pre-processor - The number of false positives is reduced by
almost a factor of 4.5. But this case is unfavourable as too many faces go undetected i.e.
TPR is too low.
• With Canny edge detection as pre-processor - Similar to the previous case, the FPR is
reduced by a large value(factor of 3.5), but TPR is still to low.
• With both Skin colour and Canny edge detection with an AND gate - The number of false
positives is the lowest amoung all the cases but this is also the case which the fails to
detect the most number of cases. Hence this case too is unfavourable
• With both Skin colour and Canny edge detection with an OR gae - This case has a
considerably lower FPR, reduced by almost a factor of 3. The TPR is also at an
acceptable level at around 85%.
Pre-processing TPR FPR (x 10-4)
Without Pre-processing
With Skin Colour detection
With Canny Edge detection
With Both using AND gate
With Both using OR gate
1.0 1.42
0.65 0.32
0.57 0.41
0.34 0.16
0.84 0.57
Table 6.1 - Performance comparison
37
From this analysis it was found that the case using both skin colour and
canny edge detection yields the best results.
6.2 SPEED COMPARISON
The speed-up factors of all the five cases as compared to the case without pre-processing
are as shown in Table 2.1. The following conclusions can be drawn
• Without pre-processing - Value of 1.0
• With Skin colour detection as pre-processor - As expected the speed-up is by a high
factor as a large number of faces get rejected by pre-processing.
• With Canny edge detection as pre-processor - The speed-up factor is a high value. It is
not as high as the previous case as the Canny edge detection algorithm is more time
consuming than the skin colour detection.
• With both Skin colour and Canny edge detection with an AND gate - This case has the
highest speed-up factor as it rejects the maximum number of sub-windows. This can be
expected from the low value of TPR seen in the previous section.
• With both Skin colour and Canny edge detection with an OR gate - This case has a
considerably high sped-up factor of almost 2.3.
Chart 6.1 - Speed-up comparison
38
Without Pre-processing
With Skin colour detection
With Canny edge detection
With noth using AND gate
With both using OR gate
0 1 2 3 4 5
2.28
4.07
2.86
4.0
1.0
6.3 EXAMPLE IMAGES
Fig. 6.1 (a) to (f) - (a) Original Image, (b) Without Pre-processing, (c) With Skin
colour detection, (d) With Skin colour detection, (e) With both using AND gate, (f)
With both using OR gate
39
Fig. 6.2 (a) to (f) - (a) Original Image, (b) Without Pre-processing, (c) With Skin
colour detection, (d) With Skin colour detection, (e) With both using AND gate, (f)
With both using OR gate
40
Fig. 6.3 (a) to (f) - (a) Original Image, (b) Without Pre-processing, (c) With Skin
colour detection, (d) With Skin colour detection, (e) With both using AND gate, (f)
With both using OR gate
41
Fig. 6.4 (a) to (f) - (a) Original Image, (b) Without Pre-processing, (c) With Skin
colour detection, (d) With Skin colour detection, (e) With both using AND gate, (f)
With both using OR gate
42
CHAPTER - VII
APPENDIX
7.1 SKIN COLOUR DETECTION#include <iostream.h>#include "SkinColourDetect.h"#include <stdio.h>
void doSkinColourDtetction(unsigned char *imageSkinColourDst, unsigned char *imageSkinColourSrc, int W, int H){ for(int i=0,k=0; i<W*H*4; i+=4,k++) { unsigned char Value,tmp; double Saturation, Hue, Cr, Cg, Cb; unsigned char B = imageSkinColourSrc[i], // Read red value at coordinates (x,y). G = imageSkinColourSrc[i+1], // Read green value at coordinates (x,y) R = imageSkinColourSrc[i+2] // Read blue value at coordinates (x,y) /////////////////// VALUE // find maximum if ( (R >= G) && (R >= B)) Value = R; else if ( (G >= R) && (G >= B)) Value = G; else if ( (B >= G) && (B >= R)) Value = B; /////////////////// SATURATION // find minimum if ( (R <= G) && (R <= B)) tmp = R; else if ( (G <= R) && (G <= B)) tmp = G; else if ( (B <= G) && (B <= R)) tmp = B; if (Value == 0) Saturation = 0; else Saturation = ((double)Value - (double)tmp)/(double)Value; /////////////////// SATURATION if (Saturation == 0) Hue = -1; else { Cr = ((double)Value-(double)R)/((double)Value-(double)tmp); Cg = ((double)Value-(double)G)/((double)Value-(double)tmp); Cb = ((double)Value-(double)B)/((double)Value-(double)tmp); if (R == Value) Hue = Cb - Cg; if (G == Value) Hue = 2 + Cr - Cb; if (B == Value) Hue = 4 + Cg - Cr; Hue *= 60; if (Hue < 0) Hue +=360; } // DETECT SKIN COLOR // if ( (Hue >= 0.0) && (Hue <= 50.0) && (Saturation >= 0.23) && (Saturation <= 0.68) )
43
{ imageSkinColourDst[k] = 1;// std::cout << "yes"; } else { imageSkinColourDst[k] = 0; } }}
7.2 CANNY PRUNING#include <iostream>#include "CannyPruning.h"#include "math.h"
int edgeDir[240][320];! ! ! // Stores the edge direction of each pixelfloat gradient[240][320];! ! // Stores the gradient strength of each pixel
void doCanny(unsigned char *imgData,int W,int H){! ! unsigned int row, col;! ! // Pixel's row and col positions! unsigned long i;! ! ! ! // Dummy variable for row-column vector! int! upperThreshold = 60;! // Gradient strength nessicary to start edge! int! ! lowerThreshold = 30;! // Minimum gradient strength to continue edge! unsigned long iOffset;! ! ! // Variable to offset row-column vector during sobel mask! int rowOffset;!! ! ! ! // Row offset from the current pixel! int colOffset;!! ! ! ! // Col offset from the current pixel! int rowTotal = 0;! ! ! ! // Row position of offset pixel! int colTotal = 0;! ! ! ! // Col position of offset pixel! int Gx;!! ! ! ! ! ! // Sum of Sobel mask products values in the x direction! int Gy;!! ! ! ! ! ! // Sum of Sobel mask products values in the y direction! float thisAngle;! ! ! ! // Gradient direction based on Gx and Gy! int newAngle;! ! ! ! ! // Approximation of the gradient direction! bool edgeEnd;! ! ! ! ! // Stores whether or not the edge is at the edge of the possible image! int GxMask[3][3];! ! ! ! // Sobel mask in the x direction! int GyMask[3][3];! ! ! ! // Sobel mask in the y direction//! float newPixel;! ! ! ! ! // Sum pixel values for gaussian//! float gaussianMask[5];! ! ! // Gaussian mask int gaussianMask[5][5]; int newPixel;!! for (row = 0; row < H; row++) {! ! for (col = 0; col < W; col++) {! ! ! edgeDir[row][col] = 0;! ! }! } Hist_Eq(imgData, W, H); ! /* Declare Sobel masks */! GxMask[0][0] = -1; GxMask[0][1] = 0; GxMask[0][2] = 1;! GxMask[1][0] = -2; GxMask[1][1] = 0; GxMask[1][2] = 2;! GxMask[2][0] = -1; GxMask[2][1] = 0; GxMask[2][2] = 1;
44
!! GyMask[0][0] = 1; GyMask[0][1] = 2; GyMask[0][2] = 1;! GyMask[1][0] = 0; GyMask[1][1] = 0; GyMask[1][2] = 0;! GyMask[2][0] = -1; GyMask[2][1] = -2; GyMask[2][2] = -1; /* Declare Gaussian mask */ gaussianMask[0][0] = 2;! ! gaussianMask[0][1] = 4;!! gaussianMask[0][2] = 5;! ! gaussianMask[0][3] = 4;!! gaussianMask[0][4] = 2;!! gaussianMask[1][0] = 4;! ! gaussianMask[1][1] = 9;!! gaussianMask[1][2] = 12;! gaussianMask[1][3] = 9;!! gaussianMask[1][4] = 4;!! gaussianMask[2][0] = 5;! ! gaussianMask[2][1] = 12;! gaussianMask[2][2] = 15;! gaussianMask[2][3] = 12;! gaussianMask[2][4] = 5;!! gaussianMask[3][0] = 4;! ! gaussianMask[3][1] = 9;!! gaussianMask[3][2] = 12;! gaussianMask[3][3] = 9;!! gaussianMask[3][4] = 4;!! gaussianMask[4][0] = 2;! ! gaussianMask[4][1] = 4;!! gaussianMask[4][2] = 5;! ! gaussianMask[4][3] = 4;!! gaussianMask[4][4] = 2;!! for (row = 2; row < H-2; row++) { unsigned long val = row*W;! ! for (col = 2; col < W-2; col++) {! ! ! newPixel = 0;! ! ! for (rowOffset=-2; rowOffset<=2; rowOffset++) {! ! ! ! rowTotal = row + rowOffset; const unsigned char * const rowTemp = &imgData[rowTotal * W];! ! ! ! for (colOffset=-2; colOffset<=2; colOffset++) {! ! ! ! ! colTotal = col + colOffset;! ! ! ! ! newPixel += rowTemp[colTotal] * gaussianMask[2 + rowOffset][2 + colOffset];! ! ! ! } ! ! ! }! ! ! i = (unsigned long)(val + col);! ! ! *(imgData + i) = newPixel / 159;! ! }! } ! /* Determine edge directions and gradient strengths */! for (row = 1; row < H-1; row++) {! ! for (col = 1; col < W-1; col++) { //! ! ! i = (unsigned long)(row*W + col);! ! ! Gx = 0;! ! ! Gy = 0;! ! ! /* Calculate the sum of the Sobel mask times the nine surrounding pixels in the x and y direction */! ! ! for (rowOffset=-1; rowOffset<=1; rowOffset++) { rowTotal = row + rowOffset; const unsigned char * const rowTemp = &imgData[rowTotal * W];! ! ! ! for (colOffset=-1; colOffset<=1; colOffset++) {! ! ! ! ! colTotal = col + colOffset;! ! ! ! ! Gx = Gx + (rowTemp[colTotal] * GxMask[rowOffset + 1][colOffset + 1]);! ! ! ! ! Gy = Gy + (rowTemp[colTotal] * GyMask[rowOffset + 1][colOffset + 1]);! ! ! } ! ! ! gradient[row][col] = sqrt(pow(Gx,2.0) + pow(Gy,2.0));!// Calculate gradient strength! ! !! ! ! thisAngle = (atan2(Gx,Gy)/3.14159) * 180.0;! ! // Calculate actual direction of edge
45
! ! !! ! ! /* Convert actual edge direction to approximate value */! ! ! if ( ( (thisAngle < 22.5) && (thisAngle > -22.5) ) || (thisAngle > 157.5) || (thisAngle < -157.5) )! ! ! ! newAngle = 0;! ! ! if ( ( (thisAngle > 22.5) && (thisAngle < 67.5) ) || ( (thisAngle < -112.5) && (thisAngle > -157.5) ) )! ! ! ! newAngle = 45;! ! ! if ( ( (thisAngle > 67.5) && (thisAngle < 112.5) ) || ( (thisAngle < -67.5) && (thisAngle > -112.5) ) )! ! ! ! newAngle = 90;! ! ! if ( ( (thisAngle > 112.5) && (thisAngle < 157.5) ) || ( (thisAngle < -22.5) && (thisAngle > -67.5) ) )! ! ! ! newAngle = 135; ! ! ! edgeDir[row][col] = newAngle;! ! // Store the approximate edge direction of each pixel in one array! ! }! } ! /* Trace along all the edges in the image */! for (row = 1; row < H - 1; row++) { unsigned long val = row*W;! ! for (col = 1; col < W - 1; col++) {! ! ! edgeEnd = false;! ! ! if (gradient[row][col] > upperThreshold) {! ! // Check to see if current pixel has a high enough gradient strength to be part of an edge! ! ! ! /* Switch based on current pixel's edge direction */! ! ! ! switch (edgeDir[row][col]){! !! ! ! ! ! case 0:! ! ! ! ! ! findEdge(0, 1, row, col, 0, lowerThreshold, imgData);! ! ! ! ! ! break;! ! ! ! ! case 45:! ! ! ! ! ! findEdge(1, 1, row, col, 45, lowerThreshold, imgData);! ! ! ! ! ! break;! ! ! ! ! case 90:! ! ! ! ! ! findEdge(1, 0, row, col, 90, lowerThreshold, imgData);! ! ! ! ! ! break;! ! ! ! ! case 135:! ! ! ! ! ! findEdge(1, -1, row, col, 135, lowerThreshold, imgData);! ! ! ! ! ! break;! ! ! ! ! default :! ! ! ! ! ! i = (unsigned long)(val + col);! ! ! ! ! ! *(imgData + i) = 0;! ! ! ! ! ! break; } }! ! ! else {! ! ! ! i = (unsigned long)(val + col); *(imgData + i) = 0;! ! ! }!! ! }! } ! /* Suppress any pixels not changed by the edge tracing */! for (row = 0; row < H; row++) { unsigned long val = row*W;! ! for (col = 0; col < W; col++) {!! ! ! // Recall each pixel is composed of 3 bytes! ! ! i = (unsigned long)(val + col);! ! ! // If a pixel's grayValue is not black or white make it black! ! ! if( ((*(imgData + i) != 1) && (*(imgData + i) != 0)) ) ! ! ! ! *(imgData + i) = 0; // Make pixel black! ! }! }
46
! /* Non-maximum Suppression */! for (row = 1; row < H - 1; row++) { unsigned long val = row*W; // const unsigned char * const rowTemp = &imgData[rowTotal * W];! ! for (col = 1; col < W - 1; col++) {! ! ! i = (unsigned long)(val + col);! ! ! if (*(imgData + i) == 1) {!! // Check to see if current pixel is an edge! ! ! ! /* Switch based on current pixel's edge direction */! ! ! ! switch (edgeDir[row][col]) {! !! ! ! ! ! case 0:! ! ! ! ! ! suppressNonMax( 1, 0, row, col, 0, lowerThreshold, imgData);! ! ! ! ! ! break;! ! ! ! ! case 45:! ! ! ! ! ! suppressNonMax( 1, -1, row, col, 45, lowerThreshold, imgData);! ! ! ! ! ! break;! ! ! ! ! case 90:! ! ! ! ! ! suppressNonMax( 0, 1, row, col, 90, lowerThreshold, imgData);! ! ! ! ! ! break;! ! ! ! ! case 135:! ! ! ! ! ! suppressNonMax( 1, 1, row, col, 135, lowerThreshold, imgData);! ! ! ! ! ! break;! ! ! ! ! default :! ! ! ! ! ! break;! ! ! ! }! ! ! }!! ! }! }}
void findEdge(int rowShift, int colShift, int row, int col, int dir, int lowerThreshold, unsigned char *imgData){! int W = 320;! int H = 240;! int newRow;! int newCol;! unsigned long i;! bool edgeEnd = false; ! /* Find the row and column values for the next possible pixel on the edge */! if (colShift < 0) {! ! if (col > 0)! ! ! newCol = col + colShift;! ! else! ! ! edgeEnd = true;! } else if (col < W - 1) {! ! newCol = col + colShift;! } else! ! edgeEnd = true;! ! // If the next pixel would be off image, don't do the while loop! if (rowShift < 0) {! ! if (row > 0)! ! ! newRow = row + rowShift;! ! else! ! ! edgeEnd = true;! } else if (row < H - 1) {! ! newRow = row + rowShift;! } else! ! edgeEnd = true;! ! /* Determine edge directions and gradient strengths */! while ( (edgeDir[newRow][newCol]==dir) && !edgeEnd && (gradient[newRow][newCol] > lowerThreshold) ) {! ! /* Set the new pixel as white to show it is an edge */
47
! ! i = (unsigned long)(newRow*W + newCol);! ! *(imgData + i) =1;! ! if (colShift < 0) {! ! ! if (newCol > 0)! ! ! ! newCol = newCol + colShift;! ! ! else! ! ! ! edgeEnd = true;!! ! } else if (newCol < W - 1) {! ! ! newCol = newCol + colShift;! ! } else! ! ! edgeEnd = true;!! ! if (rowShift < 0) {! ! ! if (newRow > 0)! ! ! ! newRow = newRow + rowShift;! ! ! else! ! ! ! edgeEnd = true;! ! } else if (newRow < H - 1) {! ! ! newRow = newRow + rowShift;! ! } else! ! ! edgeEnd = true;!! }!}
void suppressNonMax(int rowShift, int colShift, int row, int col, int dir, int lowerThreshold, unsigned char *imgData){! int W = 320;! int H = 240;! int newRow = 0;! int newCol = 0;! unsigned long i;! bool edgeEnd = false;! float nonMax[320][3];!! ! // Temporarily stores gradients and positions of pixels in parallel edges! int pixelCount = 0;! ! ! ! ! // Stores the number of pixels in parallel edges! int count;! ! ! ! ! ! // A for loop counter! int max[3];! ! ! ! ! ! // Maximum point in a wide edge!! if (colShift < 0) {! ! if (col > 0)! ! ! newCol = col + colShift;! ! else! ! ! edgeEnd = true;! } else if (col < W - 1) {! ! newCol = col + colShift;! } else! ! edgeEnd = true;! ! // If the next pixel would be off image, don't do the while loop! if (rowShift < 0) {! ! if (row > 0)! ! ! newRow = row + rowShift;! ! else! ! ! edgeEnd = true;! } else if (row < H - 1) {! ! newRow = row + rowShift;! } else! ! edgeEnd = true;!! i = (unsigned long)(newRow*W + newCol);! /* Find non-maximum parallel edges tracing up */! while ((edgeDir[newRow][newCol] == dir) && !edgeEnd && (*(imgData + i) == 1)) {! ! if (colShift < 0) {! ! ! if (newCol > 0)! ! ! ! newCol = newCol + colShift;! ! ! else! ! ! ! edgeEnd = true;!! ! } else if (newCol < W - 1) {! ! ! newCol = newCol + colShift;
48
! ! } else! ! ! edgeEnd = true;!! ! if (rowShift < 0) {! ! ! if (newRow > 0)! ! ! ! newRow = newRow + rowShift;! ! ! else! ! ! ! edgeEnd = true;! ! } else if (newRow < H - 1) {! ! ! newRow = newRow + rowShift;! ! } else! ! ! edgeEnd = true;!! ! nonMax[pixelCount][0] = newRow;! ! nonMax[pixelCount][1] = newCol;! ! nonMax[pixelCount][2] = gradient[newRow][newCol];! ! pixelCount++;! ! i = (unsigned long)(newRow*W + newCol);! } ! /* Find non-maximum parallel edges tracing down */! edgeEnd = false;! colShift *= -1;! rowShift *= -1;! if (colShift < 0) {! ! if (col > 0)! ! ! newCol = col + colShift;! ! else! ! ! edgeEnd = true;! } else if (col < W - 1) {! ! newCol = col + colShift;! } else! ! edgeEnd = true;!! if (rowShift < 0) {! ! if (row > 0)! ! ! newRow = row + rowShift;! ! else! ! ! edgeEnd = true;! } else if (row < H - 1) {! ! newRow = row + rowShift;! } else! ! edgeEnd = true;!! i = (unsigned long)(newRow*W + newCol);! while ((edgeDir[newRow][newCol] == dir) && !edgeEnd && (*(imgData + i) == 1)) {! ! if (colShift < 0) {! ! ! if (newCol > 0)! ! ! ! newCol = newCol + colShift;! ! ! else! ! ! ! edgeEnd = true;!! ! } else if (newCol < W - 1) {! ! ! newCol = newCol + colShift;! ! } else! ! ! edgeEnd = true;!! ! if (rowShift < 0) {! ! ! if (newRow > 0)! ! ! ! newRow = newRow + rowShift;! ! ! else! ! ! ! edgeEnd = true;! ! } else if (newRow < H - 1) {! ! ! newRow = newRow + rowShift;! ! } else! ! ! edgeEnd = true;!! ! nonMax[pixelCount][0] = newRow;! ! nonMax[pixelCount][1] = newCol;! ! nonMax[pixelCount][2] = gradient[newRow][newCol];! ! pixelCount++;! ! i = (unsigned long)(newRow*W + newCol);! } ! /* Suppress non-maximum edges */! max[0] = 0;
49
! max[1] = 0;! max[2] = 0;! for (count = 0; count < pixelCount; count++) {! ! if (nonMax[count][2] > max[2]) {! ! ! max[0] = nonMax[count][0];! ! ! max[1] = nonMax[count][1];! ! ! max[2] = nonMax[count][2];! ! }! }! for (count = 0; count < pixelCount; count++) {! ! i = (unsigned long)(nonMax[count][0]*W + nonMax[count][1]);! ! *(imgData + i) = 0;! }}void Hist_Eq(unsigned char *img_data, int width, int height){ unsigned long hist[256];! double s_hist_eq[256]={0.0}, sum_of_hist[256]={0.0};! long i, k, l, n; n = width * height; ! for(i=0;i<256;i++)! {! ! hist[i] = 0;! }! for(i=0;i<n;i++)! { l = img_data[i]; hist[l]++;! } ! for (i=0;i<256;i++) // pdf of image! {! ! s_hist_eq[i] = (double)hist[i]/(double)n;! }! sum_of_hist[0] = s_hist_eq[0];! for (i=1;i<256;i++)! // cdf of image! {! ! sum_of_hist[i] = sum_of_hist[i-1] + s_hist_eq[i];! } ! for(i=0;i<n;i++)! { k = img_data[i]; img_data[i] = (unsigned char)round( sum_of_hist[k] * 255.0 );! } }
50
CHAPTER - VIII
CONCLUSION
As can be clearly observed from the analysis, the case with both Skin colour and Canny
edge detection with an OR gate gives the best results. This gives a reduction of FPR by a
factor of almost 3 while maintaining a high true positive rate of 84%. This also gives a
speed-up by almost a factor of 2.3.
The value of TPR obtained is not a true indication of the efficiency of the face detector as
the number of images used for testing was very low. Within this number in fact, this
method failed to detect only four faces all of which were taken under low lighting. One
example of an undetected face is shown in Fig. 3.2. As can be seen this is a typically
obtuse case. So is understandable why both skin colour and canny edge detection do not
satisfactorily work in this case. If these aberrations are eliminated from the analysis, the
TPR obtained will be much higher.
It was a great pleasure working in the laboratory atmosphere at Leibniz University. It is an
ambience which is very conducive to thesis work. I am indebted to the guidance of Prof
Dr-Ing Bodo Rosenhahn and the valuable inputs given by my supervisors Mr. Arne Ehlers,
Mr Björn Scheuermann and Mr. Florian Baumann.
51
CHAPTER - IX
FUTURE ENHANCEMENTS
Some of the future developments that can be made to enhance the working of the project
are mentioned below
• Analysis for exact threshold for Canny and Skin colour detection - The threshold values
used for Canny edge detection and Skin colour detection (45% - 90% for Canny and
16.7% - 22.7% for skin colour detection) are taken just by testing out various threshold
values for the 29 test images. A proper analysis to calculate the exact threshold values for
white count within each sub-window should be done. This may drastically improve the
results.
• Speed up using Neon intrinsics - As mentioned before SIMD processing of the ARM
processor can be used to greatly improve the speed of computation.
• Speed up using GUI - The graphical user interface of the mobile device may be used to
further speed up the processing
• Pre-processing used in training - The pre-processing techniques maybe included in the
training of the features to create better classifiers.
52
REFERENCES
1. Crow. F, (1984), ‘Summed-area tables for texture mapping’, Proceedings of SIGGRAPH, Vol.18, No.3, pp.207–212.
2. Gonzalez and Woods, (2008), ‘Digital Image Processing’, Third edition, Prentice Hall, Noida.
3. John Canny, (1986), ‘A computational approach to edge detection’, Pattern Analysis and Machine Intelligence, IEEE Transactions on, Vol.8, No.6, pp.679–698.
4. Masoud Nosrati, Ronak Karimi, Mehdi Hariri, (2012), ‘Detecting circular shapes from areal images using median filter and CHT’, World Applied Programming, Vol.2, pp.49-54
5. Michael Kearns and Leslie G. Valiant, (1994), ‘Cryptographic limitations on learning Boolean formulae and finite automata’, Journal of the Association for Computing Machinery, Vol.41, No.1, pp.67–95,
6. Michael Kearns and Leslie G. Valiant, (1988), ‘Learning Boolean formulae or finite automata is as hard as factoring’, Technical Report TR-14-88, Harvard University Aiken Computation Laboratory.
7. Rick Kjeldsen and John R. Kender, (1996), ‘Finding skin in color images’, in proceedings of 2nd International Conference on Automatic Face and Gesture Recognition 96, pp.312-317.
8. Robert E. Schapire, (1990), ‘The strength of weak learnability’, Machine Learning, Vol.5, No.2, pp.197–227.
9. Sanjay Kr. Singh and D. S. Chauhan and Mayank Vatsa and Richa Singh, (2003), ‘A Robust Skin Color Based Face Detection Algorithm’, Tamkang Journal of Science and Engineering, Vol.6, pp. 227-234.
10. Viola, P., Jones, M.J, (2004), ‘Robust real-time face detection’, International Journal of Computer Vision, Vol. 57, pp.137–154.
11. Vladimir Vezhnevets, Vassili Sazonov, and Alla Andreeva, (2003), ‘A survey on pixel-based skin color detection techniques’, Proc. Graphicon-2003, pp.85- 92.
12. Yoav Freund and Robert E, (1999), ‘Schapire,A short introduction to boosting’, Journal of Japanese Society for Artificial Intelligence, Vol.14, No.5, pp.771-780.
53
WEBSITES
a. http://blogs.arm.com/software-enablement/161-coding-for-neon-part-1-load-and-stores/b. http://cimg.sourceforge.net/index.shtmlc. http://en.wikipedia.org/wiki/Artificial_neural_networkd. http://en.wikipedia.org/wiki/Canny_edge_detector e. http://en.wikipedia.org/wiki/Git_(software)#Source_code_hostingf. http://en.wikipedia.org/wiki/HSL_and_HSVg. http://en.wikipedia.org/wiki/Machine_learningh. http://en.wikipedia.org/wiki/Perceptroni. http://en.wikipedia.org/wiki/SIMDj. http://en.wikipedia.org/wiki/Viola–Jones_object_detection_frameworkk. http://pugixml.org/l. http://smig.usgs.gov/SMIG/features_0902/tualatin_ann.htmlm. http://www.chasanc.com/index.php/Image-Processing/Histogram-Equalization-
HE.htmln. http://www.chasanc.com/index.php/Image-Processing/Skin-Color-Detection-with-
HSV-Lookup.htmlo. http://www.classle.net/sites/default/files/text/36461/cannyedge_detection_0.pdfp. http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.htmlq. http://www.pages.drexel.edu/~nk752/cannyTut2.htmlr. http://www.techradar.com/news/software/applications/how-face-detection-
works-703173s. http://www.vogella.de/articles/Git/article.html
54