ECE 627 – Computer VisionSpring 2017Lecture 9:
Pattern Recognition and Classification Algorithms
Charis TheocharidesAssociate Professor, Dept. of Electrical and Computer Engineering
University of Cyprus
Semester Project (DUE MAY 20th!)• Motion Detection and Estimation/Optical Flow• Active contour model/Snakes• IP Camera Intruder Detection System for
Surveillance• Face Recognition on Mobile Phones (Android
or iOS)• 3D Reconstruction from multiview images• Raspberry-Pi based Drone Object Recognition
and Identification• Intel Compute Stick (as above)• OpenCV projects on Jetson TK1 (face
recognition, car recognition)• Kinect-based motion recognition• Gesture Recognition on leap-motion sensor• Aerial Object Detection of MOVING objects• Movable Object Tracking from Movable camera
• Goal-line optical technology for sports using multiview cameras and real-time reconstruction
• Pedestrian vs. animal vs. car/truck classification for driver assistance
• SLAM (robotics)• License Plate Recognition• Road Sign Recognition• Road-Line Detection and tracking• Face Expression Recognition• Handwritten character recognition• Top view Object Detection (Cars,Buildings)
from google map images (also maybe landmark recognition)
• Hand Gesture Recognition• - Food Recognition (see "On Filter Banks of
Texture Features for Mobile Food Classification")
INDEPENDENT STUDY
• Each one of you will do areview
• Submit a 15-page report byend of semester – MAY 20th!
• Review related to the workyou will do for project
• Present your knowledge and review on a topic of your choice
• Suggested topics:– Object Recognition– Classification– Region Segmentation– Motion Detection– Gesture and Motion
Recognition– Contours and Edges– Tools/Software (OpenCV)– Algorithms (Viola-Jones,
SURF, etc.)
ΠΕΡΙΕΧΟΜΕΝΑ ΜΑΘΗΜΑΤΟΣ
• Introduction To Computer Vision• Image Fundamentals: Cameras, Lenses and Optical Sensors, Data Acquisition
and Representation, Radiometry & Reflectance• Image Formation: Sources, Shading, Colour, Metadata • Linear Filters & Edges, Lines, Textures, Pyramids• Segmentation: Transforms, Contours, Feature Extraction• Optical Flow, Silhouettes,Contours, Motion Vectors• Motion - Continuous and Discrete• Recognition Algorithms and Introduction to Computational Intelligence for
Vision• Template Matching and Recognition (Classifiers, Neural Nets, SVM,...)• Object Detection, Recognition, Tracking• Epipolar Geometry, Multiple View Geometry and Stereo Matching, Calibration• 3D Vision - Stereo/Multiview, Structured Light Approaches, other 3D
Approaches• Embedded and Mobile Computer Vision - Concepts, Constraints, Approaches and
Solutions in Emerging Applications
Object Detection Object Identification
Where is Jane?Where is a Face?Is there a face in the image?
Who is it?Is it Jane or Erik?
Object Recognition
What is pattern recognition?
• A pattern is an object, process or event that can be given a name.
• A pattern class (or category) is a set of patterns sharing common attributes and usually originating from the same source.
• During recognition (or classification) given objects are assigned to prescribed classes.
• A classifier is a machine which performs classification.
“The assignment of a physical object or event to one of several prespecified categeries” -- Duda & Hart
Examples of applications
• Optical Character Recognition
(OCR)
• Biometrics
• Diagnostic systems
• Military applications
• Handwritten: sorting letters by postal code, input device for PDA‘s.• Printed texts: reading machines for blind people, digitalization of text documents.
• Face recognition, verification, retrieval. • Finger prints recognition.• Speech recognition.
• Medical diagnosis: X-Ray, EKG analysis.• Machine diagnostics, waster detection.
• Automated Target Recognition (ATR).• Image segmentation and analysis (recognition from aerial or satelite photographs).
Approaches• Statistical PR: based on underlying statistical model of patterns
and pattern classes.• Structural (or syntactic) PR: pattern classes represented by means
of formal structures as grammars, automata, strings, etc. • Neural networks: classifier is represented as a network of cells
modeling neurons of the human brain (connectionist approach).
Examples of applications
Overfitting and underfitting
Problem: how rich class of classifications q(x;θ) to use.
underfitting overfittinggood fit
Problem of generalization: a small emprical risk Remp does not imply small true expected risk R.
Basic concepts
y x=
úúúú
û
ù
êêêê
ë
é
nx
xx
!2
1Feature vector- A vector of observations (measurements).- is a point in feature space .
Hidden state- Cannot be directly measured.- Patterns with equal hidden state belong to the same class.
XÎx
x X
YÎy
Task- To design a classifer (decision rule) which decides about a hidden state based on an onbservation.
YX ®:q
Pattern
Example
x=úû
ùêë
é
2
1
xx
height
weight
Task: horse jockey recognition.
The set of hidden state is
The feature space is },{ JH=Y
2Â=X
Training examples )},(,),,{( 11 ll yy xx !
1x
2x
Jy =
Hy =Linear classifier:
îíì
<+׳+×
=0)(0)(
)q(bifJbifH
xwxw
x
0)( =+× bxw
Components of PR system
Sensors and preprocessing
Feature extraction Classifier Class
assignment
• Sensors and preprocessing.• A feature extraction aims to create discriminative features good for classification.• A classifier.• A teacher provides information about hidden state -- supervised learning.• A learning algorithm sets PR from training examples.
Learning algorithmTeacher
Pattern
Feature extraction
Task: to extract features which are good for classification.Good features: • Objects from the same class have similar feature values.
• Objects from different classes have different values.
“Good” features “Bad” features
Feature extraction methods
úúúú
û
ù
êêêê
ë
é
km
mm
!2
1
úúúú
û
ù
êêêê
ë
é
nx
xx
!2
11φ
2φ
nφ úúúúúú
û
ù
êêêêêê
ë
é
km
mmm
!3
2
1
úúúú
û
ù
êêêê
ë
é
nx
xx
!2
1
Feature extraction Feature selection
Problem can be expressed as optimization of parameters of featrure extractor .
Supervised methods: objective function is a criterion of separability (discriminability) of labeled examples, e.g., linear discriminat analysis (LDA).
Unsupervised methods: lower dimesional representation which preserves important characteristics of input data is sought for, e.g., principal component analysis (PCA).
φ(θ)
ClassifierA classifier partitions feature space X into class-labeled regions such that
||21 YXXXX ÈÈÈ= ! }0{||21 =ÇÇÇ YXXX !and
1X 3X
2X
1X1X
2X
3X
The classification consists of determining to which region a feature vector x belongs to.
Borders between decision boundaries are called decision regions.
Representation of classifier
A classifier is typically represented as a set of discriminant functions
||,,1,:)(f YX !=® ii xThe classifier assigns a feature vector x to the i-the class if )(f)(f xx ji > ij ¹"
)(f1 x
)(f2 x
)(f || xY
maxx yFeature vector
Discriminant function
Class identifier!
Review
Fig.1 Basic components of a pattern recognition system
Steps
• Data acquisition and sensing• Pre-processing
u Removal of noise in data.u Isolation of patterns of interest from the background.
• Feature extractionu Finding a new representation in terms of features.
(Better for further processing)
Steps• Model learning and estimation
u Learning a mapping between features and pattern groups.
• Classificationu Using learned models to assign a pattern to a predefined
category
• Post-processingu Evaluation of confidence in decisions.u Exploitation of context to improve performances.
Table 1 : Examples of pattern recognition applications
35
Image Recognition2D Matched Filter
• Functionalityu Degrading the noise effect.
u Computing the similarity of two objects. (Template matching for images)
• Functional block
2D Matched Filter
Input image I(m,n) Template image H*(-m,-n)
Output image Y(m,n)
Impulse response H*(-m,-n)
36
2D Matched Filter : Template MatchingInput image
I(m,n)
Template imageH(m,n)
Rotated image
H*(-m,-n)
2D MatchedFilter Without normalization
With normaliztion
Output image
37
2D Matched Filter : Template Matching
• Drawbacksu Poor discriminative ability on template shape.
(Ignoring the structural relation of patterns)u Changes in in rotation and magnification of
template objects result in enormous number oftemplates testing.
• Template matching is usually limited to smaller local features, which are more invariant to size and shape variations of an object.
38
Image Registration• What is Image Registration?
u Aligning images correctly to make systems have better performance.
• Misregistration between imagesu Translational differencesu Scale differencesu Rotational differences
39
Bayes Statistical Classifiers• Consideration
u Randomness of patterns• Decision criterion
Pattern x is labeled as class wi if
ki k k qj q q
W W
k=1 q=1
L p(x / w )P(w )< L p(x / w )P(w )å å
Lij : Misclassification loss function p(x/wi) : P.d.f. of a particular pattern x comes from class wiP(wi) : Probability of occurrence of class wi
40
Bayes Statistical Classifiers• Decision criterion :
Given Lij is symmetrical functionu Posterior probability decision rule
i i j jp(x / w)P(w)> p(x / w)P(w)
j j j jd (x)= p(x / w)P(w)= P(w / x)dj(x) : decision functions
Pattern x classifies to class j if dj(x) yields the largest value
41
Bayes Statistical Classifiers• Advantages
u Optimization in minimizing the total avarage lossin miscalssification.
• Disadvantagesu Both P(wj) and p(x/wj) must be known in advance.
Estimation is required. Performance highly depends on the assumption ofthe distributions.( P(wj) and p(x/wj) )
Two Schools of Thought1. Statistical Pattern Recognition
The data is reduced to vectors of numbers and statistical techniques are used for the tasks to be performed.
2. Structural Pattern Recognition
The data is converted to a discrete structure (such as a grammar or a graph) and the techniques are related to computer sciencesubjects (such as parsing and graph matching).
Classification in Statistical PR• A class is a set of objects having some important
properties in common
• A feature extractor is a program that inputs thedata (image) and extracts features that can beused in classification.
• A classifier is a program that inputs the feature vector and assigns it to one of a set of designated classes or to the “reject” class.
With what kinds of classes do you work?
Feature Vector Representation• X=[x1, x2, … , xn],
each xj a real number• xj may be an object
measurement• xj may be count of
object parts• Example: object rep.
[#holes, #strokes, moments, …]
Possible features for char rec.
Some Terminology• Classes: set of m known categories of objects
(a) might have a known description for each(b) might have a set of samples for each
• Reject Class:a generic class for objects not in any of the designated known classes
• Classifier:Assigns object to a class based on features
Discriminant functions• Functions f(x, K)
perform some computation on feature vector x
• Knowledge K from training or programming is used
• Final stage determines class
Classification using nearest class mean
• Compute the Euclidean distance between feature vector X and the mean of each class.
• Choose closest class, if close enough (reject otherwise)
Nearest mean might yield poor results with complex structure
• Class 2 has two modes; where isits mean?
• But if modes are detected, two subclass mean vectors can be used
Scaling coordinates by std dev
Receiver Operating Curve ROC
• Plots correct detection rate versus false alarm rate
• Generally, false alarms go up with attempts to detect higher percentages of known objects
Confusion matrix shows empirical performance
Classifiers often used in CV• Decision Tree Classifiers
• Artificial Neural Net Classifiers
• Bayesian Classifiers and Bayesian Networks(Graphical Models)
• Support Vector Machines
Introduction – Neural Nets• What are Neural Networks?– Neural networks are a paradigm of programming computers. – They are exceptionally good at performing pattern recognition
and other tasks that are very difficult to program using conventional techniques.
– Programs that employ neural nets are also capable of learning on their own and adapting to changing conditions.
Background• An Artificial Neural Network (ANN) is an information
processing paradigm that is inspired by the biological nervous systems, such as the human brain’s information processing mechanism.
• The key element of this paradigm is the novel structure of the information processing system. It is composed of a large number of highly interconnected processing elements (neurons) working in unison to solve specific problems. NNs, like people, learn by example.
• An NN is configured for a specific application, such as pattern recognition or data classification, through a learning process. Learning in biological systems involves adjustments to the synaptic connections that exist between the neurons. This is true of NNs as well.
How the Human Brain learns
• In the human brain, a typical neuron collects signals from others through a host of fine structures called dendrites.
• The neuron sends out spikes of electrical activity through a long, thin stand known as an axon, which splits into thousands of branches.
• At the end of each branch, a structure called a synapse converts the activity from the axon into electrical effects that inhibit or excite activity in the connected neurons.
Neural Networks• Biological approach to AI• Developed in 1943• Comprised of one or more layers of
neurons• Several types, we’ll focus on feed-
forward networks
http://faculty.washington.edu/chudler/color/pic1an.gifhttp://research.yale.edu/ysm/images/78.2/articles-neural-neuron.jpg
Biological
Artificial
Neurons
A Neuron
• Receives n-inputs• Multiplies each input
by its weight• Applies activation
function to the sum of results
• Outputs result
http://www-cse.uta.edu/~cook/ai1/lectures/figures/neuron.jpg
A Neuron Model• When a neuron receives excitatory input that is sufficiently large
compared with its inhibitory input, it sends a spike of electrical activity down its axon. Learning occurs by changing the effectiveness of the synapses so that the influence of one neuron on another changes.
• We conduct these neural networks by first trying to deduce the essential features of neurons and their interconnections.
• We then typically program a computer to simulate these features.
Activation Functions• Controls when unit is “active”
or “inactive”• Threshold function outputs 1
when input is positive and 0 otherwise
• i.e. Sigmoid function = 1 / (1 + e-x)
• Hyperbolic Tangent…etc.
Neural Network Layers
• Each layer receives its inputs from the previous layer and forwards its outputs to the next layer
http://smig.usgs.gov/SMIG/features_0902/tualatin_ann.fig3.gif
Pattern Recognition• An important application of neural networks is pattern
recognition. Pattern recognition can be implemented by using a feed-forward neural network that has been trained accordingly.
• During training, the network is trained to associate outputs with input patterns.
• When the network is used, it identifies the input pattern and tries to output the associated output pattern.
• The power of neural networks comes to life when a pattern that has no output associated with it, is given as an input.
• In this case, the network gives the output that corresponds to a taught input pattern that is least different from the given pattern.
Pattern Recognition (cont.)
• Suppose a network is trained to recognize the patterns T and H. The associated patterns are all black and all white respectively as shown above.
Pattern Recognition (cont.)
Since the input pattern looks more like a ‘T’, when the network classifies it, it sees the input closely resembling ‘T’ and outputs the pattern that represents a ‘T’.
Pattern Recognition (cont.)
The input pattern here closely resembles ‘H’ with a slight difference. The network in this case classifies it as an ‘H’ and outputs the pattern representing an ‘H’.
Pattern Recognition (cont.)
• Here the top row is 2 errors away from a ‘T’ and 3 errors away from an H. So the top output is a black.
• The middle row is 1 error away from both T and H, so the output is random.• The bottom row is 1 error away from T and 2 away from H. Therefore the
output is black. • Since the input resembles a ‘T’ more than an ‘H’ the output of the network is in
favor of a ‘T’.
Learning by Back-Propagation: Illustration
ARTIFICIAL NEURAL NETWORKS Colin Fahey's Guide (Book CD)
Computational Complexity• Could lead to a very large number of calculations
Influence Map Layer 1
Influence Map Layer 2
Hidden Units
Output Units
Input Units
Different types of Neural Networks• Feed-forward networks– Feed-forward NNs allow signals to travel one way
only; from input to output. There is no feedback (loops) i.e. the output of any layer does not affect that same layer.
– Feed-forward NNs tend to be straight forward networks that associate inputs with outputs. They are extensively used in pattern recognition.
– This type of organization is also referred to as bottom-up or top-down.
Continued• Feedback networks
– Feedback networks can have signals traveling in both directions by introducing loops in the network.
– Feedback networks are dynamic; their 'state' is changing continuously until they reach an equilibrium point.
– They remain at the equilibrium point until the input changes and a new equilibrium needs to be found.
– Feedback architectures are also referred to as interactive or recurrent, although the latter term is often used to denote feedback connections in single-layer organizations.
Diagram of an NN
Fig: A simple Neural Network
Network Layers• Input Layer - The activity of the input units represents the raw
information that is fed into the network. • Hidden Layer - The activity of each hidden unit is determined by
the activities of the input units and the weights on the connections between the input and the hidden units.
• Output Layer - The behavior of the output units depends on the activity of the hidden units and the weights between the hidden and output units.
Continued• This simple type of network is interesting because the hidden units
are free to construct their own representations of the input. • The weights between the input and hidden units determine when
each hidden unit is active, and so by modifying these weights, a hidden unit can choose what it represents.
Network Structure• The number of layers and of neurons depend on the specific task.
In practice this issue is solved by trial and error.• Two types of adaptive algorithms can be used:
– start from a large network and successively remove some neurons and links until network performance degrades.
– begin with a small network and introduce new neurons until performance is satisfactory.
Network Parameters• How are the weights initialized?• How many hidden layers and how many neurons?• How many examples in the training set?
Weights• In general, initial weights are randomly chosen,
with typical values between -1.0 and 1.0 or -0.5 and 0.5.
• There are two types of NNs. The first type is known as – Fixed Networks – where the weights are fixed– Adaptive Networks – where the weights are changed
to reduce prediction error.
Size of Training Data• Rule of thumb:
– the number of training examples should be at least five to ten times the number of weights of the network.
• Other rule:
a)-(1|W| N >
|W|= number of weights
a = expected accuracy on test set
Training Basics• The most basic method of training a neural
network is trial and error. • If the network isn't behaving the way it should,
change the weighting of a random link by a random amount. If the accuracy of the network declines, undo the change and make a different one.
• It takes time, but the trial and error method does produce results.
Training: Backprop algorithm• The Backprop algorithm searches for weight values that minimize the total
error of the network over the set of training examples (training set).• Backprop consists of the repeated application of the following two passes:
– Forward pass: in this step the network is activated on one example and the error of (each neuron of) the output layer is computed.
– Backward pass: in this step the network error is used for updating the weights. Starting at the output layer, the error is propagated backwards through the network, layer by layer. This is done by recursively computing the local gradient of each neuron.
Back Propagation• Learning Methodology
l Back-propagation training algorithm
l Backprop adjusts the weights of the NN in order to minimize the network total mean squared error.
Network activationForward Step
Error propagationBackward Step
The Learning Process (cont.)• Every neural network possesses knowledge which
is contained in the values of the connection weights.
• Modifying the knowledge stored in the network as a function of experience implies a learning rule for changing the values of the weights.
The Learning Process (cont.)• Recall: Adaptive networks are NNs that allow the
change of weights in its connections. • The learning methods can be classified in two
categories: – Supervised Learning– Unsupervised Learning
Supervised Learning• Supervised learning which incorporates an external teacher, so
that each output unit is told what its desired response to input signals ought to be.
• An important issue concerning supervised learning is the problem of error convergence, ie the minimization of error between the desired and computed unit values.
• The aim is to determine a set of weights which minimizes the error. One well-known method, which is common to many learning paradigms is the least mean square (LMS) convergence.
Supervised Learning• In this sort of learning, the human teacher’s experience is used to
tell the NN which outputs are correct and which are not. • This does not mean that a human teacher needs to be present at
all times, only the correct classifications gathered from the human teacher on a domain needs to be present.
• The network then learns from its error, that is, it changes its weight to reduce its prediction error.
Unsupervised Learning• Unsupervised learning uses no external teacher and is based upon
only local information. It is also referred to as self-organization, in the sense that it self-organizes data presented to the network and detects their emergent collective properties.
• The network is then used to construct clusters of similar patterns. • This is particularly useful is domains were a instances are checked
to match previous scenarios. For example, detecting credit card fraud.
Neural Network in Use
Since neural networks are best at identifying patterns or trends
in data, they are well suited for prediction or forecastingneeds including: – sales forecasting – industrial process control – customer research – data validation – risk management
ANN are also used in the following specific paradigms: recognition of speakers in communications; diagnosis of hepatitis; undersea mine detection; texture analysis; three-dimensional object recognition; hand-written word recognition; and facial recognition.
Other (Linear) Classifiers
f x
a
yest
denotes +1denotes -1
f(x,w,b) = sign(w x + b)
How would you classify this data?
wx + b=0
w x + b<0
w x + b>0
Linear Classifiers
f x
a
yest
denotes +1denotes -1
f(x,w,b) = sign(w x + b)
How would you classify this data?
Linear Classifiersf x
a
yest
denotes +1denotes -1
f(x,w,b) = sign(w x + b)
How would you classify this data?
Linear Classifiers
f x
a
yest
denotes +1denotes -1
f(x,w,b) = sign(w x + b)
Any of these would be fine..
..but which is best?
Linear Classifiersf x
a
yest
denotes +1denotes -1
f(x,w,b) = sign(w x + b)
How would you classify this data?
Misclassifiedto +1 class
f x
a
yest
denotes +1denotes -1
f(x,w,b) = sign(w x + b)
Define the marginof a linear classifier as the width that the boundary could be increased by before hitting a datapoint.
Classifier Marginf x
a
yest
denotes +1denotes -1
f(x,w,b) = sign(w x + b)
Define the marginof a linear classifier as the width that the boundary could be increased by before hitting a datapoint.
Maximum Marginf x
a
yest
denotes +1denotes -1
f(x,w,b) = sign(w x + b)
The maximum margin linear classifier is the linear classifier with the, um, maximum margin.This is the simplest kind of SVM (Called an LSVM)
Linear SVM
Support Vectors are those datapoints that the margin pushes up against
1. Maximizing the margin is good2. Implies that only support vectors are
important; other training examples are ignorable.
3. Empirically it works very very well.
SVM applications• SVMs were originally proposed by Boser, Guyon and Vapnik in
1992 and gained increasing popularity in late 1990s.
• SVMs are currently among the best performers for a number of classification tasks ranging from text to genomic data.
• SVM techniques have been extended to a number of tasks such as regression [Vapnik et al. ’97], principal component analysis [Schölkopf et al. ’99], etc.
• Most popular optimization algorithms for SVMs are SMO [Platt ’99] and SVMlight [Joachims’ 99], both use decomposition to hill-climb over a subset of αi’s at a time.
• Tuning SVMs remains a black art: selecting a specific kernel and parameters is usually done in a try-and-see manner.
Support Vector Machines
• SVMs pick best separating hyperplane according to some criterion– e.g. maximum margin
• Training process is an optimisation• Training set is effectively reduced to a relatively small
number of support vectors
Feature Spaces• We may separate data by mapping to a higher-
dimensional feature space– The feature space may even have an infinite number
of dimensions!• We need not explicitly construct the new feature space
Kernels
• We may use Kernel functions to implicitly map to a new feature space
• Kernel fn:
• Kernel must be equivalent to an inner product in some feature space
( ) Rxx Î21,K
Example Kernels
zx ×Linear:
Polynomial: ( )zx ×P
Gaussian: ( )22 /exp szx--
Perceptron Revisited: Linear Separators
• Binary classification can be viewed as the task of separating classes in feature space:
wTx + b = 0
wTx + b < 0wTx + b > 0
f(x) = sign(wTx + b)
Which of the linear separators is optimal?
Best Linear Separator?
Best Linear Separator?
Best Linear Separator?
Best Linear Separator?
Find Closest Points in Convex Hulls
c
d
Plane Bisect Closest Points
dc
wT x + b =0w = d - c
Classification Margin
• Distance from example data to the separator is • Data closest to the hyperplane are support vectors. • Margin ρ of the separator is the width of separation between
classes.
wxw brT +
=
r
ρ
Maximum Margin Classification
• Maximizing the margin is good according to intuition and theory.
• Implies that only support vectors are important; other training examples are ignorable.
Margins and Complexity
Skinny marginis more flexiblethus more complex.
Margins and Complexity
Fat marginis less complex.
n SVM locates a separating hyperplane in the feature space and classify points in that space
n It does not need to represent the space explicitly, simply by defining a kernel function
n The kernel function plays the role of the dot product in the feature space.
Nonlinear SVM - Overview
Properties of SVM
• Flexibility in choosing a similarity function• Sparseness of solution when dealing with large data sets
- only support vectors are used to specify the separating hyperplane • Ability to handle large feature spaces
- complexity does not depend on the dimensionality of the feature space• Overfitting can be controlled by soft margin approach• Nice math property: a simple convex optimization problem which is
guaranteed to converge to a single global solution• Feature Selection
SVM Applications• SVM has been used successfully in many real-
world problems- text (and hypertext) categorization- image classification- bioinformatics (Protein classification,
Cancer classification)- hand-written character recognition
Application 1: Cancer Classification
• High Dimensional- p>1000; n<100
• Imbalanced- less positive samples
• Many irrelevant features• Noisy
GenesPatients g-1 g-2 …… g-p
P-1p-2…….
p-n
Nn
xxkxxK+
+= l),(],[
FEATURE SELECTION
In the linear case,wi
2 gives the ranking of dim i
SVM is sensitive to noisy (mis-labeled) data L
Weakness of SVM• It is sensitive to noise
- A relatively small number of mislabeled examples can dramatically decrease the performance
• It only considers two classes- how to do multi-class classification with SVM?- Answer: 1) with output arity m, learn m SVM’s– SVM 1 learns “Output==1” vs “Output != 1”– SVM 2 learns “Output==2” vs “Output != 2”– :– SVM m learns “Output==m” vs “Output != m”
2)To predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region.
Application 2: Text Categorization• Task: The classification of natural text (or
hypertext) documents into a fixed number of predefined categories based on their content.- email filtering, web searching, sorting documents by topic, etc..
• A document can be assigned to more than one category, so this can be viewed as a series of binary classification problems, one for each category
Representation of Text
IR’s vector space model (aka bag-of-words representation)n A doc is represented by a vector indexed by a pre-fixed
set or dictionary of termsn Values of an entry can be binary or weights
n Normalization, stop words, word stems n Doc x => φ(x)
Text Categorization using SVM• The distance between two documents is φ(x)·φ(z)
• K(x,z) = �φ(x)·φ(z) is a valid kernel, SVM can be used with K(x,z) for discrimination.
• Why SVM?-High dimensional input space
-Few irrelevant features (dense concept)
-Sparse document vectors (sparse instances)
-Text categorization problems are linearly separable
Some Issues• Choice of kernel
- Gaussian or polynomial kernel is default- if ineffective, more elaborate kernels are needed- domain experts can give assistance in formulating appropriate similarity measures
• Choice of kernel parameters- e.g. σ in Gaussian kernel- σ is the distance between closest points with different classifications - In the absence of reliable criteria, applications rely on the use of a
validation set or cross-validation to set such parameters.
• Optimization criterion – Hard margin v.s. Soft margin- a lengthy series of experiments in which various parameters are tested
Additional Resources
• An excellent tutorial on VC-dimension and Support Vector Machines:
C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998.
• The VC/SRM/SVM Bible:Statistical Learning Theory by Vladimir Vapnik, Wiley-Interscience; 1998
http://www.kernel-machines.org/
•A Case Study on Face Detection and Recognition
•Feature based face matching
Face image From facedetection
Normalization Feature extraction
Feature vector
classifierDecision makerOutput results
You can extract various features
You can use various classifiers
You can use various decision makers
Normalization
)()())(()()y(
TITImeanTImeanC
T
TN ss
-=
Eye location Normalization: rotation normalization, scale normalization
Cross Correlation :
object template
Averaged for objects
Feature extraction•Eyebrow thickness and vertical position at the eye center position
•A coarse description of the left eyebrow’s arches
•Nose vertical position and width
•Mouth vertical position, width, height upper and lower lips
• eleven radii describing the chin shape
•Bigonial breadth (face width at nose position)
•Zygomatic breadth (face width halfway between nose tip and eyes).
Example of some geometrical features
Classifier
å---=D
1 )()()( jT
jj mxmxx
Bayes classifier
Feature vector Computer )(xj
Dx
(j=2,3,…N)jm
Rank the distance values
)(xj
D
Output the results
This is just one example of classifier!
Template matchingProduce a template
Face image From facedetection
Normalization
Decision makerOutput results
matching
Templatesdatabase
You have to create the data base of templates for all people you want to recognize
There are different templates used in various regions of the normalized face.
Various methods can be used to compress information for each template.
Example-Based Learning Approach
Three parts:• The image is divided into many possible-overlapping
windows, – each window pattern gets classified as either “a face” or
“not a face” based on a set of local image measurements.• For each new pattern to be classified, the system
computes a set of different measurements between the new pattern and the canonical face model.
• A trained classifier identifies the new pattern as “a face” or “not a face”.
Example of a system using EBL
• Kanade et al. first proposed an NN-based approach in 1996.
• Although NN have received significant attention in many research areas, few applications were successful in face recognition.
Why?
Neural Nets
Neural network (NN)• It’s easy to train a neural network with samples which
contain faces, but it is much harder to train a neural network with samples which do not.
• The number of “non-face” samples are just too large.
Neural network (NN)
• Neural network-based filter.– A small filter window is used to scan through all
portions of the image, – and to detect whether a face exists in each window.
• Merging overlapping detections and arbitration. By setting a small threshold, many false detections can be eliminated.
Rowley and Kanade’s Approach!
Test results of using NN