SVM classifiers
Binary classification
Given training data (π₯π , π¦π) for π = 1 . . . N, with π₯π β Rπ and π¦π β {-1,1},
learn a classifier π(π₯) such that
π π₯π = β₯ 0, π¦π= +1< 0, π¦π= β1
i.e. π¦ππ π₯π > 0 for a correct classification.
Linear separability
Linearly
separable
not
Linearly
separable
Linear classifiers
A linear classifier has the form
π π₯ = π€ππ₯ + b
β’ In 2D the discriminant is a line
β’ π€ is the normal to the line, and b the bias
β’ π€ is known as the weight vector
Linear classifiers
A linear classifier has the form
π π₯ = π€ππ₯ + b
β’ In 3D the discriminant is a plane, and in nD it is a hyperplain
For a K-NN classifier it was necessary to βcarryβ the training data
For a linear classifier, the training data is used to learn π€ and then discarded
Only π€ is needed for classifying new data
Reminder: The Perceptron Classifier
Given linearly separable data π₯π labelled into two categories π¦π = {-1,1}, find a weight vector π€ such that the discriminant function
π π₯π = π€ππ₯π+ b
Separates the categories for π = 1, β¦ ,N
β’ How can we find this separating hyperplane?
The Perceptron Algorithm
Write classifier as π π₯π = π€π π₯π + Ο0 = π€
ππ₯πwhere π€ = ( π€, Ο0), π₯π = ( π₯π,1)
β’ Initialize π€ = 0
β’ Cycle though the data points {π₯π, π¦π}
β’ If π₯π is misclassified then π€ β π€ + Ξ±sign(π π₯π )π₯πβ’ Until all the data is correctly classified
For example in 2D
β’ Initialize π€ = 0
β’ Cycle though the data points {π₯π, π¦π}
β’ If π₯π is misclassified then π€ β π€ + Ξ±sign(π π₯π )π₯π
β’ Until all the data is correctly classified
β’ If the data is linearly separable, then the algorithm will converge
β’ Convergence can be slow β¦
β’ Separating line close to training data
β’ We would prefer a larger margin for generalization
β’ Maximum margin solution: most stable under perturbations of the inputs
Support Vector Machine
SVM β sketch derivation
β’ Since π€ππ₯+ b = 0 and c(π€ππ₯+ b) = 0 define the same plane, we have the freedom to choose the normalization of π€
β’ Choose normalization such that π€ππ₯++ b = +1 and π€ππ₯-+ b = -1 for the
positive and negative support vectors respectively
β’ Then the margin is given by
π€
π€β (π₯+ β π₯β) =
π€π(π₯+βπ₯β)
π€=
2
π€
Support Vector MachineLinearly separable data
SVM β Optimization
β’ Learning the SVM can be formulated as an optimization:
mππ₯π€
2
π€subject to π€ππ₯π + π
β₯ 1 ππ π¦π = +1β€ β1 ππ π¦π = β1
for π = 1 . . . N
β’ Or equivalently
mπππ€
π€ 2 subject to π¦π(π€ππ₯π + π) β₯ 1 for π = 1 . . . N
β’ This is a quadratic optimization problem subject to linear constraints and
there is a unique minimum
Linear separability again: What is the best w?
β’ The points can be linearly separated but there is a very narrow margin
In general there is a trade off between the margin and the number of
Mistakes on the training data
β’ But possibly the large margin solution is better, even though one constraint is violated
Introduce βslackβ variables
βSoftβ margin solution
The optimization problem becomes
minπ€βπ π,π
πβπ +
π€ 2 +π
π
π
π
subject to
π¦π(π€ππ₯π + π) β₯ 1- ππ for π = 1 . . . N
β’ Every constraint can be satisfied if ππ is sufficiently large
β’ πΆ is regularization parameter:
- small πΆ allows constraints to be easily ignored β large margin
- large πΆ makes constraints hard to ignored β narrow margin
- πΆ = β enforces all constraints: hard margin
β’ This is still a quadratic optimization problem and there is a unique minimum.
Note, there is only one parameter, πΆ.
β’ Data is linearly separable
β’ But only with a narrow margin
πΆ = β : hard margin
C = 10 soft margin
Application: Pedestrian detection in
Computer Vision
β’ Objective: detect (localize) standing humans in an image
(c.f. face detection with a sliding window classifier)β’
β’ reduces object detection to binary classification
β’ does an image window contain a person or not?
Detection problem (binary) classification
problem
Each window is separately classified
Training data
β’ 64x128 images of humans cropped from a varied set of personal photos
β’ Positive data β 1239 positive window examples (reflections->2478)
β’ Negative data β 1218 person-free training photos (12180 patches)
Training
β’ A preliminary detector
β’ Trained with (2478) vs (12180) samples
β’ Retraining
β’ With augmented data set
β’ initial 12180 + hard examples
β’ Hard examples
β’ 1218 negative training photos are searched exhaustively for false positive
Feature: histogram of oriented gradients
(HOG)
Averaged examples
Algorithm
β’ Training(Learning)
β’ Represent each example window by a HOG feature vector
β’ Train a SVM classifier
β’ Testing(Detection)
β’ Sliding window classifier
π π₯ = π€ππ₯ + b
xπβRπ, π€ππ‘β π = 1024
Learned model
π π₯ = π€ππ₯ + b