Object Recognition with Deformable Models
Pedro F. FelzenszwalbDepartment of Computer Science
University of Chicago
Joint work with: Dan Huttenlocher, Joshua Schwartz, David McAllester, Deva Ramanan.
Example Problems
Detecting non-rigid objects
PASCAL challenge
Segmenting cells
Medical imageanalysis
Detecting rigid objects
Deformable Models
• Significant challenge:
- Handling variation in appearance within object classes
- Non-rigid objects, generic categories, etc.
• Deformable models approach:
- Consider each object as a deformed version of a template
- Compact representation
- Leads to interesting modeling and algorithmic problems
Overview
• Part I: Pictorial Structures
- Deformable part models
- Highly efficient matching algorithms
• Part II: Deformable Shapes
- Triangulated polygons
- Hierarchical models
• Part III: The PASCAL Challenge
- Recognizing 20 object categories in realistic scenes
- Discriminatively trained, multiscale, deformable part models
Part I: Pictorial Structures
• Introduced by Fischler and Elschlager in 1973
• Part-based models:
- Each part represents local visual properties
- “Springs” capture spatial relationships
Matching model to image involves joint optimization of part locations
“stretch and fit”
Local Evidence + Global Decision
• Parts have a match quality at each image location
• Local evidence is noisy
- Parts are detected in the context of the whole model
part
test image match quality
Matching Problem
• Model is represented by a graph G = (V, E)
- V = {v1,...,vn} are the parts
- (vi,vj) ∈ E indicates a connection between parts
• mi(li) is a cost for placing part i at location li
• dij(li,lj) is a deformation cost
• Optimal configuration for the object is L = (l1,...,ln) minimizing
i=1E(L) = ∑ mi(li) + ∑ dij(li,lj)
n
(vi,vj) ∈ E
Matching Problem
• Assume n parts, k possible locations for each part
- There are kn configurations L
• If graph is a tree we can use dynamic programming
- O(nk2) algorithm
• If dij(li,lj) = g(li-lj) we can use min-convolutions
- O(nk) algorithm
- As fast as matching each part separately!
i=1E(L) = ∑ mi(li) + ∑ dij(li,lj)
n
(vi,vj) ∈ E
• For each l1 find best l2:
- Best2(l1) = min [m2(l2) + d12(l1,l2)]
• “Delete” v2 and solve problem with smaller model
• Keep removing leafs until there is a single part left
Dynamic Programming on Trees
v1
v2
i=1E(L) = ∑ mi(li) + ∑ dij(li,lj)
n
(vi,vj) ∈ E
l2
Min-Convolution Speedup
• Brute force: O(k2) --- k is number of locations
• Suppose d12(l1,l2) = g(l1-l2):
- Best2(l1) = min [m2(l2) + g(l1-l2)]
• Min-convolution: O(k) if g is convex
Best2(l1) = min [m2(l2) + d12(l1,l2)] v1
v2
l2
l2
Finding Motorbikes
Model with 6 parts:2 wheels
2 headlightsfront & back of seat
Human Pose Estimation
Human Tracking
Ramanan, Forsyth, Zisserman, Tracking People by Learning their Appearance IEEE Pattern Analysis and Machine Intelligence (PAMI). Jan 2007
Part II: Deformable Shapes
• Shape is a fundamental cue for recognizing objects
• Many objects have no well defined parts
- We can capture their outlines using deformable models
Triangulated Polygons
• Polygonal templates
• Delauney triangulation gives natural decomposition of an object
• Consider deforming each triangle “independently”
Rabbit ear can be bent by changing shape of a single
triangle
Structure of Triangulated Polygons
There are 2 graphs associated with a triangulated polygon
Dual graph is a tree
If the polygon is simple (no holes):
Graphical structure of triangulation is a 2-tree
Deformable Matching
Matching to MRI data
Model
Consider piecewise affine maps from model to image (taking triangles to triangles)
Find globally optimal deformation using dynamic programming over 2-tree
Hierarchical Shape Model• Shape-tree of curve from a to b:
- Select midpoint c, store relative location c | a,b.
- Left child is a shape-tree of sub-curve from a to c.
- Right child is a shape-tree of sub-curve from c to b.
b a
ce
dg
fh
i
g | e,c i | d,bh | c,df | a,e
d | c,be | a,c
c | a,b
Deformations
• Independently perturb relative locations stored in a shape-tree
- Local and global properties are preserved
- Reconstructed curve is perceptually similar to original
p
q
r
Matching
Match(v, [p,q]) = w1Match(u, [q,r]) = w2
Match(w, [p,r]) = w1 + w2 + dif((e|a,c), (q|p,r))
b a
ce
dg
fh
i
g | e,c i | d,bh | c,df | a,e
d | c,be | a,c
c | a,b
v
u
w
model curve
similar to parsing with the CKY algorithm
Recognizing Leafs
15 species
75 examples per species
(25 training, 50 test)
Nearest neighbor classification
Shape-tree 96.28
Inner distance 94.13
Shape context 88.12
Part III: PASCAL Challenge
• ~10,000 images, with ~25,000 target objects
- Objects from 20 categories (person, car, bicycle, cow, table...)
- Objects are annotated with labeled bounding boxes
Model Overview
Model has a root filter plus deformable parts
root filter part filters deformation models
detection
Histogram of Gradient (HOG) Features
• Image is partitioned into 8x8 pixel blocks
• In each block we compute a histogram of gradient orientations
- Invariant to changes in lighting, small deformations, etc.
• We compute features at different resolutions (pyramid)
Filters
• Filters are rectangular templates defining weights for features
• Score is dot product of filter and subwindow of HOG pyramid
Image pyramid HOG feature pyramid
HOG pyramid
W
Score of H at this location is H ⋅ W
H
Object Hypothesis
Image pyramid HOG feature pyramid
Multiscale model captures features at two-resolutions
Score is sum of filter scores plus deformation
scores
Training• Training data consists of images with labeled bounding boxes
• Need to learn the model structure, filters and deformation costs
Training
Connection With Linear Classifiers
w is a modelx is a detection windowz are filter placements
concatenation of features and part displacements
concatenation of filters and deformation parameters
• Score of model is sum of filter scores plus deformation scores
- Bounding box in training data specifies that score should be high for some placement in a range
Latent SVMs
Linear in w if z is fixed
Regularization Hinge loss
Learned Models
Bottle
Car
Bicycle
Sofa
Example Results
More Results
Overall Results
• 9 systems competed in the 2007 challenge
• Out of 20 classes we get:
- First place in 10 classes
- Second place in 6 classes
• Some statistics:
- It takes ~2 seconds to evaluate a model in one image
- It takes ~3 hours to train a model
- MUCH faster than most systems
Component Analysis
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.10.20.30.40.50.60.70.80.9
1
recall
prec
ision
PASCAL2006 Person
Root (0.18)Root+Latent (0.24)Parts+Latent (0.29)Root+Parts+Latent (0.34)
Summary
• Deformable models provide an elegant framework for object detection and recognition
- Efficient algorithms for matching models to images
- Applications: pose estimation, medical image analysis, object recognition, etc.
• We can learn models from partially labeled data
- Generalized standard ideas from machine learning
- Leads to state-of-the-art results in PASCAL challenge
• Future work: hierarchical models, grammars, 3D objects