Lecture 14: Introduction to Object Recognition Bag of Words...

transcript

Lecture 14Fei-Fei Li

Lecture 14: Introduction to Object Recognition & Bag‐of‐Words (BoW) Models

Professor Fei‐Fei LiStanford Vision Lab

8‐Nov‐111

What we will learn today?

• Introduction to object recognition– Representation– Learning– Recognition

• Bag of Words models (Problem Set 4 (Q2))– Basic representation– Different learning and recognition algorithms

8‐Nov‐112

What are the different visual recognition tasks?

8‐Nov‐113

Classification: Does this image contain a building? [yes/no]

8‐Nov‐114

Classification:Is this an beach?

8‐Nov‐115

Image Search

Organizing photo collections

8‐Nov‐116

Detection:Does this image contain a car? [where?]

8‐Nov‐117

Building

personcar

Detection:Which object does this image contain? [where?]

8‐Nov‐118

Detection:Accurate localization (segmentation)

8‐Nov‐119

Object: Person, back;1‐2 meters away

Object: Police car, side view, 4‐5 m away

Object: Building, 45º pose, 8‐10 meters awayIt has bricks

Detection: Estimating object semantic & geometric attributes

8‐Nov‐1110

Applications of computer vision

SurveillanceAssistive technologies

Security Assistive driving

Computational photography

8‐Nov‐1111

Categorization vs Single instance recognitionDoes this image contain the Chicago Macy building’s?

8‐Nov‐1112

Where is the crunchy nut?

Categorization vs Single instance recognition

8‐Nov‐1113

•Recognizing landmarks in mobile platforms

Applications of computer vision

8‐Nov‐1114

Activity or Event recognitionWhat are these people doing?

8‐Nov‐1115

Visual Recognition

• Design algorithms that are capable to–Classify images or videos–Detect and localize objects– Estimate semantic and geometrical attributes

– Classify human activities and events

Why is this challenging?8‐Nov‐1116

How many object categories are there?

8‐Nov‐1117

Challenges: viewpoint variation

Michelangelo 1475-1564

8‐Nov‐1118

Challenges: illumination

image credit: J. Koenderink

8‐Nov‐1119

Challenges: scale

8‐Nov‐1120

Challenges: deformation

8‐Nov‐1121

Challenges: occlusion

Magritte, 1957

8‐Nov‐1122

Challenges: background clutter

Kilmeny Niland. 1995

8‐Nov‐1123

Challenges: intra‐class variation

8‐Nov‐1124

Lecture 14

• Turk and Pentland, 1991• Belhumeur, Hespanha, & Kriegman, 1997• Schneiderman & Kanade 2004• Viola and Jones, 2000

• Amit and Geman, 1999• LeCun et al. 1998• Belongie and Malik, 2002

• Schneiderman & Kanade, 2004• Argawal and Roth, 2002• Poggio et al. 1993

Some early works on object categorization

8‐Nov‐11

Basic issues

• Representation– How to represent an object category; which classification scheme?

• Learning– How to learn the classifier, given training data

• Recognition– How the classifier is to be used on novel data

8‐Nov‐1126

Representation‐ Building blocks: Sampling strategies

RandomlyMultiple interest operators

Interest operators Dense, uniformly

Image cred

its: L. Fei‐Fei, E. N

owak, J. Sivic

8‐Nov‐1127

Representation– Appearance only or location and appearance

8‐Nov‐1128

Representation

–Invariances• View point• Illumination• Occlusion• Scale• Deformation• Clutter• etc.

8‐Nov‐1129

Representation

– To handle intra‐class variability, it is convenient to describe an object categories using probabilistic models

– Object models: Generative vs Discriminative vs hybrid

8‐Nov‐1130

Lecture 14

Object categorization: the statistical viewpoint

)|( imagezebrap

)( ezebra|imagnopvs.

)|()|(

imagezebranopimagezebrap

• Bayes rule:

8‐Nov‐1131

Lecture 14

)|( imagezebrap

)( ezebra|imagnopvs.

• Bayes rule:

)|()|(

zebranopzebrap

zebranoimagepzebraimagep

posterior ratio likelihood ratio prior ratio

8‐Nov‐11

Lecture 14

• Bayes rule:

)|()|(

zebranopzebrap

posterior ratio likelihood ratio prior ratio

• Discriminative methods model posterior

• Generative methods model likelihood and prior

8‐Nov‐11

Discriminative models

Non‐zebra

Decisionboundary

)|()|(

• Modeling the posterior ratio:

8‐Nov‐1134

Support Vector Machines

Guyon, Vapnik, Heisele, Serre, Poggio…

Boosting

Viola, Jones 2001, Torralba et al. 2004, Opelt et al. 2006,…

106 examples

Nearest neighbor

Shakhnarovich, Viola, Darrell 2003Berg, Berg, Malik 2005...

Neural networks

Source: Vittorio Ferrari, Kristen Grauman, Antonio Torralba

Latent SVMStructural SVM

Felzenszwalb 00Ramanan 03…

LeCun, Bottou, Bengio, Haffner 1998Rowley, Baluja, Kanade 1998…

8‐Nov‐1135

Generative models• Modeling the likelihood ratio:

)|()|(

0 0.2 0.4 0.6 0.8 10

p(x|C1)

p(x|C2)

8‐Nov‐1136

Lecture 14

)|( zebranoimagep)|( zebraimagep

Generative models

p(x|C1)

p(x|C2)

High Low

Low High

8‐Nov‐1137

Generative models• Naïve Bayes classifier

– Csurka Bray, Dance & Fan, 2004

• Hierarchical Bayesian topic models (e.g. pLSAand LDA)

– Object categorization: Sivic et al. 2005, Sudderth et al. 2005– Natural scene categorization: Fei‐Fei et al. 2005

• 2D Part based models‐ Constellation models: Weber et al 2000; Fergus et al 200‐ Star models: ISM (Leibe et al 05)

• 3D part based models: ‐multi‐aspects: Sun, et al, 2009

8‐Nov‐1138

Basic issues

8‐Nov‐1139

• Learning parameters: What are you maximizing? Likelihood (Gen.) or performances on train/validation set (Disc.)

Learning

8‐Nov‐1140

• Level of supervision• Manual segmentation; bounding box; image labels; noisy labels

Learning

• Batch/incremental

• Priors

8‐Nov‐1141

• Level of supervision• Manual segmentation; bounding box; image labels; noisy labels

Learning

• Batch/incremental

• Training images:•Issue of overfitting•Negative images for discriminative methods

• Priors

8‐Nov‐1142

Basic issues

8‐Nov‐1143

– Recognition task: classification, detection, etc..

Recognition

8‐Nov‐1144

Recognition– Recognition task– Search strategy: Sliding Windows

• Simple• Computational complexity (x,y, S, , N of classes)

‐ BSW by Lampert et al 08

‐ Also, Alexe, et al 10

Viola, Jones 2001,

8‐Nov‐1145

• Localization• Objects are not boxes

Viola, Jones 2001,

8‐Nov‐1146

• Localization• Objects are not boxes• Prone to false positive

Non max suppression: Canny ’86….Desai et al , 2009

Viola, Jones 2001,

8‐Nov‐1147

Recognition

Category: carAzimuth = 225ºZenith = 30º

•Savarese, 2007 •Sun et al 2009• Liebelt et al., ’08, 10•Farhadi et al 09

‐ It has metal‐ it is glossy‐ has wheels

•Farhadi et al 09 • Lampert et al 09• Wang & Forsyth 09

– Recognition task– Search strategy– Attributes

8‐Nov‐1148

Semantic:•Torralba et al 03• Rabinovich et al 07• Gupta & Davis 08• Heitz & Koller 08• L‐J Li et al 08• Yao & Fei‐Fei 10

Recognition– Recognition task– Search strategy– Attributes– Context

Geometric• Hoiem, et al 06• Gould et al 09• Bao, Sun, Savarese 10

8‐Nov‐1149

Basic issues

8‐Nov‐1150

Part 1: Bag‐of‐words models

This segment is based on the tutorial “Recognizing and Learning Object Categories: Year 2007”, by Prof L. Fei‐Fei, A. Torralba, and R. Fergus

8‐Nov‐1151

Related works

• Early “bag of words” models: mostly texture recognition– Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001;

Schmid 2001; Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003;

• Hierarchical Bayesian models for documents (pLSA, LDA, etc.)– Hoffman 1999; Blei, Ng & Jordan, 2004; Teh, Jordan, Beal & Blei, 2004

• Object categorization– Csurka, Bray, Dance & Fan, 2004; Sivic, Russell, Efros, Freeman &

Zisserman, 2005; Sudderth, Torralba, Freeman & Willsky, 2005;

• Natural scene categorization– Vogel & Schiele, 2004; Fei‐Fei & Perona, 2005; Bosch, Zisserman &

Munoz, 2006

8‐Nov‐1152

Object Bag of ‘words’

8‐Nov‐1153

Analogy to documentsOf all the sensory impressions proceeding to the brain, the visual experiences are the dominant ones. Our perception of the world around us is based essentially on the messages that reach the brain from our eyes. For a long time it was thought that the retinal image was transmitted point by point to visual centers in the brain; the cerebral cortex was a movie screen, so to speak, upon which the image in the eye was projected. Through the discoveries of Hubel and Wiesel we now know that behind the origin of the visual perception in the brain there is a considerably more complicated course of events. By following the visual impulses along their path to the various cell layers of the optical cortex, Hubel and Wiesel have been able to demonstrate that the message about the image falling on the retina undergoes a step-wise analysis in a system of nerve cells stored in columns. In this system each cell has its specific function and is responsible for a specific detail in the pattern of the retinal image.

sensory, brain, visual, perception,

retinal, cerebral cortex,eye, cell, optical

nerve, imageHubel, Wiesel

China is forecasting a trade surplus of $90bn (£51bn) to $100bn this year, a threefold increase on 2004's $32bn. The Commerce Ministry said the surplus would be created by a predicted 30% jump in exports to $750bn, compared with a 18% rise in imports to $660bn. The figures are likely to further annoy the US, which has long argued that China's exports are unfairly helped by a deliberately undervalued yuan. Beijing agrees the surplus is too high, but says the yuan is only one factor. Bank of China governor Zhou Xiaochuan said the country also needed to do more to boost domestic demand so more goods stayed within the country. China increased the value of the yuan against the dollar by 2.1% in July and permitted it to trade within a narrow band, but the US wants the yuan to be allowed to trade freely. However, Beijing has made it clear that it will take its time and tread carefully before allowing the yuan to rise further in value.

China, trade, surplus, commerce,

exports, imports, US, yuan, bank, domestic,

foreign, increase, trade, value

8‐Nov‐1154

– Independent features

definition of “BoW”

face bike violin

8‐Nov‐1155

definition of “BoW”– Independent features – histogram representation

codewords dictionary

8‐Nov‐1156

categorydecision

Representation

feature detection& representation

image representation

category models(and/or) classifiers

recognitionle

8‐Nov‐1157

1.Feature detection and representation

8‐Nov‐1158

• Regular grid– Vogel & Schiele, 2003– Fei‐Fei & Perona, 2005

8‐Nov‐1159

• Interest point detector– Csurka, et al. 2004– Fei‐Fei & Perona, 2005– Sivic, et al. 2005

8‐Nov‐1160

• Interest point detector– Csurka, Bray, Dance & Fan, 2004– Fei‐Fei & Perona, 2005– Sivic, Russell, Efros, Freeman & Zisserman, 2005

• Other methods– Random sampling (Vidal‐Naquet & Ullman, 2002)– Segmentation based patches (Barnard, Duygulu, Forsyth, de Freitas, Blei, Jordan, 2003)

8‐Nov‐1161

Normalize patch

Detect patches[Mikojaczyk and Schmid ’02]

[Mata, Chum, Urban & Pajdla, ’02]

[Sivic & Zisserman, ’03]

Compute SIFT

descriptor[Lowe’99]

Slide credit: Josef Sivic

8‐Nov‐1162

8‐Nov‐1163

2. Codewords dictionary formation

8‐Nov‐1164

Clustering/vector quantization

Cluster center= code word

8‐Nov‐1165

Fei-Fei et al. 2005

8‐Nov‐1166

Image patch examples of codewords

Sivic et al. 2005

8‐Nov‐1167

Visual vocabularies: Issues

• How to choose vocabulary size?– Too small: visual words not representative of all patches– Too large: quantization artifacts, overfitting

• Computational efficiency– Vocabulary trees

(Nister & Stewenius, 2006)

8‐Nov‐1168

3. Bag of word representation

Codewords dictionary • Nearest neighbors assignment• K‐D tree search strategy

8‐Nov‐1169

3. Bag of word representation

Codewords dictionary codewords

8‐Nov‐1170

feature detection& representation

image representation

Representation

8‐Nov‐1171

categorydecision

Learning and Recognition

8‐Nov‐1172

1. Discriminative method: - NN- SVM

2.Generative method: - graphical models

8‐Nov‐1173

category models

Class 1 Class N

… ……

Discriminative classifiers

Model space

8‐Nov‐1174

Discriminative classifiers

Query image

Winning class: pink

Model space

8‐Nov‐1175

Nearest Neighborsclassifier

Query image

Winning class: pink

• Assign label of nearest training data point to each test data point

Model space

8‐Nov‐1176

Query image

• For a new point, find the k closest points from training data• Labels of the k points “vote” to classify• Works well provided there is lots of data and the distance function is good

K- Nearest Neighborsclassifier

Model space

Winning class: pink

8‐Nov‐1177

• For k dimensions: k‐D tree = space‐partitioning data structure for organizing points in a k‐dimensional space• Enable efficient search

from Duda et al.

K- Nearest Neighborsclassifier

• Voronoi partitioning of feature space for 2‐category 2‐D and 3‐D data

• Nice tutorial: http://www.cs.umd.edu/class/spring2002/cmsc420‐0401/pbasic.pdf

8‐Nov‐1178

Functions for comparing histograms• L1 distance

• χ2 distance

• Quadratic distance (cross‐bin)

iihihhhD

12121 |)()(|),(

Jan Puzicha, Yossi Rubner, Carlo Tomasi, Joachim M. Buhmann: Empirical Evaluation of Dissimilarity Measures for Color and Texture. ICCV 1999

i ihihihihhhD

21 )()()()(),(

ij jhihAhhD,

22121 ))()((),(

8‐Nov‐1179

1. Discriminative method: - NN- SVM

2.Generative method: - graphical models

8‐Nov‐1180

Discriminative classifiers(linear classifier)

Model spacecategory models

Class 1 Class N

… ……

8‐Nov‐1181

Support vector machines• Find hyperplane that maximizes the margin between the positive and

negative examples

MarginSupport vectors

Distance between point and hyperplane: ||||

||wwx bi

Support vectors: 1 bi wx

Margin = 2 / ||w||

Credit slide: S. Lazebnik

i iii y xw

bybi iii xxxw

Classification function (decision boundary):

Solution:

8‐Nov‐1182

Support vector machines• Classification

Margin

bybi iii xxxw

classbifclassbif

Test point

C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

8‐Nov‐1183

• Datasets that are linearly separable work out great:•

• But what if the dataset is just too hard?

• We can map it to a higher‐dimensional space:

Nonlinear SVMs

Slide credit: Andrew Moore

8‐Nov‐1184

Φ: x→ φ(x)

Nonlinear SVMs• General idea: the original input space can always be mapped

to some higher‐dimensional feature space where the training set is separable:

Slide credit: Andrew Moorelifting transformation

8‐Nov‐1185

Nonlinear SVMs• Nonlinear decision boundary in the original feature space:

iii ),( xx

C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

•The kernel K = product of the lifting transformation φ(x):

K(xi,xjj) = φ(xi ) · φ(xj)NOTE:• It is not required to compute φ(x) explicitly:• The kernel must satisfy the “Mercer inequality”

8‐Nov‐1186

Kernels for bags of features

• Histogram intersection kernel:

• Generalized Gaussian kernel:

• D can be Euclidean distance, χ2 distance etc…

iihihhhI

12121 ))(),(min(),(

2121 ),(1exp),( hhDA

J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, Local Features and Kernels for Classifcation of Texture and Object Categories: A Comprehensive Study, IJCV 2007

8‐Nov‐1187

Pyramid match kernel• Fast approximation of Earth Mover’s Distance• Weighted sum of histogram intersections at mutliple resolutions (linear in

the number of features instead of cubic)

K. Grauman and T. Darrell. The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features, ICCV 2005.

8‐Nov‐1188

Spatial Pyramid Matching

Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. S. Lazebnik, C. Schmid, and J. Ponce. CVPR 2006

8‐Nov‐1189

What about multi‐class SVMs?

• No “definitive” multi‐class SVM formulation• In practice, we have to obtain a multi‐class SVM by combining

multiple two‐class SVMs • One vs. others

– Traning: learn an SVM for each class vs. the others– Testing: apply each SVM to test example and assign to it the class of

the SVM that returns the highest decision value

• One vs. one– Training: learn an SVM for each pair of classes– Testing: each learned SVM “votes” for a class to assign to the test

example

Credit slide: S. Lazebnik

8‐Nov‐1190

SVMs: Pros and cons• Pros

– Many publicly available SVM packages:http://www.kernel‐machines.org/software

– Kernel‐based framework is very powerful, flexible– SVMs work very well in practice, even with very small training sample sizes

• Cons– No “direct” multi‐class SVM, must combine two‐class SVMs– Computation, memory

• During training time, must compute matrix of kernel values for every pair of examples

• Learning can take a very long time for large‐scale problems

8‐Nov‐1191

Lecture 14

Object recognition results

• ETH‐80 database of 8 object classes (Eichhorn and Chapelle 2004)

• Features: – Harris detector– PCA‐SIFT descriptor, d=10

Kernel Complexity Recognition rateMatch [Wallraven et al.] 84%

Bhattacharyya affinity [Kondor & Jebara]

Pyramid match 84%Slide credit: Kristen Grauman

8‐Nov‐11

Support Vector Machines

Guyon, Vapnik, Heisele, Serre, Poggio…

Boosting

Viola, Jones 2001, Torralba et al. 2004, Opelt et al. 2006,…

106 examples

Nearest neighbor

Shakhnarovich, Viola, Darrell 2003Berg, Berg, Malik 2005...

Neural networks

Source: Vittorio Ferrari, Kristen Grauman, Antonio Torralba

Latent SVMStructural SVM

Felzenszwalb 00Ramanan 03…

LeCun, Bottou, Bengio, Haffner 1998Rowley, Baluja, Kanade 1998…

8‐Nov‐1193

1. Discriminative method: ‐ NN‐ SVM

2.Generative method: ‐ graphical models

Model the probability distribution that produces a given bag of features

8‐Nov‐1194

Generative models

1. Naïve Bayes classifier– Csurka Bray, Dance & Fan, 2004

2. Hierarchical Bayesian text models (pLSA and LDA)

– Background: Hoffman 2001, Blei, Ng & Jordan, 2004– Object categorization: Sivic et al. 2005, Sudderth et al.

2005– Natural scene categorization: Fei‐Fei et al. 2005

8‐Nov‐1195

• w: a collection of all N codewords in the imagew = [w1,w2,…,wN]

• c: category of the image

Some notations

8‐Nov‐1196

Lecture 14

the Naïve Bayes model

)|()( cwpcp)|( wcp

8‐Nov‐11

Prior prob. of the object classes

Image likelihoodgiven the class

Graphical model

Posterior =probability that image I is of category c

Lecture 14

the Naïve Bayes model

)|()( cwpcp

nn cwpcp

Object classdecision

)|( wcpc

c maxarg

Likelihood of ith visual wordgiven the class

Estimated by empirical frequencies of code words in images from a given class

8‐Nov‐11

Graphical model

Csurka et al. 2004

8‐Nov‐1199

Csurka et al. 2004

8‐Nov‐11100

Other generative BoWmodels

• Hierarchical Bayesian topic models (e.g. pLSAand LDA)

– Object categorization: Sivic et al. 2005, Sudderth et al. 2005– Natural scene categorization: Fei‐Fei et al. 2005

8‐Nov‐11101

Generative vs discriminative

• Discriminative methods– Computationally efficient & fast

• Generative models– Convenient for weakly‐ or un‐supervised, incremental training

– Prior information– Flexibility in modeling parameters

8‐Nov‐11102

• No rigorous geometric information of the object components

• It’s intuitive to most of us that objects are made of parts – no such information

• Not extensively tested yet for– View point invariance– Scale invariance

• Segmentation and localization unclear

Weakness of BoW the models

8‐Nov‐11103

What have learned today?

• Introduction to object recognition– Representation– Learning– Recognition

• Bag of Words models (Problem Set 4 (Q2))– Basic representation– Different learning and recognition algorithms

8‐Nov‐11104

Lecture 14: Introduction to Object Recognition Bag of Words...

Documents