+ All Categories
Home > Documents > Learning and Inference in Vision: from Features to Scene Understanding

Learning and Inference in Vision: from Features to Scene Understanding

Date post: 07-Jan-2016
Category:
Upload: truly
View: 24 times
Download: 0 times
Share this document with a friend
Description:
Learning and Inference in Vision: from Features to Scene Understanding. Jonathan Huang, Tomasz Malisiewicz MLD Student Research Symposium, 2009. Sky. Bridge. Sign. Trees. Car. Road. Huge datasets. PASCAL Visual Objects Challenge (VOC) dataset. ~15000 annotated images, - PowerPoint PPT Presentation
Popular Tags:
54
Learning and Inference in Vision: from Features to Scene Understanding Jonathan Huang, Tomasz Malisiewicz MLD Student Research Symposium, 2009
Transcript
Page 1: Learning and Inference in Vision:  from Features to Scene Understanding

Learning and Inference in Vision: from Features to Scene Understanding

Jonathan Huang, Tomasz Malisiewicz

MLD Student Research Symposium, 2009

Page 2: Learning and Inference in Vision:  from Features to Scene Understanding

Road

Sky

Trees

Bridge

SignCar

Page 3: Learning and Inference in Vision:  from Features to Scene Understanding

Huge datasetsPASCAL Visual Objects Challenge (VOC) dataset

~15000 annotated images, ~35,000 annotated object instances, 20 object classes with segmentations, bounding boxes

Page 4: Learning and Inference in Vision:  from Features to Scene Understanding

Huge datasets

 

LabelMe dataset

~11845 static images, >100,000 labeled polygons

Page 5: Learning and Inference in Vision:  from Features to Scene Understanding

Outline

I. Recognizing single object classes (Jon)

II. Scene understanding with multiple classes (Tomasz)

Page 6: Learning and Inference in Vision:  from Features to Scene Understanding

Recognition task #1: Find all markers

Page 7: Learning and Inference in Vision:  from Features to Scene Understanding

Geometric Variability

Recognition task #2: Find all cats

Object recognition is often hard due to:

Page 8: Learning and Inference in Vision:  from Features to Scene Understanding

Variation within an object class

Page 9: Learning and Inference in Vision:  from Features to Scene Understanding

Viewpoint/Scales/Illumination Variability Images from Flickr

Page 10: Learning and Inference in Vision:  from Features to Scene Understanding

From Pixels to Visual features

car

ImagingImaging

InferenceInference

Scene

Featu

res

Pixels

Low level features

Higher level inference

Page 11: Learning and Inference in Vision:  from Features to Scene Understanding

Local Visual Features

Images are high dimensional!

Compute image statistics in a region (e.g., estimate the distribution of image gradient orientations)

(640 width) *(480 height) = (307200 pixels)

Page 12: Learning and Inference in Vision:  from Features to Scene Understanding

Key ideas in feature design

Be invariant to stuff you don’t care about…

while not being too invariant

Page 13: Learning and Inference in Vision:  from Features to Scene Understanding

Object classification

Inference: What object class is this?Learning: What does each object class look like?

Cow or Horse??

Let’s look at a simpler example first…

Page 14: Learning and Inference in Vision:  from Features to Scene Understanding

Document classification analogy

John Terry scored on a header to lift Chelsea to a 1-0 victory over Manchester United and extend the Blues’ Premier League lead to 5 points. Chelsea had been frustrated by Manchester United for 76 minutes, but took advantage of a free kick awarded when Darren Fletcher fouled Ashley Cole.Brian Ching scored six minutes into overtime and the Houston Dynamo advanced to Major League Soccer’s Western ...

In the Senate, where proposals differ substantially from the House-passed measure on issues like a government-run plan and how to pay for coverage, the bill is stalled while budget analysts assess its overall costs. The slim margin in the House — the bill passed with just two votes to spare, and 39 Democrats opposed it — suggests even greater challenges in the Senate, where the majority leader, ...

??? ???

Classify each document as sports or politics

Page 15: Learning and Inference in Vision:  from Features to Scene Understanding

Bag-of-words models for text classification

“Much of the meaning behind written language is preserved even when the ordering of the individual words is lost.” [El-Arini et al.,’09]

bag

words(Sue Ann)

Page 16: Learning and Inference in Vision:  from Features to Scene Understanding

Document classification analogy

but to on Darren awarded Fletcher advanced Ashley lift over to 1-0 scored advantage Major for lead 76 Chelsea Premier to Terry League John Houston the kick Chelsea took United points. free minutes fouled United been frustrated overtime Manchester six a when League a extend victory Ching 5 and to and Western Manchester Brian Cole. Dynamo Soccer’s by a minutes, Blues’ the had header into of scored ...

the margin how In on majority 39 costs. with measure slim overall — to like opposed suggests challenges pay even substantially stalled government run where the issues votes it the where bill for spare, from bill and a Senate, analysts coverage, in — the Democrats greater differ two proposals budget its House assess while Senate, to in just the leader and the plan passed the is House passed The ...

??? ???

Page 17: Learning and Inference in Vision:  from Features to Scene Understanding

Document classification analogy

but to on Darren awarded Fletcher advanced Ashley lift over to 1-0 scored advantage Major for lead 76 Chelsea Premier to Terry League John Houston the kick Chelsea took United points. free minutes fouled United been frustrated overtime Manchester six a when League a extend victory Ching 5 and to and Western Manchester Brian Cole. Dynamo Soccer’s by a minutes, Blues’ the had header into of scored ...

the margin how In on majority 39 costs. with measure slim overall — to like opposed suggests challenges pay even substantially stalled government-run where the issues votes it the where bill for spare, from bill and a Senate, analysts coverage, in — the Democrats greater differ two proposals budget its House assess while Senate, to in just the leader and the plan passed the is House-passed The ...

??? ???

Page 18: Learning and Inference in Vision:  from Features to Scene Understanding

 

Page 19: Learning and Inference in Vision:  from Features to Scene Understanding

Visual words (discretization)

  Words are discrete, visual features are typically continuous…

Discretization via clustering/vector quantization

Page 20: Learning and Inference in Vision:  from Features to Scene Understanding

Visual words

 

[Sivic et al., ‘05]

Page 21: Learning and Inference in Vision:  from Features to Scene Understanding

Object classification with bag of words

 

[Sivic et al., ‘05]

Page 22: Learning and Inference in Vision:  from Features to Scene Understanding

Object classification with bag of wordsPerformance on Caltech 101 dataset with linear SVM on bag-of-word vectors:

Faces

Airplanes Cars

[Csurka et al., ‘04]

Page 23: Learning and Inference in Vision:  from Features to Scene Understanding

Object Detection problemDetection: Locate all the faces in this image.

Classification: Is this a face, or not a face?

Page 24: Learning and Inference in Vision:  from Features to Scene Understanding

Face detection via a series of classifications(a.k.a. sliding window brain damage)

Page 25: Learning and Inference in Vision:  from Features to Scene Understanding

False Detection

Missed Faces

Sliding window detection results

Page 26: Learning and Inference in Vision:  from Features to Scene Understanding

The need for… capturing spatial relationships

Page 27: Learning and Inference in Vision:  from Features to Scene Understanding

One ApproachCreate a more descriptive (complicated) feature

Histograms of Oriented Gradients (HOG) features

Original ImageSubdivided Image cells

Histogrammed gradients in

each cell

Estimated Image Gradients

gradient magnitudes

gradient orientations

[Dalal & Triggs, ‘06]

Page 28: Learning and Inference in Vision:  from Features to Scene Understanding

People Tracking with HOG features

bette

r

Page 29: Learning and Inference in Vision:  from Features to Scene Understanding

Modeling Spatial Relationships with Deformable Part Based Models

      

Spring-based models: Parts prefer low-energy configurations

[Fischler & Elschlager ,’73], [Ramanan et al,’07], [Felszwenwalb et al,’05,’09], [Kumar et al, ‘09]

Page 30: Learning and Inference in Vision:  from Features to Scene Understanding

Parts Based Model

Vertices – Local Appearance

Edges - Spatial Relationship

Goal: Assign model parts to image regions preserving

both local appearance and spatial relationships

Page 31: Learning and Inference in Vision:  from Features to Scene Understanding

Parts based models - Inference ProblemInference problem: What is the best scoring assignment f?

Local Appearance termPairwise Spatial

Relationship term

Inference is NP-hard for general graphs

For trees can use belief propagation for exact solution in polytime

Page 32: Learning and Inference in Vision:  from Features to Scene Understanding

Parts based models - Learning Problem

Linear models:

s.t.

Local Appearance termPairwise Spatial

Relationship term

Convex max-margin objective

Positive examples on one side

Negative examples on the other

[Kumar et al,’09]

Learning linear models: Find weight vectors that best separate positive and negative examples. E.g.,

Page 33: Learning and Inference in Vision:  from Features to Scene Understanding

Person deformable part model

Root filter (8x8 resolution)

Part filter (4x4 resolution)

Quadratic spatial configuration model

[Felszwenwalb et al,’09]

Page 34: Learning and Inference in Vision:  from Features to Scene Understanding

 [Felszwenwalb et al,’09]

Page 35: Learning and Inference in Vision:  from Features to Scene Understanding

 [Ramanan et al,’09]

Page 36: Learning and Inference in Vision:  from Features to Scene Understanding

Outline

I. Recognizing single object classes (Jon)

II. Scene understanding with multiple classes (Tomasz)

Page 37: Learning and Inference in Vision:  from Features to Scene Understanding

Part II: Scene Understanding with Multiple ClassesGoal: Predict Many Different Objects in a Single Image

Car

Fire Hydrant

Building

Fence

Sidewalk

Tree

Page 38: Learning and Inference in Vision:  from Features to Scene Understanding

Wait...

• What’s wrong with just learning a different sliding window classifier for each object type in the world?

Page 39: Learning and Inference in Vision:  from Features to Scene Understanding

The image as seen from a object detector’s point of view

Page 40: Learning and Inference in Vision:  from Features to Scene Understanding

41

Relationships between objects make recognition possible

41Antonio Torralba. The Context Challenge. http://web.mit.edu/torralba/www/carsAndFacesInContext.html

Page 41: Learning and Inference in Vision:  from Features to Scene Understanding

43

Objects as the “Parts” of a Scene

Key Challenge in Scene Understanding: Modeling relationships between objects from different categories

Deformable Part Model Scene Model

Page 42: Learning and Inference in Vision:  from Features to Scene Understanding

Fixed Extent “Things” vs Free-form “Stuff”

Building

Fence

Sidewalk

Car

Fire Hydrant

Tree

Things have a well-defined shape. A part of a car is not a car.

Stuff is free-form and mostly defined by color/texture. A part of a building is still a building.

Page 43: Learning and Inference in Vision:  from Features to Scene Understanding

3 Types of Scene Models

Pixel-based Window-based Segment-based

Page 44: Learning and Inference in Vision:  from Features to Scene Understanding

Pixel-based Scene Understanding

Unable to reason about instances

Only limited notion of context

TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-class Object Recognition and Segmentation. Shotton et al. ECCV 2006

Produces Segmentation

Works well on “stuff”

Page 45: Learning and Inference in Vision:  from Features to Scene Understanding

50

Pixel-wise Conditional Random Fields (TextonBoost)

• Inference

• y^* = argmax_y p(y|x)

• Training: Use boosting to learn unary potential

• Future Direction: Higher-Order Cliques50

TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-class Object Recognition and Segmentation. Shotton et al. ECCV 2006

Page 46: Learning and Inference in Vision:  from Features to Scene Understanding

Window-based Scene Understanding

Often not possible to model “stuff” using windows.

Window assumption also questionable for some “things.”

Possible to model interactions between object instances.

Discriminative models for multi-class object layout. Desai et al. ICCV 2009Object Recognition by Scene Alignment.

Russell et al. NIPS 2007

Page 47: Learning and Inference in Vision:  from Features to Scene Understanding

52

Discriminative models for multi-class object layout

• Inference via Greedy Forward Search

• Training

52

Page 48: Learning and Inference in Vision:  from Features to Scene Understanding

53

Window-based results

53

Page 49: Learning and Inference in Vision:  from Features to Scene Understanding

Region-Based Scene Understanding

Use Segmentation algorithm to extract stable regionsUse CRF to label those segments

Problem: Hard to get object-segments. Problem: Inference difficult for fully connected models.

Page 50: Learning and Inference in Vision:  from Features to Scene Understanding

56

Region-Based CRF

• Training: Bag of Words with Nearest Neighbor classifier

• Maximum Likelihood training of pairwise potentials

56

Object Categorization using Co-Occurrence, Location and Appearance. Galleguillos et al. CVPR 2008.

Spatial Relations

Page 51: Learning and Inference in Vision:  from Features to Scene Understanding

57

Segmentation-Based Results

57

Input image No context w/ context

Object Categorization using Co-Occurrence, Location and Appearance. Galleguillos et al. CVPR 2008.

Page 52: Learning and Inference in Vision:  from Features to Scene Understanding

58

Model Granularity vs. Object Type

Pixels Windows Regions

Things (car, cow, person) :-( :-) :-/

Stuff (road, sky, tree) :-) :-( :-)

Granularity

ObjectType

Page 53: Learning and Inference in Vision:  from Features to Scene Understanding

Scene Understanding Recap

• Rich object-object interactions are important for scene understanding.

• Different underlying assumptions (pixel vs. window vs. region) are better suited for different types of objects (“stuff” vs. “things”)

• Many of the techniques for single class object recognition (e.g., part based models) are relevant for scene understanding

Page 54: Learning and Inference in Vision:  from Features to Scene Understanding

Thanks!

Image Classification

Sliding Window based Object Detection

Modeling Spatial Relationships between parts

Modeling Spatial Relationships between objects


Recommended