Reading Between the Lines: Object Localization Using Implicit Cues from Image Tags

transcript

Sung Ju Hwang and Kristen GraumanUniversity of Texas at Austin

Images tagged with keywords clearly tell us which objects to search for

Detecting tagged objects

DogBlack labJasperSofaSelfLiving roomFedoraExplore#24

Duygulu et al. 2002

Detecting tagged objects

Previous work using tagged images focuses on the noun ↔ object correspondence.

Fergus et al. 2005

Li et al., 2009

Berg et al. 2004

[Lavrenko et al. 2003, Monay & Gatica-Perez 2003, Barnard et al. 2004, Schroff et al. 2007, Gupta & Davis 2008, Vijayanarasimhan & Grauman 2008, …]

MugKeyKeyboardToothbrushPenPhotoPost-it

ComputerPosterDeskBookshelfScreenKeyboardScreenMugPoster

Based on tags alone, can you guess where and what size the mug will be in each image?

Our Idea

The list of human-provided tags gives useful cues beyond just which objects are present.

Our Idea

Presence of larger objectsMug is named first

Absence of larger objectsMug is named later

The list of human-provided tags gives useful cues beyond just which objects are present.

Our Idea

We propose to learn the implicit localization cues provided by tag lists to improve object detection.

WomanTableMugLadder

Approach overview

Object detector

Implicit tag features

ComputerPosterDeskScreenMugPoster

Training: Learn object-specific connection between localization parameters and implicit tag features.

MugEiffel

DeskMugOffice

MugCoffee

Testing: Given novel image, localize objects based on both tags and appearance.

P (location, scale | tags)

WomanTableMugLadder

Approach overview

Object detector

Training: Learn object-specific connection between localization parameters and implicit tag features.

MugEiffel

DeskMugOffice

MugCoffee

Testing: Given novel image, localize objects based on both tags and appearance.

P (location, scale | tags)

Feature: Word presence/absence

Presence or absence of other objects affects the scene layout record bag-of-words frequency.

Presence or absence of other objects affects the scene layout

= count of i-th word.

, where

Mug Pen Post-itToothbrush

Key PhotoComput

erScreen

Keyboard

DeskBookshe

lfPoster

W(im1) 1 1 1 1 1 1 0 0 1 0 0 0

W(im2) 1 0 0 0 0 0 1 2 1 1 1 1

Feature: Word presence/absence

Presence or absence of other objects affects the scene layout record bag-of-words frequency.

Presence or absence of other objects affects the scene layout

Large objects mentioned

Small objects mentioned

= count of i-th word.

, where

Mug Pen Post-itToothbrush

Key PhotoComput

erScreen

Keyboard

DeskBookshe

lfPoster

W(im1) 1 1 1 1 1 1 0 0 1 0 0 0

W(im2) 1 0 0 0 0 0 1 2 1 1 1 1

Feature: Rank of tags

People tag the “important” objects earlierPeople tag the “important” objects earlier record rank of each tag compared to its typical rank.

= percentile rank of i-th word.

, whereMugKeyKeyboardToothbrushPenPhotoPost-it

MugComput

erScreen

Keyboard

DeskBookshe

lfPoster Photo Pen Post-it

Toothbrush

R(im1) 0.80 0 0 0.51 0 0 0 0.28 0.72 0.82 0 0.90

R(im2) 0.23 0.62 0.21 0.13 0.48 0.61 0.41 0 0 0 0 0

Feature: Rank of tags

People tag the “important” objects earlierPeople tag the “important” objects earlier record rank of each tag compared to its typical rank.

= percentile rank of i-th word.

, whereMugKeyKeyboardToothbrushPenPhotoPost-it

Relatively high rank

MugComput

erScreen

Keyboard

DeskBookshe

lfPoster Photo Pen Post-it

Toothbrush

R(im1) 0.80 0 0 0.51 0 0 0 0.28 0.72 0.82 0 0.90

R(im2) 0.23 0.62 0.21 0.13 0.48 0.61 0.41 0 0 0 0 0

Feature: Proximity of tags

People tend to move eyes to nearby objects after first fixation record proximity of all tag pairs.

People tend to move eyes to nearby objects after first fixation

= rank difference.

, where

1) Mug2) Key3) Keyboard4) Toothbrush5) Pen6) Photo7) Post-it

1) Computer2) Poster3) Desk4) Bookshelf5) Screen6) Keyboard7) Screen8) Mug9) Poster

Mug ScreenKeyboar

Bookshelf

Mug 1 0 0.5 0 0Screen 　 0 0 0 0

Keyboard

　　 1 0 0

Desk 　　　 0 0Bookshe

lf　　　　 0

Mug ScreenKeyboar

Bookshelf

Mug 1 1 0.5 0.2 0.25Screen 　 1 1 0.33 0.5

Keyboard

　　 1 0.33 0.5

lf　　　　 1

Feature: Proximity of tags

People tend to move eyes to nearby objects after first fixation record proximity of all tag pairs.

People tend to move eyes to nearby objects after first fixation

= rank difference.

, where

1) Mug2) Key3) Keyboard4) Toothbrush5) Pen6) Photo7) Post-it

1) Computer2) Poster3) Desk4) Bookshelf5) Screen6) Keyboard7) Screen8) Mug9) Poster

Mug ScreenKeyboar

Bookshelf

Mug 1 0 0.5 0 0Screen 　 0 0 0 0

Keyboard

　　 1 0 0

lf　　　　 0

Mug ScreenKeyboar

Bookshelf

Mug 1 1 0.5 0.2 0.25Screen 　 1 1 0.33 0.5

Keyboard

　　 1 0.33 0.5

lf　　　　 1May be close to each other

WomanTableMugLadder

Approach overview

Object detector

MugEiffel

DeskMugOffice

MugCoffee

P (location, scale | W,R,P)

Training:

Testing:

Modeling P(X|T)

We model it directly using a mixture density network (MDN) [Bishop, 1994].

We need PDF for location and scale of the target object, given the tag feature:

P(X = scale, x, y | T = tag feature)

Input tag feature(Words, Rank, or Proximity)

Mixture model

Neural network

α µ Σ α µ Σ α µ Σ

CarWheelWheelLight

WindowHouseHouse

CarCarRoadHouseLightpole

CarWindowsBuildingManBarrelCarTruckCar

Boulder

Modeling P(X|T)

Example: Top 30 most likely localization parameters sampled for the object “car”, given only the tags.

CarWheelWheelLight

WindowHouseHouse

CarCarRoadHouseLightpole

CarWindowsBuildingManBarrelCarTruckCar

Boulder

Modeling P(X|T)

Example: Top 30 most likely localization parameters sampled for the object “car”, given only the tags.

WomanTableMugLadder

Approach overview

Object detector

MugEiffel

DeskMugOffice

MugCoffee

P (location, scale | W,R,P)

Training:

Testing:

Integrating with object detector

How to exploit this learned distribution P(X|T)?1)Use it to speed up the detection

process (location priming)

(a) Sort all candidate windows according to P(X|T).

Most likelyLess likelyLeast likely

(b) Run detector only at the most probable locations and scales.

process (location priming)2)Use it to increase detection accuracy

(modulate the detector output scores)Predictions from object detector

0.70.8

Predictions based on tag features

process (location priming)2)Use it to increase detection accuracy

(modulate the detector output scores)

0.630.24

Experiments: Datasets

LabelMe PASCAL VOC 2007 Street and office

scenes Contains ordered tag

lists via labels added 5 classes 56 unique taggers 23 tags / image Dalal & Trigg’s HOG

detector

Flickr images Tag lists obtained on

Mechanical Turk 20 classes 758 unique taggers 5.5 tags / image Felzenszwalb et al.’s

LSVM detector

Experiments

We evaluate Detection Speed Detection Accuracy

We compare Raw detector (HOG, LSVM) Raw detector + Our tag features

We also show the results when using Gist [Torralba 2003] as context, for reference.

We search fewer windows to achieve same detection rate.

We know which detection hypotheses to trust most.

PASCAL: Performance evaluation

0 0.2 0.4 0.60

recall

Accuracy: All 20 PASCAL Classes

LSVM (AP=33.69)

0 0.2 0.4 0.60

recall

LSVM (AP=33.69)LSVM+Tags (AP=36.79)

0 0.2 0.4 0.60

recall

LSVM (AP=33.69)LSVM+Tags (AP=36.79)LSVM+Gist (AP=36.28)

Naïve sliding window searches 70%.

We search only 30%.

PASCAL: Accuracy vs Gist per class

LampPersonBottleDogSofaPaintingTable

PersonTableChairMirrorTableclothBowlBottleShelfPaintingFood

Bottle

CarLicense PlateBuilding

LSVM+Tags (Ours)

LSVM alone

PASCAL: Example detections

CarDoorDoorGearSteering WheelSeatSeatPersonPersonCamera

DogFloorHairclip

DogDogDogPersonPersonGroundBenchScarf

PersonMicrophoneLight

HorsePersonTreeHouseBuildingGroundHurdleFence

PASCAL: Example detectionsDog

Person

LSVM+Tags (Ours)

LSVM alone

AeroplaneSkyBuildingShadow

PersonPersonPoleBuildingSidewalkGrassRoad

DogClothesRopeRopePlantGroundShadowStringWall

BottleGlassWineTable

PASCAL: Example failure casesLSVM+Tags

(Ours)LSVM alone

Results: Observations

Often our implicit features predict:- scale well for indoor objects- position well for outdoor objects

Gist usually better for y position, while our tags are generally stronger for scale

Need to have learned about target objects in variety of examples with different contexts

- visual and tag context are complementary

Summary

We want to learn what is implied (beyond objects present) by how a human provides tags for an image.

Approach translates existing insights about human viewing behavior (attention, importance, gaze, etc.) into enhanced object detection.

Novel tag cues enable effective localization prior.

Significant gains with state-of-the-art detectors and two datasets.

Joint multi-object detection

From tags to natural language sentences

Image retrieval applications

Future work

Summary

We want to learn what is implied (beyond objects present) by how a human provides tags for an image.

Approach translates existing insights about human viewing behavior (attention, importance, gaze, etc.) into enhanced object detection.

Novel tag cues enable effective localization prior.

Significant gains with state-of-the-art detectors and two datasets.

Reading Between the Lines: Object Localization Using Implicit Cues from Image Tags

Documents