Sung Ju Hwang and Kristen GraumanUniversity of Texas at Austin
Images tagged with keywords clearly tell us which objects to search for
Detecting tagged objects
DogBlack labJasperSofaSelfLiving roomFedoraExplore#24
Duygulu et al. 2002
Detecting tagged objects
Previous work using tagged images focuses on the noun ↔ object correspondence.
Fergus et al. 2005
Li et al., 2009
Berg et al. 2004
[Lavrenko et al. 2003, Monay & Gatica-Perez 2003, Barnard et al. 2004, Schroff et al. 2007, Gupta & Davis 2008, Vijayanarasimhan & Grauman 2008, …]
MugKeyKeyboardToothbrushPenPhotoPost-it
ComputerPosterDeskBookshelfScreenKeyboardScreenMugPoster
? ?
Based on tags alone, can you guess where and what size the mug will be in each image?
Our Idea
The list of human-provided tags gives useful cues beyond just which objects are present.
ComputerPosterDeskBookshelfScreenKeyboardScreenMugPoster
MugKeyKeyboardToothbrushPenPhotoPost-it
Our Idea
Presence of larger objectsMug is named first
Absence of larger objectsMug is named later
The list of human-provided tags gives useful cues beyond just which objects are present.
Our Idea
We propose to learn the implicit localization cues provided by tag lists to improve object detection.
WomanTableMugLadder
Approach overview
MugKeyKeyboardToothbrushPenPhotoPost-it
Object detector
Implicit tag features
ComputerPosterDeskScreenMugPoster
Training: Learn object-specific connection between localization parameters and implicit tag features.
MugEiffel
DeskMugOffice
MugCoffee
Testing: Given novel image, localize objects based on both tags and appearance.
P (location, scale | tags)
Implicit tag features
WomanTableMugLadder
Approach overview
MugKeyKeyboardToothbrushPenPhotoPost-it
Object detector
Implicit tag features
ComputerPosterDeskScreenMugPoster
Training: Learn object-specific connection between localization parameters and implicit tag features.
MugEiffel
DeskMugOffice
MugCoffee
Testing: Given novel image, localize objects based on both tags and appearance.
P (location, scale | tags)
Implicit tag features
Feature: Word presence/absence
MugKeyKeyboardToothbrushPenPhotoPost-it
ComputerPosterDeskBookshelfScreenKeyboardScreenMugPoster
Presence or absence of other objects affects the scene layout record bag-of-words frequency.
Presence or absence of other objects affects the scene layout
= count of i-th word.
, where
Mug Pen Post-itToothbrush
Key PhotoComput
erScreen
Keyboard
DeskBookshe
lfPoster
W(im1) 1 1 1 1 1 1 0 0 1 0 0 0
W(im2) 1 0 0 0 0 0 1 2 1 1 1 1
Feature: Word presence/absence
MugKeyKeyboardToothbrushPenPhotoPost-it
ComputerPosterDeskBookshelfScreenKeyboardScreenMugPoster
Presence or absence of other objects affects the scene layout record bag-of-words frequency.
Presence or absence of other objects affects the scene layout
Large objects mentioned
Small objects mentioned
= count of i-th word.
, where
Mug Pen Post-itToothbrush
Key PhotoComput
erScreen
Keyboard
DeskBookshe
lfPoster
W(im1) 1 1 1 1 1 1 0 0 1 0 0 0
W(im2) 1 0 0 0 0 0 1 2 1 1 1 1
Feature: Rank of tags
People tag the “important” objects earlierPeople tag the “important” objects earlier record rank of each tag compared to its typical rank.
= percentile rank of i-th word.
, whereMugKeyKeyboardToothbrushPenPhotoPost-it
ComputerPosterDeskBookshelfScreenKeyboardScreenMugPoster
MugComput
erScreen
Keyboard
DeskBookshe
lfPoster Photo Pen Post-it
Toothbrush
Key
R(im1) 0.80 0 0 0.51 0 0 0 0.28 0.72 0.82 0 0.90
R(im2) 0.23 0.62 0.21 0.13 0.48 0.61 0.41 0 0 0 0 0
Feature: Rank of tags
People tag the “important” objects earlierPeople tag the “important” objects earlier record rank of each tag compared to its typical rank.
= percentile rank of i-th word.
, whereMugKeyKeyboardToothbrushPenPhotoPost-it
ComputerPosterDeskBookshelfScreenKeyboardScreenMugPoster
Relatively high rank
MugComput
erScreen
Keyboard
DeskBookshe
lfPoster Photo Pen Post-it
Toothbrush
Key
R(im1) 0.80 0 0 0.51 0 0 0 0.28 0.72 0.82 0 0.90
R(im2) 0.23 0.62 0.21 0.13 0.48 0.61 0.41 0 0 0 0 0
Feature: Proximity of tags
People tend to move eyes to nearby objects after first fixation record proximity of all tag pairs.
People tend to move eyes to nearby objects after first fixation
= rank difference.
, where
1) Mug2) Key3) Keyboard4) Toothbrush5) Pen6) Photo7) Post-it
1) Computer2) Poster3) Desk4) Bookshelf5) Screen6) Keyboard7) Screen8) Mug9) Poster
2 3
45
6
7
1
2
34 5
67
891
Mug ScreenKeyboar
dDesk
Bookshelf
Mug 1 0 0.5 0 0Screen 0 0 0 0
Keyboard
1 0 0
Desk 0 0Bookshe
lf 0
Mug ScreenKeyboar
dDesk
Bookshelf
Mug 1 1 0.5 0.2 0.25Screen 1 1 0.33 0.5
Keyboard
1 0.33 0.5
Desk 1 1Bookshe
lf 1
Feature: Proximity of tags
People tend to move eyes to nearby objects after first fixation record proximity of all tag pairs.
People tend to move eyes to nearby objects after first fixation
= rank difference.
, where
1) Mug2) Key3) Keyboard4) Toothbrush5) Pen6) Photo7) Post-it
1) Computer2) Poster3) Desk4) Bookshelf5) Screen6) Keyboard7) Screen8) Mug9) Poster
2 3
45
6
7
1
2
34 5
67
891
Mug ScreenKeyboar
dDesk
Bookshelf
Mug 1 0 0.5 0 0Screen 0 0 0 0
Keyboard
1 0 0
Desk 0 0Bookshe
lf 0
Mug ScreenKeyboar
dDesk
Bookshelf
Mug 1 1 0.5 0.2 0.25Screen 1 1 0.33 0.5
Keyboard
1 0.33 0.5
Desk 1 1Bookshe
lf 1May be close to each other
WomanTableMugLadder
Approach overview
MugKeyKeyboardToothbrushPenPhotoPost-it
Object detector
Implicit tag features
ComputerPosterDeskScreenMugPoster
MugEiffel
DeskMugOffice
MugCoffee
P (location, scale | W,R,P)
Implicit tag features
Training:
Testing:
Modeling P(X|T)
We model it directly using a mixture density network (MDN) [Bishop, 1994].
We need PDF for location and scale of the target object, given the tag feature:
P(X = scale, x, y | T = tag feature)
Input tag feature(Words, Rank, or Proximity)
Mixture model
Neural network
α µ Σ α µ Σ α µ Σ
Lamp
CarWheelWheelLight
WindowHouseHouse
CarCarRoadHouseLightpole
CarWindowsBuildingManBarrelCarTruckCar
Boulder
Car
Modeling P(X|T)
Example: Top 30 most likely localization parameters sampled for the object “car”, given only the tags.
Lamp
CarWheelWheelLight
WindowHouseHouse
CarCarRoadHouseLightpole
CarWindowsBuildingManBarrelCarTruckCar
Boulder
Car
Modeling P(X|T)
Example: Top 30 most likely localization parameters sampled for the object “car”, given only the tags.
WomanTableMugLadder
Approach overview
MugKeyKeyboardToothbrushPenPhotoPost-it
Object detector
Implicit tag features
ComputerPosterDeskScreenMugPoster
MugEiffel
DeskMugOffice
MugCoffee
P (location, scale | W,R,P)
Implicit tag features
Training:
Testing:
Integrating with object detector
How to exploit this learned distribution P(X|T)?1)Use it to speed up the detection
process (location priming)
Integrating with object detector
How to exploit this learned distribution P(X|T)?1)Use it to speed up the detection
process (location priming)
(a) Sort all candidate windows according to P(X|T).
Most likelyLess likelyLeast likely
(b) Run detector only at the most probable locations and scales.
Integrating with object detector
How to exploit this learned distribution P(X|T)?1)Use it to speed up the detection
process (location priming)2)Use it to increase detection accuracy
(modulate the detector output scores)Predictions from object detector
0.70.8
0.9
Predictions based on tag features
0.3
0.2
0.9
Integrating with object detector
How to exploit this learned distribution P(X|T)?1)Use it to speed up the detection
process (location priming)2)Use it to increase detection accuracy
(modulate the detector output scores)
0.630.24
0.18
Experiments: Datasets
LabelMe PASCAL VOC 2007 Street and office
scenes Contains ordered tag
lists via labels added 5 classes 56 unique taggers 23 tags / image Dalal & Trigg’s HOG
detector
Flickr images Tag lists obtained on
Mechanical Turk 20 classes 758 unique taggers 5.5 tags / image Felzenszwalb et al.’s
LSVM detector
Experiments
We evaluate Detection Speed Detection Accuracy
We compare Raw detector (HOG, LSVM) Raw detector + Our tag features
We also show the results when using Gist [Torralba 2003] as context, for reference.
We search fewer windows to achieve same detection rate.
We know which detection hypotheses to trust most.
PASCAL: Performance evaluation
0 0.2 0.4 0.60
0.2
0.4
0.6
0.8
1
recall
pre
cisi
on
Accuracy: All 20 PASCAL Classes
LSVM (AP=33.69)
0 0.2 0.4 0.60
0.2
0.4
0.6
0.8
1
recall
pre
cisi
on
Accuracy: All 20 PASCAL Classes
LSVM (AP=33.69)LSVM+Tags (AP=36.79)
0 0.2 0.4 0.60
0.2
0.4
0.6
0.8
1
recall
pre
cisi
on
Accuracy: All 20 PASCAL Classes
LSVM (AP=33.69)LSVM+Tags (AP=36.79)LSVM+Gist (AP=36.28)
Naïve sliding window searches 70%.
We search only 30%.
PASCAL: Accuracy vs Gist per class
PASCAL: Accuracy vs Gist per class
LampPersonBottleDogSofaPaintingTable
PersonTableChairMirrorTableclothBowlBottleShelfPaintingFood
Bottle
CarLicense PlateBuilding
Car
LSVM+Tags (Ours)
LSVM alone
PASCAL: Example detections
CarDoorDoorGearSteering WheelSeatSeatPersonPersonCamera
DogFloorHairclip
DogDogDogPersonPersonGroundBenchScarf
PersonMicrophoneLight
HorsePersonTreeHouseBuildingGroundHurdleFence
PASCAL: Example detectionsDog
Person
LSVM+Tags (Ours)
LSVM alone
AeroplaneSkyBuildingShadow
PersonPersonPoleBuildingSidewalkGrassRoad
DogClothesRopeRopePlantGroundShadowStringWall
BottleGlassWineTable
PASCAL: Example failure casesLSVM+Tags
(Ours)LSVM alone
Results: Observations
Often our implicit features predict:- scale well for indoor objects- position well for outdoor objects
Gist usually better for y position, while our tags are generally stronger for scale
Need to have learned about target objects in variety of examples with different contexts
- visual and tag context are complementary
Summary
We want to learn what is implied (beyond objects present) by how a human provides tags for an image.
Approach translates existing insights about human viewing behavior (attention, importance, gaze, etc.) into enhanced object detection.
Novel tag cues enable effective localization prior.
Significant gains with state-of-the-art detectors and two datasets.
Joint multi-object detection
From tags to natural language sentences
Image retrieval applications
Future work
Summary
We want to learn what is implied (beyond objects present) by how a human provides tags for an image.
Approach translates existing insights about human viewing behavior (attention, importance, gaze, etc.) into enhanced object detection.
Novel tag cues enable effective localization prior.
Significant gains with state-of-the-art detectors and two datasets.