READING BETWEEN THE LINES: OBJECT LOCALIZATION USING IM-PLICIT CUES FROM IMAGE TAGS
Sung Ju Hwang and Kristen GraumanUniversity of Texas at AustinCVPR 2010
Hwang & Grauman, CVPR 2010
Images tagged with keywords clearly tell us which objects to search for
Detecting tagged objects
DogBlack labJasperSofaSelfLiving roomFedoraExplore#24
Hwang & Grauman, CVPR 2010
Duygulu et al. 2002
Detecting tagged objectsPrevious work using tagged images fo-cuses on the noun ↔ object correspon-dence.
Fergus et al. 2005
Li et al., 2009
Berg et al. 2004
[Lavrenko et al. 2003, Monay & Gatica-Perez 2003, Barnard et al. 2004, Schroff et al. 2007, Gupta & Davis 2008, Vijayanarasimhan & Grauman 2008, …]
Hwang & Grauman, CVPR 2010
MugKeyKeyboardToothbrushPenPhotoPost-it
ComputerPosterDeskBookshelfScreenKeyboardScreenMugPoster
? ?
Based on tags alone, can you guess where and what size the mug will be in each im-age?
Our IdeaThe list of human-provided tags gives useful cues beyond just which objects are present.
Hwang & Grauman, CVPR 2010
ComputerPosterDeskBookshelfScreenKeyboardScreenMugPoster
MugKeyKeyboardToothbrushPenPhotoPost-it
Our Idea
Presence of larger objectsMug is named first
Absence of larger objectsMug is named later
The list of human-provided tags gives useful cues beyond just which objects are present.
Hwang & Grauman, CVPR 2010
Our IdeaWe propose to learn the implicit localiza-tion cues provided by tag lists to improve object detection.
Hwang & Grauman, CVPR 2010
WomanTableMugLadder
Approach overview
MugKeyKeyboardTooth-brushPenPhotoPost-it
Object de-tector
Implicit tag features
ComputerPosterDeskScreenMugPoster
Training: Learn object-specific connection between localization parameters and implicit tag features.
MugEiffel
DeskMugOffice
MugCoffee
Testing: Given novel image, localize objects based on both tags and appearance.
P (location, scale | tags)Implicit tag
features
Hwang & Grauman, CVPR 2010
WomanTableMugLadder
Approach overview
MugKeyKeyboardTooth-brushPenPhotoPost-it
Object de-tector
Implicit tag features
ComputerPosterDeskScreenMugPoster
Training: Learn object-specific connection between localization parameters and implicit tag features.
MugEiffel
DeskMugOffice
MugCoffee
Testing: Given novel image, localize objects based on both tags and appearance.
P (location, scale | tags)Implicit tag
features
Hwang & Grauman, CVPR 2010
Feature: Word presence/ab-sence
MugKeyKeyboardToothbrushPenPhotoPost-it
ComputerPosterDeskBookshelfScreenKeyboardScreenMugPoster
Presence or absence of other objects af -fects the scene layout record bag-of-words frequency.
Presence or absence of other objects af -fects the scene layout
= count of i-th word.
, where
Mug Pen Post-it Toothbrush Key Photo Com-
puter Screen Key-board Desk Book-
shelf Poster
W(im1) 1 1 1 1 1 1 0 0 1 0 0 0
W(im2) 1 0 0 0 0 0 1 2 1 1 1 1
Hwang & Grauman, CVPR 2010
Feature: Word presence/ab-sence
MugKeyKeyboardToothbrushPenPhotoPost-it
ComputerPosterDeskBookshelfScreenKeyboardScreenMugPoster
Presence or absence of other objects af -fects the scene layout record bag-of-words frequency.
Presence or absence of other objects af -fects the scene layout
Large objects men-tioned
Small objects men-tioned
= count of i-th word.
, where
Mug Pen Post-it Toothbrush Key Photo Com-
puter Screen Key-board Desk Book-
shelf Poster
W(im1) 1 1 1 1 1 1 0 0 1 0 0 0
W(im2) 1 0 0 0 0 0 1 2 1 1 1 1
Hwang & Grauman, CVPR 2010
Feature: Rank of tagsPeople tag the “important” objects ear-lierPeople tag the “important” objects earlier record rank of each tag compared to its typical rank. = percentile rank of i-
th word., whereMugKeyKeyboardToothbrushPenPhotoPost-it
ComputerPosterDeskBookshelfScreenKeyboardScreenMugPoster
Mug Com-puter Screen Key-
board Desk Book-shelf Poster Photo Pen Post-it Tooth
brush Key
R(im1) 0.80 0 0 0.51 0 0 0 0.28 0.72 0.82 0 0.90
R(im2) 0.23 0.62 0.21 0.13 0.48 0.61 0.41 0 0 0 0 0
Hwang & Grauman, CVPR 2010
Feature: Rank of tagsPeople tag the “important” objects ear-lierPeople tag the “important” objects earlier record rank of each tag compared to its typical rank. = percentile rank of i-
th word., whereMugKeyKeyboardToothbrushPenPhotoPost-it
ComputerPosterDeskBookshelfScreenKeyboardScreenMugPoster
Relatively high rank
Mug Com-puter Screen Key-
board Desk Book-shelf Poster Photo Pen Post-it Tooth
brush Key
R(im1) 0.80 0 0 0.51 0 0 0 0.28 0.72 0.82 0 0.90
R(im2) 0.23 0.62 0.21 0.13 0.48 0.61 0.41 0 0 0 0 0
Hwang & Grauman, CVPR 2010
Feature: Proximity of tagsPeople tend to move eyes to nearby ob-jects after first fixation record proximity of all tag pairs.
People tend to move eyes to nearby ob-jects after first fixation
= rank differ-ence.
, where
1) Mug2) Key3) Keyboard4) Toothbrush5) Pen6) Photo7) Post-it
1) Computer2) Poster3) Desk4) Bookshelf5) Screen6) Keyboard7) Screen8) Mug9) Poster
2 3
456
7
1
2
3 4 56789
1
Mug Screen Key-board Desk Book-
shelfMug 1 0 0.5 0 0
Screen 0 0 0 0Key-
board 1 0 0Desk 0 0Book-shelf 0
Mug Screen Key-board Desk Book-
shelfMug 1 1 0.5 0.2 0.25
Screen 1 1 0.33 0.5Key-
board 1 0.33 0.5Desk 1 1Book-shelf 1
Hwang & Grauman, CVPR 2010
Feature: Proximity of tagsPeople tend to move eyes to nearby ob-jects after first fixation record proximity of all tag pairs.
People tend to move eyes to nearby ob-jects after first fixation
= rank differ-ence.
, where
1) Mug2) Key3) Keyboard4) Toothbrush5) Pen6) Photo7) Post-it
1) Computer2) Poster3) Desk4) Bookshelf5) Screen6) Keyboard7) Screen8) Mug9) Poster
2 3
456
7
1
2
3 4 56789
1
Mug Screen Key-board Desk Book-
shelfMug 1 0 0.5 0 0
Screen 0 0 0 0Key-
board 1 0 0Desk 0 0Book-shelf 0
Mug Screen Key-board Desk Book-
shelfMug 1 1 0.5 0.2 0.25
Screen 1 1 0.33 0.5Key-
board 1 0.33 0.5Desk 1 1Book-shelf 1May be close to each other
Hwang & Grauman, CVPR 2010
WomanTableMugLadder
Approach overview
MugKeyKeyboardTooth-brushPenPhotoPost-it
Object de-tector
Implicit tag features
ComputerPosterDeskScreenMugPoster
MugEiffel
DeskMugOffice
MugCoffee
P (location, scale | W,R,P)Implicit tag
features
Training:
Testing:
Hwang & Grauman, CVPR 2010
Modeling P(X|T)
We model it directly using a mixture den-sity network (MDN) [Bishop, 1994].
We need PDF for location and scale of the target object, given the tag feature:P(X = scale, x, y | T = tag feature)
Input tag feature(Words, Rank, or Proximity)
Mixture model
Neural network
α µ Σ α µ Σ α µ Σ
Hwang & Grauman, CVPR 2010
LampCarWheelWheelLight
WindowHouseHouseCarCarRoadHouseLightpole
CarWindowsBuildingManBarrelCarTruckCar
BoulderCar
Modeling P(X|T)Example: Top 30 most likely localization pa-rameters sampled for the object “car”, given only the tags.
Hwang & Grauman, CVPR 2010
LampCarWheelWheelLight
WindowHouseHouseCarCarRoadHouseLightpole
CarWindowsBuildingManBarrelCarTruckCar
BoulderCar
Modeling P(X|T)Example: Top 30 most likely localization pa-rameters sampled for the object “car”, given only the tags.
Hwang & Grauman, CVPR 2010
WomanTableMugLadder
Approach overview
MugKeyKeyboardTooth-brushPenPhotoPost-it
Object de-tector
Implicit tag features
ComputerPosterDeskScreenMugPoster
MugEiffel
DeskMugOffice
MugCoffee
P (location, scale | W,R,P)Implicit tag
features
Training:
Testing:
Hwang & Grauman, CVPR 2010
Integrating with object detec-torHow to exploit this learned distribution
P(X|T)?1)Use it to speed up the detection
process (location priming)
Hwang & Grauman, CVPR 2010
Integrating with object detec-torHow to exploit this learned distribution
P(X|T)?1)Use it to speed up the detection
process (location priming)(a) Sort all candi-
date windows according to P(X|T).
Most likelyLess likelyLeast likely
(b) Run detector only at the most probable locations and scales.
Integrating with object detec-torHow to exploit this learned distribution
P(X|T)?1)Use it to speed up the detection
process (location priming)2)Use it to increase detection accuracy
(modulate the detector output scores)Predictions from object detector0.7
0.8
0.9
Predictions based on tag features
0.3
0.2
0.9
Integrating with object detec-torHow to exploit this learned distribution
P(X|T)?1)Use it to speed up the detection
process (location priming)2)Use it to increase detection accuracy
(modulate the detector output scores)
0.630.24
0.18
Hwang & Grauman, CVPR 2010
Experiments: DatasetsLabelMe PASCAL VOC 2007 Street and office scenes Contains ordered tag
lists via labels added 5 classes 56 unique taggers 23 tags / image Dalal & Trigg’s HOG de-
tector
Flickr images Tag lists obtained on
Mechanical Turk 20 classes 758 unique taggers 5.5 tags / image Felzenszwalb et al.’s
LSVM detector
Hwang & Grauman, CVPR 2010
ExperimentsWe evaluate
Detection Speed Detection Accuracy
We compare Raw detector (HOG, LSVM) Raw detector + Our tag features
We also show the results when using Gist [Tor-ralba 2003] as context, for reference.
Hwang & Grauman, CVPR 2010
We search fewer win-dows to achieve same detection rate.
We know which detec-tion hypotheses to trust most.
PASCAL: Performance evaluation
0 0.2 0.4 0.60
0.2
0.4
0.6
0.8
1
recall
prec
isio
n
Accuracy: All 20 PASCAL Classes
LSVM (AP=33.69)
0 0.2 0.4 0.60
0.2
0.4
0.6
0.8
1
recall
prec
isio
n
Accuracy: All 20 PASCAL Classes
LSVM (AP=33.69)LSVM+Tags (AP=36.79)
0 0.2 0.4 0.60
0.2
0.4
0.6
0.8
1
recall
prec
isio
n
Accuracy: All 20 PASCAL Classes
LSVM (AP=33.69)LSVM+Tags (AP=36.79)LSVM+Gist (AP=36.28)
0 0.2 0.4 0.6 0.80
0.2
0.4
0.6
0.8
detection rate
porti
on o
f win
dow
s se
arch
ed
Speed: All 20 LabelMe Classes
Sliding (0.223)
0 0.2 0.4 0.6 0.80
0.2
0.4
0.6
0.8
detection rate
porti
on o
f win
dow
s se
arch
ed
Speed: All 20 LabelMe Classes
Sliding (0.223)Sliding+Tags (0.098)
0 0.2 0.4 0.6 0.80
0.2
0.4
0.6
0.8
detection rate
porti
on o
f win
dow
s se
arch
ed
Speed: All 20 LabelMe Classes
Sliding (0.223)Sliding+Tags (0.098)Sliding+Gist (0.125)
Naïve sliding window searches 70%.
We search only 30%.
0 0.2 0.4 0.60
0.2
0.4
0.6
0.8
1
recall
prec
isio
n
Accuracy: All 20 PASCAL Classes
LSVM (AP=33.69)LSVM+Tags (AP=36.79)LSVM+Gist (AP=36.28)
Hwang & Grauman, CVPR 2010
potte
dplan
t cat sofa
boat
motorbi
ke train car cha
ir
tvmon
itor
horse
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
Tags
Object Class
AP Im
prov
emen
tPASCAL: Accuracy vs Gist per class
Hwang & Grauman, CVPR 2010
potte
dplan
t cat sofa
boat
motorbi
ke train car cha
ir
tvmon
itor
horse
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
TagsGist
Object Class
AP Im
prov
emen
tPASCAL: Accuracy vs Gist per class
Hwang & Grauman, CVPR 2010
LampPersonBottleDogSofaPaintingTable
PersonTableChairMirrorTableclothBowlBottleShelfPaintingFood
Bottle
CarLicense PlateBuilding
Car
LSVM+Tags (Ours)
LSVM alone
PASCAL: Example detections
CarDoorDoorGearSteering WheelSeatSeatPersonPersonCamera
Hwang & Grauman, CVPR 2010
DogFloorHairclip
DogDogDogPersonPersonGroundBenchScarf
PersonMicrophoneLight
HorsePersonTreeHouseBuildingGroundHurdleFence
PASCAL: Example detectionsDog
Person
LSVM+Tags (Ours)
LSVM alone
Hwang & Grauman, CVPR 2010
AeroplaneSkyBuildingShadow
PersonPersonPoleBuildingSidewalkGrassRoad
DogClothesRopeRopePlantGroundShadowStringWall
BottleGlassWineTable
PASCAL: Example failure casesLSVM+Tags (Ours)
LSVM alone
Hwang & Grauman, CVPR 2010
Results: Observations Often our implicit features predict:
- scale well for indoor objects- position well for outdoor objects
Gist usually better for y position, while our tags are generally stronger for scale
Need to have learned about target ob-jects in variety of examples with differ-ent contexts
- visual and tag context are complementary
Hwang & Grauman, CVPR 2010
Summary We want to learn what is implied (beyond objects
present) by how a human provides tags for an im-age.
Approach translates existing insights about hu-man viewing behavior (attention, importance, gaze, etc.) into enhanced object detection.
Novel tag cues enable effective localization prior. Significant gains with state-of-the-art detectors
and two datasets.
Hwang & Grauman, CVPR 2010
Joint multi-object detection
From tags to natural language sen-tences
Image retrieval applications
Future work