+ All Categories
Home > Documents > Reading Between the Lines: Object Localization Using Implicit Cues from Image Tags

Reading Between the Lines: Object Localization Using Implicit Cues from Image Tags

Date post: 22-Feb-2016
Category:
Upload: said
View: 39 times
Download: 0 times
Share this document with a friend
Description:
Sung Ju Hwang and Kristen Grauman University of Texas at Austin. Reading Between the Lines: Object Localization Using Implicit Cues from Image Tags. Detecting tagged objects. Image tagged with keywords clearly tell us Which object to search for. Dog Black lab Jasper Sofa Self - PowerPoint PPT Presentation
Popular Tags:
33
READING BETWEEN THE LINES: OBJECT LOCALIZATION USING IMPLICIT CUES FROM IMAGE TAGS Sung Ju Hwang and Kristen Grauman University of Texas at Austin
Transcript
Page 1: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

READING BETWEEN THE LINES: OBJECT LOCALIZATION USING IM-PLICIT CUES FROM IMAGE TAGS

Sung Ju Hwang and Kristen Grauman

University of Texas at Austin

Page 2: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

Image tagged with keywords clearly tell us Which object to search for

Detecting tagged objects

DogBlack labJasperSofaSelfLiving roomFedoraExplore#24

Page 3: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

3

Previous work using tagged images fo-cuses on the noun ↔ object correspon-dence.

Duygulu et al. 2002

Fergus et al. 2005

Berg et al. 2004

Vijayanarasimhan & Grauman 2008

Detecting tagged objectsImage tagged with keywords clearly tell us Which object to search for

Page 4: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

MugKeyKeyboardToothbrushPenPhotoPost-it

ComputerPosterDeskBookshelfScreenKeyboardScreenMugPosterComputer

? ?

Can you guess where and what size the mug will appear in both images?

Main IdeaThe list of tags on an image may give useful information Beyond just what objects are present

Page 5: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

Main Idea

MugKeyKeyboardToothbrushPenPhotoPost-it

ComputerPosterDeskBookshelfScreenKeyboardScreenMugPosterComputer

Mug is named the first Mug is named later in the listAbsence of larger objectsPresence of larger objects

Tag as context

Page 6: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

Feature: word presence/ab-sence

MugKeyKeyboardToothbrushPenPhotoPost-it

ComputerPosterDeskBookshelfScreenKeyboardScreenMugPosterComputer

Presence/absence of some other ob-jects, and the number of those objects affects the scene layout

Presence of smaller objects, such as key, and the absence of larger objects hints that it might be a close-up scene

Presence of the larger objects such as desk and book-shelf hints that the image describes a typical office scene

Page 7: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

Feature: word presence/ab-sence

MugKeyKeyboardToothbrushPenPhotoPost-it

ComputerPosterDeskBookshelfScreenKeyboardScreenMugPosterComputer

Word Mug Com-puter Screen Keyboard Desk Book-

shelf Poster Photo Pen Post-it Tooth-brush Key

W1 1 0 0 1 0 0 0 1 1 1 1 1W2 1 2 2 1 1 1 2 0 0 0 0 0

Blue Larger objects Red Smaller objects

Plain bag-of-words fea-ture describing word frequency. Wi = word

Page 8: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

Feature: tag rank

MugKeyKeyboardToothbrushPenPhotoPost-it

ComputerPosterDeskBookshelfScreenKeyboardScreenMugPosterComputer

People tag the ‘important’ objects ear-lier

If the object is tagged the first, there is a high chance that it is the main object: large, and cen-tered

If the object is tagged later, then it means that the object might not be salient: either it might be far from the center or small in scale

Page 9: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

Feature: tag rank

Blue High relative rank (>0.6)

Red Low relative rank(<0.4)

Percentile of the ab-solute rank of the tag compared against its typical rank.

ri = percentile of the rank for tag i

MugKeyKeyboardToothbrushPenPhotoPost-it

ComputerPosterDeskBookshelfScreenKeyboardScreenMugPosterComputer

Word Mug Com-puter Screen Keyboard Desk Book-

shelf Poster Photo Pen Post-it Tooth-brush Key

W1 0.80 0 0 0.51 0 0 0 0.28 0.72 0.82 0 0.90W2 0.23 0.62 0.21 0.13 0.48 0.61 0.41 0 0 0 0 0

Green Medium relative rank (0.4~0.6)

Page 10: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

Feature: proximity

1) Mug2) Key3) Keyboard4) Toothbrush5) Pen6) Photo7) Post-it

1) Computer2) Poster3) Desk4) Bookshelf5) Screen6) Keyboard7) Screen8) Mug9) Poster10) Computer

People tend to move their eyes to the objects nearby

Objects that are close to each other in the tag list are likely to be close in the image

1

2 3

45 6

7

1

2

3

45

6

78

9

10

Page 11: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

Feature: proximityEncoded as the inverse of the average rank differ-ence between tag words.Pi,j = rank difference between

tag i and j 1) Mug2) Key3) Keyboard4) Toothbrush5) Pen6) Photo7) Post-it

1) Computer2) Poster3) Desk4) Bookshelf5) Screen6) Keyboard7) Screen8) Mug9) Poster10) Computer

1

2 3

45 6

7

1

2

3

45

6

78

9

10

Word Mug Screen Key-board Desk Book-

shelfMug 1 0 0.5 0 0

Screen   0 0 0 0Key-

board     1 0 0Desk       0 0Book-shelf         0

Word Mug Screen Key-board Desk Book-

shelfMug 1 1 0.5 0.2 0.25

Screen   1 1 0.33 0.5Key-

board     1 0.33 0.5Desk       1 1Book-shelf         1Blue Objects close to

each other

Page 12: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

Overview of the approach

MugKeyKeyboardToothbrushPenPhotoPost-it

Im-age

Tags

W = {1, 0, 2, … , 3}R = {0.9, 0.5, … , 0.2}P = {0.25, 0.33, … , 0.1}

Appearance Model

Implicit tag fea-tures

P(X|W)P(X|R)P(X|P)

P(X|A) Sliding window detector

What?

Where?

Localizationresult

Priming the detector

Getting appear-ance Based predic-tion

Modeling P(X|T)

Page 13: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

Overview of the approach

MugKeyKeyboardToothbrushPenPhotoPost-it

Im-age

Tags

W = {1, 0, 2, … , 3}R = {0.9, 0.5, … , 0.2}P = {0.25, 0.33, … , 0.1}

Appearance Model

Implicit tag fea-tures

P(X|W)P(X|R)P(X|P)

P(X|A) Localizationresult

+What?

Modulating thedetector

Sliding window detector

Getting appear-ance Based predic-tion

Modeling P(X|T)

0.24

0.81

Page 14: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

Approach: modeling P(X|T)

We modeled this condi-tional PDF P(X|T) directly without calculating the joint distribution P(X,T), using the mixture density network (MDN)

We wish to know the conditional PDF of the location and scale of the target object, given the tag features: P(X|T) (X = s,x,y, T = tag feature)Lamp

CarWheelWheelLight

WindowHouseHouseCarCarRoadHouseLightpole

CarWindowsBuildingManBarrelCarTruckCar

BoulderCar

Top 30 mostly liked posi-tions for class car. Bounding box sampled according to P(X|T)

Page 15: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

Approach: Priming the detector

Region to search

Ignored

Ignored

Most proba-blescale

Unlikely scale

33000

38600 1) Rank the detec-tion

results based on thelearned P(X|T)

5

Then how can we make use of this learned dis-tribution P(X|T)?

1) Use it to speed the detection process2) Use it to modulate the detection confidence

score

2) Search only theprobable region and

thescale, following the

rank

Page 16: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

Approach: Modulating the de-tector Then how can we make use of this learned dis-

tribution P(X|T)?

1) Use it to speed the detection process2) Use it to modulate the detection confidence

scoreP(X|A)

Detector

P(X|W)P(X|R)P(X|P)

Logistic regressionClassifier

We learn the weights for each prediction,P(X|A), P(X|W), P(X|R), and P(X|P)

LampCarWheelWheelLight

Image tags

Page 17: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

Approach: Modulating the de-tector Then how can we make use of this learned dis-

tribution P(X|T)?

1) Use it to speed the detection process2) Use it to modulate the detection confidence

score

0.70.8

Prediction based on theoriginal detector score

0.9

Page 18: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

Approach: Modulating the de-tector Then how can we make use of this learned dis-

tribution P(X|T)?

1) Use it to speed the detection process2) Use it to modulate the detection confidence

score

0.70.8

Prediction based on theoriginal detector score

0.9

Prediction based on the tag fea-tures

0.3

0.90.2

Page 19: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

Approach: Modulating the de-tector Then how can we make use of this learned dis-

tribution P(X|T)?

1) Use it to speed the detection process2) Use it to modulate the detection confidence

score

0.630.24

0.18

Page 20: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

Experiments

We compare the following two Detection Speed

Number of windows to search Detection Accuracy

AUROC AP

On three methods Appearance-only Appearance + Gist Appearance + tag features (ours)

Page 21: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

LabelMe contains the or-

dered tag list. Used Dalal &

Trigg’s Hog detec-tor

contains images that have high variance in composi-tion.

Tag lists are obtained from anonymous workers on Mechanical Turks

Felzenszwalb’s LSVM de-tector

Dataset LabelMe PascalNumber of training/test images 3799/2553 5011/4953

Number of classes 5 20Number of keywords 209 399Number of taggers 56 758

Avg. Number of Tags / Image 23 5.5

Experiments: Dataset

PASCAL VOC 2007

Page 22: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

LabelMe: Performance Evaluation

More accurate de-tection,

Because we know which

hypotheses to trust most.

Modified version of the HOG detector by Dalal and Triggs.

Faster detection, because

we know where to look first

Page 23: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

Results: LabelMeSkyBuildingsPersonSidewalkCarCarRoad

CarWindowRoadWindowSkyWheelSign

HOG HOG+Gist HOG+Tags

Gist and Tags are likely to predict the same position, but

different scale. Most of the accuracy gain us-ing the tag

features comes from accurate scale predic-tion

Page 24: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

Results: LabelMeDeskKeyboardScreen

BookshelfDeskKeyboardScreen

MugKeyboardScreenCD

HOG HOG+Gist HOG+Tags

Page 25: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

PASCAL VOC 2007: Performance Evaluation

Need to test less number

of windows to achieve the

same detection rate.

Modified Felzenszwalb’s LSVM detector

9.2% improvement in

accuracy over all classes

(Average Precision)

65%

25%

77%70%

Page 26: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

Per-class localization accu-racy

Significant improvement on Bird Boat Cat Dog Potted plant

Page 27: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

PASCAL VOC 2007 (examples)

Aeroplane

BuildingAeroplaneSmoke

AeroplaneAeroplaneAeroplaneAeroplaneAeroplane

LampPersonBottleDogSofaPaintingTable

Bottle

PersonTableChairMirrorTableclothBowlBottleShelfPaintingFood

Ours

LSVM base-line

Page 28: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

PASCAL VOC 2007 (examples)

Dog

DogFloorHairclip

DogDogDogPersonPersonGroundBenchScarf

Person

PersonMicrophoneLight

HorsePersonTreeHouseBuildingGroundHurdleFence

Page 29: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

PASCAL VOC 2007 (Failure case)

AeroplaneSkyBuildingShadow

PersonPersonPoleBuildingSidewalkGrassRoad

DogClothesRopeRopePlantGroundShadowStringWall

BottleGlassWineTable

Page 30: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

Some Observations We find that often implicit features

predict:- scale better for indoor objects- position better for outdoor objects

We find Gist usually better for y position, while tags are generally stronger for scale- agrees with previous experiments using Gist

In general, need to have learned about target objects in variety of ex-amples with different contexts

Page 31: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

Conclusion

We showed how to exploit the im-plicit information present in human tagging behavior, on improving ob-ject localization performance in both speed and accuracy.

Page 32: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

Future Work

Joint multi-object detection

From tags to natural language sen-tences

Image retrieval

Using Wordnet to group words with similar meanings

Page 33: Reading Between the Lines:  Object Localization Using Implicit Cues from Image Tags

Conclusion

We showed how to exploit the im-plicit information present in human tagging behavior, on improving ob-ject localization performance in both speed and accuracy.


Recommended