+ All Categories
Home > Documents > Visual Relationship Detection with Language Priors · 2016-10-14 · dog-ride-surfboard dog...

Visual Relationship Detection with Language Priors · 2016-10-14 · dog-ride-surfboard dog...

Date post: 19-Jul-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
1
Visual Relationship Detection with Language Priors Cewu Lu*, Ranjay Krishna*, Michael Bernstein, Li Fei-Fei Stanford University Model overview Challenges Diverse model predictions Motivation Visual module Dataset falling off riding pushing next to carrying While objects are the core building blocks of an image, it is often the relationships between objects that determine the holistic interpretation. person kick ball person on top of ramp person wear shirt elephant taller than person motorcycle with wheel Action Spatial Comparative Verb Preposition INPUT OUTPUT PHRASE DET. PREDICATE DET. RELATIONSHIP DET. ride dog - ride - surfboard dog surfboard dog surfboard ride Quantitative results Visual Relationship Detection Zero-shot Relationship Detection Content-based Image Search 1. Quadratic explosion of label space - Detect objects and predicates individually 2. Long-tail distribution of relationships - semantic word embeddings First, we use RCNN to detect all the objects in an image. Next, we take pairs of objects and use our visual module to predict relationships between them: where O 1 and O 2 are the subject and object features. Θ is the parameter set of {z k , s k }, are the parameters learnt to convert our CNN features to relationship likelihoods. Language module We score every relationship using their word2vec features: where t i and t j are the subject and object words. W is the set of {{w1, b1}, . . . , {wk, bk}}, where each row presents one of our K predicates. We train W by ensuring that similar relationships are projected closer together: where d(R, R ) is the sum of the cosine distances (in word2vec space) between the two objects and the predicates of the relationships. var() is the variance function. Finally, to ensure that f() outputs the likelihood of a relationship, we learn which relationships are more probable using a ranking loss function: So, our final objective function becomes: where person wear glasses person wear shirt tower attach to building person on skis person use computer wheel on motorcycle person wear pants person ride bicycle person on skateboard pole wear hat horse wear hat bicycle behind pole Zero shot predictions person sit hydrant Single image output spatial, comparative, asymmetrical, verb and prepositional relationships
Transcript
Page 1: Visual Relationship Detection with Language Priors · 2016-10-14 · dog-ride-surfboard dog surfboard dog surfboard ride Quantitative results Visual Relationship Detection Zero-shot

Visual Relationship Detection with Language PriorsCewu Lu*, Ranjay Krishna*, Michael Bernstein, Li Fei-Fei

Stanford University

Model overview

Challenges

Diverse model predictions

Motivation Visual module

Dataset

falling offriding pushing next to carrying

While objects are the core building blocks of an image, it is often the relationships between objects that determine the holistic interpretation.

person kick ball

person on top of ramp

person wear shirt

elephant taller than person

motorcycle with wheel

Action Spatial ComparativeVerbPreposition

INPUT OUTPUT

PH

RA

SE

D

ET.

PR

ED

ICA

TE

DE

T.R

ELA

TIO

NS

HIP

D

ET.

ride

dog - ride - surfboard

dog

surfboard

dog

surfboard

ride

Quantitative resultsVisual Relationship Detection

Zero-shot Relationship Detection

Content-based Image Search

1. Quadratic explosion of label space - Detect objects and predicates individually

2. Long-tail distribution of relationships- semantic word embeddings

First, we use RCNN to detect all the objects in an image. Next, we take pairs of objects and use our visual module to predict relationships between them:

where O1 and O2 are the subject and object features. Θ is the parameter set of {zk, sk}, are the parameters learnt to convert our CNN features to relationship likelihoods.

Language moduleWe score every relationship using their word2vec features:

where ti and tj are the subject and object words. W is the set of {{w1, b1}, . . . , {wk, bk}}, where each row presents oneof our K predicates.

We train W by ensuring that similar relationships are projected closer together:

where d(R, R’) is the sum of the cosine distances (in word2vec space) between the two objects and the predicates of the relationships. var() is the variance function.

Finally, to ensure that f() outputs the likelihood of a relationship, we learn which relationships are more probable using a ranking loss function:

So, our final objective function becomes:

where

person wear glasses

person wear shirt

tower attach to building

person on skis

person use computer

wheel on motorcycle

person wear pants

person ride bicycle

person on skateboard

pole wear hat horse wear hat

bicycle behind pole

Zero shot predictions

person sit hydrant

Single image outputspatial, comparative, asymmetrical, verb and prepositional relationships

Recommended