Visual Relationship Detection with Language Priors · 2016-10-14 · dog-ride-surfboard dog...

Post on 19-Jul-2020

4 views 0 download

transcript

Visual Relationship Detection with Language PriorsCewu Lu*, Ranjay Krishna*, Michael Bernstein, Li Fei-Fei

Stanford University

Model overview

Challenges

Diverse model predictions

Motivation Visual module

Dataset

falling offriding pushing next to carrying

While objects are the core building blocks of an image, it is often the relationships between objects that determine the holistic interpretation.

person kick ball

person on top of ramp

person wear shirt

elephant taller than person

motorcycle with wheel

Action Spatial ComparativeVerbPreposition

INPUT OUTPUT

PH

RA

SE

D

ET.

PR

ED

ICA

TE

DE

T.R

ELA

TIO

NS

HIP

D

ET.

ride

dog - ride - surfboard

dog

surfboard

dog

surfboard

ride

Quantitative resultsVisual Relationship Detection

Zero-shot Relationship Detection

Content-based Image Search

1. Quadratic explosion of label space - Detect objects and predicates individually

2. Long-tail distribution of relationships- semantic word embeddings

First, we use RCNN to detect all the objects in an image. Next, we take pairs of objects and use our visual module to predict relationships between them:

where O1 and O2 are the subject and object features. Θ is the parameter set of {zk, sk}, are the parameters learnt to convert our CNN features to relationship likelihoods.

Language moduleWe score every relationship using their word2vec features:

where ti and tj are the subject and object words. W is the set of {{w1, b1}, . . . , {wk, bk}}, where each row presents oneof our K predicates.

We train W by ensuring that similar relationships are projected closer together:

where d(R, R’) is the sum of the cosine distances (in word2vec space) between the two objects and the predicates of the relationships. var() is the variance function.

Finally, to ensure that f() outputs the likelihood of a relationship, we learn which relationships are more probable using a ranking loss function:

So, our final objective function becomes:

where

person wear glasses

person wear shirt

tower attach to building

person on skis

person use computer

wheel on motorcycle

person wear pants

person ride bicycle

person on skateboard

pole wear hat horse wear hat

bicycle behind pole

Zero shot predictions

person sit hydrant

Single image outputspatial, comparative, asymmetrical, verb and prepositional relationships