Visual Relationship Detection with Language PriorsCewu Lu*, Ranjay Krishna*, Michael Bernstein, Li Fei-Fei
Stanford University
Model overview
Challenges
Diverse model predictions
Motivation Visual module
Dataset
falling offriding pushing next to carrying
While objects are the core building blocks of an image, it is often the relationships between objects that determine the holistic interpretation.
person kick ball
person on top of ramp
person wear shirt
elephant taller than person
motorcycle with wheel
Action Spatial ComparativeVerbPreposition
INPUT OUTPUT
PH
RA
SE
D
ET.
PR
ED
ICA
TE
DE
T.R
ELA
TIO
NS
HIP
D
ET.
ride
dog - ride - surfboard
dog
surfboard
dog
surfboard
ride
Quantitative resultsVisual Relationship Detection
Zero-shot Relationship Detection
Content-based Image Search
1. Quadratic explosion of label space - Detect objects and predicates individually
2. Long-tail distribution of relationships- semantic word embeddings
First, we use RCNN to detect all the objects in an image. Next, we take pairs of objects and use our visual module to predict relationships between them:
where O1 and O2 are the subject and object features. Θ is the parameter set of {zk, sk}, are the parameters learnt to convert our CNN features to relationship likelihoods.
Language moduleWe score every relationship using their word2vec features:
where ti and tj are the subject and object words. W is the set of {{w1, b1}, . . . , {wk, bk}}, where each row presents oneof our K predicates.
We train W by ensuring that similar relationships are projected closer together:
where d(R, R’) is the sum of the cosine distances (in word2vec space) between the two objects and the predicates of the relationships. var() is the variance function.
Finally, to ensure that f() outputs the likelihood of a relationship, we learn which relationships are more probable using a ranking loss function:
So, our final objective function becomes:
where
person wear glasses
person wear shirt
tower attach to building
person on skis
person use computer
wheel on motorcycle
person wear pants
person ride bicycle
person on skateboard
pole wear hat horse wear hat
bicycle behind pole
Zero shot predictions
person sit hydrant
Single image outputspatial, comparative, asymmetrical, verb and prepositional relationships