GuessWhat?! Cooperative Visual Dialog Agents
GuessWhat?! Visual object discovery throughmulti-modal dialogue1
Learning Cooperative Visual Dialog Agentswith Deep Reinforcement Learning2
1 Harm de Vries, Florian Strub, Sarath Chandar, OlivierPietquin, Hugo Larochelle, Aaron Courville
2 Abhishek Das, Satwik Kottur, Jose M. F. Moura, StefanLee, Dhruv Batra
Presented by Ruiyi (Roy) Zhang
February 16th, 2018
GuessWhat?! Cooperative Visual Dialog Agents
Main ideas (GuessWhat?!)
Key contribution: The introduction of the GuessWhat?!dataset based on the MS COCO datasetDefine sub-tasks: the questioner, guesser and oracle tasksEstablish initial baselines of the introduced tasks
Is it a person?
Is it a snowboard?NoIs it the red one?Yes
Is it a cow? Yes
NoIs the cow on the left? No
On the right ? Yes
Is it an item being worn or held?
Is it the one being held by theperson in blue?
Yes First cow near us?
Is it the big cow in the middle?
Yes
YesNo
#203974 #168019
Figure: Two example games in the dataset.
GuessWhat?! Cooperative Visual Dialog Agents
a GuessWhat?! game
An image I ∈ RM×N containing a set of K segmentedobjects {O1, . . . , OK}.Each object Ok is assigned an object categoryck ∈ {1, . . . , C}The game further consists of a sequence of questions andanswers D = {q1, a1, . . . , qJ , aJ}, produced by thequestioner and oracle. Each answer aj ∈ {Yes, No, N/A}The oracle has access to the identity of the correct objectOcorrect, and the prediction of the questioner will bedenoted as Opredict.
GuessWhat?! Cooperative Visual Dialog Agents
The oracle task
Produce a yes-no answer for any object within a image given anatural language question.
Is
VGG16 VGG16
MLP
Yes/No/Not applicable
LSTM LSTM LSTM LSTM LSTM
CONTEXT CROP SPATIALINFORMATION
OBJECTCATEGORY
it a vase ?
GuessWhat?! Cooperative Visual Dialog Agents
The guesser task
Guesser Given an image I and a sequence of questions and answers DJ ,predict the correct object Ocorrect from the set of all objects O.
Questioner Given an image I and a sequence of T questions and answersD≤T , produce a new question qT+1.The Guesser model:
LSTM / HREDencoder
Is it a vase? Yes Is it partially visible? NoIs it in the left corner? NoIs it the turquoise and purple one? Yes
MLP MLP MLP
obj1
Softmax
Opredict
obj2 obj3 obj4
MLP
Figure: Overview of the guesser model for an image with 4 segmentedobjects. The weights are shared among the MLPs.
GuessWhat?! Cooperative Visual Dialog Agents
The questioner taskTrained by maximizing the conditional log-likelihood:
logP (Q|A, I) = logJ∏j=1
P (qj |q<j , a<j , I) = logJ∏j=1
Nj∏i=1
P (wji|wj<i, a≤j , I)
(1)
Encoder
VGG
a1
context context
a2
Is it a vase?
context context
w11 w12 w14
Decoder
Encoder Encoder Encoder
Is it partially visible?
q2q1
Is it in the left corner?
w11
w11
Decoder
Is it partially visible?
w14w12
w13
Yes No
VGG
Figure: HRED model conditioned on the VGG features of the image.Example over the third question given the first two questions, itsanswers and the image P (q2|q<2, a<2, I).
GuessWhat?! Cooperative Visual Dialog Agents
Oracle baseline results
Model Train err Val err Test errDominant class (no) 47.4% 46.2% 50.9%Question 40.2% 41.7% 41.2%Image 45.7% 46.7% 46.7%Crop 40.9% 42.7% 43.0%Question + Crop 22.3% 29.1% 29.2%Question + Image 37.9% 40.2% 39.8%Question + Category 23.1% 25.8% 25.7%Question + Spatial 28.0% 31.2% 31.3%Question + Category + Spatial 17.2% 21.1% 21.5%Question + Category + Crop 20.4% 24.4% 24.7%Question + Spatial + Crop 19.4% 26.0% 26.2%Question + Category + Spatial + Crop 16.1% 21.7% 22.1%Question + Spatial + Crop + Image 20.7% 27.7% 27.9%Question + Category + Spatial + Image 19.2% 23.2% 23.5%
Table: Classification errors for the oracle baselines.The best performing model is "Question + Category + Spatial"and refers to the MLP that takes the question, the selectedobject class and its spatial features as input.
GuessWhat?! Cooperative Visual Dialog Agents
Guesser and questioner baseline results
Model Train err Val err Test errHuman 9.0% 9.2% 9.2%Random 82.9% 82.9% 82.9%LSTM 27.9% 37.9% 38.7%HRED 32.6% 38.2% 39.0%LSTM+VGG 26.1% 38.5% 39.5%HRED+VGG 27.4% 38.4% 39.6%
Table: Classification errors for the guesser baselines.
Model ErrorHuman generated dialogue 38.7%QGen+GT 53.2%QGen+ORACLE 66.0%Random 82.9%
Table: Test error for the questioner (QGen) based on VGG+HREDguesser model. The accuracy error of the guesser model fed with thequestions from the questioner.
GuessWhat?! Cooperative Visual Dialog Agents
Main ideas (Cooperative Visual Dialog Agents)
I think we were talking about this image!
Two zebra are walking around their pen at the zoo.
Q1: Any people in the shot?
A1: No, there aren’t any.[0.1, -1, 0.2, … , 0.5]
Q10: Are they facing each other?
A10: They aren’t.[-0.5, 0.1, 0.7, … , 1]
A cooperative imageguessing game between twoagents Q-BOT and A-BOTis proposed.
Communication through anatural language dialog andthen Q-BOT select aparticular unseen imagefrom a lineup.
These agents are modeled asdeep neural networks andtrained end-to-end withreinforcement learning.
GuessWhat?! Cooperative Visual Dialog Agents
Model Overview
Are there any animals?
Yes, there are two elephants.
A-BOT
Question Encoder
AnswerDecoder
History Encoder
Fact EmbeddingQ-BOT
QuestionDecoder
Fact Embedding
Feature Regression
Network
History Encoder
Rou
nds
of D
ialo
g
[0.1, -2, 0, … , 0.57] Reward Function
Two agents: Q-BOT & A-BOTEnvironment: ImageAction:
Q-BOT: question qt Are there any animals?A-BOT: answer at Yes, there are two elephantsQ-BOT: image regression yt ∈ R4096
State:Q-BOT: sQt = [c, q1, a1, ..., qt−1, at−1]A-BOT: sAt = [I, c, q1, a1, ..., qt−1, at−1, qt]
GuessWhat?! Cooperative Visual Dialog Agents
Model Overview
Are there any animals?
Yes, there are two elephants.
A-BOT
Question Encoder
AnswerDecoder
History Encoder
Fact EmbeddingQ-BOT
QuestionDecoder
Fact Embedding
Feature Regression
Network
History Encoder
Rou
nds
of D
ialo
g
[0.1, -2, 0, … , 0.57] Reward Function
At each round t of dialog,Q-BOT generates a question qt from its question decoderconditioned on its state encoding SQt−1A-BOT encodes qt, updates its state encoding SAt , andgenerates an answer atBoth encode the completed exchange as FQt and FAtQ-BOT updates its state to SQt , predicts an imagerepresentation yt and receives a reward
GuessWhat?! Cooperative Visual Dialog Agents
Details
GuessWhat?! Cooperative Visual Dialog Agents
Details
GuessWhat?! Cooperative Visual Dialog Agents
Details
GuessWhat?! Cooperative Visual Dialog Agents
Details
GuessWhat?! Cooperative Visual Dialog Agents
Joint Training with Policy Gradients
Rewards definition:
rt
(sQt︸︷︷︸
state
, (qt, at, yt)︸ ︷︷ ︸action
)= `
(yt−1, y
gt)︸ ︷︷ ︸
distance at t-1
− `(yt, y
gt)︸ ︷︷ ︸
distance at t
(2)
Objective functions:
minθA,θQ,θf
J(θA, θQ, θf ) , EπQ,πA
[T∑t=1
rt(sQt , (qt, at, yt)
)](3)
Policy Gradients:
∇θQJ = EπQ,πA
[rt (·) ∇θQ
log πQ(qt|sQt−1
)](4)
∇θAJ = EπQ,πA
[rt (·) ∇θA
log πA(at|sAt
)]. (5)
Feature regression network (θf ) receives gradient updates fordifferentiable l(·, ·)
GuessWhat?! Cooperative Visual Dialog Agents
Results of Q-BOT/A-BOT Interactions
GuessWhat?! Cooperative Visual Dialog Agents
Qualitative Retrieval Results