Visual7W Grounded Question Answering in Images

transcript

Visual7WGrounded Question Answering in Images

Yuke Zhu, Oliver Groth, Michael Bernstein, Li Fei-Fei

Slides by Issey Masuda MoraComputer Vision Reading Group (09/05/2016)

[arXiv] [web] [GitHub]

Context

Visual Question Answering

Goal: predict the answer of a given question related to an image

Motivation

New Turing test? How to evaluate AI’s image understanding?

Visual7W

The 7W

Questions: multi-choice4 candidates, only one correct

Grounding: image-text correspondencesExploit the relation between image regions and nouns in the questions

The new answer is...Question-Answer types:

● Telling questions: the answer is text

● Pointing questions: a new type of QA that they introduce where the answers are image regions

Related work

Common approach

Who is under the umbrella?

Extract visual features

Embedding

Merge Predict answer Two women

The Dataset

Visual7W DatasetCharacteristics:

● 47.300 images from COCO dataset● 327.939 QA pairs● 561.459 object bounding boxes spread across 36.579 categories

Creating the DatasetProcedure:

● Write QA pairs● 3 AMT workers evaluate as good or bad each pair● Only the ones with at least 2 good evaluations are considered● Write the 3 wrong answers (having the right one)● Extract object names and draw bounding boxes for each one

The Model

Attention-based modelPointing questions model

Experiments & Results

ExperimentsDifferent experiments have been conducted depending on the information given to the subject:

● Only the question● Question + Image

Subjects/models:

● Human● Logistic regression● LSTM● LSTM + attention model

Results

Conclusions

● Visual QA model has been presented● Attention model to focus on local regions of the image● Dataset created with goundings

Visual7W Grounded Question Answering in Images

Technology