1/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
A picture is worth13.6 words
(on average)
AlexBerg
AmitGoyal
TamaraBerg
JesseDodge
YejinChoi
YiannisAloimonos
KotaYamaguchi
AlyssaMensch
KarlStratos
MegMitchell
XufengHan
Ching LikTeo
YezhouYang
2/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
An on-paper experiment
Write a captionfor this image,one sentencein length.
(In English.)
3/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
People write weird captions
Another dream car toadd to the list, this onespotted in Hanbury St.
4/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
People write weird captions
Another dream car toadd to the list, this onespotted in Hanbury St.
Shot out my car windowwhile stuck in trafficbecause people in
Cincinatti can'tdrive in the rain
5/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
People write weird captions
Another dream car toadd to the list, this onespotted in Hanbury St.
Shot out my car windowwhile stuck in trafficbecause people in
Cincinatti can'tdrive in the rain
1. A distorted photo of a mancutting up a large cut ofmeat in a garage.
2. A man smiling at thecamera while carvingup meat.
3. A man smiling while hecuts up a piece of meat.
4. A smiling man is standing next to a table dressinga piece of venison.
5. The man is smiling into the camera as he cuts meat.
6/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Two complementary questions...Image ⇒ Text?
“two women sitting brunette blonde on bench reading magazine”
7/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Two complementary questions...Image ⇒ Text? Text ⇒ Image?
“two women sitting brunette blonde on bench reading magazine”
“looking for castles in the clouds out my car window”
8/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Two complementary questions...Image ⇒ Text? Text ⇒ Image?
“two women sitting brunette blonde on bench reading magazine”
“looking for castles in the clouds out my car window”
Understanding andUnderstanding andPredicting ImportancePredicting Importancein Imagesin ImagesBBDDGHMMSSY, CVPR 2012BBDDGHMMSSY, CVPR 2012
Detecting Visual TextDetecting Visual TextDGHMMSYCDBB, NAACL 2012DGHMMSYCDBB, NAACL 2012
Corpus-Guided SentenceCorpus-Guided SentenceGeneration of Natural ImagesGeneration of Natural ImagesYTDA, EMNLP 2011YTDA, EMNLP 2011
Midge: Generating ImageMidge: Generating ImageDescriptions fromDescriptions fromComputer Vision DetectionsComputer Vision DetectionsMDGYSHMBBD, EACL 2012MDGYSHMBBD, EACL 2012
10/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Why do this?Caption Generation
the sheep meandered along a desolate road in the highlands of Scotland through frozen grass
12/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Why do this?Caption Generation
Visual Scene Construction
13/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Why do this?Caption Generation
Visual Scene Construction
the small white cat is -17 inches above the hat. the tiny white illuminator is in front of the cat. it is night. the ground is red.
the 200 foot tall dragon is facing the 100 foot tall car. The ground is a checkerboard. the sky is pink
Coyne & Sproat, SIGGRAPH 2001WordsEye: An Automatic
Text-to-Scene Conversion System
14/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Why do this?Caption Generation
Visual Scene Construction
15/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Why do this?Caption Generation
Visual Scene Construction
Training Object Detectors from Text
16/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Why do this?Caption Generation
Visual Scene Construction
Training Object Detectors from Text
“elephant in the beach”
“a personriding a horse”
≠Person + Horse
Farhadi + Sadeghi, CVPR 2011Recognition Using Visual
Phrases
17/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Why do this?Caption Generation
Visual Scene Construction
Training Object Detectors from Text
18/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
What is “visual text”● Photographer/viewer distinctions
Kevin’s mom, so punxrawkin Kev’s black flag hat
19/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
What is “visual text”● Photographer/viewer distinctions
● Amount of inference
Kevin’s mom, so punxrawkin Kev’s black flag hat
Another dream car toadd to the list, this onespotted in Hanbury St.
20/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
What is “visual text”● Photographer/viewer distinctions
● Amount of inference
● Temporal events
Kevin’s mom, so punxrawkin Kev’s black flag hat
Another dream car toadd to the list, this onespotted in Hanbury St.
Tuckered out from playingin Nannie’s yard.
21/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
What is “visual text”● Photographer/viewer distinctions
● Amount of inference
● Temporal events
Kevin’s mom, so punxrawkin Kev’s black flag hat
Another dream car toadd to the list, this onespotted in Hanbury St.
Tuckered out from playingin Nannie’s yard.
A phrase is visual if there is apiece of the image you can cut
out, place in another image,and still use the same description.
22/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Okay, so can we detect it?● SBU Flickr data● 3 NPs per caption● 800 images: ≥3 annotations● 48k images: 1 annotation● People largely agree
(74% whatever that means...)● 3 NPs per caption, 70% visual
23/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Okay, so can a computer detect it?Word+stems
BigramsSpelling
Hypernyms(Inside, Before and After)
Another dream car to add to the list...another anothdream dreamcar caranother_dream dream_carAa+ a+ a+Vehicle … artifact … entity
to toadd addto_adda+ a+
24/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Okay, so can a computer detect it?Word+stems
BigramsSpelling
Hypernyms(Inside, Before and After)
Another dream car to add to the list...another anothdream dreamcar caranother_dream dream_carAa+ a+ a+Vehicle … artifact … entity
to toadd addto_adda+ a+
≈67% AUC
25/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Bootstrapping visual terminology● Start with some seeds● Apply bootstrapping or label propagation
car house tree horse animal mantable bottle woman computeridea bravery deceit trust dedicationanger humour luck inflation honesty
brown green wooden striped orangerectangular furry shiny rusty feathered
public original whole righteous adjectivespolitical personal intrinsic seeds individual
Adj
Nou
n V
NV
V
NV
Color purple blue maroon beige greenMaterial plastic cotton wooden metallic silverShape circular square round rectangular triangularSize small big tiny tall hugeSurface coarse smooth furry fluffy roughDirection sideways north upward left downPattern striped dotted checked plaid quiltedQuality shiny rusty dirty burned glitteryBeauty beautiful cute pretty gorgeous lovelyAge young mature immature older seniorEthnicity french asian american greek hispanic
26/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Bootstrapping visual terminology● Start with some seeds● Apply bootstrapping or label propagation
car house tree horse animal mantable bottle woman computeridea bravery deceit trust dedicationanger humour luck inflation honesty
brown green wooden striped orangerectangular furry shiny rusty feathered
public original whole righteous adjectivespolitical personal intrinsic seeds individual
Adj
Nou
n V
NV
V
NV
Color purple blue maroon beige greenMaterial plastic cotton wooden metallic silverShape circular square round rectangular triangularSize small big tiny tall hugeSurface coarse smooth furry fluffy roughDirection sideways north upward left downPattern striped dotted checked plaid quiltedQuality shiny rusty dirty burned glitteryBeauty beautiful cute pretty gorgeous lovelyAge young mature immature older seniorEthnicity french asian american greek hispanic
grayish, chestnut, emerald, rufous
27/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Bootstrapping visual terminology● Start with some seeds● Apply bootstrapping or label propagation
car house tree horse animal mantable bottle woman computeridea bravery deceit trust dedicationanger humour luck inflation honesty
brown green wooden striped orangerectangular furry shiny rusty feathered
public original whole righteous adjectivespolitical personal intrinsic seeds individual
Adj
Nou
n V
NV
V
NV
Color purple blue maroon beige greenMaterial plastic cotton wooden metallic silverShape circular square round rectangular triangularSize small big tiny tall hugeSurface coarse smooth furry fluffy roughDirection sideways north upward left downPattern striped dotted checked plaid quiltedQuality shiny rusty dirty burned glitteryBeauty beautiful cute pretty gorgeous lovelyAge young mature immature older seniorEthnicity french asian american greek hispanic
grayish, chestnut, emerald, rufous
#A81C07
28/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Bootstrapping visual terminology● Start with some seeds● Apply bootstrapping or label propagation
car house tree horse animal mantable bottle woman computeridea bravery deceit trust dedicationanger humour luck inflation honesty
brown green wooden striped orangerectangular furry shiny rusty feathered
public original whole righteous adjectivespolitical personal intrinsic seeds individual
Adj
Nou
n V
NV
V
NV
Color purple blue maroon beige greenMaterial plastic cotton wooden metallic silverShape circular square round rectangular triangularSize small big tiny tall hugeSurface coarse smooth furry fluffy roughDirection sideways north upward left downPattern striped dotted checked plaid quiltedQuality shiny rusty dirty burned glitteryBeauty beautiful cute pretty gorgeous lovelyAge young mature immature older seniorEthnicity french asian american greek hispanic
grayish, chestnut, emerald, rufous
#A81C07#A81C07
29/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Bootstrapping visual terminology● Start with some seeds● Apply bootstrapping or label propagation
car house tree horse animal mantable bottle woman computeridea bravery deceit trust dedicationanger humour luck inflation honesty
brown green wooden striped orangerectangular furry shiny rusty feathered
public original whole righteous adjectivespolitical personal intrinsic seeds individual
Adj
Nou
n V
NV
V
NV
Color purple blue maroon beige greenMaterial plastic cotton wooden metallic silverShape circular square round rectangular triangularSize small big tiny tall hugeSurface coarse smooth furry fluffy roughDirection sideways north upward left downPattern striped dotted checked plaid quiltedQuality shiny rusty dirty burned glitteryBeauty beautiful cute pretty gorgeous lovelyAge young mature immature older seniorEthnicity french asian american greek hispanic
grayish, chestnut, emerald, rufous
oblong, hemispherical, quadrangular, convex
#A81C07#A81C07
30/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Bootstrapping visual terminology● Start with some seeds● Apply bootstrapping or label propagation
car house tree horse animal mantable bottle woman computeridea bravery deceit trust dedicationanger humour luck inflation honesty
brown green wooden striped orangerectangular furry shiny rusty feathered
public original whole righteous adjectivespolitical personal intrinsic seeds individual
Adj
Nou
n V
NV
V
NV
Color purple blue maroon beige greenMaterial plastic cotton wooden metallic silverShape circular square round rectangular triangularSize small big tiny tall hugeSurface coarse smooth furry fluffy roughDirection sideways north upward left downPattern striped dotted checked plaid quiltedQuality shiny rusty dirty burned glitteryBeauty beautiful cute pretty gorgeous lovelyAge young mature immature older seniorEthnicity french asian american greek hispanic
grayish, chestnut, emerald, rufous
oblong, hemispherical, quadrangular, convex
#A81C07#A81C07
≈67% AUC
31/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Bootstrapping visual terminology● Start with some seeds● Apply bootstrapping or label propagation
car house tree horse animal mantable bottle woman computeridea bravery deceit trust dedicationanger humour luck inflation honesty
brown green wooden striped orangerectangular furry shiny rusty feathered
public original whole righteous adjectivespolitical personal intrinsic seeds individual
Adj
Nou
n V
NV
V
NV
Color purple blue maroon beige greenMaterial plastic cotton wooden metallic silverShape circular square round rectangular triangularSize small big tiny tall hugeSurface coarse smooth furry fluffy roughDirection sideways north upward left downPattern striped dotted checked plaid quiltedQuality shiny rusty dirty burned glitteryBeauty beautiful cute pretty gorgeous lovelyAge young mature immature older seniorEthnicity french asian american greek hispanic
grayish, chestnut, emerald, rufous
oblong, hemispherical, quadrangular, convex
#A81C07#A81C07
≈67% AUC≈71% AUC
32/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
But this doesn't use the images!!!
50
55
60
65
70
75
80
85
90
95
RandomModelModel+ListsHuman
33/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
But this doesn't use the images!!!
50
55
60
65
70
75
80
85
90
95
RandomModelModel+ListsHuman
34/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
What I used to think vision did...
35/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
What I used to think vision did...
36/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
What I used to think vision did...
37/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
What I used to think vision did...
39/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Adding in image features
Ecuador, amazon basin, near coca, rain forest, passion fruit flower
● Does a detector corresponding to this head noun exist?
● Did it fire?● How many times did it fire?● How confident was the “best”
firing?● What %age of pixels in the image
are in that bounding box?
40/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Results with vision features
50
55
60
65
70
75
80
85
90
95
RandomModelModel+Lists+VisionHuman
41/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Results with vision features
50
55
60
65
70
75
80
85
90
95
RandomModelModel+Lists+VisionHuman
Features only availableon about 11% of examples
42/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Results with vision features
50
55
60
65
70
75
80
85
90
95
RandomModelModel+Lists+VisionHuman
Features only availableon about 11% of examples
8% improvement onphrases with recognizers
A picture is worth 13.6 words43 Hal Daumé III ([email protected])
bird
boat
bottle
bowl
Detecting on a large scale...
A picture is worth 13.6 words44 Hal Daumé III ([email protected])
Given an image
1)
What do people describe?
A picture is worth 13.6 words45 Hal Daumé III ([email protected])
Predict what people will describe
Given an image
1)
“two women sitting brunette blonde on bench reading magazine”
What do people describe?
A picture is worth 13.6 words46 Hal Daumé III ([email protected])
Predict what people will describe
Given an image
1)
“two women sitting brunette blonde on bench reading magazine”
women ● bench ●
magazine● grass
skirt
…
What do people describe?
A picture is worth 13.6 words47 Hal Daumé III ([email protected])
What’s in this image?
Predicting what will be described
A picture is worth 13.6 words48 Hal Daumé III ([email protected])
manbabysling
ladderfridgetable
watermelonchair
boxescups
water bottlewall
pacifierbeard
glassesshirt
…
What’s in this image?
Predicting what will be described
A picture is worth 13.6 words49 Hal Daumé III ([email protected])
What do people describe?“A bearded man is holding a child in a sling.”
manbabysling
ladderfridgetable
watermelonchair
boxescups
water bottlewall
pacifierbeard
glassesshirt
…
What’s in this image?
Predicting what will be described
A picture is worth 13.6 words50 Hal Daumé III ([email protected])
What do people describe?“A bearded man is holding a child in a sling.”
manbabysling
ladderfridgetable
watermelonchair
boxescups
water bottlewall
pacifierbeard
glassesshirt
…
What’s in this image?
Predicting what will be described
“A bearded man stands while holdinga small child in a green sheet.” “A bearded man with a baby in a sling poses.”“Man standing in kitchen with little girlin green sack.” “Man with beard and baby”
A picture is worth 13.6 words51 Hal Daumé III ([email protected])
What do people describe?“A bearded man is holding a child in a sling.”
manbabysling
ladderfridgetable
watermelonchair
boxescups
water bottlewall
pacifierbeard
glassesshirt
…
What’s in this image?
Predicting what will be described
“A bearded man stands while holdinga small child in a green sheet.” “A bearded man with a baby in a sling poses.”“Man standing in kitchen with little girlin green sack.” “Man with beard and baby”
A picture is worth 13.6 words52 Hal Daumé III ([email protected])
Two kinds of factors– Compositional– Semantic
What factors influence what someone will describe about an image?
Description factors
A picture is worth 13.6 words53 Hal Daumé III ([email protected])
“A sail boat on the ocean.”
Size/Saliency
Location
Compositional factors
A picture is worth 13.6 words54 Hal Daumé III ([email protected])
Compositional factors
“Two men standing on beach.”
Size/Saliency
Location
A picture is worth 13.6 words55 Hal Daumé III ([email protected])
“girl in the street”
Object Type
Nameable Scene
Unusualness
Semantic factors
A picture is worth 13.6 words56 Hal Daumé III ([email protected])
Semantic factors
“kitchen in house”
Object Type
Nameable Scene
Unusualness
A picture is worth 13.6 words57 Hal Daumé III ([email protected])
Semantic factors
“elephant in the beach”
Object Type
Nameable Scene
Unusualness
A picture is worth 13.6 words58 Hal Daumé III ([email protected])
Semantic factors
“A tree in water and a boy with a beard”
Object Type
Nameable Scene
Unusualness
A picture is worth 13.6 words59 Hal Daumé III ([email protected])
Generating captions
a) Detect objects and scenes from input image;b) Estimate optimal sentence structure quadruplet T;c) Generating a sentence from T;
A picture is worth 13.6 words63 Hal Daumé III ([email protected])
Using large corpora to compose natural captions
(why write your own material when you can just “steal” it?)
A picture is worth 13.6 words64 Hal Daumé III ([email protected])
a) monkey playing in the tree canopy, Monte Verde in the rain forest
e) the monkey sitting in a tree, posing for his picture
c) monkey spotted in Apenheul Netherlands under the tree
d) a white-faced or capuchin in the tree in the garden
b) capuchin monkey in frontof my window
Composing captions
A picture is worth 13.6 words65 Hal Daumé III ([email protected])
a) monkey playing in the tree canopy, Monte Verde in the rain forest
e) the monkey sitting in a tree, posing for his picture
c) monkey spotted in Apenheul Netherlands under the tree
d) a white-faced or capuchin in the tree in the garden
b) capuchin monkey in frontof my window
Composing captions
A picture is worth 13.6 words66 Hal Daumé III ([email protected])
Caption images where:
We assume some evidence for 1 object
&
Object detector is confident
Captioning with (some) evidence
A picture is worth 13.6 words67 Hal Daumé III ([email protected])
Caption images where:
We assume some evidence for 1 object
&
Object detector is confident
Tag: “mare” Evidence for horse
Captioning with (some) evidence
A picture is worth 13.6 words68 Hal Daumé III ([email protected])
Caption images where:
We assume some evidence for 1 object
&
Object detector is confident
Tag: “mare”
High detection score
Evidence for horse
Captioning with (some) evidence
A picture is worth 13.6 words69 Hal Daumé III ([email protected])
Grab phrases based on image similarity between query and captioned data baseObject detection similarity - NPs, VPs Stuff detection similarity – PPs Scene similarity - PPs
Mash phrases Compose descriptions using simple rule based concatenation
Generation: Grab 'N Mash
A picture is worth 13.6 words70 Hal Daumé III ([email protected])
Detect: fruit
Getting NPs – Objects
A picture is worth 13.6 words71 Hal Daumé III ([email protected])
Detect: fruit
Find matching fruit detections by color similarity
Getting NPs – Objects
A picture is worth 13.6 words72 Hal Daumé III ([email protected])
Detect: fruit
Find matching fruit detections by color similarity
Tray of glace fruit in the market at Nice, France
Fresh fruit in the market
A box of oranges was just catching the sun, bringing out detail in the skin.
The street market in Santanyi, Mallorca is a must for the oranges and local crafts.
An orange tree in the backyard of the house.
mandarin oranges in glass bowl
Getting NPs – Objects
A picture is worth 13.6 words73 Hal Daumé III ([email protected])
Getting NPs – Objects
The muddy elephantAn elephantsmall elephantA very large and seemingly old elephantmusk male elephantAfrican elephantthe temple elephant
Fushia flowera flowera pink zinna flowerThis beautiful flowera roman pink flowera tiny pink flowerpink bursting flowersa perfectly pink gerbera daisy
a lonesome ducka native new zealand duckThe duckmale Mallard duckseveral other ducksa so-called navigation duckthis ducka duckduckmandarin duck
A picture is worth 13.6 words74 Hal Daumé III ([email protected])
theses cows live in the field behind my house A cow eating flowers in
the south of the Netherlands.
The cow was more interested in eating than looking at me with a camera!
While cycling north on Tremaine Road near Milton, this cow gazed across the road intently.
Detect: cow
Find matching cow detections by shape/pose similarity
Getting VPs – objects
A picture is worth 13.6 words75 Hal Daumé III ([email protected])
Detect: grassgreen manure in the veg field - Plaw Hatch
Find matching grass detections by color similarity
Found on hawthorn in boggy grass field
Sheep in a field spotted during a coastal drive from Tramore to Dungervan
I am happy in a field of green Maryland grass
Getting PPs – stuff
A picture is worth 13.6 words76 Hal Daumé III ([email protected])
View from our B&B in this photo
Extract scene descriptor
Find matching images by scene similarity
Pedestrian street in the Old Lyon with stairs to climb up the hill of fourviere
I'm about to blow the building across the street over with my massive lung power.
Only in Paris will you find a bottle of wine on a table outside a bookstore
Getting PPs – scenes
A picture is worth 13.6 words78 Hal Daumé III ([email protected])
object color
object pose
scene
stuff
Composing captions
A picture is worth 13.6 words79 Hal Daumé III ([email protected])
NP: the sheep
VP: meandered along a desolate road
PP: in the highlands of Scotland
PP: through frozen grass
object color
object pose
scene
stuff
Composing captions
A picture is worth 13.6 words80 Hal Daumé III ([email protected])
NP: the sheep
VP: meandered along a desolate road
PP: in the highlands of Scotland
PP: through frozen grass
object color
object pose
scene
stuff
Various composition patterns:NP VPNP PP_stuffNP PP_scene…NP VP PP_scene PP_stuff
Composing captions
A picture is worth 13.6 words81 Hal Daumé III ([email protected])
the sheep meandered along a desolate road in the highlands of Scotland through frozen grass
NP: the sheep
VP: meandered along a desolate road
PP: in the highlands of Scotland
PP: through frozen grass
object color
object pose
scene
stuff
Various composition patterns:NP VPNP PP_stuffNP PP_scene…NP VP PP_scene PP_stuff
Composing captions
A picture is worth 13.6 words82 Hal Daumé III ([email protected])
cat enjoys hiding under the tree
A female Monarch butterfly was visiting the plant in my front yard in Devon 17/10/10 Stained glass window
depicting Christ and numerous saints in Washington National Cathedral in the Eglise
A double-decker bus under some spreading shade trees
her flower girl dress designed by Mainbocher in the house
A duck was having a bath in the harbor at whitehaven, cumbria, england in the water near Camley St
Good results
A picture is worth 13.6 words84 Hal Daumé III ([email protected])
Language issues
A Moo cow tied up around the city eating grass in various places under the tree at the young tree
male tiger sighting in twelve months of a street
Not so good results
A picture is worth 13.6 words85 Hal Daumé III ([email protected])
Language issues Vision issues
A Moo cow tied up around the city eating grass in various places under the tree at the young tree
The silhouetted building and cross stands under water around Loon Mountain
male tiger sighting in twelve months of a street
a girl walking by in a green field in the sun
Not so good results
A picture is worth 13.6 words86 Hal Daumé III ([email protected])
Language issues Vision issues Just plain silly
A Moo cow tied up around the city eating grass in various places under the tree at the young tree
dogs running pic, this time, racing through the sea at Fraisthorpe near Bridlington of Christmas tree in bed
The silhouetted building and cross stands under water around Loon Mountain
male tiger sighting in twelve months of a street
a girl walking by in a green field in the sun
bike was left here by an ancient civilization not as sophisticated as our own in the grass of granite
Not so good results
A picture is worth 13.6 words87 Hal Daumé III ([email protected])
Open question...➢ Can we do this without using pre-defined object/scene/etc.
detectors?
➢ Build a representation of each image in the database➢ Build a representation of the test image➢ Find 10 most similar database images➢ Merge their NL descriptions using text-to-text generation
techniques
➢ Q: Where do these representations come from???
88/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
And why are we trying to do this...???● Captioning the world for
people with visual impairments● But the captions we have are not
really descriptive of the world
89/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
And why are we trying to do this...???● Captioning the world for
people with visual impairments● But the captions we have are not
really descriptive of the world
● Use vision to “ground out”language● Is it turtles
all the way down?
90/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
And why are we trying to do this...???● Captioning the world for
people with visual impairments● But the captions we have are not
really descriptive of the world
● Use vision to “ground out”language● Is it turtles
all the way down?
● That's how babies work!● Sadly we don't have
baby-esque robots yet
91/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Why work on a task at all?● A solution is of benefit to society● The process focuses attention on
phenomena that are worthy of study
● What is worthy of study? (IMO)● Low-level linguistic phenomena that hide in the tail● Human-like abilities to generalize from small data● Very basic learning of correlations between different
modalities (operant conditioning)
René Descartes(1596-1650)
92/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
What about 2nd language learning?● Obvious problems
● Assumes knowledge 1st language● Assumes knowledge of the world● Still don't have a robot...
93/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
What about 2nd language learning?● Obvious problems
● Assumes knowledge 1st language● Assumes knowledge of the world● Still don't have a robot...
● But we do havesoftware withexercises for SLA
94/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
What about 2nd language learning?● Obvious problems
● Assumes knowledge 1st language● Assumes knowledge of the world● Still don't have a robot...
● But we do havesoftware withexercises for SLA
It's hard for people, too!
95/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
What about 2nd language learning?● Obvious problems
● Assumes knowledge 1st language● Assumes knowledge of the world● Still don't have a robot...
● But we do havesoftware withexercises for SLA
It's hard for people, too!
96/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Aspects of computational 2ndLL● Very specific linguistic variants
● Number, case, agreement, etc.● Not enough to get the majority case
97/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Aspects of computational 2ndLL● Very specific linguistic variants
● Number, case, agreement, etc.● Not enough to get the majority case
● Focus on subtle visual differences
98/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Aspects of computational 2ndLL● AI-style
reasoning &one-shotlearning
99/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Aspects of computational 2ndLL● AI-style
reasoning &one-shotlearning
● “It's learnable” proof of concept:
100/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
What is needed to solve this?● Linguistic model over character
sequences (words not okay!)w/o any L-specific background
● Pre-trained (?) visual detectorsfor objects, poses andphysical relationships (eg., gaze)
● Ability to reason and generalizefrom a few examples
101/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Thanks!Questions?
AlexBerg
AmitGoyal
TamaraBerg
JesseDodge
YejinChoi
YiannisAloimonos
KotaYamaguchi
AlyssaMensch
KarlStratos
MegMitchell
XufengHan
Ching LikTeo
YezhouYang