Ambient Sound Provides Supervisionfor Visual Learning
Andrew Owens1, Jiajun Wu1, Josh H. McDermott1,William T. Freeman1,2, and Antonio Torralba1
1MIT & 2Google Research
ECCV 2016
Presented by An T. Nguyen
1
Introduction
Problem
I Learn Image Representation without labels ...
I ... that useful for a real task (e.g. Object Recognition).
Idea
I Set up a pretext task.
I To solve pretext task, model must learn good representation.
Learn to predict a “natural signal”...
I ...that available for ‘free’.
I This paper: Sound.
I Others: Camera motion.(Agrawal et. al., Jayaraman & Grauman, 2015)
2
Introduction
Problem
I Learn Image Representation without labels ...
I ... that useful for a real task (e.g. Object Recognition).
Idea
I Set up a pretext task.
I To solve pretext task, model must learn good representation.
Learn to predict a “natural signal”...
I ...that available for ‘free’.
I This paper: Sound.
I Others: Camera motion.(Agrawal et. al., Jayaraman & Grauman, 2015)
2
Introduction
Problem
I Learn Image Representation without labels ...
I ... that useful for a real task (e.g. Object Recognition).
Idea
I Set up a pretext task.
I To solve pretext task, model must learn good representation.
Learn to predict a “natural signal”...
I ...that available for ‘free’.
I This paper: Sound.
I Others: Camera motion.(Agrawal et. al., Jayaraman & Grauman, 2015)
2
Introduction
Problem
I Learn Image Representation without labels ...
I ... that useful for a real task (e.g. Object Recognition).
Idea
I Set up a pretext task.
I To solve pretext task, model must learn good representation.
Learn to predict a “natural signal”...
I ...that available for ‘free’.
I This paper: Sound.
I Others: Camera motion.(Agrawal et. al., Jayaraman & Grauman, 2015)
2
Introduction
Problem
I Learn Image Representation without labels ...
I ... that useful for a real task (e.g. Object Recognition).
Idea
I Set up a pretext task.
I To solve pretext task, model must learn good representation.
Learn to predict a “natural signal”...
I ...that available for ‘free’.
I This paper: Sound.
I Others: Camera motion.(Agrawal et. al., Jayaraman & Grauman, 2015)
2
Data
Yahoo Flickr Creative Commons 100 Million Dataset.(Thomee et. al. 2015)
I 360,000 video subset.
I Sample one image per 10sec.
I Extract 3.75 sec of sound around.
I 1.8 mil. train examples.
3
Data
Yahoo Flickr Creative Commons 100 Million Dataset.(Thomee et. al. 2015)
I 360,000 video subset.
I Sample one image per 10sec.
I Extract 3.75 sec of sound around.
I 1.8 mil. train examples.
3
Examples 1(flickr.com/photos/41894173046@N01/4530333858)
Sound
Video
4
Examples 2(flickr.com/photos/42035325@N00/8029349128)
Sound
Video
5
Examples 3(flickr.com/photos/zen/2479982751)
Sound
Video
6
Challenges
I Sound is sometimes indicative of image.
I But sometimes not.
Sound producing objects
I outside image.
I not always produce sound.
Video
I is edited.
I has noisy, background sound.
Question: What representation can we learn?
7
Challenges
I Sound is sometimes indicative of image.
I But sometimes not.
Sound producing objects
I outside image.
I not always produce sound.
Video
I is edited.
I has noisy, background sound.
Question: What representation can we learn?
7
Challenges
I Sound is sometimes indicative of image.
I But sometimes not.
Sound producing objects
I outside image.
I not always produce sound.
Video
I is edited.
I has noisy, background sound.
Question: What representation can we learn?
7
Challenges
I Sound is sometimes indicative of image.
I But sometimes not.
Sound producing objects
I outside image.
I not always produce sound.
Video
I is edited.
I has noisy, background sound.
Question: What representation can we learn?
7
Represent sound
Pre-process
I Filter waveform ... (mimic human ear).
I Compute statistics (e.g. mean of each freq. channel).
I → sound texture: 502-dim vector.
Two labeling models
1. Cluster sound texture (k-mean).
2. PCA, 30 projections, threshold → binary codes.
Given an image
1. Predict sound cluster.
2. Predict 30 binary codes (multi-label classification).
8
Represent sound
Pre-process
I Filter waveform ... (mimic human ear).
I Compute statistics (e.g. mean of each freq. channel).
I → sound texture: 502-dim vector.
Two labeling models
1. Cluster sound texture (k-mean).
2. PCA, 30 projections, threshold → binary codes.
Given an image
1. Predict sound cluster.
2. Predict 30 binary codes (multi-label classification).
8
Represent sound
Pre-process
I Filter waveform ... (mimic human ear).
I Compute statistics (e.g. mean of each freq. channel).
I → sound texture: 502-dim vector.
Two labeling models
1. Cluster sound texture (k-mean).
2. PCA, 30 projections, threshold → binary codes.
Given an image
1. Predict sound cluster.
2. Predict 30 binary codes (multi-label classification).
8
Represent sound
Pre-process
I Filter waveform ... (mimic human ear).
I Compute statistics (e.g. mean of each freq. channel).
I → sound texture: 502-dim vector.
Two labeling models
1. Cluster sound texture (k-mean).
2. PCA, 30 projections, threshold → binary codes.
Given an image
1. Predict sound cluster.
2. Predict 30 binary codes (multi-label classification).
8
Represent sound
Pre-process
I Filter waveform ... (mimic human ear).
I Compute statistics (e.g. mean of each freq. channel).
I → sound texture: 502-dim vector.
Two labeling models
1. Cluster sound texture (k-mean).
2. PCA, 30 projections, threshold → binary codes.
Given an image
1. Predict sound cluster.
2. Predict 30 binary codes (multi-label classification).
8
Represent sound
Pre-process
I Filter waveform ... (mimic human ear).
I Compute statistics (e.g. mean of each freq. channel).
I → sound texture: 502-dim vector.
Two labeling models
1. Cluster sound texture (k-mean).
2. PCA, 30 projections, threshold → binary codes.
Given an image
1. Predict sound cluster.
2. Predict 30 binary codes (multi-label classification).
8
Represent sound
Pre-process
I Filter waveform ... (mimic human ear).
I Compute statistics (e.g. mean of each freq. channel).
I → sound texture: 502-dim vector.
Two labeling models
1. Cluster sound texture (k-mean).
2. PCA, 30 projections, threshold → binary codes.
Given an image
1. Predict sound cluster.
2. Predict 30 binary codes (multi-label classification).
8
Training
Convolutional Neural Network
I Similar to (Krizhevsky et. al. 2012).
I Implemented in Caffe.
9
Training
Convolutional Neural Network
I Similar to (Krizhevsky et. al. 2012).
I Implemented in Caffe.
9
Training
10
Visualizing neurons (in upper layers)
Method: for each neuron
1. Find images with large activation.
2. Find locations with large contribution to activation.
3. Highlight these regions.
4. Show to human on AMT.
11
Visualizing neurons (in upper layers)
Method: for each neuron
1. Find images with large activation.
2. Find locations with large contribution to activation.
3. Highlight these regions.
4. Show to human on AMT.
11
Visualizing neurons (in upper layers)
Method: for each neuron
1. Find images with large activation.
2. Find locations with large contribution to activation.
3. Highlight these regions.
4. Show to human on AMT.
11
Visualizing neurons (in upper layers)
Method: for each neuron
1. Find images with large activation.
2. Find locations with large contribution to activation.
3. Highlight these regions.
4. Show to human on AMT.
11
Visualizing neurons (in upper layers)
Method: for each neuron
1. Find images with large activation.
2. Find locations with large contribution to activation.
3. Highlight these regions.
4. Show to human on AMT.
11
Visualizing neurons (in upper layers)
Method: for each neuron
1. Find images with large activation.
2. Find locations with large contribution to activation.
3. Highlight these regions.
4. Show to human on AMT.
11
Visualizing neurons
12
Visualizing neurons
12
Visualizing neurons
12
Visualizing neurons
13
Visualizing neurons
13
Visualizing neurons
13
Detectors HistogramSound
Ego Motion
Labeled Scenes (supervised)
14
Detectors HistogramSound
Ego Motion
Labeled Scenes (supervised)
14
Detectors HistogramSound
Ego Motion
Labeled Scenes (supervised)
14
Observations
I Each method learn some kinds of representations...
I ...depend on the pretext task.
Representation learned from sound
I Objects with distinctive sound.
I Complementary to other methods.
15
Observations
I Each method learn some kinds of representations...
I ...depend on the pretext task.
Representation learned from sound
I Objects with distinctive sound.
I Complementary to other methods.
15
Object/Scene Recognition (1-vs-rest SVM)
Comparable Performance to Others
1. Agrawal et.al. 2015 4. Doersch et.al 201520. Krähenbühl et.al. 2016 35. Wang & Gupta 2015 16
Object/Scene Recognition (1-vs-rest SVM)
Comparable Performance to Others
1. Agrawal et.al. 2015 4. Doersch et.al 201520. Krähenbühl et.al. 2016 35. Wang & Gupta 2015 16
Object/Scene Recognition (1-vs-rest SVM)
Comparable Performance to Others1. Agrawal et.al. 2015 4. Doersch et.al 201520. Krähenbühl et.al. 2016 35. Wang & Gupta 2015 16
Object Detection (Pretrain Fast-RCNN)
Similar Performance to Motion
1. Agrawal et.al. 2015 4. Doersch et.al 201520. Krähenbühl et.al. 2016 35. Wang & Gupta 2015
17
Object Detection (Pretrain Fast-RCNN)
Similar Performance to Motion
1. Agrawal et.al. 2015 4. Doersch et.al 201520. Krähenbühl et.al. 2016 35. Wang & Gupta 2015
17
Object Detection (Pretrain Fast-RCNN)
Similar Performance to Motion1. Agrawal et.al. 2015 4. Doersch et.al 201520. Krähenbühl et.al. 2016 35. Wang & Gupta 2015
17
Discussion
Sound
I is abundant.
I can learn good representations.
I complementary to visual info.
Future work
I Other sound representations.
I What object/scene detectable by sound?
18
Discussion
Sound
I is abundant.
I can learn good representations.
I complementary to visual info.
Future work
I Other sound representations.
I What object/scene detectable by sound?
18
Bonus: Visually Indicative Sound(Owens et. al. 2016, vis.csail.mit.edu)
19