+ All Categories
Home > Documents > « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ...

« VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ...

Date post: 05-Oct-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
122
Self-Supervised Learning Andrew Zisserman Slides from: Carl Doersch, Ishan Misra, Andrew Owens, Carl Vondrick, Richard Zhang
Transcript
Page 1: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Self-Supervised Learning

Andrew Zisserman

Slides from: Carl Doersch, Ishan Misra, Andrew Owens, Carl Vondrick, Richard Zhang

Page 2: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

1000 categories

• Training: 1000 images for each category

• Testing: 100k images

The ImageNet Challenge Story …

Page 3: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

The ImageNet Challenge Story … strong supervision

Page 4: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

The ImageNet Challenge Story … outcomes

Strong supervision:

• Features from networks trained on ImageNet can be used for other visual tasks, e.g. detection, segmentation, action recognition, fine grained visual classification

• To some extent, any visual task can be solved now by:1. Construct a large-scale dataset labelled for that task

2. Specify a training loss and neural network architecture

3. Train the network and deploy

• Are there alternatives to strong supervision for training? Self-Supervised learning ….

Page 5: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

1. Expense of producing a new dataset for each new task

2. Some areas are supervision-starved, e.g. medical data, where it is hard to obtain annotation

3. Untapped/availability of vast numbers of unlabelled images/videos

– Facebook: one billion images uploaded per day

– 300 hours of video are uploaded to YouTube every minute

4. How infants may learn …

Why Self-Supervision?

Page 6: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Self-Supervised Learning

The Scientist in the Crib: What Early Learning Tells Us About the Mind by Alison Gopnik, Andrew N. Meltzoff and Patricia K. Kuhl

The Development of Embodied Cognition: Six Lessons from Babiesby Linda Smith and Michael Gasser

Page 7: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

• A form of unsupervised learning where the data provides the supervision

• In general, withhold some part of the data, and task the network with predicting it

• The task defines a proxy loss, and the network is forced to learn what we really care about, e.g. a semantic representation, in order to solve it

What is Self-Supervision?

Page 8: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Randomly Sample PatchSample Second Patch

CNN CNN

Classifier

8 possible locations

Example: relative positioning

Train network to predict relative position of two regions in the same image

Unsupervised visual representation learning by context prediction, Carl Doersch, Abhinav Gupta, Alexei A. Efros, ICCV 2015

Page 9: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

A B

Example: relative positioning

Unsupervised visual representation learning by context prediction, Carl Doersch, Abhinav Gupta, Alexei A. Efros, ICCV 2015

Page 10: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Semantics from a non-semantic task

Unsupervised visual representation learning by context prediction, Carl Doersch, Abhinav Gupta, Alexei A. Efros, ICCV 2015

Page 11: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

What is learned?

Relative-positioningInput Random Initialization ImageNet AlexNet

CNN CNN

Classifier

Page 12: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Self-supervised learning in three parts:

1. from images

2. from videos

3. from videos with sound

Outline

Page 13: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Part I

Self-Supervised Learning from Images

Page 14: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Randomly Sample PatchSample Second Patch

CNN CNN

Classifier

8 possible locations

Recap: relative positioning

Train network to predict relative position of two regions in the same image

Unsupervised visual representation learning by context prediction, Carl Doersch, Abhinav Gupta, Alexei A. Efros, ICCV 2015

Page 15: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Evaluation: PASCAL VOC Detection

• 20 object classes (car, bicycle, person, horse …)

• Predict the bounding boxes of all objects of a given class in an image (if any)

Dog Horse Motorbike Person

Page 16: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Pre-train on relative-position task, w/o labels

[Girshick et al. 2014]

Evaluation: PASCAL VOC Detection

• Pre-train CNN using self-supervision (no labels)

• Train CNN for detection in R-CNN object category detection pipeline

R-CNN

Page 17: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

45.6%

No PretrainingRelative positioning

ImageNet Labels

51.1%56.8%

Aver

age

Prec

isio

n

Evaluation: PASCAL VOC Detection

Page 18: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Avoiding Trivial Shortcuts

Include a gap

Jitter the patch locations

Page 19: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Position in Image

A Not-So “Trivial” Shortcut

Page 20: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Chromatic Aberration

Page 21: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Position in Image

A Not-So “Trivial” Shortcut

Solution?

Only use one of the colour channels

Page 22: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

abL

Concatenate (L,ab)Grayscale image: L channel

“Free” supervisory

signal

Image example II: colourizationTrain network to predict pixel colour from a monochrome input

Page 23: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Image example II: colourizationTrain network to predict pixel colour from a monochrome input

Colorful Image Colorization, Zhang et al., ECCV 2016

Page 24: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

- Exemplar Networks (Dosovitskiy et al., 2014)

- Perturb/distort image patches, e.g. by cropping and affine transformations

- Train to classify these exemplars as same class

Image example III: exemplar networks

Page 25: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Egomotion

Agrawal et al. ICCV 2015 Jayaraman et al. ICCV 2015Isola et al. ICLR Workshop 2016.

Context

Noroozi et al 2016 Pathak et al. CVPR 2016

Hinton & Salakhutdinov.Science 2006.

Zhang et al. CVPR 2017

Dosovitskiy et al., NIPS 2014

Split-brain auto-encoders

Co-Occurrence

Exemplar networksAutoencoders Denoising Autoencoders

Vincent et al. ICML 2008.

Page 26: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Multi-Task Self-Supervised Learning

Self-supervision task ImageNet Classification

top-5 accuracy

PASCAL VOC Detection

mAP

Rel. Pos 59.21 66.75

Colour 62.48 65.47

Exemplar 53.08 60.94

Rel. Pos + colour 66.64 68.75

Rel. Pos + Exemplar 65.24 69.44

Rel. Pos + colour + Exemplar 68.65 69.48

ImageNet labels 85.10 74.17

Procedure:

• ImageNet-frozen: self-supervised training, network fixed, classifier trained on features

• PASCAL: self-supervised pre-training, then train Faster-RCNN

• ImageNet labels: strong supervision

NB: all methods re-implemented on same backbone network (ResNet-101)

Multi-task self-supervised visual learning, C Doersch, A Zisserman, ICCV 2017

Page 27: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Multi-Task Self-Supervised Learning

Multi-task self-supervised visual learning, C Doersch, A Zisserman, ICCV 2017

Self-supervision task ImageNet Classification

top-5 accuracy

PASCAL VOC Detection

mAP

Rel. Pos 59.21 66.75

Colour 62.48 65.47

Exemplar 53.08 60.94

Rel. Pos + colour 66.64 68.75

Rel. Pos + Exemplar 65.24 69.44

Rel. Pos + colour + Exemplar 68.65 69.48

ImageNet labels 85.10 74.17Procedure:

• ImageNet-frozen: self-supervised training, network fixed, classifier trained on features

• PASCAL: self-supervised pre-training, then train Faster-RCNN

• ImageNet labels: strong supervision

Findings:

• Deeper network improves performance (ResNet vs AlexNet)

• Colour and Rel-Pos superior to Exemplar

• Gap between self-supervision and strong supervision closing

Page 28: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Image Transformations – 2018

Unsupervised representation learning by predicting image rotations, Spyros Gidaris, Praveer Singh, Nikos Komodakis, ICLR 2018

Which image has the correct rotation?

Page 29: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Image Transformations – 2018

Unsupervised representation learning by predicting image rotations, Spyros Gidaris, Praveer Singh, Nikos Komodakis, ICLR 2018

Page 30: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Image Transformations – 2018

Unsupervised representation learning by predicting image rotations, Spyros Gidaris, Praveer Singh, Nikos Komodakis, ICLR 2018

Page 31: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Image Transformations – 2018

• Uses AlexNet

• Closes gap between ImageNet and self-supervision

PASCAL VOC Detection mAP

Random 43.4

Rel. Pos. 51.1

Colour 46.9

Rotation 54.4

ImageNet Labels 56.8

Unsupervised representation learning by predicting image rotations, Spyros Gidaris, Praveer Singh, Nikos Komodakis, ICLR 2018

Page 32: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Summary Point

• Self-Supervision:

– A form of unsupervised learning where the data provides the supervision

– In general, withhold some information about the data, and task the network with predicting it

– The task defines a proxy loss, and the network is forced to learn what we really care about, e.g. a semantic representation, in order to solve it

• Many self-supervised tasks for images

• Often complementary, and combining improves performance

• Closing gap with strong supervision from ImageNet label training

– ImageNet image classification, PASCAL VOC detection

• Deeper networks improve performance

Page 33: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Part II

Self-Supervised Learning from Videos

Page 34: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Video

A temporal sequence of frames

What can we use to define a proxy loss?

• Nearby (in time) frames are strongly correlated, further away may not be

• Temporal order of the frames

• Motion of objects (via optical flow)

• …

Page 35: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Three example tasks:

– Video sequence order

– Video direction

– Video tracking

Outline

Page 36: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Temporal structure in videos

Time

“Sequence” of data

Shuffle and Learn: Unsupervised Learning using Temporal Order Verification

Ishan Misra, C. Lawrence Zitnick and Martial Hebert ECCV 2016

Slide credit: Ishan Misra

Page 37: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Sequential Verification

• Is this a valid sequence?

Sun and Giles, 2001; Sun et al., 2001; Cleermans 1993; Reber 1989Arrow of Time - Pickup et al., 2014

Slide credit: Ishan Misra

Page 38: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Orig

inal

vid

eo

Slide credit: Ishan Misra

Page 39: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Orig

inal

vid

eo

Temporally Correct orderTemporally Correct order

Slide credit: Ishan Misra

Page 40: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Orig

inal

vid

eo

Temporally Correct orderTemporally Correct order

Temporally Incorrect order Slide credit: Ishan Misra

Page 41: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Geometric View

Given a start and an end, can this point lie in between?

Images

Shuffle and Learn – I. Misra, L. Zitnick, M. Hebert – ECCV 2016 Slide credit: Ishan Misra

Page 42: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Dataset: UCF-101 Action Recognition

UCF101 - Soomro et al., 2012

Page 43: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Positive Tuples Negative Tuples

~900k tuples from UCF-101 dataset (Soomro et al., 2012)Slide credit: Ishan Misra

Page 44: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Orig

inal

vid

eo

Temporally Correct orderInformative training tuples

Frame Motion

Time

High motion window

Slide credit: Ishan Misra

Page 45: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

conc

aten

atio

n

Input Tuple

fc8

Correct/IncorrectTuple

class

ifica

tion

Cross Entropy Loss

Slide credit: Ishan Misra

Page 46: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Query ImageNet Shuffle & Learn Random

Nearest Neighbors of Query Frame (fc7 features)

Slide credit: Ishan Misra

Page 47: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Finetuning setup

conc

aten

atio

n

Correct/IncorrectTuple

Input Tuple

clas

sifica

tion

Test -> FinetuneSelf-supervised Pre-train

Action Labels

Slide credit: Ishan Misra

Page 48: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Results: Finetune on Action Recognition

Dataset Initialization Mean ClassificationAccuracy

UCF101 Random 38.6Shuffle & Learn 50.2

ImageNet pre-trained 67.1

Slide credit: Ishan Misra

Setup from - Simonyan & Zisserman, 2014

Page 49: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

What does the network learn?

Given a start and an end, can this point lie in between?

Images

Shuffle and Learn – I. Misra, L. Zitnick, M. Hebert – ECCV 2016 Slide credit: Ishan Misra

Page 50: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Human Pose Estimation

• Keypoint estimation using FLIC and MPII Datasets

Slide credit: Ishan Misra

Page 51: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

FLIC Dataset MPII DatasetInitialization Mean PCK AUC PCK Mean

[email protected]

[email protected] & Learn 84.9 49.6 87.7 47.6

ImageNet pre-train 85.8 51.3 85.1 47.2

FLIC - Sapp & Taskar, 2013MPII - Andriluka et al., 2014Setup fom – Toshev et al., 2013

Slide credit: Ishan Misra

Human Pose Estimation

• Keypoint estimation using FLIC and MPII Datasets

Page 52: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

More temporal structure in videos

Self-Supervised Video Representation Learning With Odd-One-Out Networks

Basura Fernando, Hakan Bilen, Efstratios Gavves, and Stephen Gould, ICCV 2017

Page 53: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

More temporal structure in videos

Self-Supervised Video Representation Learning With Odd-One-Out Networks

Basura Fernando, Hakan Bilen, Efstratios Gavves, and Stephen Gould, ICCV 2017

Initialization Mean Classification

AccuracyRandom 38.6

Shuffle and Learn 50.2

Odd-One-Out 60.3

ImageNet pre-trained

67.1

Page 54: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

• Important to select informative data in training– Hard negatives and positives

– Otherwise, most data is too easy or has no information and the network will not learn

– Often use heuristics for this, e.g. motion energy

• Consider how the network can possibly solve the task (without cheating)– This determines what it must learn, e.g. human keypoints in `shuffle and learn’

• Choose the proxy task to encourage learning the features of interest

Summary: lessons so far

Page 55: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Self-Supervision using the Arrow of Time

Donglai Wei, Joseph Lim, Bill Freeman, Andrew Zisserman CVPR 2018

Page 56: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Learning the arrow of time

Supervision:

Positive training samples: video clips playing forwards

Negative training samples: video clips playing backwards

Task: predict if video playing forwards or backwards

Page 57: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Strong cues

Semantic, face motion direction, ordering

Donglai Wei, Joseph Lim, Bill Freeman, Andrew Zisserman CVPR 2018

Page 58: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Strong cues

`Simple’ physics:• gravity• entropy• friction• causality

Donglai Wei, Joseph Lim, Bill Freeman, Andrew Zisserman CVPR 2018

Page 59: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Weak or no cues

Symmetric in time, constant motion, repetitions

Donglai Wei, Joseph Lim, Bill Freeman, Andrew Zisserman CVPR 2018

Page 60: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Temporal Class-Activation Map Network

input motion

1 … 10

forwardsor

backwards?

11 … 20

T-CAM Model:

Input: optical flow in two chunks

Final layer: global average pooling to allow class activation map (CAM)

Page 61: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

The inevitable cheating …

Dataset: UCF-101 actions

Train/Test: 70%/30%

AoT Test accuracy: 98%

Chance accuracy: 50%

Cautionary tale: Chromatic aberration used as shortcut in Doersch C, Gupta A, Efros AA, Unsupervised visual representation learning by context prediction. ICCV 2015

Donglai Wei, Joseph Lim, Bill Freeman, Andrew Zisserman CVPR 2018

Page 62: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Cue I: black framing

when black stripe signals are zeroed-out,test accuracy drops ~10%

0 5

timeh

eig

ht

black stripes are not “purely black”

46% of videos have black framing

Donglai Wei, Joseph Lim, Bill Freeman, Andrew Zisserman CVPR 2018

Page 63: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

cluster A(camera zoom-in)

cluster B(camera tilt-down)

K-means clustering on test clips with top scores

Cue II: cinematic conventions

73% of videos have camera motion

Page 64: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

when camera motion is stabilized, test accuracy drops ~10%

original camera stabilized

(black stripe removed)

Stabilize to remove camera motion/zoom

Page 65: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Datasets and Performance

Flickr 150K shots

• Obtained from 1.74M shots used in Thomee et al (2016) &Vondrick et al (2016), after black stripe removal and stabilization

• Split 70:30 for train:test

Model accuracy on test set: 81%

Human accuracy on test set: 81%

Chance: 50%

Donglai Wei, Joseph Lim, Bill Freeman, Andrew Zisserman CVPR 2018

Page 66: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

“Semantic” motionsin

put

vid

eop

redi

ctio

n he

atm

ap

backward

forward

Page 67: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Evaluation: Action Classification

Procedure:

• Pre-train network

• Fine tune & test network on UCF101 human action classification benchmark

• * = Wang et al, Temporal Segment Networks, 2016 (also VGG-16 and flow, pre-trained on ImageNet)

Pre-train Performance

T-CAM on AoT on Flickr 150k shots 84.1

T-CAM on AoT on UCF-101 86.3

Flow network on ImageNet* 85.7

Donglai Wei, Joseph Lim, Bill Freeman, Andrew Zisserman CVPR 2018

Page 68: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Tracking Emerges by Colorizing Videos

Vondrick, Shrivastava, Fathi, Guadarrama, Murphy, ECCV 2018

Page 69: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Color is mostly temporally coherent

Page 70: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Temporal Coherence of Color

RGB

ColorChannels

QuantizedColor

Page 71: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Self-supervised Tracking

Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018.

Reference Frame Gray-scale Video

Task: given a color video … Colorize all frames of a gray scale version using a reference frame

Page 72: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

What color is this?

Page 73: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Where to copy color from?

Page 74: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Semantic correspondence

Page 75: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Input Frame

Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018.

Page 76: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Target Colors

Input FrameReference Frame

Reference Colors

Colorize by Pointing

Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018.

Page 77: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Target Colors

Input FrameReference Frame

Reference Colors

Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018.

Page 78: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Target Colors

Input FrameReference Frame

Reference Colors

Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018.

Page 79: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Target Colors

Input FrameReference Frame

Reference Colors

Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018.

Page 80: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Target Colors

Input FrameReference Frame

Reference Colors

Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018.

Page 81: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Video Colorization

Ground TruthReference Frame Gray-scale Video Predicted Color

Train: Kinetics

Evaluate: DAVIS

Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018.

Page 82: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Visualizing Embeddings

Project embedding to 3 dimensions and visualize as RGB

Train: Kinetics

Evaluate: DAVIS

Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018.

Page 83: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Tracking Emerges!Input FrameReference Frame

Predicted MaskReference Mask

Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018.

Page 84: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Only the first frame is given. Colors indicate different instances.

Segment Tracking Results

Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018.

Page 85: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Pose Tracking Results

Only the skeleton in the first frame is given.

Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018.

Page 86: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Part III

Self-Supervised Learning from Videos with Sound

Page 87: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Audio-Visual Co-supervision

Sound and frames are:

• Semantically consistent

• Synchronized

Page 88: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Audio-Visual Co-supervision

Objective: use vision and sound to learn from each other

• Two types of proxy task:

1. Predict audio-visual correspondence

2. Predict audio-visual synchronization

Page 89: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Audio-Visual Co-supervision

Train a network to predict if image and audio clip correspond

Correspond?

“Objects that Sound”, Arandjelović and Zisserman, ICCV 2017 & ECCV 2018

Page 90: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Audio-Visual Correspondence

drum

guitar

Page 91: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

drum

guitar

positive

Audio-Visual Correspondence

Page 92: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

drum

guitar

positive

Audio-Visual Correspondence

Page 93: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

drum

guitar

negative

Audio-Visual Correspondence

Page 94: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Correspond?yes/no

Contrastive loss based on distance

between vectors

Distance between audio and visual vectors:

• Small: AV from the same place in a video (Positives)

• Large: AV from different videos (Negatives)

Train network from scratch

visual subnetwork

audio subnetwork

single frame

1 s

Audio-Visual Embedding (AVE-Net)

Page 95: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Overview

What can be learnt by watching and listening to videos?

• Good representations– Visual features

– Audio features

• Intra- and cross-modal retrieval– Aligned audio and visual embeddings

• “What is making the sound?”– Learn to localize objects that sound

“Objects that Sound”, Arandjelović and Zisserman, ICCV 2017 & ECCV 2018

Page 96: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

• Andrew Owens ….– Owens, A., Jiajun, W., McDermott, J., Freeman, W., Torralba, A.: Ambient sound provides

supervision for visual learning. ECCV 2016

– Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E., Freeman,W.: Visually indicated sounds. CVPR 2016

• Other MIT work:– Aytar, Y., Vondrick, C., Torralba, A.: SoundNet: Learning sound representations from

unlabeled video. NIPS 2016

• From the past:

– Kidron, E., Schechner, Y.Y., Elad, M.: Pixels that sound. CVPR 2005

– De Sa, V.: Learning classification from unlabelled data, NIPS 1994

Background: Audio-Visual

Page 97: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Dataset

- AudioSet (from YouTube), has labels

- 200k x 10s clips

- use musical instruments classes

- Correspondence accuracy on test set: 82% (chance: 50%)

“Objects that Sound”, Arandjelović and Zisserman, ICCV 2017 & ECCV 2018

Page 98: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Use audio and visual features

What can be learnt by watching and listening to videos?

• Good representations– Visual features

– Audio features

• Intra- and cross-modal retrieval– Aligned audio and visual embeddings

• “What is making the sound?”– Learn to localize objects that sound

correspond? yes/no

visual subnetwork

audio subnetwork

single frame

1 s

“Objects that Sound”, Arandjelović and Zisserman, ICCV 2017 & ECCV 2018

Page 99: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Sound classification• ESC-50 dataset

– Environmental sound classification

– Use the net to extract features

– Train linear SVM

Results: Audio features

“Objects that Sound”, Arandjelović and Zisserman, ICCV 2017 & ECCV 2018

Page 100: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

ImageNet classification• Standard evaluation procedure for unsupervised / self-supervised setting

– Use the net to extract visual features

– Linear classification on ImageNet

• On par with state-of-the-art self-supervised approaches

• The only method whose features haven’t seen ImageNet images

– Probably never seen ‘Tibetan terrier’

– Video frames are quite different from images

Results: Vision features

Page 101: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Use audio and visual features

What can be learnt by watching and listening to videos?

• Good representations– Visual features

– Audio features

• Intra- and cross-modal retrieval– Aligned audio and visual embeddings

• “What is making the sound?”– Learn to localize objects that sound

correspond? yes/no

visual subnetwork

audio subnetwork

single frame

1 s

“Objects that Sound”, Arandjelović and Zisserman, ICCV 2017 & ECCV 2018

Page 102: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Query on image, retrieve audio

Search in 200k video clips of AudioSet

Query frame

Top 10 ranked audio clips

“Objects that Sound”, Arandjelović and Zisserman, ICCV 2017 & ECCV 2018

Page 103: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Use audio and visual features

What can be learnt by watching and listening to videos?

• Good representations– Visual features

– Audio features

• Intra- and cross-modal retrieval– Aligned audio and visual embeddings

• “What is making the sound?”– Learn to localize objects that sound

correspond? yes/no

visual subnetwork

audio subnetwork

single frame

1 s

“Objects that Sound”, Arandjelović and Zisserman, ICCV 2017 & ECCV 2018

Page 104: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Objects that Sound

Audio embedding

Apply Visual ConvNetconvolutionally

AVE-Net

Visual embedding

Single audio representation

128-D

14x14 spatial grid of 128-D visual

representations

AVOL-Net

Multiple instance learning

“Objects that Sound”, Arandjelović and Zisserman, ICCV 2017 & ECCV 2018

Page 105: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Localizing objects with sound

Input: audio and video frame

Output: localization heatmap on frame

What would make this sound?

Note, no video (motion) information is used

“Objects that Sound”, Arandjelović and Zisserman, ICCV 2017 & ECCV 2018

Page 106: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

To embed or not to embed?

Audio embeddingVisual embedding

Concatenation Embedding

Features available

Cross-modal alignment in embedding

Page 107: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Specialize to talking heads …

Objective: use faces and voice to learn from each other

• Two types of proxy task:

1. Predict audio-visual correspondence

2. Predict audio-visual synchronization

Page 108: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Specialize to talking heads …

Objective: use faces and voice to learn from each other

• Two types of proxy task:

1. Predict audio-visual correspondence

2. Predict audio-visual synchronization

Page 109: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Lip-sync problem on TV

Page 110: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Face-Speech Synchronization

• Positive samples: in sync

• Negative samples: out of sync (introduce temporal offset)

Chung, Zisserman (2016) “Out of time: Automatic lip sync in the wild”

Page 111: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Sequence-sequence face-speech network

• The network is trained with contrastive loss to:

– Minimise distance between positive pairs

– Maximise distance between negative pairs

Contrastive loss

Chung, Zisserman (2016) “Out of time: Automatic lip sync in the wild”

Page 112: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Face-Speech Synchronization

Chung, Zisserman (2016) “Out of time: Automatic lip sync in the wild”

Averaged sliding windows

The predicted offset value is >99% accurate, averaged over 100 frames.

In-sync Off-sync Non-speaker

Dis

tan

ce

Offset Offset Offset

Page 113: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Application: Lip Synchronization

Page 114: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Application: Active speaker detection

Blue: speaker Red: non-speaker

Page 115: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Face-Speech Synchronization - summary

The network can be used for:

– Audio-to-video synchronisation

– Active speaker detection

– Voice-over rejection

– Visual features for lip reading

Page 116: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Audio-Visual Synchronization

Page 117: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Self-supervised Training

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features, Andrew Owens, Alyosha Efros, 2018

Page 118: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Misaligned Audio

Shifted audio track

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features, Andrew Owens, Alyosha Efros, 2018

Page 119: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Visualizing the location of sound sources

3D class activation map

3D Convolution

3D Convolution

3D Convolution

3D Convolution 1D Convolution

1D Convolution

1D Convolution

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features, Andrew Owens, Alyosha Efros, 2018

Page 120: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD
Page 121: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

Summary: Audio-Visual Co-supervision

Objective: use vision and sound to learn from each other

• Two types of proxy task:

1. Predict audio-visual correspondence -> semantics

2. Predict audio-visual synchronization -> attention

• Lessons are applicable to any two related sequences, e.g. stereo video, RGB/D video streams, visual/infrared cameras …

Page 122: « VWURQJ VXSHUYLVLRQ - Inria · e\ /lqgd 6plwk dqg 0lfkdho *dvvhu $ IRUP RI XQVXSHUYLVHG OHDUQLQJ ZKHUH WKH GDWD SURYLGHV WKH VXSHUYLVLRQ ,Q JHQHUDO ZLWKKROG VRPH SDUW RI WKH GDWD

• Self-Supervised Learning from images/video– Enables learning without explicit supervision

– Learns visual representations – on par with ImageNet training

• Self-Supervised Learning from videos with sound– Intra- and cross-modal retrieval

– Learn to localize sounds

– Tasks not just a proxy, e.g. synchronization, attention, applicable directly

• Applicable to other domains with paired signals, e.g. – face and voice

– Infrared/visible

– RGB/D

– Stereo streams …

Summary


Recommended