Unsupervised Learning of Visual Representations by Solving ... · Representations by Solving Jigsaw...

Ehsan Amiri

Unsupervised Learning of Visual

Representations by Solving Jigsaw Puzzles

Mehdi Noroozi and Paolo Favaro

Presented by : Ehsan Amiri

Ehsan Amiri

• Introduction

– Deep Learning in Visual Tasks

• Unsupervised Learning

• Self-supervised learning

• Transfer learning

• Related works

• The Jigsaw puzzles

– Motivation

• Proposed Method

– CFN Architecture

– Training the CFN

– Implementation

• Experiments

• Summary

2

Outline

Ehsan Amiri

Introduction : Deep Learning in Visual Tasks

3

• Supervised learning

– Use labeled data to train a parametric model

– Deep Convolutional Neural Networks (AlexNet )

– Manually labeling of data (costly)

Source : Krizhevsky et al.

[1]

Ehsan Amiri


4

• Unsupervised learning

– Representation / Feature learning

• General-purpose priors (smoothness, temporal and spatial coherence,

sparsity, sharing of factors, and other priors).

• General criterion is not available.

• Solution: disentangling the factors of variations.

– Methods :

1. Probabilistic Methods

2. Direct Mapping Methods

3. Manifold learning Methods

4. Self-supervised learning Methods

Ehsan Amiri


5

• Unsupervised learning

– Probabilistic Methods

• Observed and latent variables

• Max P(latent | observed)

• Restricted Boltzmann Machine (RBM)

• Problem: intractable in present of multiple layers

– Direct Mapping Methods (autoencoders)

• Feature extraction function (encoder)

• Mapping from feature back to input (decoder)

• Minimizing the reconstruction error

– Manifold learning Methods

• Map smooth variations of factors to observations

• Problem: computation of nearest neighbors(quadratically) + needs high

density of samples

– Self-supervised learning Methods

Ehsan Amiri


6

• Self-supervised learning

– Exploit freely available labelings in visual data

– Two types of labels :

• Easily accessible via non-visual signals

(ego-motion, audio, text and so on)

• Obtained from the structure of Data

(pixel arrangement)

Ehsan Amiri


7

• Transfer learning

Learned Features repurposed

Features

Task 1 Task 2

Extracted Features

Pre-training Fine-tuning

Ehsan Amiri

Related works

8

• Wang and Gupta

– extract matched patches via Tracking in videos.

– Bounding boxes (SURF) – Tracking (KCF)

Source : Wang and Gupta.

[2]

Ehsan Amiri

Related works

9

• Wang and Gupta

– Siamese- triple network

– Builds a metric to define patches’ similarities

– Use the learned features in object detection (PASCAL VOC

2012) and surface normal estimation


[2]

Ehsan Amiri

Related works

10

• Wang and Gupta

– Advantage : Intraclass variability (i.e. illumination, occlusion,

viewpoint ,pose and clutter factors)

– Disadvantage : One object’s different instances may not

necessarily semantically be clustered.


[2]

Ehsan Amiri

Related works

11

• Agrawal et al.

– Freely available egomotion

– Siamese Network (on MNIST)

– Use the learned features in object recognition (ILSVRC-2012)

,scene recognition(SUN), intraclass keypoint matching(PASCAL

VOC 2012) and visual odometry(SF)

Source : Agrawal et al.

[3]

Ehsan Amiri

Related works

12

• Agrawal et al.

– Disadvantage :

• Intraclass variability is limited.

• Learned features focus on similarities (color and texture) rather than high

level structure.

Source : Agrawal et al.

[3]

Ehsan Amiri

Related works

13

• Doersch et al.

– convolutional network

– classify the relative positions

– ImageNet 2012

Source : Doersch et al.

[4]

Ehsan Amiri

Related works

14

• Doersch et al.

– Use the learned features in object detection (PASCAL VOC

2007) and visual data mining (PASCAL VOC 2011)

– Many ambiguities (only two patches).

Source : Doersch et al.

[4]

Ehsan Amiri

The Jigsaw puzzles

15

• Appearance

– John Spilsbury (1760)

– Associated with learning

– Hooper Visual Organization Test

• visual perception, construction and integration

Source : http://www.jigzone.com

Ehsan Amiri

The Jigsaw puzzles: Motivation

16

• Reassembly problem

– Visuospatial representation of objects

– Jigsaw puzzle intersects all ambiguities and reduces them to

one singleton.

Source :

Noroozi and Favaro

Ehsan Amiri

Method

17

• Solving the puzzle

– Convolutional Neural Network (CNN)

• Immediate solution:

– Input data : 9 × 3 = 27 channels

– Increase the depth in the 1st layer of AlexNet

– CNN learns low level texture statistics close to the boundaries

– Understanding of the global object is needed

• Idea

– First compute features based on each tile’s pixels

– Delay the computation of statistics across tiles

Ehsan Amiri

Method

18

• Proposed Architecture

– Siamese-ennead convolutional neural network

– Context Free Network ( CFN )

– Context is only handled in the last fully connected layers.

– Each row up to the fc6 layer uses AlexNet architecture.

– Shared weights up to fc7

Source : Noroozi and Favaro

Ehsan Amiri

Method

19

• Context Free Network ( CFN )


Ehsan Amiri

Method

20

• CFN vs. AlexNet Architecture

– In each row is the same

– Stride in first layer is set to 2 instead of 4

– CFN is more compact than AlexNet

• Total : 27.5M vs. 61M parameters in AlexNet

• fc6 layer : ~2M vs. 37.5M parameters in AlexNet

• fc7 layer : 2M parameters more than the same in AlexNet

Ehsan Amiri

Training the CFN

21

• Input data

– ImageNet (1.3M Images)

– Resize input images to either height or width = 256 pixels

– Crop a random region 225 × 225

– Split to 3 × 3 grid of 75 × 75 pixels tiles

– By random shifts extract 64 × 64 region

– No color dropping or filling with noise

225

225

64

64

Ehsan Amiri

Training the CFN

22

• Jigsaw Puzzle task

– Set of tile configurations

– Rearrange the Input according to one configuration.

– Use only a subset of 100 Instead of 9! Possible solutions.

– Select them based on Hamming Distance (min-avg-max)

– Generate them in each iteration via hash tables.

– output : vector of probability values

1 2 3

4 5 6

7 8 9

CFN

Input output

1 {1,2,3,4,5,6,7,8,9}

2 {7,8,3,2,5,4,6,1,9}

3 {8,7,3,6,5,1,4,2,9}

.

.

.

.

.

.

100 {6,1,3,7,5,2,8,4,9}

0.0

0.0

1.0

.

.

.

0.0

Possible solutions Index patches

8 7 3

6 5 1

4 2 9

Ehsan Amiri

Training the CFN

23

• Jigsaw Puzzle Task

– Output as a PDF of scene part’s spatial arrangements

𝑝 S 𝐴1, 𝐴2, … , 𝐴9 = 𝑝 S 𝐹1, 𝐹2, … , 𝐹9 𝑝(

9

𝑖=1

𝐹𝑖|𝐴𝑖)

– S : configuration of the tiles

– 𝐴𝑖 : i-th part appearance of the object

– 𝐹𝑖 : Intermediate feature representation

– Goal : train CFN so that 𝐹𝑖 have semantic attributes and identify

the relative position

– High dimensional PDF

Ehsan Amiri

Training the CFN

24

• Jigsaw Puzzle Task

– Output as a PDF of scene part’s spatial arrangements

𝑝 S 𝐴1, 𝐴2, … , 𝐴9 = 𝑝 S 𝐹1, 𝐹2, … , 𝐹9 𝑝(

9

𝑖=1

𝐹𝑖|𝐴𝑖)

– Problem : CFN learns to associate each 𝐴𝑖 to an absolute

position. 𝐹𝑖 will have no semantic meaning.

– Strategy : feed several puzzles of the same image

• (average 69 /100 configurations)

– If 𝑆 = {𝐿1, 𝐿2, … , 𝐿9}, then

𝑝 𝐿1, 𝐿2, … , 𝐿9 𝐹1, 𝐹2, … , 𝐹9 = 𝑝(9𝑖=1 𝐹𝑖|𝐿𝑖)

– About 90M jigsaw puzzles(from 1.3M images)

Ehsan Amiri

Implementation

25

• Jigsaw Puzzle task

– Stochastic gradient decent

– Without batch normalization

– Titan X GPU

– Converges after 350K iterations

– Basic learning rate 0.01

– 59.5 hours in total (2.5 days)

Ehsan Amiri

CFN Filter activations

26

• Visualization of top 16 activations

• 6 Significant hand-picked channels

• 20 randomly sampled 64 × 64 patches from ImageNet

validation set

Conv1 filters Source : Noroozi and Favaro

Ehsan Amiri


27

Conv1 activations

• Different types

of textures

Conv2 activations

• Different types

of textures


Ehsan Amiri


28

Conv3 activations

• Face Detector

Conv4 activations

• Part Detector


Ehsan Amiri


29

Conv5 activations

• Other Part Detectors

• Scene Part Detectors


Ehsan Amiri

• Experiment 1: (Transfer learning from classification task to Jigsaw

puzzles)

– Goal : show the relation between object classification and

jigsaw puzzle.

– Transfer features from pre-trained AlexNet to solve Jigsaw

puzzles

– Use locking scheme

– Semantic training is Helpful.

Results

30


Ehsan Amiri

• Experiment 2: (Object Classification)

– Where one should extract the features.

– Last layers of AlexNet are specific to the task while first layers

are general-purpose.

– Repurpose the CFN,[2] and [4] to classification on ImageNet

2012.

– Use locking scheme

– Reference max accuracy 57.4% AlexNet

Results

31


[5]

Ehsan Amiri

• Experiment 3: (Object Detection)

– Use CFN features for object detection with Fast R-CNN .

– Use AlexNet trained on ImageNet as pre-training weights with

Fast R-CNN as baseline. 56.5% mAP

– Fill fully connected layers in Fast R-CNN with Gaussian random

weights(mean: 0.1 and std: 0.001).

– Step strategy

• baseline learning rate: 0.001

• Step : 5K

• Max. iteration 150K

– Check all the methods on PASCAL VOC 2007

Results

32

[6]

Ehsan Amiri

• Experiment 3: (Object Detection)

– CFN pre-trained on ImageNet(CFN-Sup). 56.3% mAP

– CFN pre-trained with jigsaw puzzle

• CFN-4 : based on 2×2 tile grid

• CFN-9 : based on 3×3 tile grid

– CFN-9(min) : average hamming distance 0.45

– CFN-9(middle) : average hamming distance 0.67

– CFN-9(max) : average hamming distance 0.88

Results

33

*

* Using R-CNN Source : Noroozi and Favaro

Ehsan Amiri

• Experiment 4: (Image retrieval)

– Find the nearest neighbors (NN) of pool5 features

• Bounding boxes on PASCAL VOC 2007 test set (Query)

• Bounding boxes of trainval set (retrieval entries)

– Discard Bounding boxes fewer than 10K pixels inside

– Rank the images

• inner product between normalized features of a query image and

normalized features of the retrieval set

– Top 4 matches

Results

34

Ehsan Amiri


– Qualitative evaluation

Results

35

Query AlexNet CFN [4] Doersch et al.

So

urc

e : N

oro

ozi a

nd

Fava

ro

Ehsan Amiri


– Qualitative evaluation

Results

36

Query [2] Wang and Gupta [1] AlexNet with random weights


Ehsan Amiri


– Quantitative evaluation

Results

37


Ehsan Amiri

• Context Free Network (CFN)

– Transferable features between Jigsaw puzzle reassembly ,

detection/classification tasks.(compatible)

– Required no manual labeling.

• Lower Converge time (2.5 days) than Doersch et al (4 weeks).

• In object classification

– On ImageNet 2012 - without fine-tuning 38.1% (best among other

unsupervised methods)

• In object detection

– On PASCAL VOC 2007 - 51.8mAP (The performance of the learned features

are close to the supervised AlexNet -56.5mAP )

Summary

38

Learning the Features

Utilizing the Features

Extracted Features

Solving the Jigsaw Puzzle

Pretext task Classification / Object Detection

Transfer Learning

Ehsan Amiri

References

39

[1] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classication with deep

convolutional neural networks. Advances in Neural Information Processing Systems 25

pp. 1097-1105 (2012)

[2] Wang, X., Gupta, A.: Unsupervised learning of visual representations using

videos.ICCV (2015)

[3] Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. The IEEE

International Conference on Computer Vision (ICCV) (December 2015)

[4] Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning

by context prediction. ICCV (2015)

[5] Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep

neural networks? NIPS pp. 3320-3328 (2014)

[6] Girshick, R.: Fast r-cnn. In: The IEEE International Conference on Computer Vision

(ICCV) (December 2015)

Ehsan Amiri

40

Thank you

Q&A

Date post:	11-Jul-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Unsupervised Learning of Visual Representations by Solving ... · Representations by Solving Jigsaw...

Documents