+ All Categories
Home > Documents > Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied...

Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied...

Date post: 30-Aug-2021
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
69
Clustering-style Self-Supervised Learning Mathilde Caron - FAIR Paris & Inria Grenoble June 20 th , 2021 CVPR 2021 Tutorial: Leave Those Nets Alone: Advances in Self-Supervised Learning
Transcript
Page 1: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

Clustering-style Self-Supervised Learning

Mathilde Caron - FAIR Paris & Inria GrenobleJune 20th, 2021

CVPR 2021 Tutorial:Leave Those Nets Alone:Advances in Self-Supervised Learning

Page 2: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

2

Self-Supervised Learning (SSL)

Designing a learning task that does not rely on human annotations

Example: Colorization (Zhang et al. | 2016)

Page 3: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

3

Designing SSL tasks is an active research area

2014 2015 2016 2017 2018 2019 2020 2021

Dosovitskiy et al. (Exemplar CNN)Doersh et al. (Context pred.)

Wang et al. (video)Agrawal et al. (motion)

Jayaraman et al. (motion)

Pathak et al. (inpainting)Noroozi et al. (jigsaw)

Zhang et al. (colorization)Larsson et al. (colorization)

Owens et al. (sound)Zhang et al. (split-brain)

Bojanowski & Joulin. (NAT)Doersh et al. (multi-task)

Pathak et al. (motion & segment)Yang et al. (clusters)

Donahue et al. (BiGAN)Dumoulin et al. (BiGAN)

Wu et al. (NPID)Gidaris et al. (rotnet)

Caron et al. (DeepCluster)Caron et al. (DeeperCluster)

Simon et al. (artifacts)Jayaraman (shapecodes)van der Oord et al. (CPC)

Huang et al. (kNN)Tian et al. (CMC)

Donahue & Simonyan (BigBiGAN)Hénaff et al. (CPC)

Bachman et al. (amdim)Chen et al. (SS-GAN)

Asano et al. (sela)Minderer et al. (adversarial)

Chen et al. (SimCLR)He et al. (MoCo)

Misra & van der Maaten (PIRL)Grill et al. (BYOL)

Caron et al. (SwAV)Gidaris et al. (bag of words)

Chen et al. (simsiam)Patacchiola & Storkey (rel. reason.)

Wang et al. (invariance prop)Li et al. (PCL)

Tian et al. (InfoMin)

Starting my PhD !

Page 4: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

4

Supervised pre-training: labels classification

Training images + labels Neural network Classification

mountain dog tower

Page 5: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

5

Backprop

Supervised pre-training: labels classification

Training images + labels Neural network Classification

mountain dog tower

Page 6: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

6

Backprop

We do not have labels !

Training images + labels Neural network Classification

??? ??? ???

Page 7: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

7

Can we replace labels with clustering ?

Page 8: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

DeepCluster: Deep Clustering for Unsupervised Learning of Visual Features

Mathilde Caron, Piotr Bojanowski, Armand Joulin, Matthijs DouzeECCV 2018github.com/facebookresearch/deepcluster

Page 9: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

9

dataset

DeepCluster

feature space

Page 10: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

10

feature space

dataset

DeepCluster

k-means clustering

backprop

pseudo-label = cluster assignment

Page 11: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

11

feature space

dataset

Invariance to cropping

k-means clustering

backprop

pseudo-label = cluster assignment

randomcrop

Page 12: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

12

How to Evaluate Self-Supervised Learning ?Use learned representations for downstream tasks

Page 13: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

13

How to Evaluate Self-Supervised Learning ?Example: Object detection on Pascal VOC07 dataset

Object detector: Fast R-CNN (Girshick. | 2015)

dog

q Randomq Supervisedq Self-Supervised

q DeepCluster

house

Page 14: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

14

Results on Object Detection on Pascal VOC07m

AP

(hig

her i

s be

tter

)

(2018)

Page 15: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

15

DeepCluster also produces… clusters!

Clustering evaluationClustering visualization

Page 16: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

16

Limitations of DeepCluster• Does not scale (depends on the dataset size)

epoch 1 epoch 2 epoch 3 epoch 4 epoch 5 epoch 6 epoch 7 epoch 8 epoch 9 epoch 10

new clustering

first clustering

new clustering

new clustering

new clustering

new clustering

new clustering

new clustering

new clustering

new clustering

The clusters (i.e. pseudo-labels) are refined during training

Page 17: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

17

Limitations of DeepCluster• Does not scale (depends on the dataset size)

epoch 1 epoch 2

new clustering

first clustering

Huge dataset: we can afford only 2 epochs!

Problem: clusters are refined only once…

Page 18: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

18

Limitations of DeepCluster• Does not scale (depends on the dataset size)

epoch 1

first clustering

Even bigger dataset: we never see an image twice

Problem: the clusters are never refined!

Page 19: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

19

Limitations of DeepCluster

feature space

• Does not scale (depends on the dataset size)

• Do we really need k-means ?

centroid

centroid

centroid

Page 20: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

20

Limitations of DeepCluster• Does not scale (depends on the dataset size)

• Do we really need k-means ?

feature space

pseudo-labels

centroid

centroid

centroid

Page 21: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

21

Limitations of DeepCluster

feature space

• Does not scale (depends on the dataset size)

• Do we really need k-means ?

Of no use !

Of no use !

Of no use !

pseudo-labels

Page 22: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

22

Limitations of DeepCluster

Collapse of feature space:

• Does not scale (depends on the dataset size)

• Do we really need k-means ?

• Tricks to avoid collapse

Page 23: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

23

Limitations of DeepCluster• Does not scale (depends on the dataset size)

• Do we really need k-means ?

• Tricks to avoid collapse

Collapse of feature space:

Page 24: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

24

Limitations of DeepCluster• Does not scale (depends on the dataset size)

• Do we really need k-means ?

• Tricks to avoid collapse

• Importance of random cropping is only implicit

How to overcome these limitations ?

Page 25: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

SwAV: Unsupervised Learning of Visual Features by Contrasting Cluster Assignments

Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, Armand JoulinNeurIPS 2020github.com/facebookresearch/swav

Page 26: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

26

Pseudo-labels in SwAV

feature space

Page 27: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

27

Pseudo-labels in SwAV

feature space

Page 28: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

28

Pseudo-labels in SwAV

feature space

similar to

similar to

similar to

All we need is a score for each cluster !Most similarWe can directly use the neural network to output scores !

Page 29: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

29

Pseudo-labels in SwAV

output 1

output2

output 3

neural network output

Constraint:Total score for each output

must be the same

SELA - Asano et al. ICLR 2020UIC – Chen et al. ECCV 2020

Page 30: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

30

Pseudo-labels in SwAV

output 1

output2

output 3 Constraint:

Total score for each output must be the same

………

neural network output

Page 31: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

31

Pseudo-labels in SwAV

output 1

output2

output 3

………

Constraint:Total score for each output

must be the same

Sinkhorn adjustthe scores !

neural network output

Page 32: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

32

Pseudo-labels in SwAV

output 1

output2

output 3

………

Constraint:Total score for each output

must be the same

Sinkhorn adjustthe scores !

neural network output

Page 33: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

33

Pseudo-labels in SwAV

output 1

output2

output 3

………

Constraint:Total score for each output

must be the same

neural network output

Sinkhorn adjustthe scores !

Page 34: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

34

Pseudo-labels in SwAV

output 1

output2

output 3

………

neural network output

Recap’

• We don’t need k-means

• Explicit constraints to prevent collapse

• Scalable

min

ibat

ch o

nly

!

epoch 1 epoch 2

pseudo-label at each minibatch

Page 35: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

35

SwAV: the full picture

one minibatch

Sinkhorn adjustment

Page 36: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

36

SwAV: the full picture

Pseudo-labels

one minibatch

backprop

Sinkhorn adjustment

Classification loss

SimCLR - Chen et al. 2020

Page 37: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

37

Multi-crop

one minibatch

backprop

Sinkhorn adjustment

Classification loss

Page 38: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

38

Multi-crop

Global crops

Jigsaw – Noroozi & Favaro. 2016 PIRL - Misra et al. 2020

Page 39: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

39

Multi-crop

Global crops

Local crops

Local predict the pseudo-label of global

Local-to-global matching

Page 40: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

40

Linear benchmark on ImageNet

2 crops

* networks all trained for 400 epochs

2 global crops + 4 local crops(multi-crop)

Page 41: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

41

Linear benchmark on ImageNet

2 crops

2 global crops + 4 local crops(multi-crop)

* networks all trained for 400 epochs

+6%

Page 42: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

42

SwAV vs Supervised Pretraining

We evaluate representations on different downstream tasks.

Page 43: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

43

SwAV vs Supervised PretrainingClassification – Linear

Object Detection – Full finetuning

Page 44: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

44

Great milestone for SSL in 2020

SSL outperform supervised pre-training in transfer learning

Excellent performance on ImageNete.g. SimCLR-v2 (Chen et al) and BYOL (Grill et al) > 79% top-1 !!

Page 45: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

45

Great milestone for SSL but…

Recent SSL methods are very similar to each other (simsiam Chen & He 2020)

à performance saturation

Let us seek progress in an orthogonal direction !

ViT Dosovitskiy et al. 2020DeiT Touvron et al. 2020

Page 46: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

46

Can we improve SSL by using Vision Transformers ?

Page 47: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

DINO: Emerging Properties in Self-Supervised Vision Transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, Armand JoulinUnder reviewgithub.com/facebookresearch/dino

Page 48: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

48

ConvNets & Vision Transformers

ConvNets is de facto architecture for images.

Recently, Vision Transformers (Dosovitskiy et al. 2020) have emerged as an alternative to ConvNets.

Page 49: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

49

SwAV: ConvNet VS ViT

Page 50: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

50

From SwAV to DINO

minibatch

backprop

Sinkhorn score adjustment

Mean Teacher – Tarvainen et al. 2017MoCo - He et al. CVPR 2020BYOL – Grill et al. NeurIPS 2020

Page 51: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

51

From SwAV to DINO

minibatch

backprop

Sinkhorn score adjustment

EM

A

Mean Teacher – Tarvainen et al. 2017MoCo - He et al. CVPR 2020BYOL – Grill et al. NeurIPS 2020

Page 52: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

52

DINO: Self-Distillation with No Labels

minibatch

backprop

EM

A

Teacher

Student

Page 53: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

53

Collapse to one unique dimension

Page 54: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

54

Centering Center = Average Score

Page 55: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

55

Centering

Page 56: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

56

Centering

Collapse to uniform assignment

Page 57: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

57

Centering + Sharpening

Page 58: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

58

DINO: ConvNet VS ViT

Page 59: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

59

DINO: ConvNet VS ViT

+7%

Page 60: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

60

DINO + ViT: excellent K-NN performance

throughput (img/sec)

top-

1 K

-NN

Imag

eNet

DINO + ViT

Previous works

Page 61: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

61

Application to copy detectionQuery

DINOAverage Precision: 85.5%

Supervised ViTAverage Precision: 76.4%

Multigrain architectureAverage Precision: 82.5%

Page 62: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

62

DINO & ViT: Recap’

q DINO trains to high performance with ViTs

q k-NN performance ++à Applications to copy detection and image retrieval

q Interpretability

Page 63: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

63

Self-Attention visualizations

• We look at the self-attention of the [CLS] token of the last block

Page 64: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

64

Self-Attention visualizations

• We look at the self-attention of the [CLS] token of the last block

mIoU with GT

DINO 45.9

Supervised 27.3

We train ViT with other SSL methods:Dino w/o multicropMoCo-v2BYOLSwAV

45.146.347.846.8

Method

Page 65: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

65

DINO applied per-frameto a video

supervised

Page 66: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

66

Different attention heads focus on different parts

Page 67: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

67

Application to video object segmentation on DAVIS17

Page 68: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

68

Application to video object segmentation on DAVIS17

Best SSL (dense)Jabri et al. 2020 DINO 16x16 patches DINO 8x8 patches

Page 69: Clustering-style Self-Supervised Learning · BYOL SwAV 45.1 46.3 47.8 46.8 Method. 65 DINO applied per-frame to a video supervised. 66 Different attention heads focus on different

Thank You


Recommended