+ All Categories
Home > Documents > 12 1 arXiv:2104.09124v3 [cs.CV] 9 Aug 2021

12 1 arXiv:2104.09124v3 [cs.CV] 9 Aug 2021

Date post: 06-Nov-2021
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
11
DisCo: Remedy Self-supervised Learning on Lightweight Models with Distilled Contrastive Learning Yuting Gao 1 * Jia-Xin Zhuang 1,2 * Ke Li 1 Hao Cheng 1 Xiaowei Guo 1 Feiyue Huang 1 Rongrong Ji 3 Xing Sun 1 1 Tencent Youtu Lab 2 Sun Yat-sen University 3 Xia Men University {yutinggao,scorpioguo,garyhuang}@tencent.com {lincolnz9511,tristanli.sh,louischeng369,winfred.sun}@gmail.com, [email protected] Abstract While self-supervised representation learning (SSL) has received widespread attention from the community, recent research argue that its performance will suffer a cliff fall when the model size decreases. The current method mainly relies on contrastive learning to train the network and in this work, we propose a simple yet effective Distilled Contrastive Learning (DisCo) to ease the issue by a large margin. Specifically, we find the final embedding obtained by the mainstream SSL methods contains the most fruit- ful information, and propose to distill the final embed- ding to maximally transmit a teacher’s knowledge to a lightweight model by constraining the last embedding of the student to be consistent with that of the teacher. In ad- dition, in the experiment, we find that there exists a phe- nomenon termed Distilling BottleNeck and present to en- large the embedding dimension to alleviate this problem. Our method does not introduce any extra parameter to lightweight models during deployment. Experimental re- sults demonstrate that our method achieves the state-of-the- art on all lightweight models. Particularly, when ResNet- 101/ResNet-50 is used as teacher to teach EfficientNet-B0, the linear result of EfficientNet-B0 on ImageNet is very close to ResNet-101/ResNet-50, but the number of param- eters of EfficientNet-B0 is only 9.4%/16.3% of ResNet- 101/ResNet-50. Code is available at https://github. com/Yuting-Gao/DisCo-pytorch. 1. Introduction Deep learning has achieved great success in computer vi- sion tasks, including image classification, object detection * The first two authors contributed equally. This work was done when Jia-Xin Zhuang was intern in Tencent Youtu Lab. Corresponding Author, [email protected] Figure 1. ImageNet top-1 linear evaluation accuracy on different network architectures. Our method significantly exceeds the result of using MoCo-V2 directly, and also surpasses the state-of-the-art SEED with a large margin. Particularly, the results of EfficientNet- B0 is quite close to the teacher ResNet-50, while the number of parameters of EfficientNet-B0 is only 16.3% of ResNet-50. The improvement brought by DisCo is compared to MoCo-V2 base- line. and semantic segmentation. Such success relies heavily on manually labeled datasets, which are time-consuming and expensive to obtain. Therefore, more and more researchers begin to explore how to make better use of off-the-shelf unlabeled data. Among them, Self-supervised Learning (SSL) is an effective way to explore the information con- tained in the data itself by using proxy signals as supervi- sion. Usually, after pre-training the network on massive un- labeled data with self-supervised methods and fine-tuning on downstream tasks, the performance of downstream tasks will be significantly improved. Hence, SSL has attracted widespread attention from the community, and many meth- ods have been proposed [27, 32, 13, 35, 8, 9, 20, 10, 19]. 1 arXiv:2104.09124v3 [cs.CV] 9 Aug 2021
Transcript
Page 1: 12 1 arXiv:2104.09124v3 [cs.CV] 9 Aug 2021

DisCo Remedy Self-supervised Learning on Lightweight Models with DistilledContrastive Learning

Yuting Gao1 Jia-Xin Zhuang12 Ke Li1 Hao Cheng1 Xiaowei Guo 1

Feiyue Huang1 Rongrong Ji3 Xing Sun1 dagger

1Tencent Youtu Lab 2Sun Yat-sen University 3Xia Men Universityyutinggaoscorpioguogaryhuangtencentcom

lincolnz9511tristanlishlouischeng369winfredsungmailcom rrjixmueducn

Abstract

While self-supervised representation learning (SSL) hasreceived widespread attention from the community recentresearch argue that its performance will suffer a cliff fallwhen the model size decreases The current method mainlyrelies on contrastive learning to train the network andin this work we propose a simple yet effective DistilledContrastive Learning (DisCo) to ease the issue by a largemargin Specifically we find the final embedding obtainedby the mainstream SSL methods contains the most fruit-ful information and propose to distill the final embed-ding to maximally transmit a teacherrsquos knowledge to alightweight model by constraining the last embedding ofthe student to be consistent with that of the teacher In ad-dition in the experiment we find that there exists a phe-nomenon termed Distilling BottleNeck and present to en-large the embedding dimension to alleviate this problemOur method does not introduce any extra parameter tolightweight models during deployment Experimental re-sults demonstrate that our method achieves the state-of-the-art on all lightweight models Particularly when ResNet-101ResNet-50 is used as teacher to teach EfficientNet-B0the linear result of EfficientNet-B0 on ImageNet is veryclose to ResNet-101ResNet-50 but the number of param-eters of EfficientNet-B0 is only 94163 of ResNet-101ResNet-50 Code is available at httpsgithubcomYuting-GaoDisCo-pytorch

1 IntroductionDeep learning has achieved great success in computer vi-

sion tasks including image classification object detection

The first two authors contributed equally This work was done whenJia-Xin Zhuang was intern in Tencent Youtu Lab

daggerCorresponding Author winfredsungmailcom

5 10 15 20Numbers of Parameters(Millions)

35

40

45

50

55

60

65

70

Top-

1 Ac

cura

cy(

)

Eff-B0

Mob-V3

Eff-B1

R-18

R-34

+197

+282

+182

+84+57

DisCo (Ours)SEEDMoCo-V2Teacher (R-50)

Figure 1 ImageNet top-1 linear evaluation accuracy on differentnetwork architectures Our method significantly exceeds the resultof using MoCo-V2 directly and also surpasses the state-of-the-artSEED with a large margin Particularly the results of EfficientNet-B0 is quite close to the teacher ResNet-50 while the number ofparameters of EfficientNet-B0 is only 163 of ResNet-50 Theimprovement brought by DisCo is compared to MoCo-V2 base-line

and semantic segmentation Such success relies heavily onmanually labeled datasets which are time-consuming andexpensive to obtain Therefore more and more researchersbegin to explore how to make better use of off-the-shelfunlabeled data Among them Self-supervised Learning(SSL) is an effective way to explore the information con-tained in the data itself by using proxy signals as supervi-sion Usually after pre-training the network on massive un-labeled data with self-supervised methods and fine-tuningon downstream tasks the performance of downstream taskswill be significantly improved Hence SSL has attractedwidespread attention from the community and many meth-ods have been proposed [27 32 13 35 8 9 20 10 19]

1

arX

iv2

104

0912

4v3

[cs

CV

] 9

Aug

202

1

Among them methods [8 9 20 10 19] based on con-trastive learning are becoming the mainstream due to theirsuperior results These methods constantly refresh theresults with relatively large networks but are unsatisfy-ing on some lightweight models For example the num-ber of parameters of MobileNet-v3-LargeResNet-152 is52M574M [25 22] and the corresponding linear evalu-ation top-1 accuracy on ImageNet [38] using MoCo-V2 is362741 Compared to their fully supervised counter-parts 7527857 the results of MobileNet-v3-Large isfar from satisfying Meanwhile in real scenarios sometimesonly lightweight models can be deployed due to the lim-ited hardware resources Therefore improving the abilityof self-supervised representation learning on small modelsis of great significance

Knowledge distillation [24] is an effective way to trans-fer the knowledge learned by a large model (teacher) to asmall model (student) Recently some self-supervised rep-resentation learning methods use knowledge distillation toimprove the efficacy of small models SimCLR-V2 [9] useslogits in the fine-tuning stage to transfer the knowledge in atask-specific way CompRess [1] and SEED [18] mimic thesimilarity score distribution between a teacher and a studentmodel over a dynamically maintained queue Though distil-lation has been shown to be effective two factors affect theresult prominently ie which knowledge is most needed bythe student and how to deliver it In this work we proposenew insights towards these two aspects

In the current mainstream contrastive learning based SSLmethods a multi-layer perceptron (MLP) is added after theencoder to obtain a low-dimensional embedding Train-ing loss and the accuracy evaluation are both performed onsuch embedding We thus hypothesize that this final em-bedding contains the most fruitful knowledge and shouldbe regarded as the first choice for knowledge transferringTo achieve this we propose a simple yet effective DistilledContrastive Learning (DisCo) to transfer this knowledgefrom large models to lightweight models in the pre-trainingstage Specifically DisCo takes the MLP embedding ob-tained by the teacher as the knowledge and injects it into thestudent by constraining the corresponding embedding of thestudent to be consistent with that of the teacher using MSEloss In addition we find that a budgeted dimension of thehidden layer in the MLP of the student may cause a knowl-edge transmission bottleneck We term this phenomenonas Distilling Bottleneck and present to enlarge the embed-ding dimension to alleviate this problem This simple yeteffective operation relates to the capability of model gener-alization in the setting of self-supervised learning from theInformation BottleNeck [42] perspective It is worth notingthat our method only introduces a small number of addi-tional parameters in the pre-training phase but during thefine-tuning and deployment stage there is no extra compu-

tational burden since the MLP layer is removed

vprime

Zm Zprime s

MLP

Pretrained Frozen

Image

View

Representation Gradients

v

Zs Gradients Zt

Similarity

Contrastive Learning

Eprime m Es Eprime s

Zprime t

MLP

Et Eprime t

Consistency Regularization

Embedding

MLP

MeanStudent Student TeacherEncoder

x

Figure 2 The framework of the proposed method DisCo Oneimage is first transformed into two views by two drastic data aug-mentation operations In addition to the original constrastive SSLpart a self-supervised pre-trained teacher is introduced and thefinal embeddings obtained by the learnable student and the frozenteacher are required to be consistent for each view (Best viewedin color)

Experimental results demonstrate that DisCo can effec-tively transfer the knowledge in the teacher to the studentmaking the representations extracted by the student moregeneralized Our approach is simple and incorporate it intoexisting contrastive based SSL methods can bring signifi-cant gains

In summary our main contributions are twofold

bull We propose a simple yet effective self-supervised dis-tillation method to boost the representation abilities oflightweight models

bull We achieve state-of-the-art SSL results on lightweightmodels Particularly the linear evaluation results ofEfficientNet-B0 [40] on ImageNet is quite close toResNet-101ResNet-50 while the number of parame-ters of EfficientNet-B0 is only 94163 of ResNet-101ResNet-50

2 Related Work21 Self-supervised Learning

Self-supervised learning (SSL) is a generic frameworkthat learns high semantic patterns from the data itself with-out any tags from human beings Current methods mainly

2

rely on three paradigms ie pretext tasks contrastive basedand clustering based

Pretext tasks Approaches based on pretext paradigm fo-cus on designing more effective surrogate tasks includ-ing Exampler-CNN [15] that identifies whether patches arecropped from the same image Rotation [27] that predictsthe rotation degree of the input image Jigsaw [32] thatplaces the shuffled patches back to the original position andContext encoder [35] that recovers the missing part of theinput image conditioned on its surrounding

Contrastive based Contrastive based approaches haveshown impressive performance on self-supervised represen-tation learning which enforce different views of the sameinput to be closer in feature space [11 9 8 23 20 10 1945 46 50] SimCLR [8 9] indicates that self-supervisedlearning can be boosted by applying strong data augmen-tation training with larger batch size of negative samplesand adding projection head (MLP) after the global aver-age pooling However SimCLR relies on very large batchsize to achieve comparable performance and cannot be ap-plied widely to many real-world scenarios MoCo [20 10]considers contrastive learning as a look-up dictionary us-ing a memory bank to maintain consistent representationsof negative samples Thus MoCo can achieve superior per-formance without large batch size which is more feasible toimplement BYOL [19] introduces a predictor to one branchof the network to break the symmetry and avoid the trivialsolution DINO [6] applies contrastive learning to visiontransformers

Clustering based Clustering is one of the most promisingapproaches for unsupervised representation learning Deep-Cluster [3] uses k-means assignments to generate pseudo-labels to iteratively group the features and update the weightof the network DeeperCluster [4] scales to large uncu-rated datasets to capture complementary statistics Differ-ent from previous works to maximize the mutual informa-tion between pseudo labels and input data SeLa [2] cast thepseudo-label assignment as an instance of optimal transportSwAV [5] formulates to map representations to prototypevectors which is assigned online and is capable to scale tolarger datasets

Although the mainstream methods SimCLR-V2 MoCo-V2 BYOL and SwAV belong to different self-supervisedcategories they have four things in common 1) two viewsfor one image 2) two encoders for feature extraction 3)two projection heads to map the representations into a lowerdimension space and 4) the two low-dimensional embed-dings are regarded to be a pair of positive samples whichcan be considered as a contrast process However all ofthese methods suffer a performance cliff fall on lightweightmodels which is what we try to remedy in this work

22 Knowledge Distillation

Knowledge distillation (KD) tries to transfer the knowl-edge from a larger teacher model to a smaller student modelAccording to the form of knowledge it can be classified intothree categories logits-based feature-based and relation-basedLogits-based Logits refers to the output of the networkclassifier KD [24] proposes to make the student mimic thelogits of the teacher by minimizing the KL-divergence ofthe class distributionFeature-based Feature-based methods directly transfer theknowledge from the intermediate layers of the teacher tostudent FitNets [37] regards the intermediate representa-tions learned by the teacher as hints and transfers the knowl-edge to a thinner and deeper student through minimizingthe mean square error between the representations AT[49] proposes to use the spatial attention of the teacher asthe knowledge and let the student pay attention to the areathat the teacher is concerned about SemCKD [7] adap-tively selects the more appropriate representation pairs ofthe teacher and studentRelation-based Relation-based approaches explore the re-lationship between data instead of the output of a single in-stance RKD [34] transfers the mutual relationship of theinput data within one batch with distance-wise and angle-wise distillation loss from the teacher to the student IRG[31] proposes to use the relationship the graph to furtherexpress the relational knowledge

23 SSL meets KD

Recently some works combine self-supervised learn-ing and knowledge distillation CRD [41] introduces acontrastive loss to transfer pair-wise relationship acrossdifferent modalities SSKD [48] lets the student mimictransformed data and self-supervision tasks to transferricher knowledge The above-mentioned works take self-supervision as an auxiliary task to further boost the pro-cess of knowledge distillation under fully supervised set-ting CompRess[1] and SEED [18] tried to employ knowl-edge distillation as a means to improve the self-supervisedvisual representation learning capability of small modelswhich utilize the negative sample queue in MoCo [20] toconstrain the distribution of positive sample over negativesamples of the student to be consistent with that of theteacher However CompRess and SEED heavily rely onMoCo framework which means that a memory bank al-ways has to be preserved during the distillation processOur method also aims to boost the self-supervised represen-tation learning ability on lightweight models by distillinghowever we do not restrict the self-supervised frameworkand are thus more flexible Furthermore our method sur-pass SEED with a large margin on all lightweight modelsunder the same setting

3

3 MethodIn this section we introduce the proposed Distilled Con-

trastive Learning (DisCo) on lightweight models We firstgive some preliminaries on contrastive based SSL and thenintroduce the overall architecture of DisCo and how DisCotransfers the knowledge from the teacher to the student Fi-nally we present how DisCo can be combined with the ex-isting contrastive based SSL methods

31 Preliminary on Contrastive Based SSL

In Figure 3 we show the framework diagrams of fourmainstream contrastive based SSL methods they have somecommon characteristics as listed below

Z

ImageView

Representation

Similarity

Contrastive Loss

E Eprime Embedding

Share

Gradients

vprime vx

MLP

Momentum update

MemoryMLP

Similarity

Contrastive Loss

E Eprime

vprime vx

Momentum update

Z Zprime Gradients

L2 LossE Eprime

vprime vx

MLP MLP

Share

Z Zprime Gradients

E Eprime

vprime vx

MLP

Gradients

ImageView

Representation

Embedding

(a) SimCLR-V2 (b) MoCo-V2

(c) BYOL (d) SwAV

Prototype

Fitness

Cross-Entropy Loss

Encoder Encoder

MLP MLP

Encoder Encoder

Encoder Encoder Encoder Encoder

MLP

MLP

ZZprime Gradients Zprime Gradients

Figure 3 Diagrams of four mainstream SSL methods

Two views one input image x is transformed into twoviews v and v

primeby two drastic data augmentation operations

Two encoders two augmented views are input to twoencoders of the same structure one is a learnable base en-coder s(middot) and the other m(middot) is updated according to thebase encoder either shared or momentum updated The en-coder here can use any network architecture such as thecommonly used ResNet For an input image the extractedrepresentation obtained after the encoder and global averagepooling is denoted as Z and its dimension is D

Projection head both the encoders are followed by asmall projection head p(middot) that maps the representation Zto a low-dimensional embedding E which contains sev-eral linear layers This procedure can be formulated asE = p(Z) =W(n) middot middot middot (σ(W(1)Z)) where W is the weightparameter of the linear layer n is the number of layers

which is greater than or equal to 1 and σ is the non-linearfunction ReLU The importance the of projection head hasbeen addressed in SimCLR-V2 [9] and MoCo-V2 [10] Fol-lowing the setting of MoCo-V2 we define the default con-figuration of the projection head as two linear layers withthe dimension being D and 128

Loss function after obtaining the final embeddings ofthese two views they are regarded as a pair of positive sam-ples to calculate the loss

32 Overall Architecture

The framework of the proposed DisCo is shown in Fig-ure 2 which consists of three encoders following the pro-jection head The first encoder s(middot) is the student that wewant to learn the second is the other encoder m(middot) in themainstream self-supervised method and the third is the pre-trained large teacher t(middot)

For each input image x it is first transformed into twoviews v and vprime by two drastic data augmentation opera-tions On the one hand v is input to s(middot) and t(middot) gen-erating two representations Zs = s(v) Zt = t(v) thenafter the projection head these two representations aremapped to low-dimensional embeddings Es = ps(Zs)Et = pt(Zt) respectively On the other hand vprime is inputto s(middot) m(middot) and t(middot) simultaneously after encoding andprojecting three low-dimensional vectors E

prime

s = ps(s(vprime))

Eprime

m = pm(m(vprime)) and Eprime

t = pt(t(vprime)) are obtained

Eprime

m and Es are the embeddings of two different viewswhich are regarded as a pair of positive samples and arepulled together in the existing SSL methods Es and EtE

prime

s and Eprime

t are two pairs of embeddings of the student andthe teacher of the same view and each pair is constrained tobe consistent during the distilling procedure which will beintroduced in detail in section 33

33 Distilling Procedure

In most contrastive based SSL methods the calculationof loss function and the evaluation of accuracy are both per-formed at the final embedding vector E Therefore we hy-pothesize that the last embedding E contains the most fruit-ful knowledge and should be considered to be delivered firstwhen distilling

For a self-supervised pre-trained teacher model we dis-till the knowledge in the last embedding into the studentthat is for view v and view vprime the embedding vector outputby the frozen teacher and the learnable student should beconsistent Specifically we use a consistency regularizationterm to pull the embedding vector Es closer to Et and E

prime

s

closer to Eprime

t Formally

Ldis = ||Es minus Et||2 + ||Eprime

s minus Eprime

t||2

(1)

In order to verify that the knowledge contained in theembedding E is the most meaningful we experimented

4

with several other commonly used distillation schemes insection 44 The experimental results prove that the knowl-edge we transmitted and the way it is transferred are indeedmore effectiveDistilling Bottleneck In our distillation experiment wefound an interesting phenomenon When the encoder ofthe student network is ResNet-18 or ResNet-34 and the de-fault MLP configuration is adopted that is the dimensionof embedding output by the encoder is projected from D toD and then to 128 the results of DisCo is not satisfactoryWe assume that this degradation is caused by the fact thatthe dimension of the hidden layer in the MLP is too smalland term this phenomenon as Distilling Bottleneck In Fig-ure 4 we exhibit the default configuration of the projectionhead of ResNet-1834 EfficientNet-B0B1 MobileNet-v3-Large and ResNet-50101152 It can be seen that the di-mension of the hidden layer of ResNet-1834 is too smallcompared to other networks

512

ResNet-1834 ResNet-50101152

1280

MLP512

128

1280

128

2048

2048

128

Mob-v3 Eff-b0b1

Figure 4 Default MLP of multiple networks

In order to alleviate the Distilling Bottleneck problemwe present to expand the dimension of the hidden layerin MLP It is worth noting that this operation only intro-duces a small number of additional parameters at the self-supervised distillation stage and the MLP will be directlydiscarded during fine-tuning and deployment which meansno extra computational burden is brought We experimen-tally verified that such a simple operation can bring signifi-cant gains in section 47

This operation can be further explained from the Infor-mation Bottleneck (IB) [42] perspective IB is utilized in[39 12] to understand how deep networks work by visu-alizing mutual information (I(XT ) and I(T Y )) in theinformation plane where I(XT ) is the mutual informa-tion between input and output and I(T Y ) is the mutualinformation between output and label The training of deepnetworks can be described by two-phases in the informationplane the first fitting phase where the network memorizesthe information of input resulting in the growth of I(XT )and I(T Y ) the subsequent compression phase where thenetwork removes irrelevant information of input for bettergeneralization resulting in the decrease of I(XT ) (SeeFigure 7 in [12]) Generally in the compression phaseI(XT ) can present the modelrsquos capability of generaliza-tion while I(T Y ) can present the modelrsquos capability of fit-

ting the label [12] We visualize the compression phase ofour model with different dimensions of the hidden layer inthe pre-training distillation stage in the information plane onone downstream transferring classification task The resultsin Figure 7 shows two interesting phenomenons

i Models with different dimensions of the hidden layerhave very similar I(T Y ) suggesting that models have thenearly equal capability of fitting the labels

ii The Model with larger dimension in the hidden layerhas smaller I(XT ) suggesting a stronger capability ofgeneralization

These phenomenons show that MLP indeed relates tothe capability of model generalization in the setting of self-supervised transfer learning

34 Overall Objective Function

The overall objective function is defined as follows

L = Ldis + λLco (2)

where Ldis comes from distillation part Lco can be thecontrastive loss function of any SSL method and λ is ahyper-parameter that controls the weights of the distillationloss and contrastive loss In our experiments λ is set to 1Due to the simplicity of the implementation we use MoCo-V2 as the testbed in the experiments

4 Experiments41 Settings

Dataset All the self-supervised pre-training experimentsare conducted on ImageNet ILSVRC-2012 [38] Fordownstream classification tasks experiments are carriedout on Cifar10 [29] and Cifar100 [29] For downstreamdetection tasks experiments are conducted on PASCALVOC [17] and MS-COCO [30] with train+valtest andtrain2017val2017 for trainingtesting respectively Fordownstream segmentation tasks the proposed method isverified on MS-COCOTeacher Encoders Six large encoders are used asteacher ResNet-50 (224M) ResNet-101 (405M) ResNet-152 (554M) ResNet-502 (555M) ViT-small(22M)[1443] and XCiT-small(443M)[16] where X(Y) denotes thatthe encoder has Y millions of parameters and the Y doesnot take linear layer into considerationStudent Encoders Five widely used small yet effectiveconvolution neural networks and two vision transformerencoders are used as student EfficientNet-B0 (40M)MobileNet-v3-Large (42M) EfficientNet-B1 (64M) andResNet-18 (107M) ResNet-34 (204M) ViT-tiny(5M) andXCiT-tiny(261M)Teacher Pre-training Setting ResNet-50101152 arepre-trained using MoCo-V2 with default hyper-parametersFollowing SEED ResNet-50 and ResNet-101 are trained

5

Table 1 ImageNet test accuracy () using linear classification on different student architectures diams denotes the teacherstudent modelsare pre-trained with MoCo-V2 which is our implementation and daggermeans the teacher is pre-trained by SwAV which is an open-sourcemodel When using R502 as the teacher SEED distills 800 epochs while DisCo distills 200 epochs Subscript in green represents theimprovement compared to MoCo-V2 baseline

Method TS Eff-b0 Eff-b1 Mob-v3 R-18 R-34

T-1 T-5 T-1 T-5 T-1 T-5 T-1 T-5 T-1 T-5Supervised 771 933 792 944 752 - 721 - 750 -

Self-supervisedMoCo-V2 (Baseline)diams 468 722 484 738 362 621 522 776 568 814

SSL DistillationSEED[18] R-50 (674) 613 827 614 831 552 803 576 818 585 826

DisCo (ours) R-50 (674)diams 665(197uarr)

876(154uarr)

666(182uarr)

875(137uarr)

644(282uarr)

862(241uarr)

606(84uarr)

837(61uarr)

625(57uarr)

854(40uarr)

SEED [18] R-101 (703) 630 838 634 846 599 835 589 825 616 849DisCo (ours) R-101 (691)diams 689

(221uarr)889

(167uarr)690

(206uarr)891

(153uarr)657

(295uarr)867

(246uarr)623

(101uarr)851

(75uarr)644

(76uarr)865

(51uarr)SEED [18] R-152 (742) 653 860 673 869 614 846 595 833 627 858

DisCo (ours) R-152 (741)diams 678(210uarr)

870(148uarr)

731(247uarr)

912(174uarr)

637(275uarr)

849(228uarr)

655(133uarr)

867(91uarr)

681(113uarr)

886(72uarr)

SEED [18] R502 (773dagger) 676 874 680 876 682 882 630 849 657 868DisCo (ours) R502 (773)dagger 691

(223uarr)889

(177uarr)640

(156uarr)846

(108uarr)589

(227uarr)814

(193uarr)652(13uarr)

868(92uarr)

676(108uarr)

886(72uarr)

for 200 epochs and ResNet-152 is trained for 400 epochsResNet-502 is pre-trained by SwAV which is an open-source model 1 and trained for 800 epochsSelf-supervised Distillation Setting The projection headof all the student networks has two linear layers with the di-mension being 2048 and 128 The configuration of learningrate and optimizer is set the same as MoCo-V2 and withouta specific statement the model is trained for 200 epochsFurthermore during the distillation stage the teacher isfrozenStudent Fine-tuning Setting For linear evaluation onImageNet the student is fine-tuned for 100 epochs Ini-tial learning rate is 3 for EfficientNet-B0EfficientNet-B1MobileNet-v3-Large and 30 for ResNet-18ResNet-34For linear evaluation on Cifar10 and Cifar100 the initiallearning rate is 3 and all the models are fine-tuned for 100epochs SGD is adopted as the optimizer and the learningrate is decreased by 10 at 60 and 80 epochs for both lin-ear evaluation For downstream detection and segmentationtasks following [20 10 18] all parameters are fine-tunedFor the detection task on VOC initial learning rate is 01with 200 warm-up iterations and decays by 10 at 18k 222ksteps The detector is trained for 48k steps with a batch sizeof 32 on 8 V100 GPUs Following [18] the scales of imagesare randomly sampled from [400 800] during the trainingand is 800 at the inference For the detection and instancesegmentation tasks on COCO the model is trained for 180kiterations with the initial learning rate 011 and the scales ofimages are randomly sampled from [600 800] during thetraining

1httpsgithubcomfacebookresearchswav

42 Linear Evaluation

We conduct linear evaluation on ImageNet to validate theeffectiveness of our method As shown in Table 1 studentmodels distilled by DisCo outperform the counterparts pre-trained by MoCo-V2 (Baseline) with a large margin Be-sides DisCo surpasses the state-of-the-art SEED over var-ious student models with teacher ResNet-50101152 un-der the same setting especially on MobileNet-v3-Largedistilled by ResNet-50 with a difference of 92 at top-1accuracy When using R502 as the teacher SEED dis-tills 800 epochs while DisCo still distills 200 epochs butthe results of EfficientNet-B0 ResNet-18 and ResNet-34using DisCo also exceed that of SEED The performanceon EfficientNet-B1 and MobileNet-v3-Large is closely re-lated to the epochs of distillation For example whenEfficientNet-B1 is distilled for 290 epochs the top-1 ac-curacy becomes 704 which surpasses SEED and whenMobileNet-v3-Large is distilled for 340 epochs the top-1accuracy becomes 64 We believe that when DisCo dis-tills 800 epochs the results will be further improved More-over since CompRess uses a better teacher which trained600 epochs longer and distills 400 epochs longer than SEEDand ours itrsquos not fair to compare thus we do not report theresult in the table In addition when DisCo uses a largermodel as the teacher the student will be further improvedFor instance using ResNet-152 instead of ResNet-50 asthe teacher ResNet-34 is improved from 625 to 681Itrsquos worth noting when using ResNet-101ResNet-50 as theteacher the linear evaluation result of EfficientNet-B0 isvery close to the teacher 689 vs 691 and 665 vs674 while the number of parameters of EfficientNet-B0

6

0 10 20 30 40 50 60 70

30

40

50

60

70

Top-

1 Ac

cura

cy(

)

R-50R-101

R-152+166

+241

+149

+202

+265

+174

+265

+254

+214

(a) EfficientNet-B0

100101

0 10 20 30 40 50 60

30

40

50

60

70

R-50R-101 R-152

+136

+99

+222

+167

+119

+235

+177

+130

+203

(b) MobileNet-v3-Large

100101

0 10 20 30 40 50 60

30

40

50

60

70

R-50R-101

R-152+141

+86

+184

+169

+103

+201

+220

+131

+233(c) ResNet-18

100101

Numbers of parameters of Teacher (Millions)

Figure 5 ImageNet top-1 accuracy () of semi-supervised linear evaluation with 1 10 and 100 training data Points where thenumber of teacher network parameters are 0 are the results of the MoCo-V2 baseline without distillation

R-18 Eff-b070

75

80

85

90

95

Top-

1 Ac

cura

cy(

)

R-50R-101 R-152

R-50 R-101 R-152

(a) Cifar10

R-18 Eff-b040

50

60

70

80

Top-

1 Ac

cura

cy(

)

R-50R-101 R-152

R-50 R-101R-152

(b) Cifar100

MoCo-V2SEEDOurs

Figure 6 Top-1 accuracy of students transferred to Cifar10Cifar100 without and with distillation from different teachers

is only 94163 of ResNet-101ResNet-50

43 Semi-supervised Linear Evaluation

Following previous works [26 28 33] we evaluate ourmethod under the semi-supervised setting Two 1 and10 sampled subsets of ImageNet training data (˜128 and˜128 images per class respectively) [8] are used for fine-tuning the student models As is shown in Figure 5 stu-dent models distilled by DisCo outperform baseline underany amount of labeled data Furthermore DisCo also showsthe consistency under different fractions of annotations thatis students always benefit from larger models as teachersMore labels will be helpful to improve the final performanceof the student model which is expected

44 Comparison against other Distillation

In order to verify the effectiveness of the proposedmethod we compare with three widely used distillationschemes namely 1) Attention transfer denoted by AT[49] 2) Relational knowledge distillation denoted by RKD[34] 3) Knowledge distillation denoted by KD [24] ATand RKD are feature-based and relation-based respec-tively which can be utilized during the self-supervised pre-training stage KD is a logits-based method which can onlybe used at the supervised fine-tuning stage The compari-son results are shown in Table 2 Singe-Knowledge meansusing one of these approaches individually and it can beseen that all distillation approaches can bring improvement

to the baseline but the gain from DisCo is the most signif-icant which indicates the knowledge that DisCo chosen totransfer and the way of transmission is indeed more effec-tive Then we also try to transfer multi-knowledge fromteacher to student by combining DisCo with other distil-lation schemes It can be seen that integrating DisCo withATRKDKD can boost the performance a lot which furtherproves the effectiveness of DisCo

Table 2 Linear evaluation top-1 accuracy () on ImageNet com-pared with different distillation methods

Method Eff-b0 Eff-b1 Mob-v3 R-18 R-34Baseline

MoCo-V2 [11] 468 484 362 522 568Single-Knowledge

AT [49] 571 582 510 562 602RKD [34] 483 503 369 564 587KD [24] 465 485 373 515 588

DisCo (ours) 665 666 644 606 625Multi-Knowledge

AT + DisCo 667 663 641 602 623RKD + DisCo 668 665 644 606 623KD + DisCo 658 659 652 606 659

45 Transfer to Cifar10Cifar100

In order to analyze the generalization of representationsobtained by DisCo we further conduct linear evaluation onCifar10 and Cifar100 with ResNet-18EfficientNet-B0 as

7

Table 3 Object detection and instance segmentation results with ResNet-34 as backbone bounding-box AP (AP bb) and mask AP(APmk) are evaluated on VOC07 test and COCO val2017 Daggermeans our implementation Subscript in green represents the improve-ment compared to MoCo-V2 baseline

S T MethodObject Detection Instance Segmentation

VOC COCO COCOAP bb AP bb

50 AP bb75 AP bb AP bb

50 AP bb75 APmk APmk

50 APmk75

R-34

times MoCo-V2Dagger 536 791 587 381 568 407 330 532 353

R-50SEED [18] 537 794 592 384 570 410 333 532 353

DisCo (ours) 565(29uarr)

806(15uarr)

625(38uarr)

400(19uarr)

591(23uarr)

434(27uarr)

349(19uarr)

563(31uarr)

371(18uarr)

R-101SEED [18] 541 798 591 385 573 414 336 541 356

DisCo (ours) 561(25uarr)

803(12uarr)

618(31uarr)

400(19uarr)

591(23uarr)

432(25uarr)

347(19uarr)

559(27uarr)

374(18uarr)

R-152SEED [18] 544 801 599 384 570 410 333 537 353

DisCo (ours) 566(30uarr)

808(17uarr)

634(57uarr)

394(13uarr)

587(19uarr)

427(20uarr)

344(14uarr)

554(22uarr)

367(14uarr)

student and ResNet-50ResNet101ResNet152 as teacherSince the image resolution of Cifar dataset is 32times32 all theimages are resized to 224 times 224 with bicubic re-samplingbefore feeding into the model following [11 18] The re-sults are shown in Figure 6 it can be seen that the pro-posed DisCo surpasses the MoCo-V2 baseline by a largemargin with different student and teacher architectures onboth Cifar10 and Cifar100 In addition our method alsohas a significant improvement compared the-state-of-art ap-proach SEED It is worth noting that as the teacher becomesbetter the improvement brought by DisCo is more obvious

46 Transfer to Detection and Segmentation

We also conduct experiments on detection and segmen-tation tasks for generalization analysis C4 based FasterR-CNN [36] are used for objection detection on VOC andMask R-CNN [21] are used for objection detection and in-stance segmentation on COCO In this part of experimentfollowing [10 18] all the parameters of the student networkare learnable and the implementation is based on detec-tron2 [47] for convenience The results of using ResNet-34as the student with teacher ResNet-50ResNet-101ResNet-152 are shown in Table 3 On the object detection ourmethod can bring obvious improvement on both VOC andCOCO datasets Furthermore as SEED [18] claimed theimprovement on COCO is relatively minor compared toVOC since COCO training dataset has 118k images whileVOC has only 165k training images thus the gain broughtby weight initialization is relatively small On the instancesegmentation task DisCo also shows superiority

47 Distilling BottleNeck Analysis

In this section we analyze the Distilling BottleNeck Phe-nomenon and we use ResNet-50 as the teacher for simplic-ityDistilling BottleNeck Phenomenon In the self-supervised

distillation stage we first tried to distill small models withthe default MLP configuration of MoCo-V2 using DisCoand the results are shown in Table 4 denoted by DisColowastIt is worth noting that the dimensions of the hidden layerin DisColowast are exactly as same as SEED It can be seenthat compared to SEED DisColowast shows superior results onEfficientNet-B0 EfficientNet-B1 and MobileNet-v3-Largeand has comparable results on ResNet-18 and ResNet-34Then we expand the dimension of the hidden layer in theMLP of the student to be consistent with that of the teacherthat is 2048D it can be seen that the results can be furtherimproved which is recorded in the third row In particularthis expansion operation brings 35 and 36 gains forResNet-18 and ResNet-34 respectively

Table 4 Linear evaluation top-1 accuracy () on ImageNetDisCo means that the dimension of hidden layer in the MLP isconsistent with that of SEED

Method Eff-b0 Eff-b1 Mob-v3 R-18 R-34SEED [18] 613 614 552 576 585

DisCo 656 658 638 571 589DisCo 665(09uarr) 666(08uarr) 644(06uarr) 606(35uarr) 625(36uarr)

Theoretical Analysis from IB perspective In Figure 7on the downstream Cifar10 classification task we visualizethe compression phase of ResNet-18 and ResNet-34 withdifferent hidden dimensions distilled by the same teacherin the information plane It can be seen that when we ad-just the hidden dimension in the MLP of ResNet-18 andResNet-34 from 512D to 2048D the value of I(XT ) be-comes smaller while I(T Y ) is basically unchanged whichsuggests that enlarging the hidden dimension can makethe student model more generalized in the setting of self-supervised transfer learningThe effectiveness of the dimension Here we further ex-plore the impact of more dimensions and the results areshown in Figure 9 It can be seen that as the dimension in-

8

80 85 90 950

1

2

3

4

5

I (T

Y) C(858) T

C (891) T

(a) ResNet-18

80 85 90 95

C (834) T

C (864) T

(b) ResNet-34

MLP (512d) MLP (2048d)I (XT)

Figure 7 Mutual information paths from transition points to con-vergence points in the compression phase of training T denotestransition points and C(X) denotes convergent points with Xtop-1 accuracy on Cifar10 Points with similar I(TY) but smallerI(XT) are better generalized

creases the top-1 accuracy also increases but when the di-mension is already large the growth trend will slow downThe performance trend of EfficientNet-B1 and MobileNet-v3-Large is close to that of EfficientNet-B0 so for the sakeof clarity we do not exhibit it in the figure

48 More SSL Methods

Variants of the testbed In this part in order to demon-strate the versatility of our method we further tried othertwo SSL methods as the testbed SwAV and DINO

For SwAV the teacher is backboned by ResNet-50 andthe results are shown in Table 5 it can be seen that formodels with very few parameters EfficientNet-B0 andMobileNet-v3-Large the pre-training results with SwAVare also very poor When DisCo is utilized the efficacyis significantly improved

Table 5 Linear evaluation top-1 accuracy () on ImageNet withSwAV as the testbed The teacher of DisCo is an online opensource ResNet-50 model pre-trained using SwAV with top-1 accu-racy 753

Method Eff-b0 Mob-v3 R-34SwAV [5] 468 194 633

SwAV + DisCo 624 557 633

For DINO use two vision transformer architectures ViTand XCiT and the results are shown in Table 6 It can beseen that no matter what SSL method is adopted and whatarchitecture is used in the network DisCo can bring sig-nificant gains It is worth noting that XCiT-tiny has 26Mparameters which is much larger than ViT-tiny(5M) butDisCo can still narrow the gap with the teacherVariants of teacher pre-training method In order toverify that our method is not picky about the pre-trainingapproach that the teacher adopted we use three ResNet-50 networks pre-trained with different SSL methods as theteacher under the testbed of MoCo-V2 It can be observed

Table 6 Linear evaluation top-1 accuracy () on ImageNet withDINO as testbed

Teacher StudentModel Acc ViT-tiny XCiT-tiny

- - 55 67ViT-small [14 43] 77 684(134uarr) -XCiT-small [16] 778 - 711(41uarr)

Table 7 Linear evaluation top-1 accuracy () on ImageNet withvariants of teacher pre-training methods All the teachers areResNet-50 and the first row is student trained by MoCo-V2 di-rectly without distillation which is the baseline

Teacher StudentMethod Acc Eff-b0 Eff-b1 Mob-v3 R-18 R-34

- - 468 484 362 522 568MoCo-V2 [10] 674 665 666 644 606 625

SeLa-V2 [2] 718 622 682 662 641 653SwAV [5] 753 700 721 650 651 675

from Table 7 that when using different pre-trained ResNet-50 as teachers DisCo can significantly boost all the resultsof small models Furthermore with the improvement of theteachers using different and stronger pre-training methodsthe results of the student can be further improved

49 Visualization Analysis

In Figure 8 we visualize the learned representations ofEfficientNet-B0ResNet-50 pretrained by MoCo-V2 andEfficientNet-B0 distilled by ResNet-50 using DisCo Forclarity we randomly select 10 classes from ImageNet testset and map the learned representations to two dimensionalspace by t-SNE [44] It can be observed that ResNet-50forms more separated clusters than EfficientNet-B0 whenusing MoCo-V2 alone and after using ResNet-50 to teachEfficientNet-B0 with DisCo EfficientNet-B0 performs verymuch like the teacher

5 Conclusion

In this paper we propose Distilled Contrastive Learning(DisCo) to remedy self-supervised learning on lightweightmodels The proposed method constraints the final embed-ding of the lightweight student to be consistent with that ofthe teacher to maximally transmit the teacherrsquos knowledgeDisCo is not limited to specific contrastive learning meth-ods and can remedy the effect of the student to be very closeto the teacher

References[1] Soroush Abbasi Koohpayegani Ajinkya Tejankar and

Hamed Pirsiavash Compress Self-supervised learning bycompressing representations In NeurIPS pages 12980ndash12992 2020 2 3

9

30 20 10 0 10 20 30 40Dim 1

30

20

10

0

10

20

Dim

2

(a) Eff-b0 (MoCo-V2)

30 20 10 0 10 20 30 40Dim 1

(b) R-50 (MoCo-V2)

30 20 10 0 10 20 30 40Dim 1

(c) Eff-b0 (DisCo distilled by ResNet-50)

Figure 8 Clustering results on the ImageNet test set Different colors represent different classes

0 512 1024 1280 2048Dimension of hidden layer

5254565860626466

Top-

1 Ac

cura

cy(

)

+46+38

+32

+67

+62

+57

+74+65

+63

+81+74

+72

EfficientNet-B0ResNet-18ResNet-34

Figure 9 Linear evaluation top-1 accuracy on ImageNet of studentprojection head with different dimensions of hidden layer duringdistillation represents the original dimension used in MoCo-V2

[2] Yuki Markus Asano Christian Rupprecht and AndreaVedaldi Self-labelling via simultaneous clustering and rep-resentation learning In ICLR 2020 3 9

[3] Mathilde Caron Piotr Bojanowski Armand Joulin andMatthijs Douze Deep clustering for unsupervised learningof visual features In ECCV pages 132ndash149 2018 3

[4] Mathilde Caron Piotr Bojanowski Julien Mairal and Ar-mand Joulin Unsupervised pre-training of image featureson non-curated data In ICCV pages 2959ndash2968 2019 3

[5] Mathilde Caron Ishan Misra Julien Mairal Priya Goyal Pi-otr Bojanowski and Armand Joulin Unsupervised learn-ing of visual features by contrasting cluster assignments InNeurIPS pages 9912ndash9924 2020 3 9

[6] Mathilde Caron Hugo Touvron Ishan Misra Herve JegouJulien Mairal Piotr Bojanowski and Armand Joulin Emerg-ing properties in self-supervised vision transformers 20213

[7] Defang Chen Jian-Ping Mei Yuan Zhang Can Wang ZheWang Yan Feng and Chun Chen Cross-layer distillationwith semantic calibration 2020 3

[8] Ting Chen Simon Kornblith Mohammad Norouzi and Ge-offrey Hinton A simple framework for contrastive learningof visual representations In ICML pages 1597ndash1607 20201 2 3 7

[9] Ting Chen Simon Kornblith Kevin Swersky MohammadNorouzi and Geoffrey Hinton Big self-supervised mod-els are strong semi-supervised learners In NeurIPS pages22243ndash22255 2020 1 2 3 4

[10] Xinlei Chen Haoqi Fan Ross Girshick and Kaiming HeImproved baselines with momentum contrastive learning InCVPR pages 9729ndash9738 2020 1 2 3 4 6 8 9

[11] Xinlei Chen and Kaiming He Exploring simple siamese rep-resentation learning In arXiv preprint arXiv2011105662020 3 7 8

[12] Hao Cheng Dongze Lian Shenghua Gao and Yanlin GengEvaluating capability of deep neural networks for imageclassification via information plane In ECCV pages 168ndash182 2018 5

[13] Carl Doersch Abhinav Gupta and Alexei A Efros Unsuper-vised visual representation learning by context prediction InICCV pages 1422ndash1430 2015 1

[14] Alexey Dosovitskiy Lucas Beyer Alexander KolesnikovDirk Weissenborn Xiaohua Zhai Thomas UnterthinerMostafa Dehghani Matthias Minderer Georg Heigold Syl-vain Gelly et al An image is worth 16x16 words Trans-formers for image recognition at scale arXiv preprintarXiv201011929 2020 5 9

[15] Alexey Dosovitskiy Philipp Fischer Jost Tobias Springen-berg Martin Riedmiller and Thomas Brox Discriminativeunsupervised feature learning with exemplar convolutionalneural networks volume 38 pages 1734ndash1747 2015 3

[16] Alaaeldin El-Nouby Hugo Touvron Mathilde Caron PiotrBojanowski Matthijs Douze Armand Joulin Ivan LaptevNatalia Neverova Gabriel Synnaeve Jakob Verbeek et alXcit Cross-covariance image transformers 2021 5 9

[17] Mark Everingham Luc Van Gool Christopher KI WilliamsJohn Winn and Andrew Zisserman The pascal visual objectclasses challenge volume 88 pages 303ndash338 2010 5

10

[18] Zhiyuan Fang Jianfeng Wang Lijuan Wang Lei ZhangYezhou Yang and Zicheng Liu Seed Self-supervised dis-tillation for visual representation In ICLR 2021 2 3 68

[19] Jean-Bastien Grill Florian Strub Florent Altche CorentinTallec Pierre Richemond Elena Buchatskaya Carl DoerschBernardo Avila Pires Zhaohan Guo Mohammad Ghesh-laghi Azar Bilal Piot koray kavukcuoglu Remi Munos andMichal Valko Bootstrap your own latent A new approachto self-supervised learning In NeurIPS pages 21271ndash212842020 1 2 3

[20] Kaiming He Haoqi Fan Yuxin Wu Saining Xie and RossGirshick Momentum contrast for unsupervised visual rep-resentation learning In CVPR pages 9729ndash9738 2020 12 3 6

[21] Kaiming He Georgia Gkioxari Piotr Dollar and Ross Gir-shick Mask r-cnn In ICCV pages 2961ndash2969 2017 8

[22] Kaiming He Xiangyu Zhang Shaoqing Ren and Jian SunDeep residual learning for image recognition In CVPRpages 770ndash778 2016 2

[23] Olivier Henaff Data-efficient image recognition with con-trastive predictive coding In ICML pages 4182ndash4192 20203

[24] Geoffrey Hinton Oriol Vinyals and Jeff Dean Distilling theknowledge in a neural network In NeurIPSW 2015 2 3 7

[25] Andrew Howard Mark Sandler Grace Chu Liang-ChiehChen Bo Chen Mingxing Tan Weijun Wang Yukun ZhuRuoming Pang Vijay Vasudevan et al Searching for mo-bilenetv3 In ICCV pages 1314ndash1324 2019 2

[26] Alexander Kolesnikov Xiaohua Zhai and Lucas Beyer Re-visiting self-supervised visual representation learning InCVPR pages 1920ndash1929 2019 7

[27] Nikos Komodakis and Spyros Gidaris Unsupervised repre-sentation learning by predicting image rotations In ICLR2018 1 3

[28] Simon Kornblith Jonathon Shlens and Quoc V Le Do bet-ter imagenet models transfer better In CVPR pages 2661ndash2671 2019 7

[29] Alex Krizhevsky and Geoffrey Hinton Learning multiplelayers of features from tiny images Citeseer 2009 5

[30] Tsung-Yi Lin Michael Maire Serge Belongie James HaysPietro Perona Deva Ramanan Piotr Dollar and C LawrenceZitnick Microsoft coco Common objects in context InECCV pages 740ndash755 2014 5

[31] Yufan Liu Jiajiong Cao Bing Li Chunfeng Yuan WeimingHu Yangxi Li and Yunqiang Duan Knowledge distillationvia instance relationship graph In CVPR pages 7096ndash71042019 3

[32] Mehdi Noroozi and Paolo Favaro Unsupervised learning ofvisual representations by solving jigsaw puzzles In ECCVpages 69ndash84 2016 1 3

[33] Aaron van den Oord Yazhe Li and Oriol Vinyals Repre-sentation learning with contrastive predictive coding 20187

[34] Wonpyo Park Dongju Kim Yan Lu and Minsu Cho Re-lational knowledge distillation In CVPR pages 3967ndash39762019 3 7

[35] Deepak Pathak Philipp Krahenbuhl Jeff Donahue TrevorDarrell and Alexei A Efros Context encoders Featurelearning by inpainting In CVPR pages 2536ndash2544 20161 3

[36] Shaoqing Ren Kaiming He Ross Girshick and Jian SunFaster r-cnn Towards real-time object detection with regionproposal networks volume 39 pages 1137ndash1149 2015 8

[37] Adriana Romero Nicolas Ballas Samira Ebrahimi KahouAntoine Chassang Carlo Gatta and Yoshua Bengio FitnetsHints for thin deep nets In ICLR 2014 3

[38] Olga Russakovsky Jia Deng Hao Su Jonathan Krause San-jeev Satheesh Sean Ma Zhiheng Huang Andrej KarpathyAditya Khosla Michael Bernstein Alexander C Berg andLi Fei-Fei ImageNet Large Scale Visual Recognition Chal-lenge volume 115 pages 211ndash252 2015 2 5

[39] Ravid Shwartz-Ziv and Naftali Tishby Opening the blackbox of deep neural networks via information 2017 5

[40] Mingxing Tan and Quoc Le Efficientnet Rethinking modelscaling for convolutional neural networks In ICML pages6105ndash6114 2019 2

[41] Yonglong Tian Dilip Krishnan and Phillip Isola Con-trastive representation distillation In ICLR 2020 3

[42] Naftali Tishby Fernando C Pereira and William Bialek Theinformation bottleneck method 2000 2 5

[43] Hugo Touvron Matthieu Cord Matthijs Douze FranciscoMassa Alexandre Sablayrolles and Herve Jegou Trainingdata-efficient image transformers amp distillation through at-tention In International Conference on Machine Learningpages 10347ndash10357 PMLR 2021 5 9

[44] Laurens Van der Maaten and Geoffrey Hinton Visualizingdata using t-sne volume 9 2008 9

[45] Jinpeng Wang Yuting Gao Ke Li Xinyang Jiang XiaoweiGuo Rongrong Ji and Xing Sun Enhancing unsupervisedvideo representation learning by decoupling the scene andthe motion 2020 3

[46] Jinpeng Wang Yuting Gao Ke Li Yiqi Lin Andy J Ma andXing Sun Removing the background by adding the back-ground Towards background robust self-supervised videorepresentation learning 2020 3

[47] Yuxin Wu Alexander Kirillov Francisco Massa Wan-YenLo and Ross Girshick Detectron2 httpsgithubcomfacebookresearchdetectron2 2019 8

[48] Guodong Xu Ziwei Liu Xiaoxiao Li and Chen Change LoyKnowledge distillation meets self-supervision In ECCVpages 588ndash604 2020 3

[49] Sergey Zagoruyko and Nikos Komodakis Paying more at-tention to attention Improving the performance of convolu-tional neural networks via attention transfer In ICLR 20173 7

[50] Jure Zbontar Li Jing Ishan Misra Yann LeCun andStephane Deny Barlow twins Self-supervised learning viaredundancy reduction 2021 3

11

Page 2: 12 1 arXiv:2104.09124v3 [cs.CV] 9 Aug 2021

Among them methods [8 9 20 10 19] based on con-trastive learning are becoming the mainstream due to theirsuperior results These methods constantly refresh theresults with relatively large networks but are unsatisfy-ing on some lightweight models For example the num-ber of parameters of MobileNet-v3-LargeResNet-152 is52M574M [25 22] and the corresponding linear evalu-ation top-1 accuracy on ImageNet [38] using MoCo-V2 is362741 Compared to their fully supervised counter-parts 7527857 the results of MobileNet-v3-Large isfar from satisfying Meanwhile in real scenarios sometimesonly lightweight models can be deployed due to the lim-ited hardware resources Therefore improving the abilityof self-supervised representation learning on small modelsis of great significance

Knowledge distillation [24] is an effective way to trans-fer the knowledge learned by a large model (teacher) to asmall model (student) Recently some self-supervised rep-resentation learning methods use knowledge distillation toimprove the efficacy of small models SimCLR-V2 [9] useslogits in the fine-tuning stage to transfer the knowledge in atask-specific way CompRess [1] and SEED [18] mimic thesimilarity score distribution between a teacher and a studentmodel over a dynamically maintained queue Though distil-lation has been shown to be effective two factors affect theresult prominently ie which knowledge is most needed bythe student and how to deliver it In this work we proposenew insights towards these two aspects

In the current mainstream contrastive learning based SSLmethods a multi-layer perceptron (MLP) is added after theencoder to obtain a low-dimensional embedding Train-ing loss and the accuracy evaluation are both performed onsuch embedding We thus hypothesize that this final em-bedding contains the most fruitful knowledge and shouldbe regarded as the first choice for knowledge transferringTo achieve this we propose a simple yet effective DistilledContrastive Learning (DisCo) to transfer this knowledgefrom large models to lightweight models in the pre-trainingstage Specifically DisCo takes the MLP embedding ob-tained by the teacher as the knowledge and injects it into thestudent by constraining the corresponding embedding of thestudent to be consistent with that of the teacher using MSEloss In addition we find that a budgeted dimension of thehidden layer in the MLP of the student may cause a knowl-edge transmission bottleneck We term this phenomenonas Distilling Bottleneck and present to enlarge the embed-ding dimension to alleviate this problem This simple yeteffective operation relates to the capability of model gener-alization in the setting of self-supervised learning from theInformation BottleNeck [42] perspective It is worth notingthat our method only introduces a small number of addi-tional parameters in the pre-training phase but during thefine-tuning and deployment stage there is no extra compu-

tational burden since the MLP layer is removed

vprime

Zm Zprime s

MLP

Pretrained Frozen

Image

View

Representation Gradients

v

Zs Gradients Zt

Similarity

Contrastive Learning

Eprime m Es Eprime s

Zprime t

MLP

Et Eprime t

Consistency Regularization

Embedding

MLP

MeanStudent Student TeacherEncoder

x

Figure 2 The framework of the proposed method DisCo Oneimage is first transformed into two views by two drastic data aug-mentation operations In addition to the original constrastive SSLpart a self-supervised pre-trained teacher is introduced and thefinal embeddings obtained by the learnable student and the frozenteacher are required to be consistent for each view (Best viewedin color)

Experimental results demonstrate that DisCo can effec-tively transfer the knowledge in the teacher to the studentmaking the representations extracted by the student moregeneralized Our approach is simple and incorporate it intoexisting contrastive based SSL methods can bring signifi-cant gains

In summary our main contributions are twofold

bull We propose a simple yet effective self-supervised dis-tillation method to boost the representation abilities oflightweight models

bull We achieve state-of-the-art SSL results on lightweightmodels Particularly the linear evaluation results ofEfficientNet-B0 [40] on ImageNet is quite close toResNet-101ResNet-50 while the number of parame-ters of EfficientNet-B0 is only 94163 of ResNet-101ResNet-50

2 Related Work21 Self-supervised Learning

Self-supervised learning (SSL) is a generic frameworkthat learns high semantic patterns from the data itself with-out any tags from human beings Current methods mainly

2

rely on three paradigms ie pretext tasks contrastive basedand clustering based

Pretext tasks Approaches based on pretext paradigm fo-cus on designing more effective surrogate tasks includ-ing Exampler-CNN [15] that identifies whether patches arecropped from the same image Rotation [27] that predictsthe rotation degree of the input image Jigsaw [32] thatplaces the shuffled patches back to the original position andContext encoder [35] that recovers the missing part of theinput image conditioned on its surrounding

Contrastive based Contrastive based approaches haveshown impressive performance on self-supervised represen-tation learning which enforce different views of the sameinput to be closer in feature space [11 9 8 23 20 10 1945 46 50] SimCLR [8 9] indicates that self-supervisedlearning can be boosted by applying strong data augmen-tation training with larger batch size of negative samplesand adding projection head (MLP) after the global aver-age pooling However SimCLR relies on very large batchsize to achieve comparable performance and cannot be ap-plied widely to many real-world scenarios MoCo [20 10]considers contrastive learning as a look-up dictionary us-ing a memory bank to maintain consistent representationsof negative samples Thus MoCo can achieve superior per-formance without large batch size which is more feasible toimplement BYOL [19] introduces a predictor to one branchof the network to break the symmetry and avoid the trivialsolution DINO [6] applies contrastive learning to visiontransformers

Clustering based Clustering is one of the most promisingapproaches for unsupervised representation learning Deep-Cluster [3] uses k-means assignments to generate pseudo-labels to iteratively group the features and update the weightof the network DeeperCluster [4] scales to large uncu-rated datasets to capture complementary statistics Differ-ent from previous works to maximize the mutual informa-tion between pseudo labels and input data SeLa [2] cast thepseudo-label assignment as an instance of optimal transportSwAV [5] formulates to map representations to prototypevectors which is assigned online and is capable to scale tolarger datasets

Although the mainstream methods SimCLR-V2 MoCo-V2 BYOL and SwAV belong to different self-supervisedcategories they have four things in common 1) two viewsfor one image 2) two encoders for feature extraction 3)two projection heads to map the representations into a lowerdimension space and 4) the two low-dimensional embed-dings are regarded to be a pair of positive samples whichcan be considered as a contrast process However all ofthese methods suffer a performance cliff fall on lightweightmodels which is what we try to remedy in this work

22 Knowledge Distillation

Knowledge distillation (KD) tries to transfer the knowl-edge from a larger teacher model to a smaller student modelAccording to the form of knowledge it can be classified intothree categories logits-based feature-based and relation-basedLogits-based Logits refers to the output of the networkclassifier KD [24] proposes to make the student mimic thelogits of the teacher by minimizing the KL-divergence ofthe class distributionFeature-based Feature-based methods directly transfer theknowledge from the intermediate layers of the teacher tostudent FitNets [37] regards the intermediate representa-tions learned by the teacher as hints and transfers the knowl-edge to a thinner and deeper student through minimizingthe mean square error between the representations AT[49] proposes to use the spatial attention of the teacher asthe knowledge and let the student pay attention to the areathat the teacher is concerned about SemCKD [7] adap-tively selects the more appropriate representation pairs ofthe teacher and studentRelation-based Relation-based approaches explore the re-lationship between data instead of the output of a single in-stance RKD [34] transfers the mutual relationship of theinput data within one batch with distance-wise and angle-wise distillation loss from the teacher to the student IRG[31] proposes to use the relationship the graph to furtherexpress the relational knowledge

23 SSL meets KD

Recently some works combine self-supervised learn-ing and knowledge distillation CRD [41] introduces acontrastive loss to transfer pair-wise relationship acrossdifferent modalities SSKD [48] lets the student mimictransformed data and self-supervision tasks to transferricher knowledge The above-mentioned works take self-supervision as an auxiliary task to further boost the pro-cess of knowledge distillation under fully supervised set-ting CompRess[1] and SEED [18] tried to employ knowl-edge distillation as a means to improve the self-supervisedvisual representation learning capability of small modelswhich utilize the negative sample queue in MoCo [20] toconstrain the distribution of positive sample over negativesamples of the student to be consistent with that of theteacher However CompRess and SEED heavily rely onMoCo framework which means that a memory bank al-ways has to be preserved during the distillation processOur method also aims to boost the self-supervised represen-tation learning ability on lightweight models by distillinghowever we do not restrict the self-supervised frameworkand are thus more flexible Furthermore our method sur-pass SEED with a large margin on all lightweight modelsunder the same setting

3

3 MethodIn this section we introduce the proposed Distilled Con-

trastive Learning (DisCo) on lightweight models We firstgive some preliminaries on contrastive based SSL and thenintroduce the overall architecture of DisCo and how DisCotransfers the knowledge from the teacher to the student Fi-nally we present how DisCo can be combined with the ex-isting contrastive based SSL methods

31 Preliminary on Contrastive Based SSL

In Figure 3 we show the framework diagrams of fourmainstream contrastive based SSL methods they have somecommon characteristics as listed below

Z

ImageView

Representation

Similarity

Contrastive Loss

E Eprime Embedding

Share

Gradients

vprime vx

MLP

Momentum update

MemoryMLP

Similarity

Contrastive Loss

E Eprime

vprime vx

Momentum update

Z Zprime Gradients

L2 LossE Eprime

vprime vx

MLP MLP

Share

Z Zprime Gradients

E Eprime

vprime vx

MLP

Gradients

ImageView

Representation

Embedding

(a) SimCLR-V2 (b) MoCo-V2

(c) BYOL (d) SwAV

Prototype

Fitness

Cross-Entropy Loss

Encoder Encoder

MLP MLP

Encoder Encoder

Encoder Encoder Encoder Encoder

MLP

MLP

ZZprime Gradients Zprime Gradients

Figure 3 Diagrams of four mainstream SSL methods

Two views one input image x is transformed into twoviews v and v

primeby two drastic data augmentation operations

Two encoders two augmented views are input to twoencoders of the same structure one is a learnable base en-coder s(middot) and the other m(middot) is updated according to thebase encoder either shared or momentum updated The en-coder here can use any network architecture such as thecommonly used ResNet For an input image the extractedrepresentation obtained after the encoder and global averagepooling is denoted as Z and its dimension is D

Projection head both the encoders are followed by asmall projection head p(middot) that maps the representation Zto a low-dimensional embedding E which contains sev-eral linear layers This procedure can be formulated asE = p(Z) =W(n) middot middot middot (σ(W(1)Z)) where W is the weightparameter of the linear layer n is the number of layers

which is greater than or equal to 1 and σ is the non-linearfunction ReLU The importance the of projection head hasbeen addressed in SimCLR-V2 [9] and MoCo-V2 [10] Fol-lowing the setting of MoCo-V2 we define the default con-figuration of the projection head as two linear layers withthe dimension being D and 128

Loss function after obtaining the final embeddings ofthese two views they are regarded as a pair of positive sam-ples to calculate the loss

32 Overall Architecture

The framework of the proposed DisCo is shown in Fig-ure 2 which consists of three encoders following the pro-jection head The first encoder s(middot) is the student that wewant to learn the second is the other encoder m(middot) in themainstream self-supervised method and the third is the pre-trained large teacher t(middot)

For each input image x it is first transformed into twoviews v and vprime by two drastic data augmentation opera-tions On the one hand v is input to s(middot) and t(middot) gen-erating two representations Zs = s(v) Zt = t(v) thenafter the projection head these two representations aremapped to low-dimensional embeddings Es = ps(Zs)Et = pt(Zt) respectively On the other hand vprime is inputto s(middot) m(middot) and t(middot) simultaneously after encoding andprojecting three low-dimensional vectors E

prime

s = ps(s(vprime))

Eprime

m = pm(m(vprime)) and Eprime

t = pt(t(vprime)) are obtained

Eprime

m and Es are the embeddings of two different viewswhich are regarded as a pair of positive samples and arepulled together in the existing SSL methods Es and EtE

prime

s and Eprime

t are two pairs of embeddings of the student andthe teacher of the same view and each pair is constrained tobe consistent during the distilling procedure which will beintroduced in detail in section 33

33 Distilling Procedure

In most contrastive based SSL methods the calculationof loss function and the evaluation of accuracy are both per-formed at the final embedding vector E Therefore we hy-pothesize that the last embedding E contains the most fruit-ful knowledge and should be considered to be delivered firstwhen distilling

For a self-supervised pre-trained teacher model we dis-till the knowledge in the last embedding into the studentthat is for view v and view vprime the embedding vector outputby the frozen teacher and the learnable student should beconsistent Specifically we use a consistency regularizationterm to pull the embedding vector Es closer to Et and E

prime

s

closer to Eprime

t Formally

Ldis = ||Es minus Et||2 + ||Eprime

s minus Eprime

t||2

(1)

In order to verify that the knowledge contained in theembedding E is the most meaningful we experimented

4

with several other commonly used distillation schemes insection 44 The experimental results prove that the knowl-edge we transmitted and the way it is transferred are indeedmore effectiveDistilling Bottleneck In our distillation experiment wefound an interesting phenomenon When the encoder ofthe student network is ResNet-18 or ResNet-34 and the de-fault MLP configuration is adopted that is the dimensionof embedding output by the encoder is projected from D toD and then to 128 the results of DisCo is not satisfactoryWe assume that this degradation is caused by the fact thatthe dimension of the hidden layer in the MLP is too smalland term this phenomenon as Distilling Bottleneck In Fig-ure 4 we exhibit the default configuration of the projectionhead of ResNet-1834 EfficientNet-B0B1 MobileNet-v3-Large and ResNet-50101152 It can be seen that the di-mension of the hidden layer of ResNet-1834 is too smallcompared to other networks

512

ResNet-1834 ResNet-50101152

1280

MLP512

128

1280

128

2048

2048

128

Mob-v3 Eff-b0b1

Figure 4 Default MLP of multiple networks

In order to alleviate the Distilling Bottleneck problemwe present to expand the dimension of the hidden layerin MLP It is worth noting that this operation only intro-duces a small number of additional parameters at the self-supervised distillation stage and the MLP will be directlydiscarded during fine-tuning and deployment which meansno extra computational burden is brought We experimen-tally verified that such a simple operation can bring signifi-cant gains in section 47

This operation can be further explained from the Infor-mation Bottleneck (IB) [42] perspective IB is utilized in[39 12] to understand how deep networks work by visu-alizing mutual information (I(XT ) and I(T Y )) in theinformation plane where I(XT ) is the mutual informa-tion between input and output and I(T Y ) is the mutualinformation between output and label The training of deepnetworks can be described by two-phases in the informationplane the first fitting phase where the network memorizesthe information of input resulting in the growth of I(XT )and I(T Y ) the subsequent compression phase where thenetwork removes irrelevant information of input for bettergeneralization resulting in the decrease of I(XT ) (SeeFigure 7 in [12]) Generally in the compression phaseI(XT ) can present the modelrsquos capability of generaliza-tion while I(T Y ) can present the modelrsquos capability of fit-

ting the label [12] We visualize the compression phase ofour model with different dimensions of the hidden layer inthe pre-training distillation stage in the information plane onone downstream transferring classification task The resultsin Figure 7 shows two interesting phenomenons

i Models with different dimensions of the hidden layerhave very similar I(T Y ) suggesting that models have thenearly equal capability of fitting the labels

ii The Model with larger dimension in the hidden layerhas smaller I(XT ) suggesting a stronger capability ofgeneralization

These phenomenons show that MLP indeed relates tothe capability of model generalization in the setting of self-supervised transfer learning

34 Overall Objective Function

The overall objective function is defined as follows

L = Ldis + λLco (2)

where Ldis comes from distillation part Lco can be thecontrastive loss function of any SSL method and λ is ahyper-parameter that controls the weights of the distillationloss and contrastive loss In our experiments λ is set to 1Due to the simplicity of the implementation we use MoCo-V2 as the testbed in the experiments

4 Experiments41 Settings

Dataset All the self-supervised pre-training experimentsare conducted on ImageNet ILSVRC-2012 [38] Fordownstream classification tasks experiments are carriedout on Cifar10 [29] and Cifar100 [29] For downstreamdetection tasks experiments are conducted on PASCALVOC [17] and MS-COCO [30] with train+valtest andtrain2017val2017 for trainingtesting respectively Fordownstream segmentation tasks the proposed method isverified on MS-COCOTeacher Encoders Six large encoders are used asteacher ResNet-50 (224M) ResNet-101 (405M) ResNet-152 (554M) ResNet-502 (555M) ViT-small(22M)[1443] and XCiT-small(443M)[16] where X(Y) denotes thatthe encoder has Y millions of parameters and the Y doesnot take linear layer into considerationStudent Encoders Five widely used small yet effectiveconvolution neural networks and two vision transformerencoders are used as student EfficientNet-B0 (40M)MobileNet-v3-Large (42M) EfficientNet-B1 (64M) andResNet-18 (107M) ResNet-34 (204M) ViT-tiny(5M) andXCiT-tiny(261M)Teacher Pre-training Setting ResNet-50101152 arepre-trained using MoCo-V2 with default hyper-parametersFollowing SEED ResNet-50 and ResNet-101 are trained

5

Table 1 ImageNet test accuracy () using linear classification on different student architectures diams denotes the teacherstudent modelsare pre-trained with MoCo-V2 which is our implementation and daggermeans the teacher is pre-trained by SwAV which is an open-sourcemodel When using R502 as the teacher SEED distills 800 epochs while DisCo distills 200 epochs Subscript in green represents theimprovement compared to MoCo-V2 baseline

Method TS Eff-b0 Eff-b1 Mob-v3 R-18 R-34

T-1 T-5 T-1 T-5 T-1 T-5 T-1 T-5 T-1 T-5Supervised 771 933 792 944 752 - 721 - 750 -

Self-supervisedMoCo-V2 (Baseline)diams 468 722 484 738 362 621 522 776 568 814

SSL DistillationSEED[18] R-50 (674) 613 827 614 831 552 803 576 818 585 826

DisCo (ours) R-50 (674)diams 665(197uarr)

876(154uarr)

666(182uarr)

875(137uarr)

644(282uarr)

862(241uarr)

606(84uarr)

837(61uarr)

625(57uarr)

854(40uarr)

SEED [18] R-101 (703) 630 838 634 846 599 835 589 825 616 849DisCo (ours) R-101 (691)diams 689

(221uarr)889

(167uarr)690

(206uarr)891

(153uarr)657

(295uarr)867

(246uarr)623

(101uarr)851

(75uarr)644

(76uarr)865

(51uarr)SEED [18] R-152 (742) 653 860 673 869 614 846 595 833 627 858

DisCo (ours) R-152 (741)diams 678(210uarr)

870(148uarr)

731(247uarr)

912(174uarr)

637(275uarr)

849(228uarr)

655(133uarr)

867(91uarr)

681(113uarr)

886(72uarr)

SEED [18] R502 (773dagger) 676 874 680 876 682 882 630 849 657 868DisCo (ours) R502 (773)dagger 691

(223uarr)889

(177uarr)640

(156uarr)846

(108uarr)589

(227uarr)814

(193uarr)652(13uarr)

868(92uarr)

676(108uarr)

886(72uarr)

for 200 epochs and ResNet-152 is trained for 400 epochsResNet-502 is pre-trained by SwAV which is an open-source model 1 and trained for 800 epochsSelf-supervised Distillation Setting The projection headof all the student networks has two linear layers with the di-mension being 2048 and 128 The configuration of learningrate and optimizer is set the same as MoCo-V2 and withouta specific statement the model is trained for 200 epochsFurthermore during the distillation stage the teacher isfrozenStudent Fine-tuning Setting For linear evaluation onImageNet the student is fine-tuned for 100 epochs Ini-tial learning rate is 3 for EfficientNet-B0EfficientNet-B1MobileNet-v3-Large and 30 for ResNet-18ResNet-34For linear evaluation on Cifar10 and Cifar100 the initiallearning rate is 3 and all the models are fine-tuned for 100epochs SGD is adopted as the optimizer and the learningrate is decreased by 10 at 60 and 80 epochs for both lin-ear evaluation For downstream detection and segmentationtasks following [20 10 18] all parameters are fine-tunedFor the detection task on VOC initial learning rate is 01with 200 warm-up iterations and decays by 10 at 18k 222ksteps The detector is trained for 48k steps with a batch sizeof 32 on 8 V100 GPUs Following [18] the scales of imagesare randomly sampled from [400 800] during the trainingand is 800 at the inference For the detection and instancesegmentation tasks on COCO the model is trained for 180kiterations with the initial learning rate 011 and the scales ofimages are randomly sampled from [600 800] during thetraining

1httpsgithubcomfacebookresearchswav

42 Linear Evaluation

We conduct linear evaluation on ImageNet to validate theeffectiveness of our method As shown in Table 1 studentmodels distilled by DisCo outperform the counterparts pre-trained by MoCo-V2 (Baseline) with a large margin Be-sides DisCo surpasses the state-of-the-art SEED over var-ious student models with teacher ResNet-50101152 un-der the same setting especially on MobileNet-v3-Largedistilled by ResNet-50 with a difference of 92 at top-1accuracy When using R502 as the teacher SEED dis-tills 800 epochs while DisCo still distills 200 epochs butthe results of EfficientNet-B0 ResNet-18 and ResNet-34using DisCo also exceed that of SEED The performanceon EfficientNet-B1 and MobileNet-v3-Large is closely re-lated to the epochs of distillation For example whenEfficientNet-B1 is distilled for 290 epochs the top-1 ac-curacy becomes 704 which surpasses SEED and whenMobileNet-v3-Large is distilled for 340 epochs the top-1accuracy becomes 64 We believe that when DisCo dis-tills 800 epochs the results will be further improved More-over since CompRess uses a better teacher which trained600 epochs longer and distills 400 epochs longer than SEEDand ours itrsquos not fair to compare thus we do not report theresult in the table In addition when DisCo uses a largermodel as the teacher the student will be further improvedFor instance using ResNet-152 instead of ResNet-50 asthe teacher ResNet-34 is improved from 625 to 681Itrsquos worth noting when using ResNet-101ResNet-50 as theteacher the linear evaluation result of EfficientNet-B0 isvery close to the teacher 689 vs 691 and 665 vs674 while the number of parameters of EfficientNet-B0

6

0 10 20 30 40 50 60 70

30

40

50

60

70

Top-

1 Ac

cura

cy(

)

R-50R-101

R-152+166

+241

+149

+202

+265

+174

+265

+254

+214

(a) EfficientNet-B0

100101

0 10 20 30 40 50 60

30

40

50

60

70

R-50R-101 R-152

+136

+99

+222

+167

+119

+235

+177

+130

+203

(b) MobileNet-v3-Large

100101

0 10 20 30 40 50 60

30

40

50

60

70

R-50R-101

R-152+141

+86

+184

+169

+103

+201

+220

+131

+233(c) ResNet-18

100101

Numbers of parameters of Teacher (Millions)

Figure 5 ImageNet top-1 accuracy () of semi-supervised linear evaluation with 1 10 and 100 training data Points where thenumber of teacher network parameters are 0 are the results of the MoCo-V2 baseline without distillation

R-18 Eff-b070

75

80

85

90

95

Top-

1 Ac

cura

cy(

)

R-50R-101 R-152

R-50 R-101 R-152

(a) Cifar10

R-18 Eff-b040

50

60

70

80

Top-

1 Ac

cura

cy(

)

R-50R-101 R-152

R-50 R-101R-152

(b) Cifar100

MoCo-V2SEEDOurs

Figure 6 Top-1 accuracy of students transferred to Cifar10Cifar100 without and with distillation from different teachers

is only 94163 of ResNet-101ResNet-50

43 Semi-supervised Linear Evaluation

Following previous works [26 28 33] we evaluate ourmethod under the semi-supervised setting Two 1 and10 sampled subsets of ImageNet training data (˜128 and˜128 images per class respectively) [8] are used for fine-tuning the student models As is shown in Figure 5 stu-dent models distilled by DisCo outperform baseline underany amount of labeled data Furthermore DisCo also showsthe consistency under different fractions of annotations thatis students always benefit from larger models as teachersMore labels will be helpful to improve the final performanceof the student model which is expected

44 Comparison against other Distillation

In order to verify the effectiveness of the proposedmethod we compare with three widely used distillationschemes namely 1) Attention transfer denoted by AT[49] 2) Relational knowledge distillation denoted by RKD[34] 3) Knowledge distillation denoted by KD [24] ATand RKD are feature-based and relation-based respec-tively which can be utilized during the self-supervised pre-training stage KD is a logits-based method which can onlybe used at the supervised fine-tuning stage The compari-son results are shown in Table 2 Singe-Knowledge meansusing one of these approaches individually and it can beseen that all distillation approaches can bring improvement

to the baseline but the gain from DisCo is the most signif-icant which indicates the knowledge that DisCo chosen totransfer and the way of transmission is indeed more effec-tive Then we also try to transfer multi-knowledge fromteacher to student by combining DisCo with other distil-lation schemes It can be seen that integrating DisCo withATRKDKD can boost the performance a lot which furtherproves the effectiveness of DisCo

Table 2 Linear evaluation top-1 accuracy () on ImageNet com-pared with different distillation methods

Method Eff-b0 Eff-b1 Mob-v3 R-18 R-34Baseline

MoCo-V2 [11] 468 484 362 522 568Single-Knowledge

AT [49] 571 582 510 562 602RKD [34] 483 503 369 564 587KD [24] 465 485 373 515 588

DisCo (ours) 665 666 644 606 625Multi-Knowledge

AT + DisCo 667 663 641 602 623RKD + DisCo 668 665 644 606 623KD + DisCo 658 659 652 606 659

45 Transfer to Cifar10Cifar100

In order to analyze the generalization of representationsobtained by DisCo we further conduct linear evaluation onCifar10 and Cifar100 with ResNet-18EfficientNet-B0 as

7

Table 3 Object detection and instance segmentation results with ResNet-34 as backbone bounding-box AP (AP bb) and mask AP(APmk) are evaluated on VOC07 test and COCO val2017 Daggermeans our implementation Subscript in green represents the improve-ment compared to MoCo-V2 baseline

S T MethodObject Detection Instance Segmentation

VOC COCO COCOAP bb AP bb

50 AP bb75 AP bb AP bb

50 AP bb75 APmk APmk

50 APmk75

R-34

times MoCo-V2Dagger 536 791 587 381 568 407 330 532 353

R-50SEED [18] 537 794 592 384 570 410 333 532 353

DisCo (ours) 565(29uarr)

806(15uarr)

625(38uarr)

400(19uarr)

591(23uarr)

434(27uarr)

349(19uarr)

563(31uarr)

371(18uarr)

R-101SEED [18] 541 798 591 385 573 414 336 541 356

DisCo (ours) 561(25uarr)

803(12uarr)

618(31uarr)

400(19uarr)

591(23uarr)

432(25uarr)

347(19uarr)

559(27uarr)

374(18uarr)

R-152SEED [18] 544 801 599 384 570 410 333 537 353

DisCo (ours) 566(30uarr)

808(17uarr)

634(57uarr)

394(13uarr)

587(19uarr)

427(20uarr)

344(14uarr)

554(22uarr)

367(14uarr)

student and ResNet-50ResNet101ResNet152 as teacherSince the image resolution of Cifar dataset is 32times32 all theimages are resized to 224 times 224 with bicubic re-samplingbefore feeding into the model following [11 18] The re-sults are shown in Figure 6 it can be seen that the pro-posed DisCo surpasses the MoCo-V2 baseline by a largemargin with different student and teacher architectures onboth Cifar10 and Cifar100 In addition our method alsohas a significant improvement compared the-state-of-art ap-proach SEED It is worth noting that as the teacher becomesbetter the improvement brought by DisCo is more obvious

46 Transfer to Detection and Segmentation

We also conduct experiments on detection and segmen-tation tasks for generalization analysis C4 based FasterR-CNN [36] are used for objection detection on VOC andMask R-CNN [21] are used for objection detection and in-stance segmentation on COCO In this part of experimentfollowing [10 18] all the parameters of the student networkare learnable and the implementation is based on detec-tron2 [47] for convenience The results of using ResNet-34as the student with teacher ResNet-50ResNet-101ResNet-152 are shown in Table 3 On the object detection ourmethod can bring obvious improvement on both VOC andCOCO datasets Furthermore as SEED [18] claimed theimprovement on COCO is relatively minor compared toVOC since COCO training dataset has 118k images whileVOC has only 165k training images thus the gain broughtby weight initialization is relatively small On the instancesegmentation task DisCo also shows superiority

47 Distilling BottleNeck Analysis

In this section we analyze the Distilling BottleNeck Phe-nomenon and we use ResNet-50 as the teacher for simplic-ityDistilling BottleNeck Phenomenon In the self-supervised

distillation stage we first tried to distill small models withthe default MLP configuration of MoCo-V2 using DisCoand the results are shown in Table 4 denoted by DisColowastIt is worth noting that the dimensions of the hidden layerin DisColowast are exactly as same as SEED It can be seenthat compared to SEED DisColowast shows superior results onEfficientNet-B0 EfficientNet-B1 and MobileNet-v3-Largeand has comparable results on ResNet-18 and ResNet-34Then we expand the dimension of the hidden layer in theMLP of the student to be consistent with that of the teacherthat is 2048D it can be seen that the results can be furtherimproved which is recorded in the third row In particularthis expansion operation brings 35 and 36 gains forResNet-18 and ResNet-34 respectively

Table 4 Linear evaluation top-1 accuracy () on ImageNetDisCo means that the dimension of hidden layer in the MLP isconsistent with that of SEED

Method Eff-b0 Eff-b1 Mob-v3 R-18 R-34SEED [18] 613 614 552 576 585

DisCo 656 658 638 571 589DisCo 665(09uarr) 666(08uarr) 644(06uarr) 606(35uarr) 625(36uarr)

Theoretical Analysis from IB perspective In Figure 7on the downstream Cifar10 classification task we visualizethe compression phase of ResNet-18 and ResNet-34 withdifferent hidden dimensions distilled by the same teacherin the information plane It can be seen that when we ad-just the hidden dimension in the MLP of ResNet-18 andResNet-34 from 512D to 2048D the value of I(XT ) be-comes smaller while I(T Y ) is basically unchanged whichsuggests that enlarging the hidden dimension can makethe student model more generalized in the setting of self-supervised transfer learningThe effectiveness of the dimension Here we further ex-plore the impact of more dimensions and the results areshown in Figure 9 It can be seen that as the dimension in-

8

80 85 90 950

1

2

3

4

5

I (T

Y) C(858) T

C (891) T

(a) ResNet-18

80 85 90 95

C (834) T

C (864) T

(b) ResNet-34

MLP (512d) MLP (2048d)I (XT)

Figure 7 Mutual information paths from transition points to con-vergence points in the compression phase of training T denotestransition points and C(X) denotes convergent points with Xtop-1 accuracy on Cifar10 Points with similar I(TY) but smallerI(XT) are better generalized

creases the top-1 accuracy also increases but when the di-mension is already large the growth trend will slow downThe performance trend of EfficientNet-B1 and MobileNet-v3-Large is close to that of EfficientNet-B0 so for the sakeof clarity we do not exhibit it in the figure

48 More SSL Methods

Variants of the testbed In this part in order to demon-strate the versatility of our method we further tried othertwo SSL methods as the testbed SwAV and DINO

For SwAV the teacher is backboned by ResNet-50 andthe results are shown in Table 5 it can be seen that formodels with very few parameters EfficientNet-B0 andMobileNet-v3-Large the pre-training results with SwAVare also very poor When DisCo is utilized the efficacyis significantly improved

Table 5 Linear evaluation top-1 accuracy () on ImageNet withSwAV as the testbed The teacher of DisCo is an online opensource ResNet-50 model pre-trained using SwAV with top-1 accu-racy 753

Method Eff-b0 Mob-v3 R-34SwAV [5] 468 194 633

SwAV + DisCo 624 557 633

For DINO use two vision transformer architectures ViTand XCiT and the results are shown in Table 6 It can beseen that no matter what SSL method is adopted and whatarchitecture is used in the network DisCo can bring sig-nificant gains It is worth noting that XCiT-tiny has 26Mparameters which is much larger than ViT-tiny(5M) butDisCo can still narrow the gap with the teacherVariants of teacher pre-training method In order toverify that our method is not picky about the pre-trainingapproach that the teacher adopted we use three ResNet-50 networks pre-trained with different SSL methods as theteacher under the testbed of MoCo-V2 It can be observed

Table 6 Linear evaluation top-1 accuracy () on ImageNet withDINO as testbed

Teacher StudentModel Acc ViT-tiny XCiT-tiny

- - 55 67ViT-small [14 43] 77 684(134uarr) -XCiT-small [16] 778 - 711(41uarr)

Table 7 Linear evaluation top-1 accuracy () on ImageNet withvariants of teacher pre-training methods All the teachers areResNet-50 and the first row is student trained by MoCo-V2 di-rectly without distillation which is the baseline

Teacher StudentMethod Acc Eff-b0 Eff-b1 Mob-v3 R-18 R-34

- - 468 484 362 522 568MoCo-V2 [10] 674 665 666 644 606 625

SeLa-V2 [2] 718 622 682 662 641 653SwAV [5] 753 700 721 650 651 675

from Table 7 that when using different pre-trained ResNet-50 as teachers DisCo can significantly boost all the resultsof small models Furthermore with the improvement of theteachers using different and stronger pre-training methodsthe results of the student can be further improved

49 Visualization Analysis

In Figure 8 we visualize the learned representations ofEfficientNet-B0ResNet-50 pretrained by MoCo-V2 andEfficientNet-B0 distilled by ResNet-50 using DisCo Forclarity we randomly select 10 classes from ImageNet testset and map the learned representations to two dimensionalspace by t-SNE [44] It can be observed that ResNet-50forms more separated clusters than EfficientNet-B0 whenusing MoCo-V2 alone and after using ResNet-50 to teachEfficientNet-B0 with DisCo EfficientNet-B0 performs verymuch like the teacher

5 Conclusion

In this paper we propose Distilled Contrastive Learning(DisCo) to remedy self-supervised learning on lightweightmodels The proposed method constraints the final embed-ding of the lightweight student to be consistent with that ofthe teacher to maximally transmit the teacherrsquos knowledgeDisCo is not limited to specific contrastive learning meth-ods and can remedy the effect of the student to be very closeto the teacher

References[1] Soroush Abbasi Koohpayegani Ajinkya Tejankar and

Hamed Pirsiavash Compress Self-supervised learning bycompressing representations In NeurIPS pages 12980ndash12992 2020 2 3

9

30 20 10 0 10 20 30 40Dim 1

30

20

10

0

10

20

Dim

2

(a) Eff-b0 (MoCo-V2)

30 20 10 0 10 20 30 40Dim 1

(b) R-50 (MoCo-V2)

30 20 10 0 10 20 30 40Dim 1

(c) Eff-b0 (DisCo distilled by ResNet-50)

Figure 8 Clustering results on the ImageNet test set Different colors represent different classes

0 512 1024 1280 2048Dimension of hidden layer

5254565860626466

Top-

1 Ac

cura

cy(

)

+46+38

+32

+67

+62

+57

+74+65

+63

+81+74

+72

EfficientNet-B0ResNet-18ResNet-34

Figure 9 Linear evaluation top-1 accuracy on ImageNet of studentprojection head with different dimensions of hidden layer duringdistillation represents the original dimension used in MoCo-V2

[2] Yuki Markus Asano Christian Rupprecht and AndreaVedaldi Self-labelling via simultaneous clustering and rep-resentation learning In ICLR 2020 3 9

[3] Mathilde Caron Piotr Bojanowski Armand Joulin andMatthijs Douze Deep clustering for unsupervised learningof visual features In ECCV pages 132ndash149 2018 3

[4] Mathilde Caron Piotr Bojanowski Julien Mairal and Ar-mand Joulin Unsupervised pre-training of image featureson non-curated data In ICCV pages 2959ndash2968 2019 3

[5] Mathilde Caron Ishan Misra Julien Mairal Priya Goyal Pi-otr Bojanowski and Armand Joulin Unsupervised learn-ing of visual features by contrasting cluster assignments InNeurIPS pages 9912ndash9924 2020 3 9

[6] Mathilde Caron Hugo Touvron Ishan Misra Herve JegouJulien Mairal Piotr Bojanowski and Armand Joulin Emerg-ing properties in self-supervised vision transformers 20213

[7] Defang Chen Jian-Ping Mei Yuan Zhang Can Wang ZheWang Yan Feng and Chun Chen Cross-layer distillationwith semantic calibration 2020 3

[8] Ting Chen Simon Kornblith Mohammad Norouzi and Ge-offrey Hinton A simple framework for contrastive learningof visual representations In ICML pages 1597ndash1607 20201 2 3 7

[9] Ting Chen Simon Kornblith Kevin Swersky MohammadNorouzi and Geoffrey Hinton Big self-supervised mod-els are strong semi-supervised learners In NeurIPS pages22243ndash22255 2020 1 2 3 4

[10] Xinlei Chen Haoqi Fan Ross Girshick and Kaiming HeImproved baselines with momentum contrastive learning InCVPR pages 9729ndash9738 2020 1 2 3 4 6 8 9

[11] Xinlei Chen and Kaiming He Exploring simple siamese rep-resentation learning In arXiv preprint arXiv2011105662020 3 7 8

[12] Hao Cheng Dongze Lian Shenghua Gao and Yanlin GengEvaluating capability of deep neural networks for imageclassification via information plane In ECCV pages 168ndash182 2018 5

[13] Carl Doersch Abhinav Gupta and Alexei A Efros Unsuper-vised visual representation learning by context prediction InICCV pages 1422ndash1430 2015 1

[14] Alexey Dosovitskiy Lucas Beyer Alexander KolesnikovDirk Weissenborn Xiaohua Zhai Thomas UnterthinerMostafa Dehghani Matthias Minderer Georg Heigold Syl-vain Gelly et al An image is worth 16x16 words Trans-formers for image recognition at scale arXiv preprintarXiv201011929 2020 5 9

[15] Alexey Dosovitskiy Philipp Fischer Jost Tobias Springen-berg Martin Riedmiller and Thomas Brox Discriminativeunsupervised feature learning with exemplar convolutionalneural networks volume 38 pages 1734ndash1747 2015 3

[16] Alaaeldin El-Nouby Hugo Touvron Mathilde Caron PiotrBojanowski Matthijs Douze Armand Joulin Ivan LaptevNatalia Neverova Gabriel Synnaeve Jakob Verbeek et alXcit Cross-covariance image transformers 2021 5 9

[17] Mark Everingham Luc Van Gool Christopher KI WilliamsJohn Winn and Andrew Zisserman The pascal visual objectclasses challenge volume 88 pages 303ndash338 2010 5

10

[18] Zhiyuan Fang Jianfeng Wang Lijuan Wang Lei ZhangYezhou Yang and Zicheng Liu Seed Self-supervised dis-tillation for visual representation In ICLR 2021 2 3 68

[19] Jean-Bastien Grill Florian Strub Florent Altche CorentinTallec Pierre Richemond Elena Buchatskaya Carl DoerschBernardo Avila Pires Zhaohan Guo Mohammad Ghesh-laghi Azar Bilal Piot koray kavukcuoglu Remi Munos andMichal Valko Bootstrap your own latent A new approachto self-supervised learning In NeurIPS pages 21271ndash212842020 1 2 3

[20] Kaiming He Haoqi Fan Yuxin Wu Saining Xie and RossGirshick Momentum contrast for unsupervised visual rep-resentation learning In CVPR pages 9729ndash9738 2020 12 3 6

[21] Kaiming He Georgia Gkioxari Piotr Dollar and Ross Gir-shick Mask r-cnn In ICCV pages 2961ndash2969 2017 8

[22] Kaiming He Xiangyu Zhang Shaoqing Ren and Jian SunDeep residual learning for image recognition In CVPRpages 770ndash778 2016 2

[23] Olivier Henaff Data-efficient image recognition with con-trastive predictive coding In ICML pages 4182ndash4192 20203

[24] Geoffrey Hinton Oriol Vinyals and Jeff Dean Distilling theknowledge in a neural network In NeurIPSW 2015 2 3 7

[25] Andrew Howard Mark Sandler Grace Chu Liang-ChiehChen Bo Chen Mingxing Tan Weijun Wang Yukun ZhuRuoming Pang Vijay Vasudevan et al Searching for mo-bilenetv3 In ICCV pages 1314ndash1324 2019 2

[26] Alexander Kolesnikov Xiaohua Zhai and Lucas Beyer Re-visiting self-supervised visual representation learning InCVPR pages 1920ndash1929 2019 7

[27] Nikos Komodakis and Spyros Gidaris Unsupervised repre-sentation learning by predicting image rotations In ICLR2018 1 3

[28] Simon Kornblith Jonathon Shlens and Quoc V Le Do bet-ter imagenet models transfer better In CVPR pages 2661ndash2671 2019 7

[29] Alex Krizhevsky and Geoffrey Hinton Learning multiplelayers of features from tiny images Citeseer 2009 5

[30] Tsung-Yi Lin Michael Maire Serge Belongie James HaysPietro Perona Deva Ramanan Piotr Dollar and C LawrenceZitnick Microsoft coco Common objects in context InECCV pages 740ndash755 2014 5

[31] Yufan Liu Jiajiong Cao Bing Li Chunfeng Yuan WeimingHu Yangxi Li and Yunqiang Duan Knowledge distillationvia instance relationship graph In CVPR pages 7096ndash71042019 3

[32] Mehdi Noroozi and Paolo Favaro Unsupervised learning ofvisual representations by solving jigsaw puzzles In ECCVpages 69ndash84 2016 1 3

[33] Aaron van den Oord Yazhe Li and Oriol Vinyals Repre-sentation learning with contrastive predictive coding 20187

[34] Wonpyo Park Dongju Kim Yan Lu and Minsu Cho Re-lational knowledge distillation In CVPR pages 3967ndash39762019 3 7

[35] Deepak Pathak Philipp Krahenbuhl Jeff Donahue TrevorDarrell and Alexei A Efros Context encoders Featurelearning by inpainting In CVPR pages 2536ndash2544 20161 3

[36] Shaoqing Ren Kaiming He Ross Girshick and Jian SunFaster r-cnn Towards real-time object detection with regionproposal networks volume 39 pages 1137ndash1149 2015 8

[37] Adriana Romero Nicolas Ballas Samira Ebrahimi KahouAntoine Chassang Carlo Gatta and Yoshua Bengio FitnetsHints for thin deep nets In ICLR 2014 3

[38] Olga Russakovsky Jia Deng Hao Su Jonathan Krause San-jeev Satheesh Sean Ma Zhiheng Huang Andrej KarpathyAditya Khosla Michael Bernstein Alexander C Berg andLi Fei-Fei ImageNet Large Scale Visual Recognition Chal-lenge volume 115 pages 211ndash252 2015 2 5

[39] Ravid Shwartz-Ziv and Naftali Tishby Opening the blackbox of deep neural networks via information 2017 5

[40] Mingxing Tan and Quoc Le Efficientnet Rethinking modelscaling for convolutional neural networks In ICML pages6105ndash6114 2019 2

[41] Yonglong Tian Dilip Krishnan and Phillip Isola Con-trastive representation distillation In ICLR 2020 3

[42] Naftali Tishby Fernando C Pereira and William Bialek Theinformation bottleneck method 2000 2 5

[43] Hugo Touvron Matthieu Cord Matthijs Douze FranciscoMassa Alexandre Sablayrolles and Herve Jegou Trainingdata-efficient image transformers amp distillation through at-tention In International Conference on Machine Learningpages 10347ndash10357 PMLR 2021 5 9

[44] Laurens Van der Maaten and Geoffrey Hinton Visualizingdata using t-sne volume 9 2008 9

[45] Jinpeng Wang Yuting Gao Ke Li Xinyang Jiang XiaoweiGuo Rongrong Ji and Xing Sun Enhancing unsupervisedvideo representation learning by decoupling the scene andthe motion 2020 3

[46] Jinpeng Wang Yuting Gao Ke Li Yiqi Lin Andy J Ma andXing Sun Removing the background by adding the back-ground Towards background robust self-supervised videorepresentation learning 2020 3

[47] Yuxin Wu Alexander Kirillov Francisco Massa Wan-YenLo and Ross Girshick Detectron2 httpsgithubcomfacebookresearchdetectron2 2019 8

[48] Guodong Xu Ziwei Liu Xiaoxiao Li and Chen Change LoyKnowledge distillation meets self-supervision In ECCVpages 588ndash604 2020 3

[49] Sergey Zagoruyko and Nikos Komodakis Paying more at-tention to attention Improving the performance of convolu-tional neural networks via attention transfer In ICLR 20173 7

[50] Jure Zbontar Li Jing Ishan Misra Yann LeCun andStephane Deny Barlow twins Self-supervised learning viaredundancy reduction 2021 3

11

Page 3: 12 1 arXiv:2104.09124v3 [cs.CV] 9 Aug 2021

rely on three paradigms ie pretext tasks contrastive basedand clustering based

Pretext tasks Approaches based on pretext paradigm fo-cus on designing more effective surrogate tasks includ-ing Exampler-CNN [15] that identifies whether patches arecropped from the same image Rotation [27] that predictsthe rotation degree of the input image Jigsaw [32] thatplaces the shuffled patches back to the original position andContext encoder [35] that recovers the missing part of theinput image conditioned on its surrounding

Contrastive based Contrastive based approaches haveshown impressive performance on self-supervised represen-tation learning which enforce different views of the sameinput to be closer in feature space [11 9 8 23 20 10 1945 46 50] SimCLR [8 9] indicates that self-supervisedlearning can be boosted by applying strong data augmen-tation training with larger batch size of negative samplesand adding projection head (MLP) after the global aver-age pooling However SimCLR relies on very large batchsize to achieve comparable performance and cannot be ap-plied widely to many real-world scenarios MoCo [20 10]considers contrastive learning as a look-up dictionary us-ing a memory bank to maintain consistent representationsof negative samples Thus MoCo can achieve superior per-formance without large batch size which is more feasible toimplement BYOL [19] introduces a predictor to one branchof the network to break the symmetry and avoid the trivialsolution DINO [6] applies contrastive learning to visiontransformers

Clustering based Clustering is one of the most promisingapproaches for unsupervised representation learning Deep-Cluster [3] uses k-means assignments to generate pseudo-labels to iteratively group the features and update the weightof the network DeeperCluster [4] scales to large uncu-rated datasets to capture complementary statistics Differ-ent from previous works to maximize the mutual informa-tion between pseudo labels and input data SeLa [2] cast thepseudo-label assignment as an instance of optimal transportSwAV [5] formulates to map representations to prototypevectors which is assigned online and is capable to scale tolarger datasets

Although the mainstream methods SimCLR-V2 MoCo-V2 BYOL and SwAV belong to different self-supervisedcategories they have four things in common 1) two viewsfor one image 2) two encoders for feature extraction 3)two projection heads to map the representations into a lowerdimension space and 4) the two low-dimensional embed-dings are regarded to be a pair of positive samples whichcan be considered as a contrast process However all ofthese methods suffer a performance cliff fall on lightweightmodels which is what we try to remedy in this work

22 Knowledge Distillation

Knowledge distillation (KD) tries to transfer the knowl-edge from a larger teacher model to a smaller student modelAccording to the form of knowledge it can be classified intothree categories logits-based feature-based and relation-basedLogits-based Logits refers to the output of the networkclassifier KD [24] proposes to make the student mimic thelogits of the teacher by minimizing the KL-divergence ofthe class distributionFeature-based Feature-based methods directly transfer theknowledge from the intermediate layers of the teacher tostudent FitNets [37] regards the intermediate representa-tions learned by the teacher as hints and transfers the knowl-edge to a thinner and deeper student through minimizingthe mean square error between the representations AT[49] proposes to use the spatial attention of the teacher asthe knowledge and let the student pay attention to the areathat the teacher is concerned about SemCKD [7] adap-tively selects the more appropriate representation pairs ofthe teacher and studentRelation-based Relation-based approaches explore the re-lationship between data instead of the output of a single in-stance RKD [34] transfers the mutual relationship of theinput data within one batch with distance-wise and angle-wise distillation loss from the teacher to the student IRG[31] proposes to use the relationship the graph to furtherexpress the relational knowledge

23 SSL meets KD

Recently some works combine self-supervised learn-ing and knowledge distillation CRD [41] introduces acontrastive loss to transfer pair-wise relationship acrossdifferent modalities SSKD [48] lets the student mimictransformed data and self-supervision tasks to transferricher knowledge The above-mentioned works take self-supervision as an auxiliary task to further boost the pro-cess of knowledge distillation under fully supervised set-ting CompRess[1] and SEED [18] tried to employ knowl-edge distillation as a means to improve the self-supervisedvisual representation learning capability of small modelswhich utilize the negative sample queue in MoCo [20] toconstrain the distribution of positive sample over negativesamples of the student to be consistent with that of theteacher However CompRess and SEED heavily rely onMoCo framework which means that a memory bank al-ways has to be preserved during the distillation processOur method also aims to boost the self-supervised represen-tation learning ability on lightweight models by distillinghowever we do not restrict the self-supervised frameworkand are thus more flexible Furthermore our method sur-pass SEED with a large margin on all lightweight modelsunder the same setting

3

3 MethodIn this section we introduce the proposed Distilled Con-

trastive Learning (DisCo) on lightweight models We firstgive some preliminaries on contrastive based SSL and thenintroduce the overall architecture of DisCo and how DisCotransfers the knowledge from the teacher to the student Fi-nally we present how DisCo can be combined with the ex-isting contrastive based SSL methods

31 Preliminary on Contrastive Based SSL

In Figure 3 we show the framework diagrams of fourmainstream contrastive based SSL methods they have somecommon characteristics as listed below

Z

ImageView

Representation

Similarity

Contrastive Loss

E Eprime Embedding

Share

Gradients

vprime vx

MLP

Momentum update

MemoryMLP

Similarity

Contrastive Loss

E Eprime

vprime vx

Momentum update

Z Zprime Gradients

L2 LossE Eprime

vprime vx

MLP MLP

Share

Z Zprime Gradients

E Eprime

vprime vx

MLP

Gradients

ImageView

Representation

Embedding

(a) SimCLR-V2 (b) MoCo-V2

(c) BYOL (d) SwAV

Prototype

Fitness

Cross-Entropy Loss

Encoder Encoder

MLP MLP

Encoder Encoder

Encoder Encoder Encoder Encoder

MLP

MLP

ZZprime Gradients Zprime Gradients

Figure 3 Diagrams of four mainstream SSL methods

Two views one input image x is transformed into twoviews v and v

primeby two drastic data augmentation operations

Two encoders two augmented views are input to twoencoders of the same structure one is a learnable base en-coder s(middot) and the other m(middot) is updated according to thebase encoder either shared or momentum updated The en-coder here can use any network architecture such as thecommonly used ResNet For an input image the extractedrepresentation obtained after the encoder and global averagepooling is denoted as Z and its dimension is D

Projection head both the encoders are followed by asmall projection head p(middot) that maps the representation Zto a low-dimensional embedding E which contains sev-eral linear layers This procedure can be formulated asE = p(Z) =W(n) middot middot middot (σ(W(1)Z)) where W is the weightparameter of the linear layer n is the number of layers

which is greater than or equal to 1 and σ is the non-linearfunction ReLU The importance the of projection head hasbeen addressed in SimCLR-V2 [9] and MoCo-V2 [10] Fol-lowing the setting of MoCo-V2 we define the default con-figuration of the projection head as two linear layers withthe dimension being D and 128

Loss function after obtaining the final embeddings ofthese two views they are regarded as a pair of positive sam-ples to calculate the loss

32 Overall Architecture

The framework of the proposed DisCo is shown in Fig-ure 2 which consists of three encoders following the pro-jection head The first encoder s(middot) is the student that wewant to learn the second is the other encoder m(middot) in themainstream self-supervised method and the third is the pre-trained large teacher t(middot)

For each input image x it is first transformed into twoviews v and vprime by two drastic data augmentation opera-tions On the one hand v is input to s(middot) and t(middot) gen-erating two representations Zs = s(v) Zt = t(v) thenafter the projection head these two representations aremapped to low-dimensional embeddings Es = ps(Zs)Et = pt(Zt) respectively On the other hand vprime is inputto s(middot) m(middot) and t(middot) simultaneously after encoding andprojecting three low-dimensional vectors E

prime

s = ps(s(vprime))

Eprime

m = pm(m(vprime)) and Eprime

t = pt(t(vprime)) are obtained

Eprime

m and Es are the embeddings of two different viewswhich are regarded as a pair of positive samples and arepulled together in the existing SSL methods Es and EtE

prime

s and Eprime

t are two pairs of embeddings of the student andthe teacher of the same view and each pair is constrained tobe consistent during the distilling procedure which will beintroduced in detail in section 33

33 Distilling Procedure

In most contrastive based SSL methods the calculationof loss function and the evaluation of accuracy are both per-formed at the final embedding vector E Therefore we hy-pothesize that the last embedding E contains the most fruit-ful knowledge and should be considered to be delivered firstwhen distilling

For a self-supervised pre-trained teacher model we dis-till the knowledge in the last embedding into the studentthat is for view v and view vprime the embedding vector outputby the frozen teacher and the learnable student should beconsistent Specifically we use a consistency regularizationterm to pull the embedding vector Es closer to Et and E

prime

s

closer to Eprime

t Formally

Ldis = ||Es minus Et||2 + ||Eprime

s minus Eprime

t||2

(1)

In order to verify that the knowledge contained in theembedding E is the most meaningful we experimented

4

with several other commonly used distillation schemes insection 44 The experimental results prove that the knowl-edge we transmitted and the way it is transferred are indeedmore effectiveDistilling Bottleneck In our distillation experiment wefound an interesting phenomenon When the encoder ofthe student network is ResNet-18 or ResNet-34 and the de-fault MLP configuration is adopted that is the dimensionof embedding output by the encoder is projected from D toD and then to 128 the results of DisCo is not satisfactoryWe assume that this degradation is caused by the fact thatthe dimension of the hidden layer in the MLP is too smalland term this phenomenon as Distilling Bottleneck In Fig-ure 4 we exhibit the default configuration of the projectionhead of ResNet-1834 EfficientNet-B0B1 MobileNet-v3-Large and ResNet-50101152 It can be seen that the di-mension of the hidden layer of ResNet-1834 is too smallcompared to other networks

512

ResNet-1834 ResNet-50101152

1280

MLP512

128

1280

128

2048

2048

128

Mob-v3 Eff-b0b1

Figure 4 Default MLP of multiple networks

In order to alleviate the Distilling Bottleneck problemwe present to expand the dimension of the hidden layerin MLP It is worth noting that this operation only intro-duces a small number of additional parameters at the self-supervised distillation stage and the MLP will be directlydiscarded during fine-tuning and deployment which meansno extra computational burden is brought We experimen-tally verified that such a simple operation can bring signifi-cant gains in section 47

This operation can be further explained from the Infor-mation Bottleneck (IB) [42] perspective IB is utilized in[39 12] to understand how deep networks work by visu-alizing mutual information (I(XT ) and I(T Y )) in theinformation plane where I(XT ) is the mutual informa-tion between input and output and I(T Y ) is the mutualinformation between output and label The training of deepnetworks can be described by two-phases in the informationplane the first fitting phase where the network memorizesthe information of input resulting in the growth of I(XT )and I(T Y ) the subsequent compression phase where thenetwork removes irrelevant information of input for bettergeneralization resulting in the decrease of I(XT ) (SeeFigure 7 in [12]) Generally in the compression phaseI(XT ) can present the modelrsquos capability of generaliza-tion while I(T Y ) can present the modelrsquos capability of fit-

ting the label [12] We visualize the compression phase ofour model with different dimensions of the hidden layer inthe pre-training distillation stage in the information plane onone downstream transferring classification task The resultsin Figure 7 shows two interesting phenomenons

i Models with different dimensions of the hidden layerhave very similar I(T Y ) suggesting that models have thenearly equal capability of fitting the labels

ii The Model with larger dimension in the hidden layerhas smaller I(XT ) suggesting a stronger capability ofgeneralization

These phenomenons show that MLP indeed relates tothe capability of model generalization in the setting of self-supervised transfer learning

34 Overall Objective Function

The overall objective function is defined as follows

L = Ldis + λLco (2)

where Ldis comes from distillation part Lco can be thecontrastive loss function of any SSL method and λ is ahyper-parameter that controls the weights of the distillationloss and contrastive loss In our experiments λ is set to 1Due to the simplicity of the implementation we use MoCo-V2 as the testbed in the experiments

4 Experiments41 Settings

Dataset All the self-supervised pre-training experimentsare conducted on ImageNet ILSVRC-2012 [38] Fordownstream classification tasks experiments are carriedout on Cifar10 [29] and Cifar100 [29] For downstreamdetection tasks experiments are conducted on PASCALVOC [17] and MS-COCO [30] with train+valtest andtrain2017val2017 for trainingtesting respectively Fordownstream segmentation tasks the proposed method isverified on MS-COCOTeacher Encoders Six large encoders are used asteacher ResNet-50 (224M) ResNet-101 (405M) ResNet-152 (554M) ResNet-502 (555M) ViT-small(22M)[1443] and XCiT-small(443M)[16] where X(Y) denotes thatthe encoder has Y millions of parameters and the Y doesnot take linear layer into considerationStudent Encoders Five widely used small yet effectiveconvolution neural networks and two vision transformerencoders are used as student EfficientNet-B0 (40M)MobileNet-v3-Large (42M) EfficientNet-B1 (64M) andResNet-18 (107M) ResNet-34 (204M) ViT-tiny(5M) andXCiT-tiny(261M)Teacher Pre-training Setting ResNet-50101152 arepre-trained using MoCo-V2 with default hyper-parametersFollowing SEED ResNet-50 and ResNet-101 are trained

5

Table 1 ImageNet test accuracy () using linear classification on different student architectures diams denotes the teacherstudent modelsare pre-trained with MoCo-V2 which is our implementation and daggermeans the teacher is pre-trained by SwAV which is an open-sourcemodel When using R502 as the teacher SEED distills 800 epochs while DisCo distills 200 epochs Subscript in green represents theimprovement compared to MoCo-V2 baseline

Method TS Eff-b0 Eff-b1 Mob-v3 R-18 R-34

T-1 T-5 T-1 T-5 T-1 T-5 T-1 T-5 T-1 T-5Supervised 771 933 792 944 752 - 721 - 750 -

Self-supervisedMoCo-V2 (Baseline)diams 468 722 484 738 362 621 522 776 568 814

SSL DistillationSEED[18] R-50 (674) 613 827 614 831 552 803 576 818 585 826

DisCo (ours) R-50 (674)diams 665(197uarr)

876(154uarr)

666(182uarr)

875(137uarr)

644(282uarr)

862(241uarr)

606(84uarr)

837(61uarr)

625(57uarr)

854(40uarr)

SEED [18] R-101 (703) 630 838 634 846 599 835 589 825 616 849DisCo (ours) R-101 (691)diams 689

(221uarr)889

(167uarr)690

(206uarr)891

(153uarr)657

(295uarr)867

(246uarr)623

(101uarr)851

(75uarr)644

(76uarr)865

(51uarr)SEED [18] R-152 (742) 653 860 673 869 614 846 595 833 627 858

DisCo (ours) R-152 (741)diams 678(210uarr)

870(148uarr)

731(247uarr)

912(174uarr)

637(275uarr)

849(228uarr)

655(133uarr)

867(91uarr)

681(113uarr)

886(72uarr)

SEED [18] R502 (773dagger) 676 874 680 876 682 882 630 849 657 868DisCo (ours) R502 (773)dagger 691

(223uarr)889

(177uarr)640

(156uarr)846

(108uarr)589

(227uarr)814

(193uarr)652(13uarr)

868(92uarr)

676(108uarr)

886(72uarr)

for 200 epochs and ResNet-152 is trained for 400 epochsResNet-502 is pre-trained by SwAV which is an open-source model 1 and trained for 800 epochsSelf-supervised Distillation Setting The projection headof all the student networks has two linear layers with the di-mension being 2048 and 128 The configuration of learningrate and optimizer is set the same as MoCo-V2 and withouta specific statement the model is trained for 200 epochsFurthermore during the distillation stage the teacher isfrozenStudent Fine-tuning Setting For linear evaluation onImageNet the student is fine-tuned for 100 epochs Ini-tial learning rate is 3 for EfficientNet-B0EfficientNet-B1MobileNet-v3-Large and 30 for ResNet-18ResNet-34For linear evaluation on Cifar10 and Cifar100 the initiallearning rate is 3 and all the models are fine-tuned for 100epochs SGD is adopted as the optimizer and the learningrate is decreased by 10 at 60 and 80 epochs for both lin-ear evaluation For downstream detection and segmentationtasks following [20 10 18] all parameters are fine-tunedFor the detection task on VOC initial learning rate is 01with 200 warm-up iterations and decays by 10 at 18k 222ksteps The detector is trained for 48k steps with a batch sizeof 32 on 8 V100 GPUs Following [18] the scales of imagesare randomly sampled from [400 800] during the trainingand is 800 at the inference For the detection and instancesegmentation tasks on COCO the model is trained for 180kiterations with the initial learning rate 011 and the scales ofimages are randomly sampled from [600 800] during thetraining

1httpsgithubcomfacebookresearchswav

42 Linear Evaluation

We conduct linear evaluation on ImageNet to validate theeffectiveness of our method As shown in Table 1 studentmodels distilled by DisCo outperform the counterparts pre-trained by MoCo-V2 (Baseline) with a large margin Be-sides DisCo surpasses the state-of-the-art SEED over var-ious student models with teacher ResNet-50101152 un-der the same setting especially on MobileNet-v3-Largedistilled by ResNet-50 with a difference of 92 at top-1accuracy When using R502 as the teacher SEED dis-tills 800 epochs while DisCo still distills 200 epochs butthe results of EfficientNet-B0 ResNet-18 and ResNet-34using DisCo also exceed that of SEED The performanceon EfficientNet-B1 and MobileNet-v3-Large is closely re-lated to the epochs of distillation For example whenEfficientNet-B1 is distilled for 290 epochs the top-1 ac-curacy becomes 704 which surpasses SEED and whenMobileNet-v3-Large is distilled for 340 epochs the top-1accuracy becomes 64 We believe that when DisCo dis-tills 800 epochs the results will be further improved More-over since CompRess uses a better teacher which trained600 epochs longer and distills 400 epochs longer than SEEDand ours itrsquos not fair to compare thus we do not report theresult in the table In addition when DisCo uses a largermodel as the teacher the student will be further improvedFor instance using ResNet-152 instead of ResNet-50 asthe teacher ResNet-34 is improved from 625 to 681Itrsquos worth noting when using ResNet-101ResNet-50 as theteacher the linear evaluation result of EfficientNet-B0 isvery close to the teacher 689 vs 691 and 665 vs674 while the number of parameters of EfficientNet-B0

6

0 10 20 30 40 50 60 70

30

40

50

60

70

Top-

1 Ac

cura

cy(

)

R-50R-101

R-152+166

+241

+149

+202

+265

+174

+265

+254

+214

(a) EfficientNet-B0

100101

0 10 20 30 40 50 60

30

40

50

60

70

R-50R-101 R-152

+136

+99

+222

+167

+119

+235

+177

+130

+203

(b) MobileNet-v3-Large

100101

0 10 20 30 40 50 60

30

40

50

60

70

R-50R-101

R-152+141

+86

+184

+169

+103

+201

+220

+131

+233(c) ResNet-18

100101

Numbers of parameters of Teacher (Millions)

Figure 5 ImageNet top-1 accuracy () of semi-supervised linear evaluation with 1 10 and 100 training data Points where thenumber of teacher network parameters are 0 are the results of the MoCo-V2 baseline without distillation

R-18 Eff-b070

75

80

85

90

95

Top-

1 Ac

cura

cy(

)

R-50R-101 R-152

R-50 R-101 R-152

(a) Cifar10

R-18 Eff-b040

50

60

70

80

Top-

1 Ac

cura

cy(

)

R-50R-101 R-152

R-50 R-101R-152

(b) Cifar100

MoCo-V2SEEDOurs

Figure 6 Top-1 accuracy of students transferred to Cifar10Cifar100 without and with distillation from different teachers

is only 94163 of ResNet-101ResNet-50

43 Semi-supervised Linear Evaluation

Following previous works [26 28 33] we evaluate ourmethod under the semi-supervised setting Two 1 and10 sampled subsets of ImageNet training data (˜128 and˜128 images per class respectively) [8] are used for fine-tuning the student models As is shown in Figure 5 stu-dent models distilled by DisCo outperform baseline underany amount of labeled data Furthermore DisCo also showsthe consistency under different fractions of annotations thatis students always benefit from larger models as teachersMore labels will be helpful to improve the final performanceof the student model which is expected

44 Comparison against other Distillation

In order to verify the effectiveness of the proposedmethod we compare with three widely used distillationschemes namely 1) Attention transfer denoted by AT[49] 2) Relational knowledge distillation denoted by RKD[34] 3) Knowledge distillation denoted by KD [24] ATand RKD are feature-based and relation-based respec-tively which can be utilized during the self-supervised pre-training stage KD is a logits-based method which can onlybe used at the supervised fine-tuning stage The compari-son results are shown in Table 2 Singe-Knowledge meansusing one of these approaches individually and it can beseen that all distillation approaches can bring improvement

to the baseline but the gain from DisCo is the most signif-icant which indicates the knowledge that DisCo chosen totransfer and the way of transmission is indeed more effec-tive Then we also try to transfer multi-knowledge fromteacher to student by combining DisCo with other distil-lation schemes It can be seen that integrating DisCo withATRKDKD can boost the performance a lot which furtherproves the effectiveness of DisCo

Table 2 Linear evaluation top-1 accuracy () on ImageNet com-pared with different distillation methods

Method Eff-b0 Eff-b1 Mob-v3 R-18 R-34Baseline

MoCo-V2 [11] 468 484 362 522 568Single-Knowledge

AT [49] 571 582 510 562 602RKD [34] 483 503 369 564 587KD [24] 465 485 373 515 588

DisCo (ours) 665 666 644 606 625Multi-Knowledge

AT + DisCo 667 663 641 602 623RKD + DisCo 668 665 644 606 623KD + DisCo 658 659 652 606 659

45 Transfer to Cifar10Cifar100

In order to analyze the generalization of representationsobtained by DisCo we further conduct linear evaluation onCifar10 and Cifar100 with ResNet-18EfficientNet-B0 as

7

Table 3 Object detection and instance segmentation results with ResNet-34 as backbone bounding-box AP (AP bb) and mask AP(APmk) are evaluated on VOC07 test and COCO val2017 Daggermeans our implementation Subscript in green represents the improve-ment compared to MoCo-V2 baseline

S T MethodObject Detection Instance Segmentation

VOC COCO COCOAP bb AP bb

50 AP bb75 AP bb AP bb

50 AP bb75 APmk APmk

50 APmk75

R-34

times MoCo-V2Dagger 536 791 587 381 568 407 330 532 353

R-50SEED [18] 537 794 592 384 570 410 333 532 353

DisCo (ours) 565(29uarr)

806(15uarr)

625(38uarr)

400(19uarr)

591(23uarr)

434(27uarr)

349(19uarr)

563(31uarr)

371(18uarr)

R-101SEED [18] 541 798 591 385 573 414 336 541 356

DisCo (ours) 561(25uarr)

803(12uarr)

618(31uarr)

400(19uarr)

591(23uarr)

432(25uarr)

347(19uarr)

559(27uarr)

374(18uarr)

R-152SEED [18] 544 801 599 384 570 410 333 537 353

DisCo (ours) 566(30uarr)

808(17uarr)

634(57uarr)

394(13uarr)

587(19uarr)

427(20uarr)

344(14uarr)

554(22uarr)

367(14uarr)

student and ResNet-50ResNet101ResNet152 as teacherSince the image resolution of Cifar dataset is 32times32 all theimages are resized to 224 times 224 with bicubic re-samplingbefore feeding into the model following [11 18] The re-sults are shown in Figure 6 it can be seen that the pro-posed DisCo surpasses the MoCo-V2 baseline by a largemargin with different student and teacher architectures onboth Cifar10 and Cifar100 In addition our method alsohas a significant improvement compared the-state-of-art ap-proach SEED It is worth noting that as the teacher becomesbetter the improvement brought by DisCo is more obvious

46 Transfer to Detection and Segmentation

We also conduct experiments on detection and segmen-tation tasks for generalization analysis C4 based FasterR-CNN [36] are used for objection detection on VOC andMask R-CNN [21] are used for objection detection and in-stance segmentation on COCO In this part of experimentfollowing [10 18] all the parameters of the student networkare learnable and the implementation is based on detec-tron2 [47] for convenience The results of using ResNet-34as the student with teacher ResNet-50ResNet-101ResNet-152 are shown in Table 3 On the object detection ourmethod can bring obvious improvement on both VOC andCOCO datasets Furthermore as SEED [18] claimed theimprovement on COCO is relatively minor compared toVOC since COCO training dataset has 118k images whileVOC has only 165k training images thus the gain broughtby weight initialization is relatively small On the instancesegmentation task DisCo also shows superiority

47 Distilling BottleNeck Analysis

In this section we analyze the Distilling BottleNeck Phe-nomenon and we use ResNet-50 as the teacher for simplic-ityDistilling BottleNeck Phenomenon In the self-supervised

distillation stage we first tried to distill small models withthe default MLP configuration of MoCo-V2 using DisCoand the results are shown in Table 4 denoted by DisColowastIt is worth noting that the dimensions of the hidden layerin DisColowast are exactly as same as SEED It can be seenthat compared to SEED DisColowast shows superior results onEfficientNet-B0 EfficientNet-B1 and MobileNet-v3-Largeand has comparable results on ResNet-18 and ResNet-34Then we expand the dimension of the hidden layer in theMLP of the student to be consistent with that of the teacherthat is 2048D it can be seen that the results can be furtherimproved which is recorded in the third row In particularthis expansion operation brings 35 and 36 gains forResNet-18 and ResNet-34 respectively

Table 4 Linear evaluation top-1 accuracy () on ImageNetDisCo means that the dimension of hidden layer in the MLP isconsistent with that of SEED

Method Eff-b0 Eff-b1 Mob-v3 R-18 R-34SEED [18] 613 614 552 576 585

DisCo 656 658 638 571 589DisCo 665(09uarr) 666(08uarr) 644(06uarr) 606(35uarr) 625(36uarr)

Theoretical Analysis from IB perspective In Figure 7on the downstream Cifar10 classification task we visualizethe compression phase of ResNet-18 and ResNet-34 withdifferent hidden dimensions distilled by the same teacherin the information plane It can be seen that when we ad-just the hidden dimension in the MLP of ResNet-18 andResNet-34 from 512D to 2048D the value of I(XT ) be-comes smaller while I(T Y ) is basically unchanged whichsuggests that enlarging the hidden dimension can makethe student model more generalized in the setting of self-supervised transfer learningThe effectiveness of the dimension Here we further ex-plore the impact of more dimensions and the results areshown in Figure 9 It can be seen that as the dimension in-

8

80 85 90 950

1

2

3

4

5

I (T

Y) C(858) T

C (891) T

(a) ResNet-18

80 85 90 95

C (834) T

C (864) T

(b) ResNet-34

MLP (512d) MLP (2048d)I (XT)

Figure 7 Mutual information paths from transition points to con-vergence points in the compression phase of training T denotestransition points and C(X) denotes convergent points with Xtop-1 accuracy on Cifar10 Points with similar I(TY) but smallerI(XT) are better generalized

creases the top-1 accuracy also increases but when the di-mension is already large the growth trend will slow downThe performance trend of EfficientNet-B1 and MobileNet-v3-Large is close to that of EfficientNet-B0 so for the sakeof clarity we do not exhibit it in the figure

48 More SSL Methods

Variants of the testbed In this part in order to demon-strate the versatility of our method we further tried othertwo SSL methods as the testbed SwAV and DINO

For SwAV the teacher is backboned by ResNet-50 andthe results are shown in Table 5 it can be seen that formodels with very few parameters EfficientNet-B0 andMobileNet-v3-Large the pre-training results with SwAVare also very poor When DisCo is utilized the efficacyis significantly improved

Table 5 Linear evaluation top-1 accuracy () on ImageNet withSwAV as the testbed The teacher of DisCo is an online opensource ResNet-50 model pre-trained using SwAV with top-1 accu-racy 753

Method Eff-b0 Mob-v3 R-34SwAV [5] 468 194 633

SwAV + DisCo 624 557 633

For DINO use two vision transformer architectures ViTand XCiT and the results are shown in Table 6 It can beseen that no matter what SSL method is adopted and whatarchitecture is used in the network DisCo can bring sig-nificant gains It is worth noting that XCiT-tiny has 26Mparameters which is much larger than ViT-tiny(5M) butDisCo can still narrow the gap with the teacherVariants of teacher pre-training method In order toverify that our method is not picky about the pre-trainingapproach that the teacher adopted we use three ResNet-50 networks pre-trained with different SSL methods as theteacher under the testbed of MoCo-V2 It can be observed

Table 6 Linear evaluation top-1 accuracy () on ImageNet withDINO as testbed

Teacher StudentModel Acc ViT-tiny XCiT-tiny

- - 55 67ViT-small [14 43] 77 684(134uarr) -XCiT-small [16] 778 - 711(41uarr)

Table 7 Linear evaluation top-1 accuracy () on ImageNet withvariants of teacher pre-training methods All the teachers areResNet-50 and the first row is student trained by MoCo-V2 di-rectly without distillation which is the baseline

Teacher StudentMethod Acc Eff-b0 Eff-b1 Mob-v3 R-18 R-34

- - 468 484 362 522 568MoCo-V2 [10] 674 665 666 644 606 625

SeLa-V2 [2] 718 622 682 662 641 653SwAV [5] 753 700 721 650 651 675

from Table 7 that when using different pre-trained ResNet-50 as teachers DisCo can significantly boost all the resultsof small models Furthermore with the improvement of theteachers using different and stronger pre-training methodsthe results of the student can be further improved

49 Visualization Analysis

In Figure 8 we visualize the learned representations ofEfficientNet-B0ResNet-50 pretrained by MoCo-V2 andEfficientNet-B0 distilled by ResNet-50 using DisCo Forclarity we randomly select 10 classes from ImageNet testset and map the learned representations to two dimensionalspace by t-SNE [44] It can be observed that ResNet-50forms more separated clusters than EfficientNet-B0 whenusing MoCo-V2 alone and after using ResNet-50 to teachEfficientNet-B0 with DisCo EfficientNet-B0 performs verymuch like the teacher

5 Conclusion

In this paper we propose Distilled Contrastive Learning(DisCo) to remedy self-supervised learning on lightweightmodels The proposed method constraints the final embed-ding of the lightweight student to be consistent with that ofthe teacher to maximally transmit the teacherrsquos knowledgeDisCo is not limited to specific contrastive learning meth-ods and can remedy the effect of the student to be very closeto the teacher

References[1] Soroush Abbasi Koohpayegani Ajinkya Tejankar and

Hamed Pirsiavash Compress Self-supervised learning bycompressing representations In NeurIPS pages 12980ndash12992 2020 2 3

9

30 20 10 0 10 20 30 40Dim 1

30

20

10

0

10

20

Dim

2

(a) Eff-b0 (MoCo-V2)

30 20 10 0 10 20 30 40Dim 1

(b) R-50 (MoCo-V2)

30 20 10 0 10 20 30 40Dim 1

(c) Eff-b0 (DisCo distilled by ResNet-50)

Figure 8 Clustering results on the ImageNet test set Different colors represent different classes

0 512 1024 1280 2048Dimension of hidden layer

5254565860626466

Top-

1 Ac

cura

cy(

)

+46+38

+32

+67

+62

+57

+74+65

+63

+81+74

+72

EfficientNet-B0ResNet-18ResNet-34

Figure 9 Linear evaluation top-1 accuracy on ImageNet of studentprojection head with different dimensions of hidden layer duringdistillation represents the original dimension used in MoCo-V2

[2] Yuki Markus Asano Christian Rupprecht and AndreaVedaldi Self-labelling via simultaneous clustering and rep-resentation learning In ICLR 2020 3 9

[3] Mathilde Caron Piotr Bojanowski Armand Joulin andMatthijs Douze Deep clustering for unsupervised learningof visual features In ECCV pages 132ndash149 2018 3

[4] Mathilde Caron Piotr Bojanowski Julien Mairal and Ar-mand Joulin Unsupervised pre-training of image featureson non-curated data In ICCV pages 2959ndash2968 2019 3

[5] Mathilde Caron Ishan Misra Julien Mairal Priya Goyal Pi-otr Bojanowski and Armand Joulin Unsupervised learn-ing of visual features by contrasting cluster assignments InNeurIPS pages 9912ndash9924 2020 3 9

[6] Mathilde Caron Hugo Touvron Ishan Misra Herve JegouJulien Mairal Piotr Bojanowski and Armand Joulin Emerg-ing properties in self-supervised vision transformers 20213

[7] Defang Chen Jian-Ping Mei Yuan Zhang Can Wang ZheWang Yan Feng and Chun Chen Cross-layer distillationwith semantic calibration 2020 3

[8] Ting Chen Simon Kornblith Mohammad Norouzi and Ge-offrey Hinton A simple framework for contrastive learningof visual representations In ICML pages 1597ndash1607 20201 2 3 7

[9] Ting Chen Simon Kornblith Kevin Swersky MohammadNorouzi and Geoffrey Hinton Big self-supervised mod-els are strong semi-supervised learners In NeurIPS pages22243ndash22255 2020 1 2 3 4

[10] Xinlei Chen Haoqi Fan Ross Girshick and Kaiming HeImproved baselines with momentum contrastive learning InCVPR pages 9729ndash9738 2020 1 2 3 4 6 8 9

[11] Xinlei Chen and Kaiming He Exploring simple siamese rep-resentation learning In arXiv preprint arXiv2011105662020 3 7 8

[12] Hao Cheng Dongze Lian Shenghua Gao and Yanlin GengEvaluating capability of deep neural networks for imageclassification via information plane In ECCV pages 168ndash182 2018 5

[13] Carl Doersch Abhinav Gupta and Alexei A Efros Unsuper-vised visual representation learning by context prediction InICCV pages 1422ndash1430 2015 1

[14] Alexey Dosovitskiy Lucas Beyer Alexander KolesnikovDirk Weissenborn Xiaohua Zhai Thomas UnterthinerMostafa Dehghani Matthias Minderer Georg Heigold Syl-vain Gelly et al An image is worth 16x16 words Trans-formers for image recognition at scale arXiv preprintarXiv201011929 2020 5 9

[15] Alexey Dosovitskiy Philipp Fischer Jost Tobias Springen-berg Martin Riedmiller and Thomas Brox Discriminativeunsupervised feature learning with exemplar convolutionalneural networks volume 38 pages 1734ndash1747 2015 3

[16] Alaaeldin El-Nouby Hugo Touvron Mathilde Caron PiotrBojanowski Matthijs Douze Armand Joulin Ivan LaptevNatalia Neverova Gabriel Synnaeve Jakob Verbeek et alXcit Cross-covariance image transformers 2021 5 9

[17] Mark Everingham Luc Van Gool Christopher KI WilliamsJohn Winn and Andrew Zisserman The pascal visual objectclasses challenge volume 88 pages 303ndash338 2010 5

10

[18] Zhiyuan Fang Jianfeng Wang Lijuan Wang Lei ZhangYezhou Yang and Zicheng Liu Seed Self-supervised dis-tillation for visual representation In ICLR 2021 2 3 68

[19] Jean-Bastien Grill Florian Strub Florent Altche CorentinTallec Pierre Richemond Elena Buchatskaya Carl DoerschBernardo Avila Pires Zhaohan Guo Mohammad Ghesh-laghi Azar Bilal Piot koray kavukcuoglu Remi Munos andMichal Valko Bootstrap your own latent A new approachto self-supervised learning In NeurIPS pages 21271ndash212842020 1 2 3

[20] Kaiming He Haoqi Fan Yuxin Wu Saining Xie and RossGirshick Momentum contrast for unsupervised visual rep-resentation learning In CVPR pages 9729ndash9738 2020 12 3 6

[21] Kaiming He Georgia Gkioxari Piotr Dollar and Ross Gir-shick Mask r-cnn In ICCV pages 2961ndash2969 2017 8

[22] Kaiming He Xiangyu Zhang Shaoqing Ren and Jian SunDeep residual learning for image recognition In CVPRpages 770ndash778 2016 2

[23] Olivier Henaff Data-efficient image recognition with con-trastive predictive coding In ICML pages 4182ndash4192 20203

[24] Geoffrey Hinton Oriol Vinyals and Jeff Dean Distilling theknowledge in a neural network In NeurIPSW 2015 2 3 7

[25] Andrew Howard Mark Sandler Grace Chu Liang-ChiehChen Bo Chen Mingxing Tan Weijun Wang Yukun ZhuRuoming Pang Vijay Vasudevan et al Searching for mo-bilenetv3 In ICCV pages 1314ndash1324 2019 2

[26] Alexander Kolesnikov Xiaohua Zhai and Lucas Beyer Re-visiting self-supervised visual representation learning InCVPR pages 1920ndash1929 2019 7

[27] Nikos Komodakis and Spyros Gidaris Unsupervised repre-sentation learning by predicting image rotations In ICLR2018 1 3

[28] Simon Kornblith Jonathon Shlens and Quoc V Le Do bet-ter imagenet models transfer better In CVPR pages 2661ndash2671 2019 7

[29] Alex Krizhevsky and Geoffrey Hinton Learning multiplelayers of features from tiny images Citeseer 2009 5

[30] Tsung-Yi Lin Michael Maire Serge Belongie James HaysPietro Perona Deva Ramanan Piotr Dollar and C LawrenceZitnick Microsoft coco Common objects in context InECCV pages 740ndash755 2014 5

[31] Yufan Liu Jiajiong Cao Bing Li Chunfeng Yuan WeimingHu Yangxi Li and Yunqiang Duan Knowledge distillationvia instance relationship graph In CVPR pages 7096ndash71042019 3

[32] Mehdi Noroozi and Paolo Favaro Unsupervised learning ofvisual representations by solving jigsaw puzzles In ECCVpages 69ndash84 2016 1 3

[33] Aaron van den Oord Yazhe Li and Oriol Vinyals Repre-sentation learning with contrastive predictive coding 20187

[34] Wonpyo Park Dongju Kim Yan Lu and Minsu Cho Re-lational knowledge distillation In CVPR pages 3967ndash39762019 3 7

[35] Deepak Pathak Philipp Krahenbuhl Jeff Donahue TrevorDarrell and Alexei A Efros Context encoders Featurelearning by inpainting In CVPR pages 2536ndash2544 20161 3

[36] Shaoqing Ren Kaiming He Ross Girshick and Jian SunFaster r-cnn Towards real-time object detection with regionproposal networks volume 39 pages 1137ndash1149 2015 8

[37] Adriana Romero Nicolas Ballas Samira Ebrahimi KahouAntoine Chassang Carlo Gatta and Yoshua Bengio FitnetsHints for thin deep nets In ICLR 2014 3

[38] Olga Russakovsky Jia Deng Hao Su Jonathan Krause San-jeev Satheesh Sean Ma Zhiheng Huang Andrej KarpathyAditya Khosla Michael Bernstein Alexander C Berg andLi Fei-Fei ImageNet Large Scale Visual Recognition Chal-lenge volume 115 pages 211ndash252 2015 2 5

[39] Ravid Shwartz-Ziv and Naftali Tishby Opening the blackbox of deep neural networks via information 2017 5

[40] Mingxing Tan and Quoc Le Efficientnet Rethinking modelscaling for convolutional neural networks In ICML pages6105ndash6114 2019 2

[41] Yonglong Tian Dilip Krishnan and Phillip Isola Con-trastive representation distillation In ICLR 2020 3

[42] Naftali Tishby Fernando C Pereira and William Bialek Theinformation bottleneck method 2000 2 5

[43] Hugo Touvron Matthieu Cord Matthijs Douze FranciscoMassa Alexandre Sablayrolles and Herve Jegou Trainingdata-efficient image transformers amp distillation through at-tention In International Conference on Machine Learningpages 10347ndash10357 PMLR 2021 5 9

[44] Laurens Van der Maaten and Geoffrey Hinton Visualizingdata using t-sne volume 9 2008 9

[45] Jinpeng Wang Yuting Gao Ke Li Xinyang Jiang XiaoweiGuo Rongrong Ji and Xing Sun Enhancing unsupervisedvideo representation learning by decoupling the scene andthe motion 2020 3

[46] Jinpeng Wang Yuting Gao Ke Li Yiqi Lin Andy J Ma andXing Sun Removing the background by adding the back-ground Towards background robust self-supervised videorepresentation learning 2020 3

[47] Yuxin Wu Alexander Kirillov Francisco Massa Wan-YenLo and Ross Girshick Detectron2 httpsgithubcomfacebookresearchdetectron2 2019 8

[48] Guodong Xu Ziwei Liu Xiaoxiao Li and Chen Change LoyKnowledge distillation meets self-supervision In ECCVpages 588ndash604 2020 3

[49] Sergey Zagoruyko and Nikos Komodakis Paying more at-tention to attention Improving the performance of convolu-tional neural networks via attention transfer In ICLR 20173 7

[50] Jure Zbontar Li Jing Ishan Misra Yann LeCun andStephane Deny Barlow twins Self-supervised learning viaredundancy reduction 2021 3

11

Page 4: 12 1 arXiv:2104.09124v3 [cs.CV] 9 Aug 2021

3 MethodIn this section we introduce the proposed Distilled Con-

trastive Learning (DisCo) on lightweight models We firstgive some preliminaries on contrastive based SSL and thenintroduce the overall architecture of DisCo and how DisCotransfers the knowledge from the teacher to the student Fi-nally we present how DisCo can be combined with the ex-isting contrastive based SSL methods

31 Preliminary on Contrastive Based SSL

In Figure 3 we show the framework diagrams of fourmainstream contrastive based SSL methods they have somecommon characteristics as listed below

Z

ImageView

Representation

Similarity

Contrastive Loss

E Eprime Embedding

Share

Gradients

vprime vx

MLP

Momentum update

MemoryMLP

Similarity

Contrastive Loss

E Eprime

vprime vx

Momentum update

Z Zprime Gradients

L2 LossE Eprime

vprime vx

MLP MLP

Share

Z Zprime Gradients

E Eprime

vprime vx

MLP

Gradients

ImageView

Representation

Embedding

(a) SimCLR-V2 (b) MoCo-V2

(c) BYOL (d) SwAV

Prototype

Fitness

Cross-Entropy Loss

Encoder Encoder

MLP MLP

Encoder Encoder

Encoder Encoder Encoder Encoder

MLP

MLP

ZZprime Gradients Zprime Gradients

Figure 3 Diagrams of four mainstream SSL methods

Two views one input image x is transformed into twoviews v and v

primeby two drastic data augmentation operations

Two encoders two augmented views are input to twoencoders of the same structure one is a learnable base en-coder s(middot) and the other m(middot) is updated according to thebase encoder either shared or momentum updated The en-coder here can use any network architecture such as thecommonly used ResNet For an input image the extractedrepresentation obtained after the encoder and global averagepooling is denoted as Z and its dimension is D

Projection head both the encoders are followed by asmall projection head p(middot) that maps the representation Zto a low-dimensional embedding E which contains sev-eral linear layers This procedure can be formulated asE = p(Z) =W(n) middot middot middot (σ(W(1)Z)) where W is the weightparameter of the linear layer n is the number of layers

which is greater than or equal to 1 and σ is the non-linearfunction ReLU The importance the of projection head hasbeen addressed in SimCLR-V2 [9] and MoCo-V2 [10] Fol-lowing the setting of MoCo-V2 we define the default con-figuration of the projection head as two linear layers withthe dimension being D and 128

Loss function after obtaining the final embeddings ofthese two views they are regarded as a pair of positive sam-ples to calculate the loss

32 Overall Architecture

The framework of the proposed DisCo is shown in Fig-ure 2 which consists of three encoders following the pro-jection head The first encoder s(middot) is the student that wewant to learn the second is the other encoder m(middot) in themainstream self-supervised method and the third is the pre-trained large teacher t(middot)

For each input image x it is first transformed into twoviews v and vprime by two drastic data augmentation opera-tions On the one hand v is input to s(middot) and t(middot) gen-erating two representations Zs = s(v) Zt = t(v) thenafter the projection head these two representations aremapped to low-dimensional embeddings Es = ps(Zs)Et = pt(Zt) respectively On the other hand vprime is inputto s(middot) m(middot) and t(middot) simultaneously after encoding andprojecting three low-dimensional vectors E

prime

s = ps(s(vprime))

Eprime

m = pm(m(vprime)) and Eprime

t = pt(t(vprime)) are obtained

Eprime

m and Es are the embeddings of two different viewswhich are regarded as a pair of positive samples and arepulled together in the existing SSL methods Es and EtE

prime

s and Eprime

t are two pairs of embeddings of the student andthe teacher of the same view and each pair is constrained tobe consistent during the distilling procedure which will beintroduced in detail in section 33

33 Distilling Procedure

In most contrastive based SSL methods the calculationof loss function and the evaluation of accuracy are both per-formed at the final embedding vector E Therefore we hy-pothesize that the last embedding E contains the most fruit-ful knowledge and should be considered to be delivered firstwhen distilling

For a self-supervised pre-trained teacher model we dis-till the knowledge in the last embedding into the studentthat is for view v and view vprime the embedding vector outputby the frozen teacher and the learnable student should beconsistent Specifically we use a consistency regularizationterm to pull the embedding vector Es closer to Et and E

prime

s

closer to Eprime

t Formally

Ldis = ||Es minus Et||2 + ||Eprime

s minus Eprime

t||2

(1)

In order to verify that the knowledge contained in theembedding E is the most meaningful we experimented

4

with several other commonly used distillation schemes insection 44 The experimental results prove that the knowl-edge we transmitted and the way it is transferred are indeedmore effectiveDistilling Bottleneck In our distillation experiment wefound an interesting phenomenon When the encoder ofthe student network is ResNet-18 or ResNet-34 and the de-fault MLP configuration is adopted that is the dimensionof embedding output by the encoder is projected from D toD and then to 128 the results of DisCo is not satisfactoryWe assume that this degradation is caused by the fact thatthe dimension of the hidden layer in the MLP is too smalland term this phenomenon as Distilling Bottleneck In Fig-ure 4 we exhibit the default configuration of the projectionhead of ResNet-1834 EfficientNet-B0B1 MobileNet-v3-Large and ResNet-50101152 It can be seen that the di-mension of the hidden layer of ResNet-1834 is too smallcompared to other networks

512

ResNet-1834 ResNet-50101152

1280

MLP512

128

1280

128

2048

2048

128

Mob-v3 Eff-b0b1

Figure 4 Default MLP of multiple networks

In order to alleviate the Distilling Bottleneck problemwe present to expand the dimension of the hidden layerin MLP It is worth noting that this operation only intro-duces a small number of additional parameters at the self-supervised distillation stage and the MLP will be directlydiscarded during fine-tuning and deployment which meansno extra computational burden is brought We experimen-tally verified that such a simple operation can bring signifi-cant gains in section 47

This operation can be further explained from the Infor-mation Bottleneck (IB) [42] perspective IB is utilized in[39 12] to understand how deep networks work by visu-alizing mutual information (I(XT ) and I(T Y )) in theinformation plane where I(XT ) is the mutual informa-tion between input and output and I(T Y ) is the mutualinformation between output and label The training of deepnetworks can be described by two-phases in the informationplane the first fitting phase where the network memorizesthe information of input resulting in the growth of I(XT )and I(T Y ) the subsequent compression phase where thenetwork removes irrelevant information of input for bettergeneralization resulting in the decrease of I(XT ) (SeeFigure 7 in [12]) Generally in the compression phaseI(XT ) can present the modelrsquos capability of generaliza-tion while I(T Y ) can present the modelrsquos capability of fit-

ting the label [12] We visualize the compression phase ofour model with different dimensions of the hidden layer inthe pre-training distillation stage in the information plane onone downstream transferring classification task The resultsin Figure 7 shows two interesting phenomenons

i Models with different dimensions of the hidden layerhave very similar I(T Y ) suggesting that models have thenearly equal capability of fitting the labels

ii The Model with larger dimension in the hidden layerhas smaller I(XT ) suggesting a stronger capability ofgeneralization

These phenomenons show that MLP indeed relates tothe capability of model generalization in the setting of self-supervised transfer learning

34 Overall Objective Function

The overall objective function is defined as follows

L = Ldis + λLco (2)

where Ldis comes from distillation part Lco can be thecontrastive loss function of any SSL method and λ is ahyper-parameter that controls the weights of the distillationloss and contrastive loss In our experiments λ is set to 1Due to the simplicity of the implementation we use MoCo-V2 as the testbed in the experiments

4 Experiments41 Settings

Dataset All the self-supervised pre-training experimentsare conducted on ImageNet ILSVRC-2012 [38] Fordownstream classification tasks experiments are carriedout on Cifar10 [29] and Cifar100 [29] For downstreamdetection tasks experiments are conducted on PASCALVOC [17] and MS-COCO [30] with train+valtest andtrain2017val2017 for trainingtesting respectively Fordownstream segmentation tasks the proposed method isverified on MS-COCOTeacher Encoders Six large encoders are used asteacher ResNet-50 (224M) ResNet-101 (405M) ResNet-152 (554M) ResNet-502 (555M) ViT-small(22M)[1443] and XCiT-small(443M)[16] where X(Y) denotes thatthe encoder has Y millions of parameters and the Y doesnot take linear layer into considerationStudent Encoders Five widely used small yet effectiveconvolution neural networks and two vision transformerencoders are used as student EfficientNet-B0 (40M)MobileNet-v3-Large (42M) EfficientNet-B1 (64M) andResNet-18 (107M) ResNet-34 (204M) ViT-tiny(5M) andXCiT-tiny(261M)Teacher Pre-training Setting ResNet-50101152 arepre-trained using MoCo-V2 with default hyper-parametersFollowing SEED ResNet-50 and ResNet-101 are trained

5

Table 1 ImageNet test accuracy () using linear classification on different student architectures diams denotes the teacherstudent modelsare pre-trained with MoCo-V2 which is our implementation and daggermeans the teacher is pre-trained by SwAV which is an open-sourcemodel When using R502 as the teacher SEED distills 800 epochs while DisCo distills 200 epochs Subscript in green represents theimprovement compared to MoCo-V2 baseline

Method TS Eff-b0 Eff-b1 Mob-v3 R-18 R-34

T-1 T-5 T-1 T-5 T-1 T-5 T-1 T-5 T-1 T-5Supervised 771 933 792 944 752 - 721 - 750 -

Self-supervisedMoCo-V2 (Baseline)diams 468 722 484 738 362 621 522 776 568 814

SSL DistillationSEED[18] R-50 (674) 613 827 614 831 552 803 576 818 585 826

DisCo (ours) R-50 (674)diams 665(197uarr)

876(154uarr)

666(182uarr)

875(137uarr)

644(282uarr)

862(241uarr)

606(84uarr)

837(61uarr)

625(57uarr)

854(40uarr)

SEED [18] R-101 (703) 630 838 634 846 599 835 589 825 616 849DisCo (ours) R-101 (691)diams 689

(221uarr)889

(167uarr)690

(206uarr)891

(153uarr)657

(295uarr)867

(246uarr)623

(101uarr)851

(75uarr)644

(76uarr)865

(51uarr)SEED [18] R-152 (742) 653 860 673 869 614 846 595 833 627 858

DisCo (ours) R-152 (741)diams 678(210uarr)

870(148uarr)

731(247uarr)

912(174uarr)

637(275uarr)

849(228uarr)

655(133uarr)

867(91uarr)

681(113uarr)

886(72uarr)

SEED [18] R502 (773dagger) 676 874 680 876 682 882 630 849 657 868DisCo (ours) R502 (773)dagger 691

(223uarr)889

(177uarr)640

(156uarr)846

(108uarr)589

(227uarr)814

(193uarr)652(13uarr)

868(92uarr)

676(108uarr)

886(72uarr)

for 200 epochs and ResNet-152 is trained for 400 epochsResNet-502 is pre-trained by SwAV which is an open-source model 1 and trained for 800 epochsSelf-supervised Distillation Setting The projection headof all the student networks has two linear layers with the di-mension being 2048 and 128 The configuration of learningrate and optimizer is set the same as MoCo-V2 and withouta specific statement the model is trained for 200 epochsFurthermore during the distillation stage the teacher isfrozenStudent Fine-tuning Setting For linear evaluation onImageNet the student is fine-tuned for 100 epochs Ini-tial learning rate is 3 for EfficientNet-B0EfficientNet-B1MobileNet-v3-Large and 30 for ResNet-18ResNet-34For linear evaluation on Cifar10 and Cifar100 the initiallearning rate is 3 and all the models are fine-tuned for 100epochs SGD is adopted as the optimizer and the learningrate is decreased by 10 at 60 and 80 epochs for both lin-ear evaluation For downstream detection and segmentationtasks following [20 10 18] all parameters are fine-tunedFor the detection task on VOC initial learning rate is 01with 200 warm-up iterations and decays by 10 at 18k 222ksteps The detector is trained for 48k steps with a batch sizeof 32 on 8 V100 GPUs Following [18] the scales of imagesare randomly sampled from [400 800] during the trainingand is 800 at the inference For the detection and instancesegmentation tasks on COCO the model is trained for 180kiterations with the initial learning rate 011 and the scales ofimages are randomly sampled from [600 800] during thetraining

1httpsgithubcomfacebookresearchswav

42 Linear Evaluation

We conduct linear evaluation on ImageNet to validate theeffectiveness of our method As shown in Table 1 studentmodels distilled by DisCo outperform the counterparts pre-trained by MoCo-V2 (Baseline) with a large margin Be-sides DisCo surpasses the state-of-the-art SEED over var-ious student models with teacher ResNet-50101152 un-der the same setting especially on MobileNet-v3-Largedistilled by ResNet-50 with a difference of 92 at top-1accuracy When using R502 as the teacher SEED dis-tills 800 epochs while DisCo still distills 200 epochs butthe results of EfficientNet-B0 ResNet-18 and ResNet-34using DisCo also exceed that of SEED The performanceon EfficientNet-B1 and MobileNet-v3-Large is closely re-lated to the epochs of distillation For example whenEfficientNet-B1 is distilled for 290 epochs the top-1 ac-curacy becomes 704 which surpasses SEED and whenMobileNet-v3-Large is distilled for 340 epochs the top-1accuracy becomes 64 We believe that when DisCo dis-tills 800 epochs the results will be further improved More-over since CompRess uses a better teacher which trained600 epochs longer and distills 400 epochs longer than SEEDand ours itrsquos not fair to compare thus we do not report theresult in the table In addition when DisCo uses a largermodel as the teacher the student will be further improvedFor instance using ResNet-152 instead of ResNet-50 asthe teacher ResNet-34 is improved from 625 to 681Itrsquos worth noting when using ResNet-101ResNet-50 as theteacher the linear evaluation result of EfficientNet-B0 isvery close to the teacher 689 vs 691 and 665 vs674 while the number of parameters of EfficientNet-B0

6

0 10 20 30 40 50 60 70

30

40

50

60

70

Top-

1 Ac

cura

cy(

)

R-50R-101

R-152+166

+241

+149

+202

+265

+174

+265

+254

+214

(a) EfficientNet-B0

100101

0 10 20 30 40 50 60

30

40

50

60

70

R-50R-101 R-152

+136

+99

+222

+167

+119

+235

+177

+130

+203

(b) MobileNet-v3-Large

100101

0 10 20 30 40 50 60

30

40

50

60

70

R-50R-101

R-152+141

+86

+184

+169

+103

+201

+220

+131

+233(c) ResNet-18

100101

Numbers of parameters of Teacher (Millions)

Figure 5 ImageNet top-1 accuracy () of semi-supervised linear evaluation with 1 10 and 100 training data Points where thenumber of teacher network parameters are 0 are the results of the MoCo-V2 baseline without distillation

R-18 Eff-b070

75

80

85

90

95

Top-

1 Ac

cura

cy(

)

R-50R-101 R-152

R-50 R-101 R-152

(a) Cifar10

R-18 Eff-b040

50

60

70

80

Top-

1 Ac

cura

cy(

)

R-50R-101 R-152

R-50 R-101R-152

(b) Cifar100

MoCo-V2SEEDOurs

Figure 6 Top-1 accuracy of students transferred to Cifar10Cifar100 without and with distillation from different teachers

is only 94163 of ResNet-101ResNet-50

43 Semi-supervised Linear Evaluation

Following previous works [26 28 33] we evaluate ourmethod under the semi-supervised setting Two 1 and10 sampled subsets of ImageNet training data (˜128 and˜128 images per class respectively) [8] are used for fine-tuning the student models As is shown in Figure 5 stu-dent models distilled by DisCo outperform baseline underany amount of labeled data Furthermore DisCo also showsthe consistency under different fractions of annotations thatis students always benefit from larger models as teachersMore labels will be helpful to improve the final performanceof the student model which is expected

44 Comparison against other Distillation

In order to verify the effectiveness of the proposedmethod we compare with three widely used distillationschemes namely 1) Attention transfer denoted by AT[49] 2) Relational knowledge distillation denoted by RKD[34] 3) Knowledge distillation denoted by KD [24] ATand RKD are feature-based and relation-based respec-tively which can be utilized during the self-supervised pre-training stage KD is a logits-based method which can onlybe used at the supervised fine-tuning stage The compari-son results are shown in Table 2 Singe-Knowledge meansusing one of these approaches individually and it can beseen that all distillation approaches can bring improvement

to the baseline but the gain from DisCo is the most signif-icant which indicates the knowledge that DisCo chosen totransfer and the way of transmission is indeed more effec-tive Then we also try to transfer multi-knowledge fromteacher to student by combining DisCo with other distil-lation schemes It can be seen that integrating DisCo withATRKDKD can boost the performance a lot which furtherproves the effectiveness of DisCo

Table 2 Linear evaluation top-1 accuracy () on ImageNet com-pared with different distillation methods

Method Eff-b0 Eff-b1 Mob-v3 R-18 R-34Baseline

MoCo-V2 [11] 468 484 362 522 568Single-Knowledge

AT [49] 571 582 510 562 602RKD [34] 483 503 369 564 587KD [24] 465 485 373 515 588

DisCo (ours) 665 666 644 606 625Multi-Knowledge

AT + DisCo 667 663 641 602 623RKD + DisCo 668 665 644 606 623KD + DisCo 658 659 652 606 659

45 Transfer to Cifar10Cifar100

In order to analyze the generalization of representationsobtained by DisCo we further conduct linear evaluation onCifar10 and Cifar100 with ResNet-18EfficientNet-B0 as

7

Table 3 Object detection and instance segmentation results with ResNet-34 as backbone bounding-box AP (AP bb) and mask AP(APmk) are evaluated on VOC07 test and COCO val2017 Daggermeans our implementation Subscript in green represents the improve-ment compared to MoCo-V2 baseline

S T MethodObject Detection Instance Segmentation

VOC COCO COCOAP bb AP bb

50 AP bb75 AP bb AP bb

50 AP bb75 APmk APmk

50 APmk75

R-34

times MoCo-V2Dagger 536 791 587 381 568 407 330 532 353

R-50SEED [18] 537 794 592 384 570 410 333 532 353

DisCo (ours) 565(29uarr)

806(15uarr)

625(38uarr)

400(19uarr)

591(23uarr)

434(27uarr)

349(19uarr)

563(31uarr)

371(18uarr)

R-101SEED [18] 541 798 591 385 573 414 336 541 356

DisCo (ours) 561(25uarr)

803(12uarr)

618(31uarr)

400(19uarr)

591(23uarr)

432(25uarr)

347(19uarr)

559(27uarr)

374(18uarr)

R-152SEED [18] 544 801 599 384 570 410 333 537 353

DisCo (ours) 566(30uarr)

808(17uarr)

634(57uarr)

394(13uarr)

587(19uarr)

427(20uarr)

344(14uarr)

554(22uarr)

367(14uarr)

student and ResNet-50ResNet101ResNet152 as teacherSince the image resolution of Cifar dataset is 32times32 all theimages are resized to 224 times 224 with bicubic re-samplingbefore feeding into the model following [11 18] The re-sults are shown in Figure 6 it can be seen that the pro-posed DisCo surpasses the MoCo-V2 baseline by a largemargin with different student and teacher architectures onboth Cifar10 and Cifar100 In addition our method alsohas a significant improvement compared the-state-of-art ap-proach SEED It is worth noting that as the teacher becomesbetter the improvement brought by DisCo is more obvious

46 Transfer to Detection and Segmentation

We also conduct experiments on detection and segmen-tation tasks for generalization analysis C4 based FasterR-CNN [36] are used for objection detection on VOC andMask R-CNN [21] are used for objection detection and in-stance segmentation on COCO In this part of experimentfollowing [10 18] all the parameters of the student networkare learnable and the implementation is based on detec-tron2 [47] for convenience The results of using ResNet-34as the student with teacher ResNet-50ResNet-101ResNet-152 are shown in Table 3 On the object detection ourmethod can bring obvious improvement on both VOC andCOCO datasets Furthermore as SEED [18] claimed theimprovement on COCO is relatively minor compared toVOC since COCO training dataset has 118k images whileVOC has only 165k training images thus the gain broughtby weight initialization is relatively small On the instancesegmentation task DisCo also shows superiority

47 Distilling BottleNeck Analysis

In this section we analyze the Distilling BottleNeck Phe-nomenon and we use ResNet-50 as the teacher for simplic-ityDistilling BottleNeck Phenomenon In the self-supervised

distillation stage we first tried to distill small models withthe default MLP configuration of MoCo-V2 using DisCoand the results are shown in Table 4 denoted by DisColowastIt is worth noting that the dimensions of the hidden layerin DisColowast are exactly as same as SEED It can be seenthat compared to SEED DisColowast shows superior results onEfficientNet-B0 EfficientNet-B1 and MobileNet-v3-Largeand has comparable results on ResNet-18 and ResNet-34Then we expand the dimension of the hidden layer in theMLP of the student to be consistent with that of the teacherthat is 2048D it can be seen that the results can be furtherimproved which is recorded in the third row In particularthis expansion operation brings 35 and 36 gains forResNet-18 and ResNet-34 respectively

Table 4 Linear evaluation top-1 accuracy () on ImageNetDisCo means that the dimension of hidden layer in the MLP isconsistent with that of SEED

Method Eff-b0 Eff-b1 Mob-v3 R-18 R-34SEED [18] 613 614 552 576 585

DisCo 656 658 638 571 589DisCo 665(09uarr) 666(08uarr) 644(06uarr) 606(35uarr) 625(36uarr)

Theoretical Analysis from IB perspective In Figure 7on the downstream Cifar10 classification task we visualizethe compression phase of ResNet-18 and ResNet-34 withdifferent hidden dimensions distilled by the same teacherin the information plane It can be seen that when we ad-just the hidden dimension in the MLP of ResNet-18 andResNet-34 from 512D to 2048D the value of I(XT ) be-comes smaller while I(T Y ) is basically unchanged whichsuggests that enlarging the hidden dimension can makethe student model more generalized in the setting of self-supervised transfer learningThe effectiveness of the dimension Here we further ex-plore the impact of more dimensions and the results areshown in Figure 9 It can be seen that as the dimension in-

8

80 85 90 950

1

2

3

4

5

I (T

Y) C(858) T

C (891) T

(a) ResNet-18

80 85 90 95

C (834) T

C (864) T

(b) ResNet-34

MLP (512d) MLP (2048d)I (XT)

Figure 7 Mutual information paths from transition points to con-vergence points in the compression phase of training T denotestransition points and C(X) denotes convergent points with Xtop-1 accuracy on Cifar10 Points with similar I(TY) but smallerI(XT) are better generalized

creases the top-1 accuracy also increases but when the di-mension is already large the growth trend will slow downThe performance trend of EfficientNet-B1 and MobileNet-v3-Large is close to that of EfficientNet-B0 so for the sakeof clarity we do not exhibit it in the figure

48 More SSL Methods

Variants of the testbed In this part in order to demon-strate the versatility of our method we further tried othertwo SSL methods as the testbed SwAV and DINO

For SwAV the teacher is backboned by ResNet-50 andthe results are shown in Table 5 it can be seen that formodels with very few parameters EfficientNet-B0 andMobileNet-v3-Large the pre-training results with SwAVare also very poor When DisCo is utilized the efficacyis significantly improved

Table 5 Linear evaluation top-1 accuracy () on ImageNet withSwAV as the testbed The teacher of DisCo is an online opensource ResNet-50 model pre-trained using SwAV with top-1 accu-racy 753

Method Eff-b0 Mob-v3 R-34SwAV [5] 468 194 633

SwAV + DisCo 624 557 633

For DINO use two vision transformer architectures ViTand XCiT and the results are shown in Table 6 It can beseen that no matter what SSL method is adopted and whatarchitecture is used in the network DisCo can bring sig-nificant gains It is worth noting that XCiT-tiny has 26Mparameters which is much larger than ViT-tiny(5M) butDisCo can still narrow the gap with the teacherVariants of teacher pre-training method In order toverify that our method is not picky about the pre-trainingapproach that the teacher adopted we use three ResNet-50 networks pre-trained with different SSL methods as theteacher under the testbed of MoCo-V2 It can be observed

Table 6 Linear evaluation top-1 accuracy () on ImageNet withDINO as testbed

Teacher StudentModel Acc ViT-tiny XCiT-tiny

- - 55 67ViT-small [14 43] 77 684(134uarr) -XCiT-small [16] 778 - 711(41uarr)

Table 7 Linear evaluation top-1 accuracy () on ImageNet withvariants of teacher pre-training methods All the teachers areResNet-50 and the first row is student trained by MoCo-V2 di-rectly without distillation which is the baseline

Teacher StudentMethod Acc Eff-b0 Eff-b1 Mob-v3 R-18 R-34

- - 468 484 362 522 568MoCo-V2 [10] 674 665 666 644 606 625

SeLa-V2 [2] 718 622 682 662 641 653SwAV [5] 753 700 721 650 651 675

from Table 7 that when using different pre-trained ResNet-50 as teachers DisCo can significantly boost all the resultsof small models Furthermore with the improvement of theteachers using different and stronger pre-training methodsthe results of the student can be further improved

49 Visualization Analysis

In Figure 8 we visualize the learned representations ofEfficientNet-B0ResNet-50 pretrained by MoCo-V2 andEfficientNet-B0 distilled by ResNet-50 using DisCo Forclarity we randomly select 10 classes from ImageNet testset and map the learned representations to two dimensionalspace by t-SNE [44] It can be observed that ResNet-50forms more separated clusters than EfficientNet-B0 whenusing MoCo-V2 alone and after using ResNet-50 to teachEfficientNet-B0 with DisCo EfficientNet-B0 performs verymuch like the teacher

5 Conclusion

In this paper we propose Distilled Contrastive Learning(DisCo) to remedy self-supervised learning on lightweightmodels The proposed method constraints the final embed-ding of the lightweight student to be consistent with that ofthe teacher to maximally transmit the teacherrsquos knowledgeDisCo is not limited to specific contrastive learning meth-ods and can remedy the effect of the student to be very closeto the teacher

References[1] Soroush Abbasi Koohpayegani Ajinkya Tejankar and

Hamed Pirsiavash Compress Self-supervised learning bycompressing representations In NeurIPS pages 12980ndash12992 2020 2 3

9

30 20 10 0 10 20 30 40Dim 1

30

20

10

0

10

20

Dim

2

(a) Eff-b0 (MoCo-V2)

30 20 10 0 10 20 30 40Dim 1

(b) R-50 (MoCo-V2)

30 20 10 0 10 20 30 40Dim 1

(c) Eff-b0 (DisCo distilled by ResNet-50)

Figure 8 Clustering results on the ImageNet test set Different colors represent different classes

0 512 1024 1280 2048Dimension of hidden layer

5254565860626466

Top-

1 Ac

cura

cy(

)

+46+38

+32

+67

+62

+57

+74+65

+63

+81+74

+72

EfficientNet-B0ResNet-18ResNet-34

Figure 9 Linear evaluation top-1 accuracy on ImageNet of studentprojection head with different dimensions of hidden layer duringdistillation represents the original dimension used in MoCo-V2

[2] Yuki Markus Asano Christian Rupprecht and AndreaVedaldi Self-labelling via simultaneous clustering and rep-resentation learning In ICLR 2020 3 9

[3] Mathilde Caron Piotr Bojanowski Armand Joulin andMatthijs Douze Deep clustering for unsupervised learningof visual features In ECCV pages 132ndash149 2018 3

[4] Mathilde Caron Piotr Bojanowski Julien Mairal and Ar-mand Joulin Unsupervised pre-training of image featureson non-curated data In ICCV pages 2959ndash2968 2019 3

[5] Mathilde Caron Ishan Misra Julien Mairal Priya Goyal Pi-otr Bojanowski and Armand Joulin Unsupervised learn-ing of visual features by contrasting cluster assignments InNeurIPS pages 9912ndash9924 2020 3 9

[6] Mathilde Caron Hugo Touvron Ishan Misra Herve JegouJulien Mairal Piotr Bojanowski and Armand Joulin Emerg-ing properties in self-supervised vision transformers 20213

[7] Defang Chen Jian-Ping Mei Yuan Zhang Can Wang ZheWang Yan Feng and Chun Chen Cross-layer distillationwith semantic calibration 2020 3

[8] Ting Chen Simon Kornblith Mohammad Norouzi and Ge-offrey Hinton A simple framework for contrastive learningof visual representations In ICML pages 1597ndash1607 20201 2 3 7

[9] Ting Chen Simon Kornblith Kevin Swersky MohammadNorouzi and Geoffrey Hinton Big self-supervised mod-els are strong semi-supervised learners In NeurIPS pages22243ndash22255 2020 1 2 3 4

[10] Xinlei Chen Haoqi Fan Ross Girshick and Kaiming HeImproved baselines with momentum contrastive learning InCVPR pages 9729ndash9738 2020 1 2 3 4 6 8 9

[11] Xinlei Chen and Kaiming He Exploring simple siamese rep-resentation learning In arXiv preprint arXiv2011105662020 3 7 8

[12] Hao Cheng Dongze Lian Shenghua Gao and Yanlin GengEvaluating capability of deep neural networks for imageclassification via information plane In ECCV pages 168ndash182 2018 5

[13] Carl Doersch Abhinav Gupta and Alexei A Efros Unsuper-vised visual representation learning by context prediction InICCV pages 1422ndash1430 2015 1

[14] Alexey Dosovitskiy Lucas Beyer Alexander KolesnikovDirk Weissenborn Xiaohua Zhai Thomas UnterthinerMostafa Dehghani Matthias Minderer Georg Heigold Syl-vain Gelly et al An image is worth 16x16 words Trans-formers for image recognition at scale arXiv preprintarXiv201011929 2020 5 9

[15] Alexey Dosovitskiy Philipp Fischer Jost Tobias Springen-berg Martin Riedmiller and Thomas Brox Discriminativeunsupervised feature learning with exemplar convolutionalneural networks volume 38 pages 1734ndash1747 2015 3

[16] Alaaeldin El-Nouby Hugo Touvron Mathilde Caron PiotrBojanowski Matthijs Douze Armand Joulin Ivan LaptevNatalia Neverova Gabriel Synnaeve Jakob Verbeek et alXcit Cross-covariance image transformers 2021 5 9

[17] Mark Everingham Luc Van Gool Christopher KI WilliamsJohn Winn and Andrew Zisserman The pascal visual objectclasses challenge volume 88 pages 303ndash338 2010 5

10

[18] Zhiyuan Fang Jianfeng Wang Lijuan Wang Lei ZhangYezhou Yang and Zicheng Liu Seed Self-supervised dis-tillation for visual representation In ICLR 2021 2 3 68

[19] Jean-Bastien Grill Florian Strub Florent Altche CorentinTallec Pierre Richemond Elena Buchatskaya Carl DoerschBernardo Avila Pires Zhaohan Guo Mohammad Ghesh-laghi Azar Bilal Piot koray kavukcuoglu Remi Munos andMichal Valko Bootstrap your own latent A new approachto self-supervised learning In NeurIPS pages 21271ndash212842020 1 2 3

[20] Kaiming He Haoqi Fan Yuxin Wu Saining Xie and RossGirshick Momentum contrast for unsupervised visual rep-resentation learning In CVPR pages 9729ndash9738 2020 12 3 6

[21] Kaiming He Georgia Gkioxari Piotr Dollar and Ross Gir-shick Mask r-cnn In ICCV pages 2961ndash2969 2017 8

[22] Kaiming He Xiangyu Zhang Shaoqing Ren and Jian SunDeep residual learning for image recognition In CVPRpages 770ndash778 2016 2

[23] Olivier Henaff Data-efficient image recognition with con-trastive predictive coding In ICML pages 4182ndash4192 20203

[24] Geoffrey Hinton Oriol Vinyals and Jeff Dean Distilling theknowledge in a neural network In NeurIPSW 2015 2 3 7

[25] Andrew Howard Mark Sandler Grace Chu Liang-ChiehChen Bo Chen Mingxing Tan Weijun Wang Yukun ZhuRuoming Pang Vijay Vasudevan et al Searching for mo-bilenetv3 In ICCV pages 1314ndash1324 2019 2

[26] Alexander Kolesnikov Xiaohua Zhai and Lucas Beyer Re-visiting self-supervised visual representation learning InCVPR pages 1920ndash1929 2019 7

[27] Nikos Komodakis and Spyros Gidaris Unsupervised repre-sentation learning by predicting image rotations In ICLR2018 1 3

[28] Simon Kornblith Jonathon Shlens and Quoc V Le Do bet-ter imagenet models transfer better In CVPR pages 2661ndash2671 2019 7

[29] Alex Krizhevsky and Geoffrey Hinton Learning multiplelayers of features from tiny images Citeseer 2009 5

[30] Tsung-Yi Lin Michael Maire Serge Belongie James HaysPietro Perona Deva Ramanan Piotr Dollar and C LawrenceZitnick Microsoft coco Common objects in context InECCV pages 740ndash755 2014 5

[31] Yufan Liu Jiajiong Cao Bing Li Chunfeng Yuan WeimingHu Yangxi Li and Yunqiang Duan Knowledge distillationvia instance relationship graph In CVPR pages 7096ndash71042019 3

[32] Mehdi Noroozi and Paolo Favaro Unsupervised learning ofvisual representations by solving jigsaw puzzles In ECCVpages 69ndash84 2016 1 3

[33] Aaron van den Oord Yazhe Li and Oriol Vinyals Repre-sentation learning with contrastive predictive coding 20187

[34] Wonpyo Park Dongju Kim Yan Lu and Minsu Cho Re-lational knowledge distillation In CVPR pages 3967ndash39762019 3 7

[35] Deepak Pathak Philipp Krahenbuhl Jeff Donahue TrevorDarrell and Alexei A Efros Context encoders Featurelearning by inpainting In CVPR pages 2536ndash2544 20161 3

[36] Shaoqing Ren Kaiming He Ross Girshick and Jian SunFaster r-cnn Towards real-time object detection with regionproposal networks volume 39 pages 1137ndash1149 2015 8

[37] Adriana Romero Nicolas Ballas Samira Ebrahimi KahouAntoine Chassang Carlo Gatta and Yoshua Bengio FitnetsHints for thin deep nets In ICLR 2014 3

[38] Olga Russakovsky Jia Deng Hao Su Jonathan Krause San-jeev Satheesh Sean Ma Zhiheng Huang Andrej KarpathyAditya Khosla Michael Bernstein Alexander C Berg andLi Fei-Fei ImageNet Large Scale Visual Recognition Chal-lenge volume 115 pages 211ndash252 2015 2 5

[39] Ravid Shwartz-Ziv and Naftali Tishby Opening the blackbox of deep neural networks via information 2017 5

[40] Mingxing Tan and Quoc Le Efficientnet Rethinking modelscaling for convolutional neural networks In ICML pages6105ndash6114 2019 2

[41] Yonglong Tian Dilip Krishnan and Phillip Isola Con-trastive representation distillation In ICLR 2020 3

[42] Naftali Tishby Fernando C Pereira and William Bialek Theinformation bottleneck method 2000 2 5

[43] Hugo Touvron Matthieu Cord Matthijs Douze FranciscoMassa Alexandre Sablayrolles and Herve Jegou Trainingdata-efficient image transformers amp distillation through at-tention In International Conference on Machine Learningpages 10347ndash10357 PMLR 2021 5 9

[44] Laurens Van der Maaten and Geoffrey Hinton Visualizingdata using t-sne volume 9 2008 9

[45] Jinpeng Wang Yuting Gao Ke Li Xinyang Jiang XiaoweiGuo Rongrong Ji and Xing Sun Enhancing unsupervisedvideo representation learning by decoupling the scene andthe motion 2020 3

[46] Jinpeng Wang Yuting Gao Ke Li Yiqi Lin Andy J Ma andXing Sun Removing the background by adding the back-ground Towards background robust self-supervised videorepresentation learning 2020 3

[47] Yuxin Wu Alexander Kirillov Francisco Massa Wan-YenLo and Ross Girshick Detectron2 httpsgithubcomfacebookresearchdetectron2 2019 8

[48] Guodong Xu Ziwei Liu Xiaoxiao Li and Chen Change LoyKnowledge distillation meets self-supervision In ECCVpages 588ndash604 2020 3

[49] Sergey Zagoruyko and Nikos Komodakis Paying more at-tention to attention Improving the performance of convolu-tional neural networks via attention transfer In ICLR 20173 7

[50] Jure Zbontar Li Jing Ishan Misra Yann LeCun andStephane Deny Barlow twins Self-supervised learning viaredundancy reduction 2021 3

11

Page 5: 12 1 arXiv:2104.09124v3 [cs.CV] 9 Aug 2021

with several other commonly used distillation schemes insection 44 The experimental results prove that the knowl-edge we transmitted and the way it is transferred are indeedmore effectiveDistilling Bottleneck In our distillation experiment wefound an interesting phenomenon When the encoder ofthe student network is ResNet-18 or ResNet-34 and the de-fault MLP configuration is adopted that is the dimensionof embedding output by the encoder is projected from D toD and then to 128 the results of DisCo is not satisfactoryWe assume that this degradation is caused by the fact thatthe dimension of the hidden layer in the MLP is too smalland term this phenomenon as Distilling Bottleneck In Fig-ure 4 we exhibit the default configuration of the projectionhead of ResNet-1834 EfficientNet-B0B1 MobileNet-v3-Large and ResNet-50101152 It can be seen that the di-mension of the hidden layer of ResNet-1834 is too smallcompared to other networks

512

ResNet-1834 ResNet-50101152

1280

MLP512

128

1280

128

2048

2048

128

Mob-v3 Eff-b0b1

Figure 4 Default MLP of multiple networks

In order to alleviate the Distilling Bottleneck problemwe present to expand the dimension of the hidden layerin MLP It is worth noting that this operation only intro-duces a small number of additional parameters at the self-supervised distillation stage and the MLP will be directlydiscarded during fine-tuning and deployment which meansno extra computational burden is brought We experimen-tally verified that such a simple operation can bring signifi-cant gains in section 47

This operation can be further explained from the Infor-mation Bottleneck (IB) [42] perspective IB is utilized in[39 12] to understand how deep networks work by visu-alizing mutual information (I(XT ) and I(T Y )) in theinformation plane where I(XT ) is the mutual informa-tion between input and output and I(T Y ) is the mutualinformation between output and label The training of deepnetworks can be described by two-phases in the informationplane the first fitting phase where the network memorizesthe information of input resulting in the growth of I(XT )and I(T Y ) the subsequent compression phase where thenetwork removes irrelevant information of input for bettergeneralization resulting in the decrease of I(XT ) (SeeFigure 7 in [12]) Generally in the compression phaseI(XT ) can present the modelrsquos capability of generaliza-tion while I(T Y ) can present the modelrsquos capability of fit-

ting the label [12] We visualize the compression phase ofour model with different dimensions of the hidden layer inthe pre-training distillation stage in the information plane onone downstream transferring classification task The resultsin Figure 7 shows two interesting phenomenons

i Models with different dimensions of the hidden layerhave very similar I(T Y ) suggesting that models have thenearly equal capability of fitting the labels

ii The Model with larger dimension in the hidden layerhas smaller I(XT ) suggesting a stronger capability ofgeneralization

These phenomenons show that MLP indeed relates tothe capability of model generalization in the setting of self-supervised transfer learning

34 Overall Objective Function

The overall objective function is defined as follows

L = Ldis + λLco (2)

where Ldis comes from distillation part Lco can be thecontrastive loss function of any SSL method and λ is ahyper-parameter that controls the weights of the distillationloss and contrastive loss In our experiments λ is set to 1Due to the simplicity of the implementation we use MoCo-V2 as the testbed in the experiments

4 Experiments41 Settings

Dataset All the self-supervised pre-training experimentsare conducted on ImageNet ILSVRC-2012 [38] Fordownstream classification tasks experiments are carriedout on Cifar10 [29] and Cifar100 [29] For downstreamdetection tasks experiments are conducted on PASCALVOC [17] and MS-COCO [30] with train+valtest andtrain2017val2017 for trainingtesting respectively Fordownstream segmentation tasks the proposed method isverified on MS-COCOTeacher Encoders Six large encoders are used asteacher ResNet-50 (224M) ResNet-101 (405M) ResNet-152 (554M) ResNet-502 (555M) ViT-small(22M)[1443] and XCiT-small(443M)[16] where X(Y) denotes thatthe encoder has Y millions of parameters and the Y doesnot take linear layer into considerationStudent Encoders Five widely used small yet effectiveconvolution neural networks and two vision transformerencoders are used as student EfficientNet-B0 (40M)MobileNet-v3-Large (42M) EfficientNet-B1 (64M) andResNet-18 (107M) ResNet-34 (204M) ViT-tiny(5M) andXCiT-tiny(261M)Teacher Pre-training Setting ResNet-50101152 arepre-trained using MoCo-V2 with default hyper-parametersFollowing SEED ResNet-50 and ResNet-101 are trained

5

Table 1 ImageNet test accuracy () using linear classification on different student architectures diams denotes the teacherstudent modelsare pre-trained with MoCo-V2 which is our implementation and daggermeans the teacher is pre-trained by SwAV which is an open-sourcemodel When using R502 as the teacher SEED distills 800 epochs while DisCo distills 200 epochs Subscript in green represents theimprovement compared to MoCo-V2 baseline

Method TS Eff-b0 Eff-b1 Mob-v3 R-18 R-34

T-1 T-5 T-1 T-5 T-1 T-5 T-1 T-5 T-1 T-5Supervised 771 933 792 944 752 - 721 - 750 -

Self-supervisedMoCo-V2 (Baseline)diams 468 722 484 738 362 621 522 776 568 814

SSL DistillationSEED[18] R-50 (674) 613 827 614 831 552 803 576 818 585 826

DisCo (ours) R-50 (674)diams 665(197uarr)

876(154uarr)

666(182uarr)

875(137uarr)

644(282uarr)

862(241uarr)

606(84uarr)

837(61uarr)

625(57uarr)

854(40uarr)

SEED [18] R-101 (703) 630 838 634 846 599 835 589 825 616 849DisCo (ours) R-101 (691)diams 689

(221uarr)889

(167uarr)690

(206uarr)891

(153uarr)657

(295uarr)867

(246uarr)623

(101uarr)851

(75uarr)644

(76uarr)865

(51uarr)SEED [18] R-152 (742) 653 860 673 869 614 846 595 833 627 858

DisCo (ours) R-152 (741)diams 678(210uarr)

870(148uarr)

731(247uarr)

912(174uarr)

637(275uarr)

849(228uarr)

655(133uarr)

867(91uarr)

681(113uarr)

886(72uarr)

SEED [18] R502 (773dagger) 676 874 680 876 682 882 630 849 657 868DisCo (ours) R502 (773)dagger 691

(223uarr)889

(177uarr)640

(156uarr)846

(108uarr)589

(227uarr)814

(193uarr)652(13uarr)

868(92uarr)

676(108uarr)

886(72uarr)

for 200 epochs and ResNet-152 is trained for 400 epochsResNet-502 is pre-trained by SwAV which is an open-source model 1 and trained for 800 epochsSelf-supervised Distillation Setting The projection headof all the student networks has two linear layers with the di-mension being 2048 and 128 The configuration of learningrate and optimizer is set the same as MoCo-V2 and withouta specific statement the model is trained for 200 epochsFurthermore during the distillation stage the teacher isfrozenStudent Fine-tuning Setting For linear evaluation onImageNet the student is fine-tuned for 100 epochs Ini-tial learning rate is 3 for EfficientNet-B0EfficientNet-B1MobileNet-v3-Large and 30 for ResNet-18ResNet-34For linear evaluation on Cifar10 and Cifar100 the initiallearning rate is 3 and all the models are fine-tuned for 100epochs SGD is adopted as the optimizer and the learningrate is decreased by 10 at 60 and 80 epochs for both lin-ear evaluation For downstream detection and segmentationtasks following [20 10 18] all parameters are fine-tunedFor the detection task on VOC initial learning rate is 01with 200 warm-up iterations and decays by 10 at 18k 222ksteps The detector is trained for 48k steps with a batch sizeof 32 on 8 V100 GPUs Following [18] the scales of imagesare randomly sampled from [400 800] during the trainingand is 800 at the inference For the detection and instancesegmentation tasks on COCO the model is trained for 180kiterations with the initial learning rate 011 and the scales ofimages are randomly sampled from [600 800] during thetraining

1httpsgithubcomfacebookresearchswav

42 Linear Evaluation

We conduct linear evaluation on ImageNet to validate theeffectiveness of our method As shown in Table 1 studentmodels distilled by DisCo outperform the counterparts pre-trained by MoCo-V2 (Baseline) with a large margin Be-sides DisCo surpasses the state-of-the-art SEED over var-ious student models with teacher ResNet-50101152 un-der the same setting especially on MobileNet-v3-Largedistilled by ResNet-50 with a difference of 92 at top-1accuracy When using R502 as the teacher SEED dis-tills 800 epochs while DisCo still distills 200 epochs butthe results of EfficientNet-B0 ResNet-18 and ResNet-34using DisCo also exceed that of SEED The performanceon EfficientNet-B1 and MobileNet-v3-Large is closely re-lated to the epochs of distillation For example whenEfficientNet-B1 is distilled for 290 epochs the top-1 ac-curacy becomes 704 which surpasses SEED and whenMobileNet-v3-Large is distilled for 340 epochs the top-1accuracy becomes 64 We believe that when DisCo dis-tills 800 epochs the results will be further improved More-over since CompRess uses a better teacher which trained600 epochs longer and distills 400 epochs longer than SEEDand ours itrsquos not fair to compare thus we do not report theresult in the table In addition when DisCo uses a largermodel as the teacher the student will be further improvedFor instance using ResNet-152 instead of ResNet-50 asthe teacher ResNet-34 is improved from 625 to 681Itrsquos worth noting when using ResNet-101ResNet-50 as theteacher the linear evaluation result of EfficientNet-B0 isvery close to the teacher 689 vs 691 and 665 vs674 while the number of parameters of EfficientNet-B0

6

0 10 20 30 40 50 60 70

30

40

50

60

70

Top-

1 Ac

cura

cy(

)

R-50R-101

R-152+166

+241

+149

+202

+265

+174

+265

+254

+214

(a) EfficientNet-B0

100101

0 10 20 30 40 50 60

30

40

50

60

70

R-50R-101 R-152

+136

+99

+222

+167

+119

+235

+177

+130

+203

(b) MobileNet-v3-Large

100101

0 10 20 30 40 50 60

30

40

50

60

70

R-50R-101

R-152+141

+86

+184

+169

+103

+201

+220

+131

+233(c) ResNet-18

100101

Numbers of parameters of Teacher (Millions)

Figure 5 ImageNet top-1 accuracy () of semi-supervised linear evaluation with 1 10 and 100 training data Points where thenumber of teacher network parameters are 0 are the results of the MoCo-V2 baseline without distillation

R-18 Eff-b070

75

80

85

90

95

Top-

1 Ac

cura

cy(

)

R-50R-101 R-152

R-50 R-101 R-152

(a) Cifar10

R-18 Eff-b040

50

60

70

80

Top-

1 Ac

cura

cy(

)

R-50R-101 R-152

R-50 R-101R-152

(b) Cifar100

MoCo-V2SEEDOurs

Figure 6 Top-1 accuracy of students transferred to Cifar10Cifar100 without and with distillation from different teachers

is only 94163 of ResNet-101ResNet-50

43 Semi-supervised Linear Evaluation

Following previous works [26 28 33] we evaluate ourmethod under the semi-supervised setting Two 1 and10 sampled subsets of ImageNet training data (˜128 and˜128 images per class respectively) [8] are used for fine-tuning the student models As is shown in Figure 5 stu-dent models distilled by DisCo outperform baseline underany amount of labeled data Furthermore DisCo also showsthe consistency under different fractions of annotations thatis students always benefit from larger models as teachersMore labels will be helpful to improve the final performanceof the student model which is expected

44 Comparison against other Distillation

In order to verify the effectiveness of the proposedmethod we compare with three widely used distillationschemes namely 1) Attention transfer denoted by AT[49] 2) Relational knowledge distillation denoted by RKD[34] 3) Knowledge distillation denoted by KD [24] ATand RKD are feature-based and relation-based respec-tively which can be utilized during the self-supervised pre-training stage KD is a logits-based method which can onlybe used at the supervised fine-tuning stage The compari-son results are shown in Table 2 Singe-Knowledge meansusing one of these approaches individually and it can beseen that all distillation approaches can bring improvement

to the baseline but the gain from DisCo is the most signif-icant which indicates the knowledge that DisCo chosen totransfer and the way of transmission is indeed more effec-tive Then we also try to transfer multi-knowledge fromteacher to student by combining DisCo with other distil-lation schemes It can be seen that integrating DisCo withATRKDKD can boost the performance a lot which furtherproves the effectiveness of DisCo

Table 2 Linear evaluation top-1 accuracy () on ImageNet com-pared with different distillation methods

Method Eff-b0 Eff-b1 Mob-v3 R-18 R-34Baseline

MoCo-V2 [11] 468 484 362 522 568Single-Knowledge

AT [49] 571 582 510 562 602RKD [34] 483 503 369 564 587KD [24] 465 485 373 515 588

DisCo (ours) 665 666 644 606 625Multi-Knowledge

AT + DisCo 667 663 641 602 623RKD + DisCo 668 665 644 606 623KD + DisCo 658 659 652 606 659

45 Transfer to Cifar10Cifar100

In order to analyze the generalization of representationsobtained by DisCo we further conduct linear evaluation onCifar10 and Cifar100 with ResNet-18EfficientNet-B0 as

7

Table 3 Object detection and instance segmentation results with ResNet-34 as backbone bounding-box AP (AP bb) and mask AP(APmk) are evaluated on VOC07 test and COCO val2017 Daggermeans our implementation Subscript in green represents the improve-ment compared to MoCo-V2 baseline

S T MethodObject Detection Instance Segmentation

VOC COCO COCOAP bb AP bb

50 AP bb75 AP bb AP bb

50 AP bb75 APmk APmk

50 APmk75

R-34

times MoCo-V2Dagger 536 791 587 381 568 407 330 532 353

R-50SEED [18] 537 794 592 384 570 410 333 532 353

DisCo (ours) 565(29uarr)

806(15uarr)

625(38uarr)

400(19uarr)

591(23uarr)

434(27uarr)

349(19uarr)

563(31uarr)

371(18uarr)

R-101SEED [18] 541 798 591 385 573 414 336 541 356

DisCo (ours) 561(25uarr)

803(12uarr)

618(31uarr)

400(19uarr)

591(23uarr)

432(25uarr)

347(19uarr)

559(27uarr)

374(18uarr)

R-152SEED [18] 544 801 599 384 570 410 333 537 353

DisCo (ours) 566(30uarr)

808(17uarr)

634(57uarr)

394(13uarr)

587(19uarr)

427(20uarr)

344(14uarr)

554(22uarr)

367(14uarr)

student and ResNet-50ResNet101ResNet152 as teacherSince the image resolution of Cifar dataset is 32times32 all theimages are resized to 224 times 224 with bicubic re-samplingbefore feeding into the model following [11 18] The re-sults are shown in Figure 6 it can be seen that the pro-posed DisCo surpasses the MoCo-V2 baseline by a largemargin with different student and teacher architectures onboth Cifar10 and Cifar100 In addition our method alsohas a significant improvement compared the-state-of-art ap-proach SEED It is worth noting that as the teacher becomesbetter the improvement brought by DisCo is more obvious

46 Transfer to Detection and Segmentation

We also conduct experiments on detection and segmen-tation tasks for generalization analysis C4 based FasterR-CNN [36] are used for objection detection on VOC andMask R-CNN [21] are used for objection detection and in-stance segmentation on COCO In this part of experimentfollowing [10 18] all the parameters of the student networkare learnable and the implementation is based on detec-tron2 [47] for convenience The results of using ResNet-34as the student with teacher ResNet-50ResNet-101ResNet-152 are shown in Table 3 On the object detection ourmethod can bring obvious improvement on both VOC andCOCO datasets Furthermore as SEED [18] claimed theimprovement on COCO is relatively minor compared toVOC since COCO training dataset has 118k images whileVOC has only 165k training images thus the gain broughtby weight initialization is relatively small On the instancesegmentation task DisCo also shows superiority

47 Distilling BottleNeck Analysis

In this section we analyze the Distilling BottleNeck Phe-nomenon and we use ResNet-50 as the teacher for simplic-ityDistilling BottleNeck Phenomenon In the self-supervised

distillation stage we first tried to distill small models withthe default MLP configuration of MoCo-V2 using DisCoand the results are shown in Table 4 denoted by DisColowastIt is worth noting that the dimensions of the hidden layerin DisColowast are exactly as same as SEED It can be seenthat compared to SEED DisColowast shows superior results onEfficientNet-B0 EfficientNet-B1 and MobileNet-v3-Largeand has comparable results on ResNet-18 and ResNet-34Then we expand the dimension of the hidden layer in theMLP of the student to be consistent with that of the teacherthat is 2048D it can be seen that the results can be furtherimproved which is recorded in the third row In particularthis expansion operation brings 35 and 36 gains forResNet-18 and ResNet-34 respectively

Table 4 Linear evaluation top-1 accuracy () on ImageNetDisCo means that the dimension of hidden layer in the MLP isconsistent with that of SEED

Method Eff-b0 Eff-b1 Mob-v3 R-18 R-34SEED [18] 613 614 552 576 585

DisCo 656 658 638 571 589DisCo 665(09uarr) 666(08uarr) 644(06uarr) 606(35uarr) 625(36uarr)

Theoretical Analysis from IB perspective In Figure 7on the downstream Cifar10 classification task we visualizethe compression phase of ResNet-18 and ResNet-34 withdifferent hidden dimensions distilled by the same teacherin the information plane It can be seen that when we ad-just the hidden dimension in the MLP of ResNet-18 andResNet-34 from 512D to 2048D the value of I(XT ) be-comes smaller while I(T Y ) is basically unchanged whichsuggests that enlarging the hidden dimension can makethe student model more generalized in the setting of self-supervised transfer learningThe effectiveness of the dimension Here we further ex-plore the impact of more dimensions and the results areshown in Figure 9 It can be seen that as the dimension in-

8

80 85 90 950

1

2

3

4

5

I (T

Y) C(858) T

C (891) T

(a) ResNet-18

80 85 90 95

C (834) T

C (864) T

(b) ResNet-34

MLP (512d) MLP (2048d)I (XT)

Figure 7 Mutual information paths from transition points to con-vergence points in the compression phase of training T denotestransition points and C(X) denotes convergent points with Xtop-1 accuracy on Cifar10 Points with similar I(TY) but smallerI(XT) are better generalized

creases the top-1 accuracy also increases but when the di-mension is already large the growth trend will slow downThe performance trend of EfficientNet-B1 and MobileNet-v3-Large is close to that of EfficientNet-B0 so for the sakeof clarity we do not exhibit it in the figure

48 More SSL Methods

Variants of the testbed In this part in order to demon-strate the versatility of our method we further tried othertwo SSL methods as the testbed SwAV and DINO

For SwAV the teacher is backboned by ResNet-50 andthe results are shown in Table 5 it can be seen that formodels with very few parameters EfficientNet-B0 andMobileNet-v3-Large the pre-training results with SwAVare also very poor When DisCo is utilized the efficacyis significantly improved

Table 5 Linear evaluation top-1 accuracy () on ImageNet withSwAV as the testbed The teacher of DisCo is an online opensource ResNet-50 model pre-trained using SwAV with top-1 accu-racy 753

Method Eff-b0 Mob-v3 R-34SwAV [5] 468 194 633

SwAV + DisCo 624 557 633

For DINO use two vision transformer architectures ViTand XCiT and the results are shown in Table 6 It can beseen that no matter what SSL method is adopted and whatarchitecture is used in the network DisCo can bring sig-nificant gains It is worth noting that XCiT-tiny has 26Mparameters which is much larger than ViT-tiny(5M) butDisCo can still narrow the gap with the teacherVariants of teacher pre-training method In order toverify that our method is not picky about the pre-trainingapproach that the teacher adopted we use three ResNet-50 networks pre-trained with different SSL methods as theteacher under the testbed of MoCo-V2 It can be observed

Table 6 Linear evaluation top-1 accuracy () on ImageNet withDINO as testbed

Teacher StudentModel Acc ViT-tiny XCiT-tiny

- - 55 67ViT-small [14 43] 77 684(134uarr) -XCiT-small [16] 778 - 711(41uarr)

Table 7 Linear evaluation top-1 accuracy () on ImageNet withvariants of teacher pre-training methods All the teachers areResNet-50 and the first row is student trained by MoCo-V2 di-rectly without distillation which is the baseline

Teacher StudentMethod Acc Eff-b0 Eff-b1 Mob-v3 R-18 R-34

- - 468 484 362 522 568MoCo-V2 [10] 674 665 666 644 606 625

SeLa-V2 [2] 718 622 682 662 641 653SwAV [5] 753 700 721 650 651 675

from Table 7 that when using different pre-trained ResNet-50 as teachers DisCo can significantly boost all the resultsof small models Furthermore with the improvement of theteachers using different and stronger pre-training methodsthe results of the student can be further improved

49 Visualization Analysis

In Figure 8 we visualize the learned representations ofEfficientNet-B0ResNet-50 pretrained by MoCo-V2 andEfficientNet-B0 distilled by ResNet-50 using DisCo Forclarity we randomly select 10 classes from ImageNet testset and map the learned representations to two dimensionalspace by t-SNE [44] It can be observed that ResNet-50forms more separated clusters than EfficientNet-B0 whenusing MoCo-V2 alone and after using ResNet-50 to teachEfficientNet-B0 with DisCo EfficientNet-B0 performs verymuch like the teacher

5 Conclusion

In this paper we propose Distilled Contrastive Learning(DisCo) to remedy self-supervised learning on lightweightmodels The proposed method constraints the final embed-ding of the lightweight student to be consistent with that ofthe teacher to maximally transmit the teacherrsquos knowledgeDisCo is not limited to specific contrastive learning meth-ods and can remedy the effect of the student to be very closeto the teacher

References[1] Soroush Abbasi Koohpayegani Ajinkya Tejankar and

Hamed Pirsiavash Compress Self-supervised learning bycompressing representations In NeurIPS pages 12980ndash12992 2020 2 3

9

30 20 10 0 10 20 30 40Dim 1

30

20

10

0

10

20

Dim

2

(a) Eff-b0 (MoCo-V2)

30 20 10 0 10 20 30 40Dim 1

(b) R-50 (MoCo-V2)

30 20 10 0 10 20 30 40Dim 1

(c) Eff-b0 (DisCo distilled by ResNet-50)

Figure 8 Clustering results on the ImageNet test set Different colors represent different classes

0 512 1024 1280 2048Dimension of hidden layer

5254565860626466

Top-

1 Ac

cura

cy(

)

+46+38

+32

+67

+62

+57

+74+65

+63

+81+74

+72

EfficientNet-B0ResNet-18ResNet-34

Figure 9 Linear evaluation top-1 accuracy on ImageNet of studentprojection head with different dimensions of hidden layer duringdistillation represents the original dimension used in MoCo-V2

[2] Yuki Markus Asano Christian Rupprecht and AndreaVedaldi Self-labelling via simultaneous clustering and rep-resentation learning In ICLR 2020 3 9

[3] Mathilde Caron Piotr Bojanowski Armand Joulin andMatthijs Douze Deep clustering for unsupervised learningof visual features In ECCV pages 132ndash149 2018 3

[4] Mathilde Caron Piotr Bojanowski Julien Mairal and Ar-mand Joulin Unsupervised pre-training of image featureson non-curated data In ICCV pages 2959ndash2968 2019 3

[5] Mathilde Caron Ishan Misra Julien Mairal Priya Goyal Pi-otr Bojanowski and Armand Joulin Unsupervised learn-ing of visual features by contrasting cluster assignments InNeurIPS pages 9912ndash9924 2020 3 9

[6] Mathilde Caron Hugo Touvron Ishan Misra Herve JegouJulien Mairal Piotr Bojanowski and Armand Joulin Emerg-ing properties in self-supervised vision transformers 20213

[7] Defang Chen Jian-Ping Mei Yuan Zhang Can Wang ZheWang Yan Feng and Chun Chen Cross-layer distillationwith semantic calibration 2020 3

[8] Ting Chen Simon Kornblith Mohammad Norouzi and Ge-offrey Hinton A simple framework for contrastive learningof visual representations In ICML pages 1597ndash1607 20201 2 3 7

[9] Ting Chen Simon Kornblith Kevin Swersky MohammadNorouzi and Geoffrey Hinton Big self-supervised mod-els are strong semi-supervised learners In NeurIPS pages22243ndash22255 2020 1 2 3 4

[10] Xinlei Chen Haoqi Fan Ross Girshick and Kaiming HeImproved baselines with momentum contrastive learning InCVPR pages 9729ndash9738 2020 1 2 3 4 6 8 9

[11] Xinlei Chen and Kaiming He Exploring simple siamese rep-resentation learning In arXiv preprint arXiv2011105662020 3 7 8

[12] Hao Cheng Dongze Lian Shenghua Gao and Yanlin GengEvaluating capability of deep neural networks for imageclassification via information plane In ECCV pages 168ndash182 2018 5

[13] Carl Doersch Abhinav Gupta and Alexei A Efros Unsuper-vised visual representation learning by context prediction InICCV pages 1422ndash1430 2015 1

[14] Alexey Dosovitskiy Lucas Beyer Alexander KolesnikovDirk Weissenborn Xiaohua Zhai Thomas UnterthinerMostafa Dehghani Matthias Minderer Georg Heigold Syl-vain Gelly et al An image is worth 16x16 words Trans-formers for image recognition at scale arXiv preprintarXiv201011929 2020 5 9

[15] Alexey Dosovitskiy Philipp Fischer Jost Tobias Springen-berg Martin Riedmiller and Thomas Brox Discriminativeunsupervised feature learning with exemplar convolutionalneural networks volume 38 pages 1734ndash1747 2015 3

[16] Alaaeldin El-Nouby Hugo Touvron Mathilde Caron PiotrBojanowski Matthijs Douze Armand Joulin Ivan LaptevNatalia Neverova Gabriel Synnaeve Jakob Verbeek et alXcit Cross-covariance image transformers 2021 5 9

[17] Mark Everingham Luc Van Gool Christopher KI WilliamsJohn Winn and Andrew Zisserman The pascal visual objectclasses challenge volume 88 pages 303ndash338 2010 5

10

[18] Zhiyuan Fang Jianfeng Wang Lijuan Wang Lei ZhangYezhou Yang and Zicheng Liu Seed Self-supervised dis-tillation for visual representation In ICLR 2021 2 3 68

[19] Jean-Bastien Grill Florian Strub Florent Altche CorentinTallec Pierre Richemond Elena Buchatskaya Carl DoerschBernardo Avila Pires Zhaohan Guo Mohammad Ghesh-laghi Azar Bilal Piot koray kavukcuoglu Remi Munos andMichal Valko Bootstrap your own latent A new approachto self-supervised learning In NeurIPS pages 21271ndash212842020 1 2 3

[20] Kaiming He Haoqi Fan Yuxin Wu Saining Xie and RossGirshick Momentum contrast for unsupervised visual rep-resentation learning In CVPR pages 9729ndash9738 2020 12 3 6

[21] Kaiming He Georgia Gkioxari Piotr Dollar and Ross Gir-shick Mask r-cnn In ICCV pages 2961ndash2969 2017 8

[22] Kaiming He Xiangyu Zhang Shaoqing Ren and Jian SunDeep residual learning for image recognition In CVPRpages 770ndash778 2016 2

[23] Olivier Henaff Data-efficient image recognition with con-trastive predictive coding In ICML pages 4182ndash4192 20203

[24] Geoffrey Hinton Oriol Vinyals and Jeff Dean Distilling theknowledge in a neural network In NeurIPSW 2015 2 3 7

[25] Andrew Howard Mark Sandler Grace Chu Liang-ChiehChen Bo Chen Mingxing Tan Weijun Wang Yukun ZhuRuoming Pang Vijay Vasudevan et al Searching for mo-bilenetv3 In ICCV pages 1314ndash1324 2019 2

[26] Alexander Kolesnikov Xiaohua Zhai and Lucas Beyer Re-visiting self-supervised visual representation learning InCVPR pages 1920ndash1929 2019 7

[27] Nikos Komodakis and Spyros Gidaris Unsupervised repre-sentation learning by predicting image rotations In ICLR2018 1 3

[28] Simon Kornblith Jonathon Shlens and Quoc V Le Do bet-ter imagenet models transfer better In CVPR pages 2661ndash2671 2019 7

[29] Alex Krizhevsky and Geoffrey Hinton Learning multiplelayers of features from tiny images Citeseer 2009 5

[30] Tsung-Yi Lin Michael Maire Serge Belongie James HaysPietro Perona Deva Ramanan Piotr Dollar and C LawrenceZitnick Microsoft coco Common objects in context InECCV pages 740ndash755 2014 5

[31] Yufan Liu Jiajiong Cao Bing Li Chunfeng Yuan WeimingHu Yangxi Li and Yunqiang Duan Knowledge distillationvia instance relationship graph In CVPR pages 7096ndash71042019 3

[32] Mehdi Noroozi and Paolo Favaro Unsupervised learning ofvisual representations by solving jigsaw puzzles In ECCVpages 69ndash84 2016 1 3

[33] Aaron van den Oord Yazhe Li and Oriol Vinyals Repre-sentation learning with contrastive predictive coding 20187

[34] Wonpyo Park Dongju Kim Yan Lu and Minsu Cho Re-lational knowledge distillation In CVPR pages 3967ndash39762019 3 7

[35] Deepak Pathak Philipp Krahenbuhl Jeff Donahue TrevorDarrell and Alexei A Efros Context encoders Featurelearning by inpainting In CVPR pages 2536ndash2544 20161 3

[36] Shaoqing Ren Kaiming He Ross Girshick and Jian SunFaster r-cnn Towards real-time object detection with regionproposal networks volume 39 pages 1137ndash1149 2015 8

[37] Adriana Romero Nicolas Ballas Samira Ebrahimi KahouAntoine Chassang Carlo Gatta and Yoshua Bengio FitnetsHints for thin deep nets In ICLR 2014 3

[38] Olga Russakovsky Jia Deng Hao Su Jonathan Krause San-jeev Satheesh Sean Ma Zhiheng Huang Andrej KarpathyAditya Khosla Michael Bernstein Alexander C Berg andLi Fei-Fei ImageNet Large Scale Visual Recognition Chal-lenge volume 115 pages 211ndash252 2015 2 5

[39] Ravid Shwartz-Ziv and Naftali Tishby Opening the blackbox of deep neural networks via information 2017 5

[40] Mingxing Tan and Quoc Le Efficientnet Rethinking modelscaling for convolutional neural networks In ICML pages6105ndash6114 2019 2

[41] Yonglong Tian Dilip Krishnan and Phillip Isola Con-trastive representation distillation In ICLR 2020 3

[42] Naftali Tishby Fernando C Pereira and William Bialek Theinformation bottleneck method 2000 2 5

[43] Hugo Touvron Matthieu Cord Matthijs Douze FranciscoMassa Alexandre Sablayrolles and Herve Jegou Trainingdata-efficient image transformers amp distillation through at-tention In International Conference on Machine Learningpages 10347ndash10357 PMLR 2021 5 9

[44] Laurens Van der Maaten and Geoffrey Hinton Visualizingdata using t-sne volume 9 2008 9

[45] Jinpeng Wang Yuting Gao Ke Li Xinyang Jiang XiaoweiGuo Rongrong Ji and Xing Sun Enhancing unsupervisedvideo representation learning by decoupling the scene andthe motion 2020 3

[46] Jinpeng Wang Yuting Gao Ke Li Yiqi Lin Andy J Ma andXing Sun Removing the background by adding the back-ground Towards background robust self-supervised videorepresentation learning 2020 3

[47] Yuxin Wu Alexander Kirillov Francisco Massa Wan-YenLo and Ross Girshick Detectron2 httpsgithubcomfacebookresearchdetectron2 2019 8

[48] Guodong Xu Ziwei Liu Xiaoxiao Li and Chen Change LoyKnowledge distillation meets self-supervision In ECCVpages 588ndash604 2020 3

[49] Sergey Zagoruyko and Nikos Komodakis Paying more at-tention to attention Improving the performance of convolu-tional neural networks via attention transfer In ICLR 20173 7

[50] Jure Zbontar Li Jing Ishan Misra Yann LeCun andStephane Deny Barlow twins Self-supervised learning viaredundancy reduction 2021 3

11

Page 6: 12 1 arXiv:2104.09124v3 [cs.CV] 9 Aug 2021

Table 1 ImageNet test accuracy () using linear classification on different student architectures diams denotes the teacherstudent modelsare pre-trained with MoCo-V2 which is our implementation and daggermeans the teacher is pre-trained by SwAV which is an open-sourcemodel When using R502 as the teacher SEED distills 800 epochs while DisCo distills 200 epochs Subscript in green represents theimprovement compared to MoCo-V2 baseline

Method TS Eff-b0 Eff-b1 Mob-v3 R-18 R-34

T-1 T-5 T-1 T-5 T-1 T-5 T-1 T-5 T-1 T-5Supervised 771 933 792 944 752 - 721 - 750 -

Self-supervisedMoCo-V2 (Baseline)diams 468 722 484 738 362 621 522 776 568 814

SSL DistillationSEED[18] R-50 (674) 613 827 614 831 552 803 576 818 585 826

DisCo (ours) R-50 (674)diams 665(197uarr)

876(154uarr)

666(182uarr)

875(137uarr)

644(282uarr)

862(241uarr)

606(84uarr)

837(61uarr)

625(57uarr)

854(40uarr)

SEED [18] R-101 (703) 630 838 634 846 599 835 589 825 616 849DisCo (ours) R-101 (691)diams 689

(221uarr)889

(167uarr)690

(206uarr)891

(153uarr)657

(295uarr)867

(246uarr)623

(101uarr)851

(75uarr)644

(76uarr)865

(51uarr)SEED [18] R-152 (742) 653 860 673 869 614 846 595 833 627 858

DisCo (ours) R-152 (741)diams 678(210uarr)

870(148uarr)

731(247uarr)

912(174uarr)

637(275uarr)

849(228uarr)

655(133uarr)

867(91uarr)

681(113uarr)

886(72uarr)

SEED [18] R502 (773dagger) 676 874 680 876 682 882 630 849 657 868DisCo (ours) R502 (773)dagger 691

(223uarr)889

(177uarr)640

(156uarr)846

(108uarr)589

(227uarr)814

(193uarr)652(13uarr)

868(92uarr)

676(108uarr)

886(72uarr)

for 200 epochs and ResNet-152 is trained for 400 epochsResNet-502 is pre-trained by SwAV which is an open-source model 1 and trained for 800 epochsSelf-supervised Distillation Setting The projection headof all the student networks has two linear layers with the di-mension being 2048 and 128 The configuration of learningrate and optimizer is set the same as MoCo-V2 and withouta specific statement the model is trained for 200 epochsFurthermore during the distillation stage the teacher isfrozenStudent Fine-tuning Setting For linear evaluation onImageNet the student is fine-tuned for 100 epochs Ini-tial learning rate is 3 for EfficientNet-B0EfficientNet-B1MobileNet-v3-Large and 30 for ResNet-18ResNet-34For linear evaluation on Cifar10 and Cifar100 the initiallearning rate is 3 and all the models are fine-tuned for 100epochs SGD is adopted as the optimizer and the learningrate is decreased by 10 at 60 and 80 epochs for both lin-ear evaluation For downstream detection and segmentationtasks following [20 10 18] all parameters are fine-tunedFor the detection task on VOC initial learning rate is 01with 200 warm-up iterations and decays by 10 at 18k 222ksteps The detector is trained for 48k steps with a batch sizeof 32 on 8 V100 GPUs Following [18] the scales of imagesare randomly sampled from [400 800] during the trainingand is 800 at the inference For the detection and instancesegmentation tasks on COCO the model is trained for 180kiterations with the initial learning rate 011 and the scales ofimages are randomly sampled from [600 800] during thetraining

1httpsgithubcomfacebookresearchswav

42 Linear Evaluation

We conduct linear evaluation on ImageNet to validate theeffectiveness of our method As shown in Table 1 studentmodels distilled by DisCo outperform the counterparts pre-trained by MoCo-V2 (Baseline) with a large margin Be-sides DisCo surpasses the state-of-the-art SEED over var-ious student models with teacher ResNet-50101152 un-der the same setting especially on MobileNet-v3-Largedistilled by ResNet-50 with a difference of 92 at top-1accuracy When using R502 as the teacher SEED dis-tills 800 epochs while DisCo still distills 200 epochs butthe results of EfficientNet-B0 ResNet-18 and ResNet-34using DisCo also exceed that of SEED The performanceon EfficientNet-B1 and MobileNet-v3-Large is closely re-lated to the epochs of distillation For example whenEfficientNet-B1 is distilled for 290 epochs the top-1 ac-curacy becomes 704 which surpasses SEED and whenMobileNet-v3-Large is distilled for 340 epochs the top-1accuracy becomes 64 We believe that when DisCo dis-tills 800 epochs the results will be further improved More-over since CompRess uses a better teacher which trained600 epochs longer and distills 400 epochs longer than SEEDand ours itrsquos not fair to compare thus we do not report theresult in the table In addition when DisCo uses a largermodel as the teacher the student will be further improvedFor instance using ResNet-152 instead of ResNet-50 asthe teacher ResNet-34 is improved from 625 to 681Itrsquos worth noting when using ResNet-101ResNet-50 as theteacher the linear evaluation result of EfficientNet-B0 isvery close to the teacher 689 vs 691 and 665 vs674 while the number of parameters of EfficientNet-B0

6

0 10 20 30 40 50 60 70

30

40

50

60

70

Top-

1 Ac

cura

cy(

)

R-50R-101

R-152+166

+241

+149

+202

+265

+174

+265

+254

+214

(a) EfficientNet-B0

100101

0 10 20 30 40 50 60

30

40

50

60

70

R-50R-101 R-152

+136

+99

+222

+167

+119

+235

+177

+130

+203

(b) MobileNet-v3-Large

100101

0 10 20 30 40 50 60

30

40

50

60

70

R-50R-101

R-152+141

+86

+184

+169

+103

+201

+220

+131

+233(c) ResNet-18

100101

Numbers of parameters of Teacher (Millions)

Figure 5 ImageNet top-1 accuracy () of semi-supervised linear evaluation with 1 10 and 100 training data Points where thenumber of teacher network parameters are 0 are the results of the MoCo-V2 baseline without distillation

R-18 Eff-b070

75

80

85

90

95

Top-

1 Ac

cura

cy(

)

R-50R-101 R-152

R-50 R-101 R-152

(a) Cifar10

R-18 Eff-b040

50

60

70

80

Top-

1 Ac

cura

cy(

)

R-50R-101 R-152

R-50 R-101R-152

(b) Cifar100

MoCo-V2SEEDOurs

Figure 6 Top-1 accuracy of students transferred to Cifar10Cifar100 without and with distillation from different teachers

is only 94163 of ResNet-101ResNet-50

43 Semi-supervised Linear Evaluation

Following previous works [26 28 33] we evaluate ourmethod under the semi-supervised setting Two 1 and10 sampled subsets of ImageNet training data (˜128 and˜128 images per class respectively) [8] are used for fine-tuning the student models As is shown in Figure 5 stu-dent models distilled by DisCo outperform baseline underany amount of labeled data Furthermore DisCo also showsthe consistency under different fractions of annotations thatis students always benefit from larger models as teachersMore labels will be helpful to improve the final performanceof the student model which is expected

44 Comparison against other Distillation

In order to verify the effectiveness of the proposedmethod we compare with three widely used distillationschemes namely 1) Attention transfer denoted by AT[49] 2) Relational knowledge distillation denoted by RKD[34] 3) Knowledge distillation denoted by KD [24] ATand RKD are feature-based and relation-based respec-tively which can be utilized during the self-supervised pre-training stage KD is a logits-based method which can onlybe used at the supervised fine-tuning stage The compari-son results are shown in Table 2 Singe-Knowledge meansusing one of these approaches individually and it can beseen that all distillation approaches can bring improvement

to the baseline but the gain from DisCo is the most signif-icant which indicates the knowledge that DisCo chosen totransfer and the way of transmission is indeed more effec-tive Then we also try to transfer multi-knowledge fromteacher to student by combining DisCo with other distil-lation schemes It can be seen that integrating DisCo withATRKDKD can boost the performance a lot which furtherproves the effectiveness of DisCo

Table 2 Linear evaluation top-1 accuracy () on ImageNet com-pared with different distillation methods

Method Eff-b0 Eff-b1 Mob-v3 R-18 R-34Baseline

MoCo-V2 [11] 468 484 362 522 568Single-Knowledge

AT [49] 571 582 510 562 602RKD [34] 483 503 369 564 587KD [24] 465 485 373 515 588

DisCo (ours) 665 666 644 606 625Multi-Knowledge

AT + DisCo 667 663 641 602 623RKD + DisCo 668 665 644 606 623KD + DisCo 658 659 652 606 659

45 Transfer to Cifar10Cifar100

In order to analyze the generalization of representationsobtained by DisCo we further conduct linear evaluation onCifar10 and Cifar100 with ResNet-18EfficientNet-B0 as

7

Table 3 Object detection and instance segmentation results with ResNet-34 as backbone bounding-box AP (AP bb) and mask AP(APmk) are evaluated on VOC07 test and COCO val2017 Daggermeans our implementation Subscript in green represents the improve-ment compared to MoCo-V2 baseline

S T MethodObject Detection Instance Segmentation

VOC COCO COCOAP bb AP bb

50 AP bb75 AP bb AP bb

50 AP bb75 APmk APmk

50 APmk75

R-34

times MoCo-V2Dagger 536 791 587 381 568 407 330 532 353

R-50SEED [18] 537 794 592 384 570 410 333 532 353

DisCo (ours) 565(29uarr)

806(15uarr)

625(38uarr)

400(19uarr)

591(23uarr)

434(27uarr)

349(19uarr)

563(31uarr)

371(18uarr)

R-101SEED [18] 541 798 591 385 573 414 336 541 356

DisCo (ours) 561(25uarr)

803(12uarr)

618(31uarr)

400(19uarr)

591(23uarr)

432(25uarr)

347(19uarr)

559(27uarr)

374(18uarr)

R-152SEED [18] 544 801 599 384 570 410 333 537 353

DisCo (ours) 566(30uarr)

808(17uarr)

634(57uarr)

394(13uarr)

587(19uarr)

427(20uarr)

344(14uarr)

554(22uarr)

367(14uarr)

student and ResNet-50ResNet101ResNet152 as teacherSince the image resolution of Cifar dataset is 32times32 all theimages are resized to 224 times 224 with bicubic re-samplingbefore feeding into the model following [11 18] The re-sults are shown in Figure 6 it can be seen that the pro-posed DisCo surpasses the MoCo-V2 baseline by a largemargin with different student and teacher architectures onboth Cifar10 and Cifar100 In addition our method alsohas a significant improvement compared the-state-of-art ap-proach SEED It is worth noting that as the teacher becomesbetter the improvement brought by DisCo is more obvious

46 Transfer to Detection and Segmentation

We also conduct experiments on detection and segmen-tation tasks for generalization analysis C4 based FasterR-CNN [36] are used for objection detection on VOC andMask R-CNN [21] are used for objection detection and in-stance segmentation on COCO In this part of experimentfollowing [10 18] all the parameters of the student networkare learnable and the implementation is based on detec-tron2 [47] for convenience The results of using ResNet-34as the student with teacher ResNet-50ResNet-101ResNet-152 are shown in Table 3 On the object detection ourmethod can bring obvious improvement on both VOC andCOCO datasets Furthermore as SEED [18] claimed theimprovement on COCO is relatively minor compared toVOC since COCO training dataset has 118k images whileVOC has only 165k training images thus the gain broughtby weight initialization is relatively small On the instancesegmentation task DisCo also shows superiority

47 Distilling BottleNeck Analysis

In this section we analyze the Distilling BottleNeck Phe-nomenon and we use ResNet-50 as the teacher for simplic-ityDistilling BottleNeck Phenomenon In the self-supervised

distillation stage we first tried to distill small models withthe default MLP configuration of MoCo-V2 using DisCoand the results are shown in Table 4 denoted by DisColowastIt is worth noting that the dimensions of the hidden layerin DisColowast are exactly as same as SEED It can be seenthat compared to SEED DisColowast shows superior results onEfficientNet-B0 EfficientNet-B1 and MobileNet-v3-Largeand has comparable results on ResNet-18 and ResNet-34Then we expand the dimension of the hidden layer in theMLP of the student to be consistent with that of the teacherthat is 2048D it can be seen that the results can be furtherimproved which is recorded in the third row In particularthis expansion operation brings 35 and 36 gains forResNet-18 and ResNet-34 respectively

Table 4 Linear evaluation top-1 accuracy () on ImageNetDisCo means that the dimension of hidden layer in the MLP isconsistent with that of SEED

Method Eff-b0 Eff-b1 Mob-v3 R-18 R-34SEED [18] 613 614 552 576 585

DisCo 656 658 638 571 589DisCo 665(09uarr) 666(08uarr) 644(06uarr) 606(35uarr) 625(36uarr)

Theoretical Analysis from IB perspective In Figure 7on the downstream Cifar10 classification task we visualizethe compression phase of ResNet-18 and ResNet-34 withdifferent hidden dimensions distilled by the same teacherin the information plane It can be seen that when we ad-just the hidden dimension in the MLP of ResNet-18 andResNet-34 from 512D to 2048D the value of I(XT ) be-comes smaller while I(T Y ) is basically unchanged whichsuggests that enlarging the hidden dimension can makethe student model more generalized in the setting of self-supervised transfer learningThe effectiveness of the dimension Here we further ex-plore the impact of more dimensions and the results areshown in Figure 9 It can be seen that as the dimension in-

8

80 85 90 950

1

2

3

4

5

I (T

Y) C(858) T

C (891) T

(a) ResNet-18

80 85 90 95

C (834) T

C (864) T

(b) ResNet-34

MLP (512d) MLP (2048d)I (XT)

Figure 7 Mutual information paths from transition points to con-vergence points in the compression phase of training T denotestransition points and C(X) denotes convergent points with Xtop-1 accuracy on Cifar10 Points with similar I(TY) but smallerI(XT) are better generalized

creases the top-1 accuracy also increases but when the di-mension is already large the growth trend will slow downThe performance trend of EfficientNet-B1 and MobileNet-v3-Large is close to that of EfficientNet-B0 so for the sakeof clarity we do not exhibit it in the figure

48 More SSL Methods

Variants of the testbed In this part in order to demon-strate the versatility of our method we further tried othertwo SSL methods as the testbed SwAV and DINO

For SwAV the teacher is backboned by ResNet-50 andthe results are shown in Table 5 it can be seen that formodels with very few parameters EfficientNet-B0 andMobileNet-v3-Large the pre-training results with SwAVare also very poor When DisCo is utilized the efficacyis significantly improved

Table 5 Linear evaluation top-1 accuracy () on ImageNet withSwAV as the testbed The teacher of DisCo is an online opensource ResNet-50 model pre-trained using SwAV with top-1 accu-racy 753

Method Eff-b0 Mob-v3 R-34SwAV [5] 468 194 633

SwAV + DisCo 624 557 633

For DINO use two vision transformer architectures ViTand XCiT and the results are shown in Table 6 It can beseen that no matter what SSL method is adopted and whatarchitecture is used in the network DisCo can bring sig-nificant gains It is worth noting that XCiT-tiny has 26Mparameters which is much larger than ViT-tiny(5M) butDisCo can still narrow the gap with the teacherVariants of teacher pre-training method In order toverify that our method is not picky about the pre-trainingapproach that the teacher adopted we use three ResNet-50 networks pre-trained with different SSL methods as theteacher under the testbed of MoCo-V2 It can be observed

Table 6 Linear evaluation top-1 accuracy () on ImageNet withDINO as testbed

Teacher StudentModel Acc ViT-tiny XCiT-tiny

- - 55 67ViT-small [14 43] 77 684(134uarr) -XCiT-small [16] 778 - 711(41uarr)

Table 7 Linear evaluation top-1 accuracy () on ImageNet withvariants of teacher pre-training methods All the teachers areResNet-50 and the first row is student trained by MoCo-V2 di-rectly without distillation which is the baseline

Teacher StudentMethod Acc Eff-b0 Eff-b1 Mob-v3 R-18 R-34

- - 468 484 362 522 568MoCo-V2 [10] 674 665 666 644 606 625

SeLa-V2 [2] 718 622 682 662 641 653SwAV [5] 753 700 721 650 651 675

from Table 7 that when using different pre-trained ResNet-50 as teachers DisCo can significantly boost all the resultsof small models Furthermore with the improvement of theteachers using different and stronger pre-training methodsthe results of the student can be further improved

49 Visualization Analysis

In Figure 8 we visualize the learned representations ofEfficientNet-B0ResNet-50 pretrained by MoCo-V2 andEfficientNet-B0 distilled by ResNet-50 using DisCo Forclarity we randomly select 10 classes from ImageNet testset and map the learned representations to two dimensionalspace by t-SNE [44] It can be observed that ResNet-50forms more separated clusters than EfficientNet-B0 whenusing MoCo-V2 alone and after using ResNet-50 to teachEfficientNet-B0 with DisCo EfficientNet-B0 performs verymuch like the teacher

5 Conclusion

In this paper we propose Distilled Contrastive Learning(DisCo) to remedy self-supervised learning on lightweightmodels The proposed method constraints the final embed-ding of the lightweight student to be consistent with that ofthe teacher to maximally transmit the teacherrsquos knowledgeDisCo is not limited to specific contrastive learning meth-ods and can remedy the effect of the student to be very closeto the teacher

References[1] Soroush Abbasi Koohpayegani Ajinkya Tejankar and

Hamed Pirsiavash Compress Self-supervised learning bycompressing representations In NeurIPS pages 12980ndash12992 2020 2 3

9

30 20 10 0 10 20 30 40Dim 1

30

20

10

0

10

20

Dim

2

(a) Eff-b0 (MoCo-V2)

30 20 10 0 10 20 30 40Dim 1

(b) R-50 (MoCo-V2)

30 20 10 0 10 20 30 40Dim 1

(c) Eff-b0 (DisCo distilled by ResNet-50)

Figure 8 Clustering results on the ImageNet test set Different colors represent different classes

0 512 1024 1280 2048Dimension of hidden layer

5254565860626466

Top-

1 Ac

cura

cy(

)

+46+38

+32

+67

+62

+57

+74+65

+63

+81+74

+72

EfficientNet-B0ResNet-18ResNet-34

Figure 9 Linear evaluation top-1 accuracy on ImageNet of studentprojection head with different dimensions of hidden layer duringdistillation represents the original dimension used in MoCo-V2

[2] Yuki Markus Asano Christian Rupprecht and AndreaVedaldi Self-labelling via simultaneous clustering and rep-resentation learning In ICLR 2020 3 9

[3] Mathilde Caron Piotr Bojanowski Armand Joulin andMatthijs Douze Deep clustering for unsupervised learningof visual features In ECCV pages 132ndash149 2018 3

[4] Mathilde Caron Piotr Bojanowski Julien Mairal and Ar-mand Joulin Unsupervised pre-training of image featureson non-curated data In ICCV pages 2959ndash2968 2019 3

[5] Mathilde Caron Ishan Misra Julien Mairal Priya Goyal Pi-otr Bojanowski and Armand Joulin Unsupervised learn-ing of visual features by contrasting cluster assignments InNeurIPS pages 9912ndash9924 2020 3 9

[6] Mathilde Caron Hugo Touvron Ishan Misra Herve JegouJulien Mairal Piotr Bojanowski and Armand Joulin Emerg-ing properties in self-supervised vision transformers 20213

[7] Defang Chen Jian-Ping Mei Yuan Zhang Can Wang ZheWang Yan Feng and Chun Chen Cross-layer distillationwith semantic calibration 2020 3

[8] Ting Chen Simon Kornblith Mohammad Norouzi and Ge-offrey Hinton A simple framework for contrastive learningof visual representations In ICML pages 1597ndash1607 20201 2 3 7

[9] Ting Chen Simon Kornblith Kevin Swersky MohammadNorouzi and Geoffrey Hinton Big self-supervised mod-els are strong semi-supervised learners In NeurIPS pages22243ndash22255 2020 1 2 3 4

[10] Xinlei Chen Haoqi Fan Ross Girshick and Kaiming HeImproved baselines with momentum contrastive learning InCVPR pages 9729ndash9738 2020 1 2 3 4 6 8 9

[11] Xinlei Chen and Kaiming He Exploring simple siamese rep-resentation learning In arXiv preprint arXiv2011105662020 3 7 8

[12] Hao Cheng Dongze Lian Shenghua Gao and Yanlin GengEvaluating capability of deep neural networks for imageclassification via information plane In ECCV pages 168ndash182 2018 5

[13] Carl Doersch Abhinav Gupta and Alexei A Efros Unsuper-vised visual representation learning by context prediction InICCV pages 1422ndash1430 2015 1

[14] Alexey Dosovitskiy Lucas Beyer Alexander KolesnikovDirk Weissenborn Xiaohua Zhai Thomas UnterthinerMostafa Dehghani Matthias Minderer Georg Heigold Syl-vain Gelly et al An image is worth 16x16 words Trans-formers for image recognition at scale arXiv preprintarXiv201011929 2020 5 9

[15] Alexey Dosovitskiy Philipp Fischer Jost Tobias Springen-berg Martin Riedmiller and Thomas Brox Discriminativeunsupervised feature learning with exemplar convolutionalneural networks volume 38 pages 1734ndash1747 2015 3

[16] Alaaeldin El-Nouby Hugo Touvron Mathilde Caron PiotrBojanowski Matthijs Douze Armand Joulin Ivan LaptevNatalia Neverova Gabriel Synnaeve Jakob Verbeek et alXcit Cross-covariance image transformers 2021 5 9

[17] Mark Everingham Luc Van Gool Christopher KI WilliamsJohn Winn and Andrew Zisserman The pascal visual objectclasses challenge volume 88 pages 303ndash338 2010 5

10

[18] Zhiyuan Fang Jianfeng Wang Lijuan Wang Lei ZhangYezhou Yang and Zicheng Liu Seed Self-supervised dis-tillation for visual representation In ICLR 2021 2 3 68

[19] Jean-Bastien Grill Florian Strub Florent Altche CorentinTallec Pierre Richemond Elena Buchatskaya Carl DoerschBernardo Avila Pires Zhaohan Guo Mohammad Ghesh-laghi Azar Bilal Piot koray kavukcuoglu Remi Munos andMichal Valko Bootstrap your own latent A new approachto self-supervised learning In NeurIPS pages 21271ndash212842020 1 2 3

[20] Kaiming He Haoqi Fan Yuxin Wu Saining Xie and RossGirshick Momentum contrast for unsupervised visual rep-resentation learning In CVPR pages 9729ndash9738 2020 12 3 6

[21] Kaiming He Georgia Gkioxari Piotr Dollar and Ross Gir-shick Mask r-cnn In ICCV pages 2961ndash2969 2017 8

[22] Kaiming He Xiangyu Zhang Shaoqing Ren and Jian SunDeep residual learning for image recognition In CVPRpages 770ndash778 2016 2

[23] Olivier Henaff Data-efficient image recognition with con-trastive predictive coding In ICML pages 4182ndash4192 20203

[24] Geoffrey Hinton Oriol Vinyals and Jeff Dean Distilling theknowledge in a neural network In NeurIPSW 2015 2 3 7

[25] Andrew Howard Mark Sandler Grace Chu Liang-ChiehChen Bo Chen Mingxing Tan Weijun Wang Yukun ZhuRuoming Pang Vijay Vasudevan et al Searching for mo-bilenetv3 In ICCV pages 1314ndash1324 2019 2

[26] Alexander Kolesnikov Xiaohua Zhai and Lucas Beyer Re-visiting self-supervised visual representation learning InCVPR pages 1920ndash1929 2019 7

[27] Nikos Komodakis and Spyros Gidaris Unsupervised repre-sentation learning by predicting image rotations In ICLR2018 1 3

[28] Simon Kornblith Jonathon Shlens and Quoc V Le Do bet-ter imagenet models transfer better In CVPR pages 2661ndash2671 2019 7

[29] Alex Krizhevsky and Geoffrey Hinton Learning multiplelayers of features from tiny images Citeseer 2009 5

[30] Tsung-Yi Lin Michael Maire Serge Belongie James HaysPietro Perona Deva Ramanan Piotr Dollar and C LawrenceZitnick Microsoft coco Common objects in context InECCV pages 740ndash755 2014 5

[31] Yufan Liu Jiajiong Cao Bing Li Chunfeng Yuan WeimingHu Yangxi Li and Yunqiang Duan Knowledge distillationvia instance relationship graph In CVPR pages 7096ndash71042019 3

[32] Mehdi Noroozi and Paolo Favaro Unsupervised learning ofvisual representations by solving jigsaw puzzles In ECCVpages 69ndash84 2016 1 3

[33] Aaron van den Oord Yazhe Li and Oriol Vinyals Repre-sentation learning with contrastive predictive coding 20187

[34] Wonpyo Park Dongju Kim Yan Lu and Minsu Cho Re-lational knowledge distillation In CVPR pages 3967ndash39762019 3 7

[35] Deepak Pathak Philipp Krahenbuhl Jeff Donahue TrevorDarrell and Alexei A Efros Context encoders Featurelearning by inpainting In CVPR pages 2536ndash2544 20161 3

[36] Shaoqing Ren Kaiming He Ross Girshick and Jian SunFaster r-cnn Towards real-time object detection with regionproposal networks volume 39 pages 1137ndash1149 2015 8

[37] Adriana Romero Nicolas Ballas Samira Ebrahimi KahouAntoine Chassang Carlo Gatta and Yoshua Bengio FitnetsHints for thin deep nets In ICLR 2014 3

[38] Olga Russakovsky Jia Deng Hao Su Jonathan Krause San-jeev Satheesh Sean Ma Zhiheng Huang Andrej KarpathyAditya Khosla Michael Bernstein Alexander C Berg andLi Fei-Fei ImageNet Large Scale Visual Recognition Chal-lenge volume 115 pages 211ndash252 2015 2 5

[39] Ravid Shwartz-Ziv and Naftali Tishby Opening the blackbox of deep neural networks via information 2017 5

[40] Mingxing Tan and Quoc Le Efficientnet Rethinking modelscaling for convolutional neural networks In ICML pages6105ndash6114 2019 2

[41] Yonglong Tian Dilip Krishnan and Phillip Isola Con-trastive representation distillation In ICLR 2020 3

[42] Naftali Tishby Fernando C Pereira and William Bialek Theinformation bottleneck method 2000 2 5

[43] Hugo Touvron Matthieu Cord Matthijs Douze FranciscoMassa Alexandre Sablayrolles and Herve Jegou Trainingdata-efficient image transformers amp distillation through at-tention In International Conference on Machine Learningpages 10347ndash10357 PMLR 2021 5 9

[44] Laurens Van der Maaten and Geoffrey Hinton Visualizingdata using t-sne volume 9 2008 9

[45] Jinpeng Wang Yuting Gao Ke Li Xinyang Jiang XiaoweiGuo Rongrong Ji and Xing Sun Enhancing unsupervisedvideo representation learning by decoupling the scene andthe motion 2020 3

[46] Jinpeng Wang Yuting Gao Ke Li Yiqi Lin Andy J Ma andXing Sun Removing the background by adding the back-ground Towards background robust self-supervised videorepresentation learning 2020 3

[47] Yuxin Wu Alexander Kirillov Francisco Massa Wan-YenLo and Ross Girshick Detectron2 httpsgithubcomfacebookresearchdetectron2 2019 8

[48] Guodong Xu Ziwei Liu Xiaoxiao Li and Chen Change LoyKnowledge distillation meets self-supervision In ECCVpages 588ndash604 2020 3

[49] Sergey Zagoruyko and Nikos Komodakis Paying more at-tention to attention Improving the performance of convolu-tional neural networks via attention transfer In ICLR 20173 7

[50] Jure Zbontar Li Jing Ishan Misra Yann LeCun andStephane Deny Barlow twins Self-supervised learning viaredundancy reduction 2021 3

11

Page 7: 12 1 arXiv:2104.09124v3 [cs.CV] 9 Aug 2021

0 10 20 30 40 50 60 70

30

40

50

60

70

Top-

1 Ac

cura

cy(

)

R-50R-101

R-152+166

+241

+149

+202

+265

+174

+265

+254

+214

(a) EfficientNet-B0

100101

0 10 20 30 40 50 60

30

40

50

60

70

R-50R-101 R-152

+136

+99

+222

+167

+119

+235

+177

+130

+203

(b) MobileNet-v3-Large

100101

0 10 20 30 40 50 60

30

40

50

60

70

R-50R-101

R-152+141

+86

+184

+169

+103

+201

+220

+131

+233(c) ResNet-18

100101

Numbers of parameters of Teacher (Millions)

Figure 5 ImageNet top-1 accuracy () of semi-supervised linear evaluation with 1 10 and 100 training data Points where thenumber of teacher network parameters are 0 are the results of the MoCo-V2 baseline without distillation

R-18 Eff-b070

75

80

85

90

95

Top-

1 Ac

cura

cy(

)

R-50R-101 R-152

R-50 R-101 R-152

(a) Cifar10

R-18 Eff-b040

50

60

70

80

Top-

1 Ac

cura

cy(

)

R-50R-101 R-152

R-50 R-101R-152

(b) Cifar100

MoCo-V2SEEDOurs

Figure 6 Top-1 accuracy of students transferred to Cifar10Cifar100 without and with distillation from different teachers

is only 94163 of ResNet-101ResNet-50

43 Semi-supervised Linear Evaluation

Following previous works [26 28 33] we evaluate ourmethod under the semi-supervised setting Two 1 and10 sampled subsets of ImageNet training data (˜128 and˜128 images per class respectively) [8] are used for fine-tuning the student models As is shown in Figure 5 stu-dent models distilled by DisCo outperform baseline underany amount of labeled data Furthermore DisCo also showsthe consistency under different fractions of annotations thatis students always benefit from larger models as teachersMore labels will be helpful to improve the final performanceof the student model which is expected

44 Comparison against other Distillation

In order to verify the effectiveness of the proposedmethod we compare with three widely used distillationschemes namely 1) Attention transfer denoted by AT[49] 2) Relational knowledge distillation denoted by RKD[34] 3) Knowledge distillation denoted by KD [24] ATand RKD are feature-based and relation-based respec-tively which can be utilized during the self-supervised pre-training stage KD is a logits-based method which can onlybe used at the supervised fine-tuning stage The compari-son results are shown in Table 2 Singe-Knowledge meansusing one of these approaches individually and it can beseen that all distillation approaches can bring improvement

to the baseline but the gain from DisCo is the most signif-icant which indicates the knowledge that DisCo chosen totransfer and the way of transmission is indeed more effec-tive Then we also try to transfer multi-knowledge fromteacher to student by combining DisCo with other distil-lation schemes It can be seen that integrating DisCo withATRKDKD can boost the performance a lot which furtherproves the effectiveness of DisCo

Table 2 Linear evaluation top-1 accuracy () on ImageNet com-pared with different distillation methods

Method Eff-b0 Eff-b1 Mob-v3 R-18 R-34Baseline

MoCo-V2 [11] 468 484 362 522 568Single-Knowledge

AT [49] 571 582 510 562 602RKD [34] 483 503 369 564 587KD [24] 465 485 373 515 588

DisCo (ours) 665 666 644 606 625Multi-Knowledge

AT + DisCo 667 663 641 602 623RKD + DisCo 668 665 644 606 623KD + DisCo 658 659 652 606 659

45 Transfer to Cifar10Cifar100

In order to analyze the generalization of representationsobtained by DisCo we further conduct linear evaluation onCifar10 and Cifar100 with ResNet-18EfficientNet-B0 as

7

Table 3 Object detection and instance segmentation results with ResNet-34 as backbone bounding-box AP (AP bb) and mask AP(APmk) are evaluated on VOC07 test and COCO val2017 Daggermeans our implementation Subscript in green represents the improve-ment compared to MoCo-V2 baseline

S T MethodObject Detection Instance Segmentation

VOC COCO COCOAP bb AP bb

50 AP bb75 AP bb AP bb

50 AP bb75 APmk APmk

50 APmk75

R-34

times MoCo-V2Dagger 536 791 587 381 568 407 330 532 353

R-50SEED [18] 537 794 592 384 570 410 333 532 353

DisCo (ours) 565(29uarr)

806(15uarr)

625(38uarr)

400(19uarr)

591(23uarr)

434(27uarr)

349(19uarr)

563(31uarr)

371(18uarr)

R-101SEED [18] 541 798 591 385 573 414 336 541 356

DisCo (ours) 561(25uarr)

803(12uarr)

618(31uarr)

400(19uarr)

591(23uarr)

432(25uarr)

347(19uarr)

559(27uarr)

374(18uarr)

R-152SEED [18] 544 801 599 384 570 410 333 537 353

DisCo (ours) 566(30uarr)

808(17uarr)

634(57uarr)

394(13uarr)

587(19uarr)

427(20uarr)

344(14uarr)

554(22uarr)

367(14uarr)

student and ResNet-50ResNet101ResNet152 as teacherSince the image resolution of Cifar dataset is 32times32 all theimages are resized to 224 times 224 with bicubic re-samplingbefore feeding into the model following [11 18] The re-sults are shown in Figure 6 it can be seen that the pro-posed DisCo surpasses the MoCo-V2 baseline by a largemargin with different student and teacher architectures onboth Cifar10 and Cifar100 In addition our method alsohas a significant improvement compared the-state-of-art ap-proach SEED It is worth noting that as the teacher becomesbetter the improvement brought by DisCo is more obvious

46 Transfer to Detection and Segmentation

We also conduct experiments on detection and segmen-tation tasks for generalization analysis C4 based FasterR-CNN [36] are used for objection detection on VOC andMask R-CNN [21] are used for objection detection and in-stance segmentation on COCO In this part of experimentfollowing [10 18] all the parameters of the student networkare learnable and the implementation is based on detec-tron2 [47] for convenience The results of using ResNet-34as the student with teacher ResNet-50ResNet-101ResNet-152 are shown in Table 3 On the object detection ourmethod can bring obvious improvement on both VOC andCOCO datasets Furthermore as SEED [18] claimed theimprovement on COCO is relatively minor compared toVOC since COCO training dataset has 118k images whileVOC has only 165k training images thus the gain broughtby weight initialization is relatively small On the instancesegmentation task DisCo also shows superiority

47 Distilling BottleNeck Analysis

In this section we analyze the Distilling BottleNeck Phe-nomenon and we use ResNet-50 as the teacher for simplic-ityDistilling BottleNeck Phenomenon In the self-supervised

distillation stage we first tried to distill small models withthe default MLP configuration of MoCo-V2 using DisCoand the results are shown in Table 4 denoted by DisColowastIt is worth noting that the dimensions of the hidden layerin DisColowast are exactly as same as SEED It can be seenthat compared to SEED DisColowast shows superior results onEfficientNet-B0 EfficientNet-B1 and MobileNet-v3-Largeand has comparable results on ResNet-18 and ResNet-34Then we expand the dimension of the hidden layer in theMLP of the student to be consistent with that of the teacherthat is 2048D it can be seen that the results can be furtherimproved which is recorded in the third row In particularthis expansion operation brings 35 and 36 gains forResNet-18 and ResNet-34 respectively

Table 4 Linear evaluation top-1 accuracy () on ImageNetDisCo means that the dimension of hidden layer in the MLP isconsistent with that of SEED

Method Eff-b0 Eff-b1 Mob-v3 R-18 R-34SEED [18] 613 614 552 576 585

DisCo 656 658 638 571 589DisCo 665(09uarr) 666(08uarr) 644(06uarr) 606(35uarr) 625(36uarr)

Theoretical Analysis from IB perspective In Figure 7on the downstream Cifar10 classification task we visualizethe compression phase of ResNet-18 and ResNet-34 withdifferent hidden dimensions distilled by the same teacherin the information plane It can be seen that when we ad-just the hidden dimension in the MLP of ResNet-18 andResNet-34 from 512D to 2048D the value of I(XT ) be-comes smaller while I(T Y ) is basically unchanged whichsuggests that enlarging the hidden dimension can makethe student model more generalized in the setting of self-supervised transfer learningThe effectiveness of the dimension Here we further ex-plore the impact of more dimensions and the results areshown in Figure 9 It can be seen that as the dimension in-

8

80 85 90 950

1

2

3

4

5

I (T

Y) C(858) T

C (891) T

(a) ResNet-18

80 85 90 95

C (834) T

C (864) T

(b) ResNet-34

MLP (512d) MLP (2048d)I (XT)

Figure 7 Mutual information paths from transition points to con-vergence points in the compression phase of training T denotestransition points and C(X) denotes convergent points with Xtop-1 accuracy on Cifar10 Points with similar I(TY) but smallerI(XT) are better generalized

creases the top-1 accuracy also increases but when the di-mension is already large the growth trend will slow downThe performance trend of EfficientNet-B1 and MobileNet-v3-Large is close to that of EfficientNet-B0 so for the sakeof clarity we do not exhibit it in the figure

48 More SSL Methods

Variants of the testbed In this part in order to demon-strate the versatility of our method we further tried othertwo SSL methods as the testbed SwAV and DINO

For SwAV the teacher is backboned by ResNet-50 andthe results are shown in Table 5 it can be seen that formodels with very few parameters EfficientNet-B0 andMobileNet-v3-Large the pre-training results with SwAVare also very poor When DisCo is utilized the efficacyis significantly improved

Table 5 Linear evaluation top-1 accuracy () on ImageNet withSwAV as the testbed The teacher of DisCo is an online opensource ResNet-50 model pre-trained using SwAV with top-1 accu-racy 753

Method Eff-b0 Mob-v3 R-34SwAV [5] 468 194 633

SwAV + DisCo 624 557 633

For DINO use two vision transformer architectures ViTand XCiT and the results are shown in Table 6 It can beseen that no matter what SSL method is adopted and whatarchitecture is used in the network DisCo can bring sig-nificant gains It is worth noting that XCiT-tiny has 26Mparameters which is much larger than ViT-tiny(5M) butDisCo can still narrow the gap with the teacherVariants of teacher pre-training method In order toverify that our method is not picky about the pre-trainingapproach that the teacher adopted we use three ResNet-50 networks pre-trained with different SSL methods as theteacher under the testbed of MoCo-V2 It can be observed

Table 6 Linear evaluation top-1 accuracy () on ImageNet withDINO as testbed

Teacher StudentModel Acc ViT-tiny XCiT-tiny

- - 55 67ViT-small [14 43] 77 684(134uarr) -XCiT-small [16] 778 - 711(41uarr)

Table 7 Linear evaluation top-1 accuracy () on ImageNet withvariants of teacher pre-training methods All the teachers areResNet-50 and the first row is student trained by MoCo-V2 di-rectly without distillation which is the baseline

Teacher StudentMethod Acc Eff-b0 Eff-b1 Mob-v3 R-18 R-34

- - 468 484 362 522 568MoCo-V2 [10] 674 665 666 644 606 625

SeLa-V2 [2] 718 622 682 662 641 653SwAV [5] 753 700 721 650 651 675

from Table 7 that when using different pre-trained ResNet-50 as teachers DisCo can significantly boost all the resultsof small models Furthermore with the improvement of theteachers using different and stronger pre-training methodsthe results of the student can be further improved

49 Visualization Analysis

In Figure 8 we visualize the learned representations ofEfficientNet-B0ResNet-50 pretrained by MoCo-V2 andEfficientNet-B0 distilled by ResNet-50 using DisCo Forclarity we randomly select 10 classes from ImageNet testset and map the learned representations to two dimensionalspace by t-SNE [44] It can be observed that ResNet-50forms more separated clusters than EfficientNet-B0 whenusing MoCo-V2 alone and after using ResNet-50 to teachEfficientNet-B0 with DisCo EfficientNet-B0 performs verymuch like the teacher

5 Conclusion

In this paper we propose Distilled Contrastive Learning(DisCo) to remedy self-supervised learning on lightweightmodels The proposed method constraints the final embed-ding of the lightweight student to be consistent with that ofthe teacher to maximally transmit the teacherrsquos knowledgeDisCo is not limited to specific contrastive learning meth-ods and can remedy the effect of the student to be very closeto the teacher

References[1] Soroush Abbasi Koohpayegani Ajinkya Tejankar and

Hamed Pirsiavash Compress Self-supervised learning bycompressing representations In NeurIPS pages 12980ndash12992 2020 2 3

9

30 20 10 0 10 20 30 40Dim 1

30

20

10

0

10

20

Dim

2

(a) Eff-b0 (MoCo-V2)

30 20 10 0 10 20 30 40Dim 1

(b) R-50 (MoCo-V2)

30 20 10 0 10 20 30 40Dim 1

(c) Eff-b0 (DisCo distilled by ResNet-50)

Figure 8 Clustering results on the ImageNet test set Different colors represent different classes

0 512 1024 1280 2048Dimension of hidden layer

5254565860626466

Top-

1 Ac

cura

cy(

)

+46+38

+32

+67

+62

+57

+74+65

+63

+81+74

+72

EfficientNet-B0ResNet-18ResNet-34

Figure 9 Linear evaluation top-1 accuracy on ImageNet of studentprojection head with different dimensions of hidden layer duringdistillation represents the original dimension used in MoCo-V2

[2] Yuki Markus Asano Christian Rupprecht and AndreaVedaldi Self-labelling via simultaneous clustering and rep-resentation learning In ICLR 2020 3 9

[3] Mathilde Caron Piotr Bojanowski Armand Joulin andMatthijs Douze Deep clustering for unsupervised learningof visual features In ECCV pages 132ndash149 2018 3

[4] Mathilde Caron Piotr Bojanowski Julien Mairal and Ar-mand Joulin Unsupervised pre-training of image featureson non-curated data In ICCV pages 2959ndash2968 2019 3

[5] Mathilde Caron Ishan Misra Julien Mairal Priya Goyal Pi-otr Bojanowski and Armand Joulin Unsupervised learn-ing of visual features by contrasting cluster assignments InNeurIPS pages 9912ndash9924 2020 3 9

[6] Mathilde Caron Hugo Touvron Ishan Misra Herve JegouJulien Mairal Piotr Bojanowski and Armand Joulin Emerg-ing properties in self-supervised vision transformers 20213

[7] Defang Chen Jian-Ping Mei Yuan Zhang Can Wang ZheWang Yan Feng and Chun Chen Cross-layer distillationwith semantic calibration 2020 3

[8] Ting Chen Simon Kornblith Mohammad Norouzi and Ge-offrey Hinton A simple framework for contrastive learningof visual representations In ICML pages 1597ndash1607 20201 2 3 7

[9] Ting Chen Simon Kornblith Kevin Swersky MohammadNorouzi and Geoffrey Hinton Big self-supervised mod-els are strong semi-supervised learners In NeurIPS pages22243ndash22255 2020 1 2 3 4

[10] Xinlei Chen Haoqi Fan Ross Girshick and Kaiming HeImproved baselines with momentum contrastive learning InCVPR pages 9729ndash9738 2020 1 2 3 4 6 8 9

[11] Xinlei Chen and Kaiming He Exploring simple siamese rep-resentation learning In arXiv preprint arXiv2011105662020 3 7 8

[12] Hao Cheng Dongze Lian Shenghua Gao and Yanlin GengEvaluating capability of deep neural networks for imageclassification via information plane In ECCV pages 168ndash182 2018 5

[13] Carl Doersch Abhinav Gupta and Alexei A Efros Unsuper-vised visual representation learning by context prediction InICCV pages 1422ndash1430 2015 1

[14] Alexey Dosovitskiy Lucas Beyer Alexander KolesnikovDirk Weissenborn Xiaohua Zhai Thomas UnterthinerMostafa Dehghani Matthias Minderer Georg Heigold Syl-vain Gelly et al An image is worth 16x16 words Trans-formers for image recognition at scale arXiv preprintarXiv201011929 2020 5 9

[15] Alexey Dosovitskiy Philipp Fischer Jost Tobias Springen-berg Martin Riedmiller and Thomas Brox Discriminativeunsupervised feature learning with exemplar convolutionalneural networks volume 38 pages 1734ndash1747 2015 3

[16] Alaaeldin El-Nouby Hugo Touvron Mathilde Caron PiotrBojanowski Matthijs Douze Armand Joulin Ivan LaptevNatalia Neverova Gabriel Synnaeve Jakob Verbeek et alXcit Cross-covariance image transformers 2021 5 9

[17] Mark Everingham Luc Van Gool Christopher KI WilliamsJohn Winn and Andrew Zisserman The pascal visual objectclasses challenge volume 88 pages 303ndash338 2010 5

10

[18] Zhiyuan Fang Jianfeng Wang Lijuan Wang Lei ZhangYezhou Yang and Zicheng Liu Seed Self-supervised dis-tillation for visual representation In ICLR 2021 2 3 68

[19] Jean-Bastien Grill Florian Strub Florent Altche CorentinTallec Pierre Richemond Elena Buchatskaya Carl DoerschBernardo Avila Pires Zhaohan Guo Mohammad Ghesh-laghi Azar Bilal Piot koray kavukcuoglu Remi Munos andMichal Valko Bootstrap your own latent A new approachto self-supervised learning In NeurIPS pages 21271ndash212842020 1 2 3

[20] Kaiming He Haoqi Fan Yuxin Wu Saining Xie and RossGirshick Momentum contrast for unsupervised visual rep-resentation learning In CVPR pages 9729ndash9738 2020 12 3 6

[21] Kaiming He Georgia Gkioxari Piotr Dollar and Ross Gir-shick Mask r-cnn In ICCV pages 2961ndash2969 2017 8

[22] Kaiming He Xiangyu Zhang Shaoqing Ren and Jian SunDeep residual learning for image recognition In CVPRpages 770ndash778 2016 2

[23] Olivier Henaff Data-efficient image recognition with con-trastive predictive coding In ICML pages 4182ndash4192 20203

[24] Geoffrey Hinton Oriol Vinyals and Jeff Dean Distilling theknowledge in a neural network In NeurIPSW 2015 2 3 7

[25] Andrew Howard Mark Sandler Grace Chu Liang-ChiehChen Bo Chen Mingxing Tan Weijun Wang Yukun ZhuRuoming Pang Vijay Vasudevan et al Searching for mo-bilenetv3 In ICCV pages 1314ndash1324 2019 2

[26] Alexander Kolesnikov Xiaohua Zhai and Lucas Beyer Re-visiting self-supervised visual representation learning InCVPR pages 1920ndash1929 2019 7

[27] Nikos Komodakis and Spyros Gidaris Unsupervised repre-sentation learning by predicting image rotations In ICLR2018 1 3

[28] Simon Kornblith Jonathon Shlens and Quoc V Le Do bet-ter imagenet models transfer better In CVPR pages 2661ndash2671 2019 7

[29] Alex Krizhevsky and Geoffrey Hinton Learning multiplelayers of features from tiny images Citeseer 2009 5

[30] Tsung-Yi Lin Michael Maire Serge Belongie James HaysPietro Perona Deva Ramanan Piotr Dollar and C LawrenceZitnick Microsoft coco Common objects in context InECCV pages 740ndash755 2014 5

[31] Yufan Liu Jiajiong Cao Bing Li Chunfeng Yuan WeimingHu Yangxi Li and Yunqiang Duan Knowledge distillationvia instance relationship graph In CVPR pages 7096ndash71042019 3

[32] Mehdi Noroozi and Paolo Favaro Unsupervised learning ofvisual representations by solving jigsaw puzzles In ECCVpages 69ndash84 2016 1 3

[33] Aaron van den Oord Yazhe Li and Oriol Vinyals Repre-sentation learning with contrastive predictive coding 20187

[34] Wonpyo Park Dongju Kim Yan Lu and Minsu Cho Re-lational knowledge distillation In CVPR pages 3967ndash39762019 3 7

[35] Deepak Pathak Philipp Krahenbuhl Jeff Donahue TrevorDarrell and Alexei A Efros Context encoders Featurelearning by inpainting In CVPR pages 2536ndash2544 20161 3

[36] Shaoqing Ren Kaiming He Ross Girshick and Jian SunFaster r-cnn Towards real-time object detection with regionproposal networks volume 39 pages 1137ndash1149 2015 8

[37] Adriana Romero Nicolas Ballas Samira Ebrahimi KahouAntoine Chassang Carlo Gatta and Yoshua Bengio FitnetsHints for thin deep nets In ICLR 2014 3

[38] Olga Russakovsky Jia Deng Hao Su Jonathan Krause San-jeev Satheesh Sean Ma Zhiheng Huang Andrej KarpathyAditya Khosla Michael Bernstein Alexander C Berg andLi Fei-Fei ImageNet Large Scale Visual Recognition Chal-lenge volume 115 pages 211ndash252 2015 2 5

[39] Ravid Shwartz-Ziv and Naftali Tishby Opening the blackbox of deep neural networks via information 2017 5

[40] Mingxing Tan and Quoc Le Efficientnet Rethinking modelscaling for convolutional neural networks In ICML pages6105ndash6114 2019 2

[41] Yonglong Tian Dilip Krishnan and Phillip Isola Con-trastive representation distillation In ICLR 2020 3

[42] Naftali Tishby Fernando C Pereira and William Bialek Theinformation bottleneck method 2000 2 5

[43] Hugo Touvron Matthieu Cord Matthijs Douze FranciscoMassa Alexandre Sablayrolles and Herve Jegou Trainingdata-efficient image transformers amp distillation through at-tention In International Conference on Machine Learningpages 10347ndash10357 PMLR 2021 5 9

[44] Laurens Van der Maaten and Geoffrey Hinton Visualizingdata using t-sne volume 9 2008 9

[45] Jinpeng Wang Yuting Gao Ke Li Xinyang Jiang XiaoweiGuo Rongrong Ji and Xing Sun Enhancing unsupervisedvideo representation learning by decoupling the scene andthe motion 2020 3

[46] Jinpeng Wang Yuting Gao Ke Li Yiqi Lin Andy J Ma andXing Sun Removing the background by adding the back-ground Towards background robust self-supervised videorepresentation learning 2020 3

[47] Yuxin Wu Alexander Kirillov Francisco Massa Wan-YenLo and Ross Girshick Detectron2 httpsgithubcomfacebookresearchdetectron2 2019 8

[48] Guodong Xu Ziwei Liu Xiaoxiao Li and Chen Change LoyKnowledge distillation meets self-supervision In ECCVpages 588ndash604 2020 3

[49] Sergey Zagoruyko and Nikos Komodakis Paying more at-tention to attention Improving the performance of convolu-tional neural networks via attention transfer In ICLR 20173 7

[50] Jure Zbontar Li Jing Ishan Misra Yann LeCun andStephane Deny Barlow twins Self-supervised learning viaredundancy reduction 2021 3

11

Page 8: 12 1 arXiv:2104.09124v3 [cs.CV] 9 Aug 2021

Table 3 Object detection and instance segmentation results with ResNet-34 as backbone bounding-box AP (AP bb) and mask AP(APmk) are evaluated on VOC07 test and COCO val2017 Daggermeans our implementation Subscript in green represents the improve-ment compared to MoCo-V2 baseline

S T MethodObject Detection Instance Segmentation

VOC COCO COCOAP bb AP bb

50 AP bb75 AP bb AP bb

50 AP bb75 APmk APmk

50 APmk75

R-34

times MoCo-V2Dagger 536 791 587 381 568 407 330 532 353

R-50SEED [18] 537 794 592 384 570 410 333 532 353

DisCo (ours) 565(29uarr)

806(15uarr)

625(38uarr)

400(19uarr)

591(23uarr)

434(27uarr)

349(19uarr)

563(31uarr)

371(18uarr)

R-101SEED [18] 541 798 591 385 573 414 336 541 356

DisCo (ours) 561(25uarr)

803(12uarr)

618(31uarr)

400(19uarr)

591(23uarr)

432(25uarr)

347(19uarr)

559(27uarr)

374(18uarr)

R-152SEED [18] 544 801 599 384 570 410 333 537 353

DisCo (ours) 566(30uarr)

808(17uarr)

634(57uarr)

394(13uarr)

587(19uarr)

427(20uarr)

344(14uarr)

554(22uarr)

367(14uarr)

student and ResNet-50ResNet101ResNet152 as teacherSince the image resolution of Cifar dataset is 32times32 all theimages are resized to 224 times 224 with bicubic re-samplingbefore feeding into the model following [11 18] The re-sults are shown in Figure 6 it can be seen that the pro-posed DisCo surpasses the MoCo-V2 baseline by a largemargin with different student and teacher architectures onboth Cifar10 and Cifar100 In addition our method alsohas a significant improvement compared the-state-of-art ap-proach SEED It is worth noting that as the teacher becomesbetter the improvement brought by DisCo is more obvious

46 Transfer to Detection and Segmentation

We also conduct experiments on detection and segmen-tation tasks for generalization analysis C4 based FasterR-CNN [36] are used for objection detection on VOC andMask R-CNN [21] are used for objection detection and in-stance segmentation on COCO In this part of experimentfollowing [10 18] all the parameters of the student networkare learnable and the implementation is based on detec-tron2 [47] for convenience The results of using ResNet-34as the student with teacher ResNet-50ResNet-101ResNet-152 are shown in Table 3 On the object detection ourmethod can bring obvious improvement on both VOC andCOCO datasets Furthermore as SEED [18] claimed theimprovement on COCO is relatively minor compared toVOC since COCO training dataset has 118k images whileVOC has only 165k training images thus the gain broughtby weight initialization is relatively small On the instancesegmentation task DisCo also shows superiority

47 Distilling BottleNeck Analysis

In this section we analyze the Distilling BottleNeck Phe-nomenon and we use ResNet-50 as the teacher for simplic-ityDistilling BottleNeck Phenomenon In the self-supervised

distillation stage we first tried to distill small models withthe default MLP configuration of MoCo-V2 using DisCoand the results are shown in Table 4 denoted by DisColowastIt is worth noting that the dimensions of the hidden layerin DisColowast are exactly as same as SEED It can be seenthat compared to SEED DisColowast shows superior results onEfficientNet-B0 EfficientNet-B1 and MobileNet-v3-Largeand has comparable results on ResNet-18 and ResNet-34Then we expand the dimension of the hidden layer in theMLP of the student to be consistent with that of the teacherthat is 2048D it can be seen that the results can be furtherimproved which is recorded in the third row In particularthis expansion operation brings 35 and 36 gains forResNet-18 and ResNet-34 respectively

Table 4 Linear evaluation top-1 accuracy () on ImageNetDisCo means that the dimension of hidden layer in the MLP isconsistent with that of SEED

Method Eff-b0 Eff-b1 Mob-v3 R-18 R-34SEED [18] 613 614 552 576 585

DisCo 656 658 638 571 589DisCo 665(09uarr) 666(08uarr) 644(06uarr) 606(35uarr) 625(36uarr)

Theoretical Analysis from IB perspective In Figure 7on the downstream Cifar10 classification task we visualizethe compression phase of ResNet-18 and ResNet-34 withdifferent hidden dimensions distilled by the same teacherin the information plane It can be seen that when we ad-just the hidden dimension in the MLP of ResNet-18 andResNet-34 from 512D to 2048D the value of I(XT ) be-comes smaller while I(T Y ) is basically unchanged whichsuggests that enlarging the hidden dimension can makethe student model more generalized in the setting of self-supervised transfer learningThe effectiveness of the dimension Here we further ex-plore the impact of more dimensions and the results areshown in Figure 9 It can be seen that as the dimension in-

8

80 85 90 950

1

2

3

4

5

I (T

Y) C(858) T

C (891) T

(a) ResNet-18

80 85 90 95

C (834) T

C (864) T

(b) ResNet-34

MLP (512d) MLP (2048d)I (XT)

Figure 7 Mutual information paths from transition points to con-vergence points in the compression phase of training T denotestransition points and C(X) denotes convergent points with Xtop-1 accuracy on Cifar10 Points with similar I(TY) but smallerI(XT) are better generalized

creases the top-1 accuracy also increases but when the di-mension is already large the growth trend will slow downThe performance trend of EfficientNet-B1 and MobileNet-v3-Large is close to that of EfficientNet-B0 so for the sakeof clarity we do not exhibit it in the figure

48 More SSL Methods

Variants of the testbed In this part in order to demon-strate the versatility of our method we further tried othertwo SSL methods as the testbed SwAV and DINO

For SwAV the teacher is backboned by ResNet-50 andthe results are shown in Table 5 it can be seen that formodels with very few parameters EfficientNet-B0 andMobileNet-v3-Large the pre-training results with SwAVare also very poor When DisCo is utilized the efficacyis significantly improved

Table 5 Linear evaluation top-1 accuracy () on ImageNet withSwAV as the testbed The teacher of DisCo is an online opensource ResNet-50 model pre-trained using SwAV with top-1 accu-racy 753

Method Eff-b0 Mob-v3 R-34SwAV [5] 468 194 633

SwAV + DisCo 624 557 633

For DINO use two vision transformer architectures ViTand XCiT and the results are shown in Table 6 It can beseen that no matter what SSL method is adopted and whatarchitecture is used in the network DisCo can bring sig-nificant gains It is worth noting that XCiT-tiny has 26Mparameters which is much larger than ViT-tiny(5M) butDisCo can still narrow the gap with the teacherVariants of teacher pre-training method In order toverify that our method is not picky about the pre-trainingapproach that the teacher adopted we use three ResNet-50 networks pre-trained with different SSL methods as theteacher under the testbed of MoCo-V2 It can be observed

Table 6 Linear evaluation top-1 accuracy () on ImageNet withDINO as testbed

Teacher StudentModel Acc ViT-tiny XCiT-tiny

- - 55 67ViT-small [14 43] 77 684(134uarr) -XCiT-small [16] 778 - 711(41uarr)

Table 7 Linear evaluation top-1 accuracy () on ImageNet withvariants of teacher pre-training methods All the teachers areResNet-50 and the first row is student trained by MoCo-V2 di-rectly without distillation which is the baseline

Teacher StudentMethod Acc Eff-b0 Eff-b1 Mob-v3 R-18 R-34

- - 468 484 362 522 568MoCo-V2 [10] 674 665 666 644 606 625

SeLa-V2 [2] 718 622 682 662 641 653SwAV [5] 753 700 721 650 651 675

from Table 7 that when using different pre-trained ResNet-50 as teachers DisCo can significantly boost all the resultsof small models Furthermore with the improvement of theteachers using different and stronger pre-training methodsthe results of the student can be further improved

49 Visualization Analysis

In Figure 8 we visualize the learned representations ofEfficientNet-B0ResNet-50 pretrained by MoCo-V2 andEfficientNet-B0 distilled by ResNet-50 using DisCo Forclarity we randomly select 10 classes from ImageNet testset and map the learned representations to two dimensionalspace by t-SNE [44] It can be observed that ResNet-50forms more separated clusters than EfficientNet-B0 whenusing MoCo-V2 alone and after using ResNet-50 to teachEfficientNet-B0 with DisCo EfficientNet-B0 performs verymuch like the teacher

5 Conclusion

In this paper we propose Distilled Contrastive Learning(DisCo) to remedy self-supervised learning on lightweightmodels The proposed method constraints the final embed-ding of the lightweight student to be consistent with that ofthe teacher to maximally transmit the teacherrsquos knowledgeDisCo is not limited to specific contrastive learning meth-ods and can remedy the effect of the student to be very closeto the teacher

References[1] Soroush Abbasi Koohpayegani Ajinkya Tejankar and

Hamed Pirsiavash Compress Self-supervised learning bycompressing representations In NeurIPS pages 12980ndash12992 2020 2 3

9

30 20 10 0 10 20 30 40Dim 1

30

20

10

0

10

20

Dim

2

(a) Eff-b0 (MoCo-V2)

30 20 10 0 10 20 30 40Dim 1

(b) R-50 (MoCo-V2)

30 20 10 0 10 20 30 40Dim 1

(c) Eff-b0 (DisCo distilled by ResNet-50)

Figure 8 Clustering results on the ImageNet test set Different colors represent different classes

0 512 1024 1280 2048Dimension of hidden layer

5254565860626466

Top-

1 Ac

cura

cy(

)

+46+38

+32

+67

+62

+57

+74+65

+63

+81+74

+72

EfficientNet-B0ResNet-18ResNet-34

Figure 9 Linear evaluation top-1 accuracy on ImageNet of studentprojection head with different dimensions of hidden layer duringdistillation represents the original dimension used in MoCo-V2

[2] Yuki Markus Asano Christian Rupprecht and AndreaVedaldi Self-labelling via simultaneous clustering and rep-resentation learning In ICLR 2020 3 9

[3] Mathilde Caron Piotr Bojanowski Armand Joulin andMatthijs Douze Deep clustering for unsupervised learningof visual features In ECCV pages 132ndash149 2018 3

[4] Mathilde Caron Piotr Bojanowski Julien Mairal and Ar-mand Joulin Unsupervised pre-training of image featureson non-curated data In ICCV pages 2959ndash2968 2019 3

[5] Mathilde Caron Ishan Misra Julien Mairal Priya Goyal Pi-otr Bojanowski and Armand Joulin Unsupervised learn-ing of visual features by contrasting cluster assignments InNeurIPS pages 9912ndash9924 2020 3 9

[6] Mathilde Caron Hugo Touvron Ishan Misra Herve JegouJulien Mairal Piotr Bojanowski and Armand Joulin Emerg-ing properties in self-supervised vision transformers 20213

[7] Defang Chen Jian-Ping Mei Yuan Zhang Can Wang ZheWang Yan Feng and Chun Chen Cross-layer distillationwith semantic calibration 2020 3

[8] Ting Chen Simon Kornblith Mohammad Norouzi and Ge-offrey Hinton A simple framework for contrastive learningof visual representations In ICML pages 1597ndash1607 20201 2 3 7

[9] Ting Chen Simon Kornblith Kevin Swersky MohammadNorouzi and Geoffrey Hinton Big self-supervised mod-els are strong semi-supervised learners In NeurIPS pages22243ndash22255 2020 1 2 3 4

[10] Xinlei Chen Haoqi Fan Ross Girshick and Kaiming HeImproved baselines with momentum contrastive learning InCVPR pages 9729ndash9738 2020 1 2 3 4 6 8 9

[11] Xinlei Chen and Kaiming He Exploring simple siamese rep-resentation learning In arXiv preprint arXiv2011105662020 3 7 8

[12] Hao Cheng Dongze Lian Shenghua Gao and Yanlin GengEvaluating capability of deep neural networks for imageclassification via information plane In ECCV pages 168ndash182 2018 5

[13] Carl Doersch Abhinav Gupta and Alexei A Efros Unsuper-vised visual representation learning by context prediction InICCV pages 1422ndash1430 2015 1

[14] Alexey Dosovitskiy Lucas Beyer Alexander KolesnikovDirk Weissenborn Xiaohua Zhai Thomas UnterthinerMostafa Dehghani Matthias Minderer Georg Heigold Syl-vain Gelly et al An image is worth 16x16 words Trans-formers for image recognition at scale arXiv preprintarXiv201011929 2020 5 9

[15] Alexey Dosovitskiy Philipp Fischer Jost Tobias Springen-berg Martin Riedmiller and Thomas Brox Discriminativeunsupervised feature learning with exemplar convolutionalneural networks volume 38 pages 1734ndash1747 2015 3

[16] Alaaeldin El-Nouby Hugo Touvron Mathilde Caron PiotrBojanowski Matthijs Douze Armand Joulin Ivan LaptevNatalia Neverova Gabriel Synnaeve Jakob Verbeek et alXcit Cross-covariance image transformers 2021 5 9

[17] Mark Everingham Luc Van Gool Christopher KI WilliamsJohn Winn and Andrew Zisserman The pascal visual objectclasses challenge volume 88 pages 303ndash338 2010 5

10

[18] Zhiyuan Fang Jianfeng Wang Lijuan Wang Lei ZhangYezhou Yang and Zicheng Liu Seed Self-supervised dis-tillation for visual representation In ICLR 2021 2 3 68

[19] Jean-Bastien Grill Florian Strub Florent Altche CorentinTallec Pierre Richemond Elena Buchatskaya Carl DoerschBernardo Avila Pires Zhaohan Guo Mohammad Ghesh-laghi Azar Bilal Piot koray kavukcuoglu Remi Munos andMichal Valko Bootstrap your own latent A new approachto self-supervised learning In NeurIPS pages 21271ndash212842020 1 2 3

[20] Kaiming He Haoqi Fan Yuxin Wu Saining Xie and RossGirshick Momentum contrast for unsupervised visual rep-resentation learning In CVPR pages 9729ndash9738 2020 12 3 6

[21] Kaiming He Georgia Gkioxari Piotr Dollar and Ross Gir-shick Mask r-cnn In ICCV pages 2961ndash2969 2017 8

[22] Kaiming He Xiangyu Zhang Shaoqing Ren and Jian SunDeep residual learning for image recognition In CVPRpages 770ndash778 2016 2

[23] Olivier Henaff Data-efficient image recognition with con-trastive predictive coding In ICML pages 4182ndash4192 20203

[24] Geoffrey Hinton Oriol Vinyals and Jeff Dean Distilling theknowledge in a neural network In NeurIPSW 2015 2 3 7

[25] Andrew Howard Mark Sandler Grace Chu Liang-ChiehChen Bo Chen Mingxing Tan Weijun Wang Yukun ZhuRuoming Pang Vijay Vasudevan et al Searching for mo-bilenetv3 In ICCV pages 1314ndash1324 2019 2

[26] Alexander Kolesnikov Xiaohua Zhai and Lucas Beyer Re-visiting self-supervised visual representation learning InCVPR pages 1920ndash1929 2019 7

[27] Nikos Komodakis and Spyros Gidaris Unsupervised repre-sentation learning by predicting image rotations In ICLR2018 1 3

[28] Simon Kornblith Jonathon Shlens and Quoc V Le Do bet-ter imagenet models transfer better In CVPR pages 2661ndash2671 2019 7

[29] Alex Krizhevsky and Geoffrey Hinton Learning multiplelayers of features from tiny images Citeseer 2009 5

[30] Tsung-Yi Lin Michael Maire Serge Belongie James HaysPietro Perona Deva Ramanan Piotr Dollar and C LawrenceZitnick Microsoft coco Common objects in context InECCV pages 740ndash755 2014 5

[31] Yufan Liu Jiajiong Cao Bing Li Chunfeng Yuan WeimingHu Yangxi Li and Yunqiang Duan Knowledge distillationvia instance relationship graph In CVPR pages 7096ndash71042019 3

[32] Mehdi Noroozi and Paolo Favaro Unsupervised learning ofvisual representations by solving jigsaw puzzles In ECCVpages 69ndash84 2016 1 3

[33] Aaron van den Oord Yazhe Li and Oriol Vinyals Repre-sentation learning with contrastive predictive coding 20187

[34] Wonpyo Park Dongju Kim Yan Lu and Minsu Cho Re-lational knowledge distillation In CVPR pages 3967ndash39762019 3 7

[35] Deepak Pathak Philipp Krahenbuhl Jeff Donahue TrevorDarrell and Alexei A Efros Context encoders Featurelearning by inpainting In CVPR pages 2536ndash2544 20161 3

[36] Shaoqing Ren Kaiming He Ross Girshick and Jian SunFaster r-cnn Towards real-time object detection with regionproposal networks volume 39 pages 1137ndash1149 2015 8

[37] Adriana Romero Nicolas Ballas Samira Ebrahimi KahouAntoine Chassang Carlo Gatta and Yoshua Bengio FitnetsHints for thin deep nets In ICLR 2014 3

[38] Olga Russakovsky Jia Deng Hao Su Jonathan Krause San-jeev Satheesh Sean Ma Zhiheng Huang Andrej KarpathyAditya Khosla Michael Bernstein Alexander C Berg andLi Fei-Fei ImageNet Large Scale Visual Recognition Chal-lenge volume 115 pages 211ndash252 2015 2 5

[39] Ravid Shwartz-Ziv and Naftali Tishby Opening the blackbox of deep neural networks via information 2017 5

[40] Mingxing Tan and Quoc Le Efficientnet Rethinking modelscaling for convolutional neural networks In ICML pages6105ndash6114 2019 2

[41] Yonglong Tian Dilip Krishnan and Phillip Isola Con-trastive representation distillation In ICLR 2020 3

[42] Naftali Tishby Fernando C Pereira and William Bialek Theinformation bottleneck method 2000 2 5

[43] Hugo Touvron Matthieu Cord Matthijs Douze FranciscoMassa Alexandre Sablayrolles and Herve Jegou Trainingdata-efficient image transformers amp distillation through at-tention In International Conference on Machine Learningpages 10347ndash10357 PMLR 2021 5 9

[44] Laurens Van der Maaten and Geoffrey Hinton Visualizingdata using t-sne volume 9 2008 9

[45] Jinpeng Wang Yuting Gao Ke Li Xinyang Jiang XiaoweiGuo Rongrong Ji and Xing Sun Enhancing unsupervisedvideo representation learning by decoupling the scene andthe motion 2020 3

[46] Jinpeng Wang Yuting Gao Ke Li Yiqi Lin Andy J Ma andXing Sun Removing the background by adding the back-ground Towards background robust self-supervised videorepresentation learning 2020 3

[47] Yuxin Wu Alexander Kirillov Francisco Massa Wan-YenLo and Ross Girshick Detectron2 httpsgithubcomfacebookresearchdetectron2 2019 8

[48] Guodong Xu Ziwei Liu Xiaoxiao Li and Chen Change LoyKnowledge distillation meets self-supervision In ECCVpages 588ndash604 2020 3

[49] Sergey Zagoruyko and Nikos Komodakis Paying more at-tention to attention Improving the performance of convolu-tional neural networks via attention transfer In ICLR 20173 7

[50] Jure Zbontar Li Jing Ishan Misra Yann LeCun andStephane Deny Barlow twins Self-supervised learning viaredundancy reduction 2021 3

11

Page 9: 12 1 arXiv:2104.09124v3 [cs.CV] 9 Aug 2021

80 85 90 950

1

2

3

4

5

I (T

Y) C(858) T

C (891) T

(a) ResNet-18

80 85 90 95

C (834) T

C (864) T

(b) ResNet-34

MLP (512d) MLP (2048d)I (XT)

Figure 7 Mutual information paths from transition points to con-vergence points in the compression phase of training T denotestransition points and C(X) denotes convergent points with Xtop-1 accuracy on Cifar10 Points with similar I(TY) but smallerI(XT) are better generalized

creases the top-1 accuracy also increases but when the di-mension is already large the growth trend will slow downThe performance trend of EfficientNet-B1 and MobileNet-v3-Large is close to that of EfficientNet-B0 so for the sakeof clarity we do not exhibit it in the figure

48 More SSL Methods

Variants of the testbed In this part in order to demon-strate the versatility of our method we further tried othertwo SSL methods as the testbed SwAV and DINO

For SwAV the teacher is backboned by ResNet-50 andthe results are shown in Table 5 it can be seen that formodels with very few parameters EfficientNet-B0 andMobileNet-v3-Large the pre-training results with SwAVare also very poor When DisCo is utilized the efficacyis significantly improved

Table 5 Linear evaluation top-1 accuracy () on ImageNet withSwAV as the testbed The teacher of DisCo is an online opensource ResNet-50 model pre-trained using SwAV with top-1 accu-racy 753

Method Eff-b0 Mob-v3 R-34SwAV [5] 468 194 633

SwAV + DisCo 624 557 633

For DINO use two vision transformer architectures ViTand XCiT and the results are shown in Table 6 It can beseen that no matter what SSL method is adopted and whatarchitecture is used in the network DisCo can bring sig-nificant gains It is worth noting that XCiT-tiny has 26Mparameters which is much larger than ViT-tiny(5M) butDisCo can still narrow the gap with the teacherVariants of teacher pre-training method In order toverify that our method is not picky about the pre-trainingapproach that the teacher adopted we use three ResNet-50 networks pre-trained with different SSL methods as theteacher under the testbed of MoCo-V2 It can be observed

Table 6 Linear evaluation top-1 accuracy () on ImageNet withDINO as testbed

Teacher StudentModel Acc ViT-tiny XCiT-tiny

- - 55 67ViT-small [14 43] 77 684(134uarr) -XCiT-small [16] 778 - 711(41uarr)

Table 7 Linear evaluation top-1 accuracy () on ImageNet withvariants of teacher pre-training methods All the teachers areResNet-50 and the first row is student trained by MoCo-V2 di-rectly without distillation which is the baseline

Teacher StudentMethod Acc Eff-b0 Eff-b1 Mob-v3 R-18 R-34

- - 468 484 362 522 568MoCo-V2 [10] 674 665 666 644 606 625

SeLa-V2 [2] 718 622 682 662 641 653SwAV [5] 753 700 721 650 651 675

from Table 7 that when using different pre-trained ResNet-50 as teachers DisCo can significantly boost all the resultsof small models Furthermore with the improvement of theteachers using different and stronger pre-training methodsthe results of the student can be further improved

49 Visualization Analysis

In Figure 8 we visualize the learned representations ofEfficientNet-B0ResNet-50 pretrained by MoCo-V2 andEfficientNet-B0 distilled by ResNet-50 using DisCo Forclarity we randomly select 10 classes from ImageNet testset and map the learned representations to two dimensionalspace by t-SNE [44] It can be observed that ResNet-50forms more separated clusters than EfficientNet-B0 whenusing MoCo-V2 alone and after using ResNet-50 to teachEfficientNet-B0 with DisCo EfficientNet-B0 performs verymuch like the teacher

5 Conclusion

In this paper we propose Distilled Contrastive Learning(DisCo) to remedy self-supervised learning on lightweightmodels The proposed method constraints the final embed-ding of the lightweight student to be consistent with that ofthe teacher to maximally transmit the teacherrsquos knowledgeDisCo is not limited to specific contrastive learning meth-ods and can remedy the effect of the student to be very closeto the teacher

References[1] Soroush Abbasi Koohpayegani Ajinkya Tejankar and

Hamed Pirsiavash Compress Self-supervised learning bycompressing representations In NeurIPS pages 12980ndash12992 2020 2 3

9

30 20 10 0 10 20 30 40Dim 1

30

20

10

0

10

20

Dim

2

(a) Eff-b0 (MoCo-V2)

30 20 10 0 10 20 30 40Dim 1

(b) R-50 (MoCo-V2)

30 20 10 0 10 20 30 40Dim 1

(c) Eff-b0 (DisCo distilled by ResNet-50)

Figure 8 Clustering results on the ImageNet test set Different colors represent different classes

0 512 1024 1280 2048Dimension of hidden layer

5254565860626466

Top-

1 Ac

cura

cy(

)

+46+38

+32

+67

+62

+57

+74+65

+63

+81+74

+72

EfficientNet-B0ResNet-18ResNet-34

Figure 9 Linear evaluation top-1 accuracy on ImageNet of studentprojection head with different dimensions of hidden layer duringdistillation represents the original dimension used in MoCo-V2

[2] Yuki Markus Asano Christian Rupprecht and AndreaVedaldi Self-labelling via simultaneous clustering and rep-resentation learning In ICLR 2020 3 9

[3] Mathilde Caron Piotr Bojanowski Armand Joulin andMatthijs Douze Deep clustering for unsupervised learningof visual features In ECCV pages 132ndash149 2018 3

[4] Mathilde Caron Piotr Bojanowski Julien Mairal and Ar-mand Joulin Unsupervised pre-training of image featureson non-curated data In ICCV pages 2959ndash2968 2019 3

[5] Mathilde Caron Ishan Misra Julien Mairal Priya Goyal Pi-otr Bojanowski and Armand Joulin Unsupervised learn-ing of visual features by contrasting cluster assignments InNeurIPS pages 9912ndash9924 2020 3 9

[6] Mathilde Caron Hugo Touvron Ishan Misra Herve JegouJulien Mairal Piotr Bojanowski and Armand Joulin Emerg-ing properties in self-supervised vision transformers 20213

[7] Defang Chen Jian-Ping Mei Yuan Zhang Can Wang ZheWang Yan Feng and Chun Chen Cross-layer distillationwith semantic calibration 2020 3

[8] Ting Chen Simon Kornblith Mohammad Norouzi and Ge-offrey Hinton A simple framework for contrastive learningof visual representations In ICML pages 1597ndash1607 20201 2 3 7

[9] Ting Chen Simon Kornblith Kevin Swersky MohammadNorouzi and Geoffrey Hinton Big self-supervised mod-els are strong semi-supervised learners In NeurIPS pages22243ndash22255 2020 1 2 3 4

[10] Xinlei Chen Haoqi Fan Ross Girshick and Kaiming HeImproved baselines with momentum contrastive learning InCVPR pages 9729ndash9738 2020 1 2 3 4 6 8 9

[11] Xinlei Chen and Kaiming He Exploring simple siamese rep-resentation learning In arXiv preprint arXiv2011105662020 3 7 8

[12] Hao Cheng Dongze Lian Shenghua Gao and Yanlin GengEvaluating capability of deep neural networks for imageclassification via information plane In ECCV pages 168ndash182 2018 5

[13] Carl Doersch Abhinav Gupta and Alexei A Efros Unsuper-vised visual representation learning by context prediction InICCV pages 1422ndash1430 2015 1

[14] Alexey Dosovitskiy Lucas Beyer Alexander KolesnikovDirk Weissenborn Xiaohua Zhai Thomas UnterthinerMostafa Dehghani Matthias Minderer Georg Heigold Syl-vain Gelly et al An image is worth 16x16 words Trans-formers for image recognition at scale arXiv preprintarXiv201011929 2020 5 9

[15] Alexey Dosovitskiy Philipp Fischer Jost Tobias Springen-berg Martin Riedmiller and Thomas Brox Discriminativeunsupervised feature learning with exemplar convolutionalneural networks volume 38 pages 1734ndash1747 2015 3

[16] Alaaeldin El-Nouby Hugo Touvron Mathilde Caron PiotrBojanowski Matthijs Douze Armand Joulin Ivan LaptevNatalia Neverova Gabriel Synnaeve Jakob Verbeek et alXcit Cross-covariance image transformers 2021 5 9

[17] Mark Everingham Luc Van Gool Christopher KI WilliamsJohn Winn and Andrew Zisserman The pascal visual objectclasses challenge volume 88 pages 303ndash338 2010 5

10

[18] Zhiyuan Fang Jianfeng Wang Lijuan Wang Lei ZhangYezhou Yang and Zicheng Liu Seed Self-supervised dis-tillation for visual representation In ICLR 2021 2 3 68

[19] Jean-Bastien Grill Florian Strub Florent Altche CorentinTallec Pierre Richemond Elena Buchatskaya Carl DoerschBernardo Avila Pires Zhaohan Guo Mohammad Ghesh-laghi Azar Bilal Piot koray kavukcuoglu Remi Munos andMichal Valko Bootstrap your own latent A new approachto self-supervised learning In NeurIPS pages 21271ndash212842020 1 2 3

[20] Kaiming He Haoqi Fan Yuxin Wu Saining Xie and RossGirshick Momentum contrast for unsupervised visual rep-resentation learning In CVPR pages 9729ndash9738 2020 12 3 6

[21] Kaiming He Georgia Gkioxari Piotr Dollar and Ross Gir-shick Mask r-cnn In ICCV pages 2961ndash2969 2017 8

[22] Kaiming He Xiangyu Zhang Shaoqing Ren and Jian SunDeep residual learning for image recognition In CVPRpages 770ndash778 2016 2

[23] Olivier Henaff Data-efficient image recognition with con-trastive predictive coding In ICML pages 4182ndash4192 20203

[24] Geoffrey Hinton Oriol Vinyals and Jeff Dean Distilling theknowledge in a neural network In NeurIPSW 2015 2 3 7

[25] Andrew Howard Mark Sandler Grace Chu Liang-ChiehChen Bo Chen Mingxing Tan Weijun Wang Yukun ZhuRuoming Pang Vijay Vasudevan et al Searching for mo-bilenetv3 In ICCV pages 1314ndash1324 2019 2

[26] Alexander Kolesnikov Xiaohua Zhai and Lucas Beyer Re-visiting self-supervised visual representation learning InCVPR pages 1920ndash1929 2019 7

[27] Nikos Komodakis and Spyros Gidaris Unsupervised repre-sentation learning by predicting image rotations In ICLR2018 1 3

[28] Simon Kornblith Jonathon Shlens and Quoc V Le Do bet-ter imagenet models transfer better In CVPR pages 2661ndash2671 2019 7

[29] Alex Krizhevsky and Geoffrey Hinton Learning multiplelayers of features from tiny images Citeseer 2009 5

[30] Tsung-Yi Lin Michael Maire Serge Belongie James HaysPietro Perona Deva Ramanan Piotr Dollar and C LawrenceZitnick Microsoft coco Common objects in context InECCV pages 740ndash755 2014 5

[31] Yufan Liu Jiajiong Cao Bing Li Chunfeng Yuan WeimingHu Yangxi Li and Yunqiang Duan Knowledge distillationvia instance relationship graph In CVPR pages 7096ndash71042019 3

[32] Mehdi Noroozi and Paolo Favaro Unsupervised learning ofvisual representations by solving jigsaw puzzles In ECCVpages 69ndash84 2016 1 3

[33] Aaron van den Oord Yazhe Li and Oriol Vinyals Repre-sentation learning with contrastive predictive coding 20187

[34] Wonpyo Park Dongju Kim Yan Lu and Minsu Cho Re-lational knowledge distillation In CVPR pages 3967ndash39762019 3 7

[35] Deepak Pathak Philipp Krahenbuhl Jeff Donahue TrevorDarrell and Alexei A Efros Context encoders Featurelearning by inpainting In CVPR pages 2536ndash2544 20161 3

[36] Shaoqing Ren Kaiming He Ross Girshick and Jian SunFaster r-cnn Towards real-time object detection with regionproposal networks volume 39 pages 1137ndash1149 2015 8

[37] Adriana Romero Nicolas Ballas Samira Ebrahimi KahouAntoine Chassang Carlo Gatta and Yoshua Bengio FitnetsHints for thin deep nets In ICLR 2014 3

[38] Olga Russakovsky Jia Deng Hao Su Jonathan Krause San-jeev Satheesh Sean Ma Zhiheng Huang Andrej KarpathyAditya Khosla Michael Bernstein Alexander C Berg andLi Fei-Fei ImageNet Large Scale Visual Recognition Chal-lenge volume 115 pages 211ndash252 2015 2 5

[39] Ravid Shwartz-Ziv and Naftali Tishby Opening the blackbox of deep neural networks via information 2017 5

[40] Mingxing Tan and Quoc Le Efficientnet Rethinking modelscaling for convolutional neural networks In ICML pages6105ndash6114 2019 2

[41] Yonglong Tian Dilip Krishnan and Phillip Isola Con-trastive representation distillation In ICLR 2020 3

[42] Naftali Tishby Fernando C Pereira and William Bialek Theinformation bottleneck method 2000 2 5

[43] Hugo Touvron Matthieu Cord Matthijs Douze FranciscoMassa Alexandre Sablayrolles and Herve Jegou Trainingdata-efficient image transformers amp distillation through at-tention In International Conference on Machine Learningpages 10347ndash10357 PMLR 2021 5 9

[44] Laurens Van der Maaten and Geoffrey Hinton Visualizingdata using t-sne volume 9 2008 9

[45] Jinpeng Wang Yuting Gao Ke Li Xinyang Jiang XiaoweiGuo Rongrong Ji and Xing Sun Enhancing unsupervisedvideo representation learning by decoupling the scene andthe motion 2020 3

[46] Jinpeng Wang Yuting Gao Ke Li Yiqi Lin Andy J Ma andXing Sun Removing the background by adding the back-ground Towards background robust self-supervised videorepresentation learning 2020 3

[47] Yuxin Wu Alexander Kirillov Francisco Massa Wan-YenLo and Ross Girshick Detectron2 httpsgithubcomfacebookresearchdetectron2 2019 8

[48] Guodong Xu Ziwei Liu Xiaoxiao Li and Chen Change LoyKnowledge distillation meets self-supervision In ECCVpages 588ndash604 2020 3

[49] Sergey Zagoruyko and Nikos Komodakis Paying more at-tention to attention Improving the performance of convolu-tional neural networks via attention transfer In ICLR 20173 7

[50] Jure Zbontar Li Jing Ishan Misra Yann LeCun andStephane Deny Barlow twins Self-supervised learning viaredundancy reduction 2021 3

11

Page 10: 12 1 arXiv:2104.09124v3 [cs.CV] 9 Aug 2021

30 20 10 0 10 20 30 40Dim 1

30

20

10

0

10

20

Dim

2

(a) Eff-b0 (MoCo-V2)

30 20 10 0 10 20 30 40Dim 1

(b) R-50 (MoCo-V2)

30 20 10 0 10 20 30 40Dim 1

(c) Eff-b0 (DisCo distilled by ResNet-50)

Figure 8 Clustering results on the ImageNet test set Different colors represent different classes

0 512 1024 1280 2048Dimension of hidden layer

5254565860626466

Top-

1 Ac

cura

cy(

)

+46+38

+32

+67

+62

+57

+74+65

+63

+81+74

+72

EfficientNet-B0ResNet-18ResNet-34

Figure 9 Linear evaluation top-1 accuracy on ImageNet of studentprojection head with different dimensions of hidden layer duringdistillation represents the original dimension used in MoCo-V2

[2] Yuki Markus Asano Christian Rupprecht and AndreaVedaldi Self-labelling via simultaneous clustering and rep-resentation learning In ICLR 2020 3 9

[3] Mathilde Caron Piotr Bojanowski Armand Joulin andMatthijs Douze Deep clustering for unsupervised learningof visual features In ECCV pages 132ndash149 2018 3

[4] Mathilde Caron Piotr Bojanowski Julien Mairal and Ar-mand Joulin Unsupervised pre-training of image featureson non-curated data In ICCV pages 2959ndash2968 2019 3

[5] Mathilde Caron Ishan Misra Julien Mairal Priya Goyal Pi-otr Bojanowski and Armand Joulin Unsupervised learn-ing of visual features by contrasting cluster assignments InNeurIPS pages 9912ndash9924 2020 3 9

[6] Mathilde Caron Hugo Touvron Ishan Misra Herve JegouJulien Mairal Piotr Bojanowski and Armand Joulin Emerg-ing properties in self-supervised vision transformers 20213

[7] Defang Chen Jian-Ping Mei Yuan Zhang Can Wang ZheWang Yan Feng and Chun Chen Cross-layer distillationwith semantic calibration 2020 3

[8] Ting Chen Simon Kornblith Mohammad Norouzi and Ge-offrey Hinton A simple framework for contrastive learningof visual representations In ICML pages 1597ndash1607 20201 2 3 7

[9] Ting Chen Simon Kornblith Kevin Swersky MohammadNorouzi and Geoffrey Hinton Big self-supervised mod-els are strong semi-supervised learners In NeurIPS pages22243ndash22255 2020 1 2 3 4

[10] Xinlei Chen Haoqi Fan Ross Girshick and Kaiming HeImproved baselines with momentum contrastive learning InCVPR pages 9729ndash9738 2020 1 2 3 4 6 8 9

[11] Xinlei Chen and Kaiming He Exploring simple siamese rep-resentation learning In arXiv preprint arXiv2011105662020 3 7 8

[12] Hao Cheng Dongze Lian Shenghua Gao and Yanlin GengEvaluating capability of deep neural networks for imageclassification via information plane In ECCV pages 168ndash182 2018 5

[13] Carl Doersch Abhinav Gupta and Alexei A Efros Unsuper-vised visual representation learning by context prediction InICCV pages 1422ndash1430 2015 1

[14] Alexey Dosovitskiy Lucas Beyer Alexander KolesnikovDirk Weissenborn Xiaohua Zhai Thomas UnterthinerMostafa Dehghani Matthias Minderer Georg Heigold Syl-vain Gelly et al An image is worth 16x16 words Trans-formers for image recognition at scale arXiv preprintarXiv201011929 2020 5 9

[15] Alexey Dosovitskiy Philipp Fischer Jost Tobias Springen-berg Martin Riedmiller and Thomas Brox Discriminativeunsupervised feature learning with exemplar convolutionalneural networks volume 38 pages 1734ndash1747 2015 3

[16] Alaaeldin El-Nouby Hugo Touvron Mathilde Caron PiotrBojanowski Matthijs Douze Armand Joulin Ivan LaptevNatalia Neverova Gabriel Synnaeve Jakob Verbeek et alXcit Cross-covariance image transformers 2021 5 9

[17] Mark Everingham Luc Van Gool Christopher KI WilliamsJohn Winn and Andrew Zisserman The pascal visual objectclasses challenge volume 88 pages 303ndash338 2010 5

10

[18] Zhiyuan Fang Jianfeng Wang Lijuan Wang Lei ZhangYezhou Yang and Zicheng Liu Seed Self-supervised dis-tillation for visual representation In ICLR 2021 2 3 68

[19] Jean-Bastien Grill Florian Strub Florent Altche CorentinTallec Pierre Richemond Elena Buchatskaya Carl DoerschBernardo Avila Pires Zhaohan Guo Mohammad Ghesh-laghi Azar Bilal Piot koray kavukcuoglu Remi Munos andMichal Valko Bootstrap your own latent A new approachto self-supervised learning In NeurIPS pages 21271ndash212842020 1 2 3

[20] Kaiming He Haoqi Fan Yuxin Wu Saining Xie and RossGirshick Momentum contrast for unsupervised visual rep-resentation learning In CVPR pages 9729ndash9738 2020 12 3 6

[21] Kaiming He Georgia Gkioxari Piotr Dollar and Ross Gir-shick Mask r-cnn In ICCV pages 2961ndash2969 2017 8

[22] Kaiming He Xiangyu Zhang Shaoqing Ren and Jian SunDeep residual learning for image recognition In CVPRpages 770ndash778 2016 2

[23] Olivier Henaff Data-efficient image recognition with con-trastive predictive coding In ICML pages 4182ndash4192 20203

[24] Geoffrey Hinton Oriol Vinyals and Jeff Dean Distilling theknowledge in a neural network In NeurIPSW 2015 2 3 7

[25] Andrew Howard Mark Sandler Grace Chu Liang-ChiehChen Bo Chen Mingxing Tan Weijun Wang Yukun ZhuRuoming Pang Vijay Vasudevan et al Searching for mo-bilenetv3 In ICCV pages 1314ndash1324 2019 2

[26] Alexander Kolesnikov Xiaohua Zhai and Lucas Beyer Re-visiting self-supervised visual representation learning InCVPR pages 1920ndash1929 2019 7

[27] Nikos Komodakis and Spyros Gidaris Unsupervised repre-sentation learning by predicting image rotations In ICLR2018 1 3

[28] Simon Kornblith Jonathon Shlens and Quoc V Le Do bet-ter imagenet models transfer better In CVPR pages 2661ndash2671 2019 7

[29] Alex Krizhevsky and Geoffrey Hinton Learning multiplelayers of features from tiny images Citeseer 2009 5

[30] Tsung-Yi Lin Michael Maire Serge Belongie James HaysPietro Perona Deva Ramanan Piotr Dollar and C LawrenceZitnick Microsoft coco Common objects in context InECCV pages 740ndash755 2014 5

[31] Yufan Liu Jiajiong Cao Bing Li Chunfeng Yuan WeimingHu Yangxi Li and Yunqiang Duan Knowledge distillationvia instance relationship graph In CVPR pages 7096ndash71042019 3

[32] Mehdi Noroozi and Paolo Favaro Unsupervised learning ofvisual representations by solving jigsaw puzzles In ECCVpages 69ndash84 2016 1 3

[33] Aaron van den Oord Yazhe Li and Oriol Vinyals Repre-sentation learning with contrastive predictive coding 20187

[34] Wonpyo Park Dongju Kim Yan Lu and Minsu Cho Re-lational knowledge distillation In CVPR pages 3967ndash39762019 3 7

[35] Deepak Pathak Philipp Krahenbuhl Jeff Donahue TrevorDarrell and Alexei A Efros Context encoders Featurelearning by inpainting In CVPR pages 2536ndash2544 20161 3

[36] Shaoqing Ren Kaiming He Ross Girshick and Jian SunFaster r-cnn Towards real-time object detection with regionproposal networks volume 39 pages 1137ndash1149 2015 8

[37] Adriana Romero Nicolas Ballas Samira Ebrahimi KahouAntoine Chassang Carlo Gatta and Yoshua Bengio FitnetsHints for thin deep nets In ICLR 2014 3

[38] Olga Russakovsky Jia Deng Hao Su Jonathan Krause San-jeev Satheesh Sean Ma Zhiheng Huang Andrej KarpathyAditya Khosla Michael Bernstein Alexander C Berg andLi Fei-Fei ImageNet Large Scale Visual Recognition Chal-lenge volume 115 pages 211ndash252 2015 2 5

[39] Ravid Shwartz-Ziv and Naftali Tishby Opening the blackbox of deep neural networks via information 2017 5

[40] Mingxing Tan and Quoc Le Efficientnet Rethinking modelscaling for convolutional neural networks In ICML pages6105ndash6114 2019 2

[41] Yonglong Tian Dilip Krishnan and Phillip Isola Con-trastive representation distillation In ICLR 2020 3

[42] Naftali Tishby Fernando C Pereira and William Bialek Theinformation bottleneck method 2000 2 5

[43] Hugo Touvron Matthieu Cord Matthijs Douze FranciscoMassa Alexandre Sablayrolles and Herve Jegou Trainingdata-efficient image transformers amp distillation through at-tention In International Conference on Machine Learningpages 10347ndash10357 PMLR 2021 5 9

[44] Laurens Van der Maaten and Geoffrey Hinton Visualizingdata using t-sne volume 9 2008 9

[45] Jinpeng Wang Yuting Gao Ke Li Xinyang Jiang XiaoweiGuo Rongrong Ji and Xing Sun Enhancing unsupervisedvideo representation learning by decoupling the scene andthe motion 2020 3

[46] Jinpeng Wang Yuting Gao Ke Li Yiqi Lin Andy J Ma andXing Sun Removing the background by adding the back-ground Towards background robust self-supervised videorepresentation learning 2020 3

[47] Yuxin Wu Alexander Kirillov Francisco Massa Wan-YenLo and Ross Girshick Detectron2 httpsgithubcomfacebookresearchdetectron2 2019 8

[48] Guodong Xu Ziwei Liu Xiaoxiao Li and Chen Change LoyKnowledge distillation meets self-supervision In ECCVpages 588ndash604 2020 3

[49] Sergey Zagoruyko and Nikos Komodakis Paying more at-tention to attention Improving the performance of convolu-tional neural networks via attention transfer In ICLR 20173 7

[50] Jure Zbontar Li Jing Ishan Misra Yann LeCun andStephane Deny Barlow twins Self-supervised learning viaredundancy reduction 2021 3

11

Page 11: 12 1 arXiv:2104.09124v3 [cs.CV] 9 Aug 2021

[18] Zhiyuan Fang Jianfeng Wang Lijuan Wang Lei ZhangYezhou Yang and Zicheng Liu Seed Self-supervised dis-tillation for visual representation In ICLR 2021 2 3 68

[19] Jean-Bastien Grill Florian Strub Florent Altche CorentinTallec Pierre Richemond Elena Buchatskaya Carl DoerschBernardo Avila Pires Zhaohan Guo Mohammad Ghesh-laghi Azar Bilal Piot koray kavukcuoglu Remi Munos andMichal Valko Bootstrap your own latent A new approachto self-supervised learning In NeurIPS pages 21271ndash212842020 1 2 3

[20] Kaiming He Haoqi Fan Yuxin Wu Saining Xie and RossGirshick Momentum contrast for unsupervised visual rep-resentation learning In CVPR pages 9729ndash9738 2020 12 3 6

[21] Kaiming He Georgia Gkioxari Piotr Dollar and Ross Gir-shick Mask r-cnn In ICCV pages 2961ndash2969 2017 8

[22] Kaiming He Xiangyu Zhang Shaoqing Ren and Jian SunDeep residual learning for image recognition In CVPRpages 770ndash778 2016 2

[23] Olivier Henaff Data-efficient image recognition with con-trastive predictive coding In ICML pages 4182ndash4192 20203

[24] Geoffrey Hinton Oriol Vinyals and Jeff Dean Distilling theknowledge in a neural network In NeurIPSW 2015 2 3 7

[25] Andrew Howard Mark Sandler Grace Chu Liang-ChiehChen Bo Chen Mingxing Tan Weijun Wang Yukun ZhuRuoming Pang Vijay Vasudevan et al Searching for mo-bilenetv3 In ICCV pages 1314ndash1324 2019 2

[26] Alexander Kolesnikov Xiaohua Zhai and Lucas Beyer Re-visiting self-supervised visual representation learning InCVPR pages 1920ndash1929 2019 7

[27] Nikos Komodakis and Spyros Gidaris Unsupervised repre-sentation learning by predicting image rotations In ICLR2018 1 3

[28] Simon Kornblith Jonathon Shlens and Quoc V Le Do bet-ter imagenet models transfer better In CVPR pages 2661ndash2671 2019 7

[29] Alex Krizhevsky and Geoffrey Hinton Learning multiplelayers of features from tiny images Citeseer 2009 5

[30] Tsung-Yi Lin Michael Maire Serge Belongie James HaysPietro Perona Deva Ramanan Piotr Dollar and C LawrenceZitnick Microsoft coco Common objects in context InECCV pages 740ndash755 2014 5

[31] Yufan Liu Jiajiong Cao Bing Li Chunfeng Yuan WeimingHu Yangxi Li and Yunqiang Duan Knowledge distillationvia instance relationship graph In CVPR pages 7096ndash71042019 3

[32] Mehdi Noroozi and Paolo Favaro Unsupervised learning ofvisual representations by solving jigsaw puzzles In ECCVpages 69ndash84 2016 1 3

[33] Aaron van den Oord Yazhe Li and Oriol Vinyals Repre-sentation learning with contrastive predictive coding 20187

[34] Wonpyo Park Dongju Kim Yan Lu and Minsu Cho Re-lational knowledge distillation In CVPR pages 3967ndash39762019 3 7

[35] Deepak Pathak Philipp Krahenbuhl Jeff Donahue TrevorDarrell and Alexei A Efros Context encoders Featurelearning by inpainting In CVPR pages 2536ndash2544 20161 3

[36] Shaoqing Ren Kaiming He Ross Girshick and Jian SunFaster r-cnn Towards real-time object detection with regionproposal networks volume 39 pages 1137ndash1149 2015 8

[37] Adriana Romero Nicolas Ballas Samira Ebrahimi KahouAntoine Chassang Carlo Gatta and Yoshua Bengio FitnetsHints for thin deep nets In ICLR 2014 3

[38] Olga Russakovsky Jia Deng Hao Su Jonathan Krause San-jeev Satheesh Sean Ma Zhiheng Huang Andrej KarpathyAditya Khosla Michael Bernstein Alexander C Berg andLi Fei-Fei ImageNet Large Scale Visual Recognition Chal-lenge volume 115 pages 211ndash252 2015 2 5

[39] Ravid Shwartz-Ziv and Naftali Tishby Opening the blackbox of deep neural networks via information 2017 5

[40] Mingxing Tan and Quoc Le Efficientnet Rethinking modelscaling for convolutional neural networks In ICML pages6105ndash6114 2019 2

[41] Yonglong Tian Dilip Krishnan and Phillip Isola Con-trastive representation distillation In ICLR 2020 3

[42] Naftali Tishby Fernando C Pereira and William Bialek Theinformation bottleneck method 2000 2 5

[43] Hugo Touvron Matthieu Cord Matthijs Douze FranciscoMassa Alexandre Sablayrolles and Herve Jegou Trainingdata-efficient image transformers amp distillation through at-tention In International Conference on Machine Learningpages 10347ndash10357 PMLR 2021 5 9

[44] Laurens Van der Maaten and Geoffrey Hinton Visualizingdata using t-sne volume 9 2008 9

[45] Jinpeng Wang Yuting Gao Ke Li Xinyang Jiang XiaoweiGuo Rongrong Ji and Xing Sun Enhancing unsupervisedvideo representation learning by decoupling the scene andthe motion 2020 3

[46] Jinpeng Wang Yuting Gao Ke Li Yiqi Lin Andy J Ma andXing Sun Removing the background by adding the back-ground Towards background robust self-supervised videorepresentation learning 2020 3

[47] Yuxin Wu Alexander Kirillov Francisco Massa Wan-YenLo and Ross Girshick Detectron2 httpsgithubcomfacebookresearchdetectron2 2019 8

[48] Guodong Xu Ziwei Liu Xiaoxiao Li and Chen Change LoyKnowledge distillation meets self-supervision In ECCVpages 588ndash604 2020 3

[49] Sergey Zagoruyko and Nikos Komodakis Paying more at-tention to attention Improving the performance of convolu-tional neural networks via attention transfer In ICLR 20173 7

[50] Jure Zbontar Li Jing Ishan Misra Yann LeCun andStephane Deny Barlow twins Self-supervised learning viaredundancy reduction 2021 3

11


Recommended