Interpreting Adversarially Trained Convolutional Neural Networks
Tianyuan Zhang 1 Zhanxing Zhu 2 3 4
AbstractWe attempt to interpret how adversarially trainedconvolutional neural networks (AT-CNNs) recog-nize objects We design systematic approaches tointerpret AT-CNNs in both qualitative and quan-titative ways and compare them with normallytrained models Surprisingly we find that adver-sarial training alleviates the texture bias of stan-dard CNNs when trained on object recognitiontasks and helps CNNs learn a more shape-biasedrepresentation We validate our hypothesis fromtwo aspects First we compare the salience mapsof AT-CNNs and standard CNNs on clean imagesand images under different transformations Thecomparison could visually show that the predic-tion of the two types of CNNs is sensitive to dra-matically different types of features Second toachieve quantitative verification we construct ad-ditional test datasets that destroy either textures orshapes such as style-transferred version of cleandata saturated images and patch-shuffled onesand then evaluate the classification accuracy ofAT-CNNs and normal CNNs on these datasetsOur findings shed some light on why AT-CNNsare more robust than those normally trained onesand contribute to a better understanding of adver-sarial training over CNNs from an interpretationperspective
1 IntroductionConvolutional neural networks (CNNs) have achieved greatsuccess in a variety of visual recognition tasks (Krizhevskyet al 2012 Girshick et al 2014 Long et al 2015) withtheir stacked local connections A crucial issue is to under-stand what is being learned after training over thousands oreven millions of images This involves interpreting CNNs
1School of EECS Peking University China 2School of Mathe-matical Sciences Peking University China 3Center for Data Sci-ence Peking University 4Beijing Institute of Big Data ResearchCorrespondence to Zhanxing Zhu ltzhanxingzhupkueducngt
Proceedings of the 36 th International Conference on MachineLearning Long Beach California PMLR 97 2019 Copyright2019 by the author(s)
Along this line some recent works showed that standardCNNs trained on ImageNet make their predictions rely onthe local textures rather than long-range dependencies en-coded in the shape of objects (Geirhos et al 2019 Brendelamp Bethge 2019 Ballester amp de Araujo 2016) Conse-quently this texture bias prevents the trained CNNs fromgeneralizing well on those images with distorted texturesbut maintained shape information Geirhos et al (2019) alsoshowed that using a combination of Stylized-ImageNet andImageNet can alleviate the texture bias of standard CNNsIt naturally raises an intriguing question
Are there any other trained CNNs are more biased towardsshapes
Recently normally trained neural networks were found tobe easily fooled by maliciously perturbed examples ie ad-versarial examples (Goodfellow et al 2014 Kurakin et al2016) To defense the adversarial examples adversarialtraining was proposed that is instead of minimizing theloss function over the clean example it minimizes almostworst-case loss over the slightly perturbed examples (Madryet al 2018) We name these adversarially trained networksas AT-CNNs They were extensively shown to be able toenhance the robustness ie improving the classificationaccuracy over the adversarial examples Then
What is learned by adversarially trained CNNs to make itmore robust
In this work in order to explore the answer to the abovequestions we systematically design various experiments tointerpret the AT-CNNs and compare them with normallytrained models We find that AT-CNNs are better at captur-ing long-range correlations such as shapes and less biasedtowards textures than normally trained CNNs in popularobject recognition datasets This finding partially explainswhy AT-CNNs tends to be more robust than standard CNNs
We validate our hypothesis from two aspects First wecompare the salience maps of AT-CNNs and standard CNNson clean images and those under different transformationsThe comparison could visually show that the predictions ofthe two CNNs are sensitive to dramatically different typesof features Second we construct additional test datasetsthat destroy either textures or shapes such as the style-transferred version of clean data saturated images and patch-
arX
iv1
905
0979
7v1
[cs
LG
] 2
3 M
ay 2
019
Interpreting Adversarially Trained Convolutional Neural Networks
shuffled images then evaluate the classification accuracyof AT-CNN and normal CNNs on these datasets Thesesophisticated designed experiments provide a quantitativecomparison between the two CNNs and demonstrate theirbiases when making predictions
To the best of our knowledge we are the first to implementsystematic investigation on interpreting the adversariallytrained CNNs both visually and quantitatively Our find-ings shed some light on why AT-CNNs are more robustthan those normally trained ones and also contribute to bet-ter understanding adversarial training over CNNs from aninterpretation perspective1
The remaining of the paper is structured as follows Weintroduce background knowledge on adversarial training andsalience methods in Section 2 The methods for interpretingAT-CNNS are described in Section 3 Then we present theexperimental results to support our findings in Section 4The related works and discussions are presented in Section 5Section 6 concludes the paper
2 Preliminary21 Adversarial training
This training method was first proposed by (Goodfellowet al 2014) which is the most successful approach forbuilding robust models so far for defending adversarial ex-amples (Madry et al 2018 Sinha et al 2018 Athalye et al2018 Zhang et al 2019ba) It can be formulated as solvinga robust optimization problem (Shaham et al 2015)
minθ
E(xy)simD
[maxδisinS
`(f(x+ δ θ) y)
] (1)
where f(x θ) represents the neural network parameterizedby weights θ the input-output pair (x y) is sample from thetraining set D δ denotes the adversarial perturbation and`(middot middot) is the chosen loss function eg cross entropy loss Sdenotes a certain norm constraints such as `infin or `2
The inner maximization is approximated by adversarialexamples generated by various attack methods Trainingagainst a projected gradient descent (PGD Madry et al(2018)) adversary leads to state-of-the-art white-box ro-bustness We use PGD based adversarial training withbounded linfin and l2 norm constraints We also investigateFGSM (Goodfellow et al 2014) based adversarial training
22 Salience maps
Given a trained neural network visualizing the saliencemaps aims at assigning a sensitivity value sometimes alsocalled ldquoattributionrdquo to show the sensitivity of the output
1Our codes are available at httpsgithubcomPKUAI26AT-CNN
to each pixel of an input image Salience methods canmainly be divided into (Ancona et al 2018) perturbation-based methods (Zeiler amp Fergus 2014 Zintgraf et al 2017)and gradient-based method (Erhan et al 2009 Simonyanet al 2013 Shrikumar et al 2017 Sundararajan et al2017 Selvaraju et al 2017 Zhou et al 2016 Smilkovet al 2017 Bach et al 2015) Recently (Adebayo et al2018) carries out a systematic test for many of the gradient-based salience methods and only variants of Grad andGradCAM (Selvaraju et al 2017) pass the proposed sanitychecks We thus choose Grad and its smoothed versionSmoothGrad (Smilkov et al 2017) for visualization
Formally let x isin Rd denote the input image a trainednetwork is a function f Rd rarr RK where K is the to-tal number of classes Let Sc denotes the class activationfunction for each class c We seek to obtain a salience mapE isin Rd The Grad explanation is the gradient of classactivation with respect to the input image x
E =partSc(x)
partx (2)
SmoothGrad (Smilkov et al 2017) was proposed to alle-viate noises in gradient explanation by averaging over thegradient of noisy copies of an input Thus for an input xthe smoothed variant of Grad SmoothGrad can be writtenas
E =1
n
nsumi=1
partSc(xi)
partxi (3)
where xi = x+gi and gi are noise vectors drawn iid froma Gaussian distribution N (0 σ2) In all our experimentswe set n = 100 and the noise level σ(xmax minus xmin) =01 We choose Sc(x) = log pc(x) where pc(x) is theprobability of class c assigned by a classifier to input x
3 MethodsIn this section we elaborate our method for interpretingthe adversarially trained CNNs and comparing them withnormally trained ones Three image datasets are consideredincluding Tiny ImageNet2 Caltech-256 (Griffin et al 2007)and CIFAR-10
We first visualize the salience maps of AT-CNNs and nor-mal CNNs to demonstrate that the two models trained withdifferent ways are sensitive to different kinds of features Be-sides this qualitative comparison we also test the two kindsof CNNs on different transformed datasets to distinguishthe difference of their preferred features
31 Visualizing the salience maps
A straightforward way of investigating the difference be-tween AT-CNNs and CNNs is to visualize which group of
2httpstiny-imagenetherokuappcom
Interpreting Adversarially Trained Convolutional Neural Networks
(a) Original (b) Stylized (c) Saturated 8 (d) Saturated 1024 (e) patch-shuffle 2 (f) patch-shuffle 4
Figure 1 Visualization of three transformations Original images are from Caltech-256 From left to right original stylized saturationlevel as 8 1024 2times 2 patch-shuffling 4times 4 patch-shuffling
pixels the network outputs are most sensitive to Saliencemaps generated by Grad and its smoothed variant Smooth-Grad are good candidates to show what features a modelis sensitive to We compare the salience maps between AT-CNNs and CNNs on clean images and images under texturepreserving and shape preserving distortions Extensive re-sults can been seen in Section 41
As pointed by Smilkov et al (2017) sensitivity maps basedon Grad method are often visually noisy highlighting thatsome pixels to a human eye seem randomly selectedSmoothGrad in Eq (3) on the other hand could reducevisual noise by averaging the gradient over the Gaussianperturbed images Thus we mainly report the salience mapsproduced by SmoothGrad and the Grad visualization resultsare provided in the appendix Note that the two visualizationmethods could help us draw a consistent conclusion on thedifference between the two trained CNNs
32 Generalization on shapetexture preservingdistortions
Besides visual inspection of sensitivity maps we proposeto measure the sensitivity of AT-CNNs and CNNs to dif-ferent features by evaluating the performance degradationunder several distortions that either preserves shapes or tex-tures Intuitively if one model relies on textures a lot theperformance would degrade severely if we destroy mostof the textures while preserving other information such asthe shapes and other features However a perfect disentan-glement of texture shape and other feature information isimpossible (Gatys et al 2015) In this work we mainlyconstruct three kinds of image translations to achieve theshape or texture distortion style-transfer saturating andpatch-shuffling operation Some of the image samples areshown in Figure 1 We also added three Fourier-filteredtest set in the appendix We now describe each of thesetransformations and their properties
Note that we conduct normal training or adversarial training
on the original training sets and then evaluate their general-izability over the transformed data During the training wenever use the transformed datasets
Stylizing Geirhos et al (2019) utilized style trans-fer (Huang amp Belongie 2017) to generate images with con-flicting shape and texture information to demonstrate thetexture bias of ImageNet-trained standard CNNs Followingthe same rationale we utilize style transfer to destroy mostof the textures while preserving the global shape structuresin images and build a stylized test dataset Therefore withsimilar generalization error models capturing shapes bet-ter should also perform better on stylized test images thanthose biased towards textures The style-transferred imagesamples are shown in Figure 1(b)
Saturation Similar to (Ding et al 2019) we denote thesaturation of the image x by xp where p indicates the sat-uration level ranging from 0 toinfin When p = 2 the satu-ration operation does not change the image When p ge 2increasing the saturation level will push the pixel valuestowards binarized ones and p = infin leads to the pure bi-narization Specifically for each pixel of image x withvalue v isin [0 1] its corresponding saturated pixel of xp isdefined as sign(2v minus 1)|2v minus 1|
2p 2 + 12 One can ob-
serve that from Figure 1(c) and (d) increasing saturationlevel can gradually destroy some texture information whilepreserving most parts of the contour structures
Patch-Shuffling To destroy long-range shape informationwe split images into k times k small patches and randomlyrearranging the order of these patches with k isin 2 4 8Favorably this operation preserves most of the texture in-formation and destroys most of the shape information Thepatch-shuffled image samples are showed in Figure 1(e) (f)Note that as k increasing more information of the originalimage is lost especially for images with low resolution
Interpreting Adversarially Trained Convolutional Neural Networks
Table 1 Accuracy and robustness of all the trained models Robustness is measured against the PGD attack with bounded linfin normDetails are listed in the appendix Note that underfitting CNNs have similar generalization performance with some of the AT-CNNs onclean images
CIFAR10 TinyImageNet Caltech 256Accuracy Robustness Accuracy Robustness Accuracy Robustness
PGD-inf 8 8627 4481 5442 1425 6641 3116PGD-inf 4 8917 3085 6185 687 7222 2010PGD-inf 2 914 3911 6706 166 7651 751PGD-inf 1 9340 753 6942 018 7911 170
PGD-L2 12 8579 3461 5344 1480 6554 3136PGD-L2 8 8801 2688 5821 1003 6975 2619PGD-L2 4 9077 1319 6424 361 7412 1433FGSM 8 8490 3425 6621 001 7088 2002FGSM 4 8813 2508 6343 013 7391 1516Normal 9452 0 7202 001 8332 0Underfit 8679 0 6005 001 6904 0
4 Experiments and analysisExperiments setup We describe the experiment setup toevaluate the performance of AT-CNNs and standard CNNsin data distributions manipulated by above-mentioned opera-tions We conduct experiments on three datasets CIFAR-10Tiny ImageNet and Caltech-256 (Griffin et al 2007) Notethat we do not create the style-transferred and patch-shuffledtest set for CIFAR-10 due to its limited resolution
When training on CIFAR-10 we use the ResNet-18model (He et al 2016ab) for data augmentation we per-form zero paddings with width as 4 horizontal flip andrandom crop
Tiny ImageNet has 200 classes of objects Each class has500 training images 50 validation images and 50 test im-ages All images from Tiny ImageNet are of size 64times 64We re-scale them to 224times224 and perform random horizon-tal flip and per-image standardization as data augmentation
Caltech-256 (Griffin et al 2007) consists of 257 objectcategories containing a total of 30607 images Resolutionof images from Caltech is much higher compared with theabove two datasets We manually split 20 of images asthe test set We perform re-scaling and random croppingfollowing (He et al 2016a) For both Tiny ImageNet andCaltech-256 we use ResNet-18 model as the network archi-tecture
Compared models their generalization and robustnessFor all above three datasets we train three types of AT-CNNs they mainly differ in the way of generating adver-sarial examples FGSM PGD with bounded linfin norm andPGD with bounded l2 norm and for each attack method wetrain several models under different attack strengths Details
are listed in the appendix To understand whether the differ-ence of performance degradation for AT-CNNs and standardCNNs is due to the poor generalization (Schmidt et al 2018Tsipras et al 2018) of adversarial training we also comparethe AT-CNNs with an underfitting CNN (trained over cleandata) with similar generalization performance as AT-CNNsWe train 11 models on each dataset Their generalizationperformance on clean data and robustness measured byPGD attack are shown in Table 1
41 Visualization results
To investigate what features of an input image AT-CNNsand normal CNNs are most sensitive to we generate sen-sitivity maps using SmoothGrad (Smilkov et al 2017) onclean images saturated images and stylized images Thevisualization results are presented in Figure 2
We can easily observe that the salience maps of AT-CNNsare much more sparse and mainly focus on contours of eachobject on all kinds of images including the clean saturatedand stylized ones Differently sensitivity maps of standardCNNs are more noisy and less biased towards the shapesof objects This is consistent with the findings in (Geirhoset al 2019)
Particularly in the second row of Figure 2 sensitivity mapsof normal CNNs of the ldquodogrdquo class are still noisy even whenthe input saturated image are nearly binarized On the otherhand after adversarial training the models successfullycapture the shape information of the object providing amore interpretable prediction
For stylized images shown in the third row of Figure 2 evenwith dramatically changed textures after style transfer AT-CNNs can still be able to focus the shapes of original object
Interpreting Adversarially Trained Convolutional Neural Networks
(a) Images from Caltech-256 (b) Images from Tiny ImageNet
Figure 2 Sensitivity maps based on SmoothGrad (Smilkov et al 2017) of three models on images under saturation and stylizing Fromtop to bottom Original Saturation 1024 and Stylizing For each group of images from left to right original image sensitivity maps ofstandard CNN underfitting CNN and PGD-linfin AT-CNN
while standard CNNs totally fail
Due to the limited space we provide more visualizationresults (including the sensitivity maps generated by Gradmethod) in appendix
42 Generalization performance on transformed data
In this part we mainly show generalization performanceof AT-CCNs and normal CNNs on either shape or texturepreserving distorted image datasets This could help us tounderstand how different that the two types of models arebiased in a quantitative way
For all experimental results below besides the top-1 accu-racy we also report an ldquoaccuracy on correctly classifiedimagesrdquo This accuracy is measured by first selecting theimages from the clean test set that is being correctly clas-sified then measuring the accuracy of transformed imagesfrom these correctly classified ones
421 STYLIZING
Following Geirhos et al (2019) we generate stylized ver-sion of test set for Caltech-256 and Tiny ImageNet
We report the ldquoaccuracy on correctly classified imagesrdquo ofall the trained models on stylized test set in Table 2 Com-pared with standard CNNs though with a lower accuracyon original test images AT-CNNs achieve higher accuracyon stylized ones with textures being dramatically changedThe comparison quantitatively shows that AT-CNNs tend tobe more invariant with respect to local textures
422 SATURATION
We use the saturation operation to manipulate the imagesand show the how increasing saturation levels affects theaccuracy of models trained in different ways
In Figure 4 we visualize images with varying saturation lev-els It can be easily observed that increasing saturation levelspushes images more ldquobinnarizedrdquo where some textures arewiped out but produces sharper edges and preserving shapeinformation When saturation level is smaller than 2 ieclean image it pushes all the pixels towards 12 and nearlyall the information is lost and p = 0 leads to a totally grayimage with constant pixel value
We measure the ldquoaccuracy on correctly classified imagesrdquofor all the trained models and show them in Figure 5 Wecan observe that with the increasing level of saturation moretexture information is lost Favorably adversarially trainedmodels exhibit a much less sensitivity to this texture lossstill obtaining a high classification accuracy The resultsindicate that AT-CNNs are more robust to ldquosaturationrdquo orldquobinarizingrdquo operations which may demonstrate that theprediction capability of AT-CNNs relies less on texture andmore on shapes Results on CIFAR-10 tells the same storyas presented in appendix due to the limited space
Additionally in our experiments for each adversarial train-ing approach either PGD or FGSM based AT-CNNs withhigher robustness towards PGD adversary are more invariantto the increasing of the saturation level and texture loss Onthe other hand adversarial training with higher robustnesstypically ruin the generalization over the clean dataset Ourfinding also supports the claim ldquorobustness maybe at oddswith accuracyrdquo (Tsipras et al 2018)
Interpreting Adversarially Trained Convolutional Neural Networks
Figure 3 Visualization of images from style-transferred test set Applying AdaIn (Huang amp Belongie 2017) style transfer distorts localtextures of original images while the global shape structure is retained The first row are images from Caltech-256 and the second roware images from Tiny ImageNet
Table 2 ldquoAccuracy on correctly classified imagesrdquo for different models on stylized test set The columns named ldquoCaltech-256rdquo andldquoTinyImageNetrdquo show the generalization of different models on the clean test set
DATASET CALTECH-256 STYLIZED CALTECH-256 TINYIMAGENET STYLIZED TINYIMAGENET
STANDARD 8332 1683 7202 725UNDERFIT 6904 975 6035 716PGD-linfin 8 6641 1975 5442 1881PGD-linfin 4 7222 2110 6185 2051PGD-linfin 2 7651 2189 6706 1925PGD-linfin 1 7911 2207 6942 1831PGD-l2 12 6524 2014 5344 1933PGD-l2 8 6975 2162 5821 2042PGD-l2 4 7412 2253 6424 2105FGSM 8 7088 2123 6621 1507FGSM 4 7391 2199 6343 2022
Figure 4 Illustration of how varying saturation changes the appearance of the image From left to right saturation level 025 05 1 2(original image) 4 8 16 64 1024 Increasing saturation level pushes pixels towards 0 or 1 which preserves most of the shape whilewiping most of the textures Decreasing saturation level pushes all pixels to 12
2minus 2 20 22 24 26 28 210
Saturation Level
0
20
40
60
80
100
Accura
cy o
n c
orr
ectl
y c
lassifie
d im
ages
cle
an im
age
PGD AT with inf norm
PGD AT with l2 norm
FGSM AT
Stardard Training
Underfitting
2minus 2 20 22 24 26 28 210
Saturation Level
0
20
40
60
80
100
Accura
cy o
n c
orr
ectl
y c
lassifie
d im
ages
cle
an im
age
PGD AT with inf norm
PGD AT with l2 norm
FGSM AT
Stardard Training
Underfitting
(a) Caltech-256 (b) Tiny ImageNetFigure 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 and Tiny ImageNet with respect todifferent saturation levels Note that in the plot there are several curves with same color and line type shown for each adversarial trainingmethod PGD and FGSM-based those of which with larger perturbation achieves better robustness for most of the cases Detailed resultsare list in the appendix
Interpreting Adversarially Trained Convolutional Neural Networks
NormalUnderf
itPGD-in
f8PGD-L2
12 FGSM800
02
04
06
08
10
0750
0952
0738 0769
0932
NormalUnderf
itPGD-in
f8PGD-L2
12 FGSM800
02
04
06
08
10
0550
0877
0012 0043 0028
NormalUnderf
itPGD-in
f8PGD-L2
12 FGSM800
02
04
06
08
10
0541
0913
0002 0012 0012Norma
lUnderf
itPGD-in
f8PGD-L2
12 FGSM800
02
04
06
08
10
0005
0305
0002 0002 0003
(a) Original Image (b) Patch-Shuffle 2 (c) Patch-Shuffle 4 (d) Patch-Shuffle 8Figure 6 Visualization of patch-shuffling transformation The first row shows probability of ldquocakerdquo assigned by different models
clean patch-shuffle 2 patch-shuffle 4 patch-shuffle 8
Patch-Shuffle
0
20
40
60
80
100
Accura
cy o
n c
orr
ectl
y c
lassifie
d im
ages
PGD AT with inf norm
PGD AT with l2 norm
FGSM AT
Stardard Training
Underfitting
clean patch-shuffle 2 patch-shuffle 4 patch-shuffle 8
Patch-Shuffle
0
20
40
60
80
100
Accura
cy o
n c
orr
ectl
y c
lassifie
d im
ages
PGD AT with inf norm
PGD AT with l2 norm
FGSM AT
Stardard Training
Underfitting
(a) Caltech-256 (b) Tiny ImageNetFigure 7 ldquoAccuracy on correctly classified imagesrdquo for different models on patch-shuffled Tiny ImageNet and Caltech-256 with differentsplitting numbers Detailed results are listed in the appendix
When decreasing the saturation level all models have simi-lar degree of performance degradation indicating that AT-CNNs are not robust to all kinds of image distortions Theytend to be more robust for fixed types of distortions Weleave the further investigation regarding this issue as futurework
423 PATCH-SHUFFLING
Stylizing and saturation operation aim at changing or re-moving the texture information of original images whilepreserving the features of shapes and edges In order to testthe different bias of AT-CNN and standard CNN in the otherway around we shatter the shape and edge information bysplitting the images into k times k patches and then randomlyshuffling them This operation could still maintains the localtextures if k is not too large
Figure 6 shows one example of patch-shuffled images underdifferent numbers of splitting The first row shows the proba-bilities assigned by different models to the ground truth class
of the original image Obviously after random shufflingthe shapes and edge features are destroyed dramaticallythe prediction probability of the adverarially trained CNNsdrops significantly while the normal CNNs still maintainsa high confidence over the ground truth class This revealsAT-CNNs are more baised towards shapes and edges thannormally trained ones
Moreover Figure 7 depicts the ldquo accuracy of correctly classi-fied imagesrdquo for all the models measured on ldquoPatch-shuffledrdquotest set with increasing number of splitting pieces AT-CNNs especially trained against with a stronger attack aremore sensitive to ldquoPatch-shufflingrdquo operations in most ofour experiments
Note that under ldquoPatch-shuffle 8rdquo operation all models havesimilar ldquo accuracy of correctly classified imagesrdquo which islargely due to the severe information loss Also note that thisaccuracy of all models on Tiny ImageNet shown in 7(a) ismush lower than that on Caltech-256 in 7(b) That is underldquoPatch-shuffle 1rdquo normally trained CNN has an accuracy
Interpreting Adversarially Trained Convolutional Neural Networks
of 8476 on Caltech-256 while only 6673 on TinyImageNet This mainly origins from the limited resolutionof Tiny ImageNet since ldquoPatch-Shufflerdquo operation on low-resolution images destroys more useful features than thosewith higher resolution
5 Related work and discussionInterpreting AT-CNNs Recently there are some relevantfindings indicating that AT-CNNs learn fundamentally differ-ent feature representations than standard classifiers Tsipraset al (2018) showed that sensitivity maps of AT-CNNs inthe input space align well with human perception Addi-tionally by visualizing large-ε adversarial examples againstAT-CNNs it can be observed that the adversarial examplescould capture salient data characteristics of a different classwhich appear semantically similar to the images of the differ-ent class Dong et al (2017) leveraged adversarial trainingto produce a more interpretable representation by visualiz-ing active neurons Compared with Tsipras et al (2018) andDong et al (2017) we have conducted a more systematicalinvestigation for interpreting AT-CNNs We construct threetypes of image transformation that can largely change thetextures while preserving shape information (ie stylizingand saturation) or shatter the shapeedge features whilekeeping the local textures (ie patch-shuffling) Evaluatingthe generalization of AT-CNNs over these designed datasetsprovides a quantitative way to verify and interpret theirstrong shape-bias compared with normal CNNs
Insights for defensing adversarial examples Based onour investigation over the AT-CNNs we find that the ro-bustness towards adversarial examples is correlated withthe capability of capturing long-range features like shapesor contours This naturally raises the question whetherany other models that can capture more global features orwith more texture invariance could lead to more robustnessto adversarial examples even without adversarial train-ing This might provide us some insights on designing newnetwork architecture or new strategies for enhancing thebias towards long-range features Some recent works turnout partially answering this question (Xie et al 2018) en-hanced standard CNNs with non-local blocks inspired from(Wang et al 2018 Vaswani et al 2017) which capture long-range dependencies in a data-dependent manner and whencombined with adversarial training their networks achievedstate-of-the-art adversarial robustness on ImageNet (Luoet al 2018) destroyed some of the local connection of stan-dard CNNs by randomly select a set of neurons and removethem from the network before training and thus forcingthe CNNs to less focus on local texture features With thisdesign they achieved improved black-box robustness
Adversarial training with other types of attacks In thiswork we mainly interpret the AT-CNNs based on norm-
constrained perturbation over the original images It is wor-thy of noting that the difference between normally trainedand adversarially trained CNNs may highly depends onthe type of adversaries Models trained against spatially-transformed adversary (Xiao et al 2018) denoted as ST-ST-CNNs have similar robustness towards PGD attack withstandard models and their salience maps are still quite dif-ferent as shown in Figure 8 Also the average distancebetween salience maps is close to that of standard CNNwhich is much higher than that of PGD-AT-CNN There ex-ists a variety of generalized types of attacks xadv = G(xw)parameterized by w such as spatially transformed (Xiaoet al 2018) and GAN-based adversarial examples (Songet al 2018) We leave interpreting the AT-CNNs based onthese generalized types of attacks as future work
Figure 8 Sensitivity maps based on SmoothGrad (Smilkov et al2017) of three models From left to right original image sensitiv-ity maps of standard CNN PGD-linfin AT-CNN and ST-AT-CNN
6 ConclusionFrom both qualitative and quantitative perspectives we haveimplemented a systematic study on interpreting the adversar-ially trained convolutional neural networks Through con-structing distorted test sets either preserving shapes or localtextures we compare the sensitivity maps of AT-CNNs andnormal CNNs on the clean stylized and saturated imageswhich visually demonstrates that AT-CNNs are more biasedtowards global structures such as shapes and edges Moreimportantly we evaluate the generalization performance ofthe two models on the three constructed datasets stylizedsaturated and patch-shuffled ones The results clearly indi-cate that AT-CNNs are less sensitive to the texture distortionand focus more on shape information while the normallytrained CNNs the other way around
Understanding what a model has learned is an essentialtopic in both machine learning and computer vision Thestrategies we propose can also be extended to interpret otherneural networks such as models for object detection andsemantic segmentation
AcknowledgementThis work is supported by National Natural Science Foun-dation of China (No61806009) Beijing Natural Sci-ence Foundation (No4184090) Beijing Academy of Ar-tificial Intelligence (BAAI) and Intelligent Manufactur-
Interpreting Adversarially Trained Convolutional Neural Networks
ing Action Plan of Industrial Solid Foundation Program(NoJCKY2018204C004) We also appreciate insightfuldiscussions with Dinghuai Zhang and Dr Lei Wu
ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt
M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018
Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018
Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018
Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015
Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016
Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019
Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009
Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input datadistributions In International Conference on LearningRepresentations 2019
Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017
Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009
Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015
Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019
Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014
Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014
Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007
He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a
He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b
Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017
Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017
Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012
Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016
Long J Shelhamer E and Darrell T Fully convolutionalnetworks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015
Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018
Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant toadversarial attacks In International Conference on Learn-ing Representations 2018
Interpreting Adversarially Trained Convolutional Neural Networks
Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017
Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018
Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017
Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015
Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017
Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013
Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018
Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017
Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018
Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017
Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018
Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017
Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018
Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018
Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018
Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014
Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a
Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b
Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016
Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017
Interpreting Adversarially Trained Convolutional Neural Networks
A Experiment SetupA1 Models
bull CIFAR-10 We train a standard ResNet-18 (He et al2016a) architecture it has 4 groups of residual layerswith filter sizes (64 128 256 512) and 2 residual units
bull Caltech-256 amp Tiny ImageNet We use a ResNet-18architecture using the code from pytorch(Paszke et al2017) Note that for models on Caltech-256 amp TinyImageNet we initialize them using ImageNet(Denget al 2009) pre-trained weighs provided by pytorch
We evaluate the robustness of all our models using a linfinprojected gradient descent adversary with ε = 8255 stepsize = 2 and number of iterations as 40
A2 Adversarial Training
We perform 9 types of adversarial training on each of thedataset 7 of the 9 kinds of adversarial training are againsta projected gradient descent (PGD) adversary(Madry et al2018) the other 2 are against FGSM adversary(Goodfellowet al 2014)
A21 TRAIN AGAINST A PROJECTED GRADIENTDESCENT (PGD) ADVERSARY
We list value of ε for adversarial training of each dataset andlp-norm In all settings PGD runs 20 iterations
bull linfin-norm bounded adversary For all of thethree data set pixel vaules range from 0 1 wetrain 4 adversarially trained CNNs with ε isin1255 2255 4255 8255 these four models aredenoted as PGD-inf1 2 4 8 respectively and stepssize as 1255 1255 2255 4255
bull l2-norm bounded adversary For Caltech-256 ampTiny ImageNet the input size for our model is 224times224 we train three adversarially trained CNNs withε isin 4 8 12 and these four models are denoted asPGD-l2 4 8 12 respectively Step sizes for thesethree models are 2255 4255 6255 For CIFAR-10where images are of size 32times 32 the three adversari-ally trained CNNs have ε isin 410 810 1210 butthey are denoted in the same way and have the samestep size as that in Caltech-256 amp Tiny ImageNet
A22 TRAIN AGAINST A FGSM ADVERSARY
ε for these two adversarially trained CNNs are ε isin4 8 and they are denoted as FGSM 4 8 respectively
B Style-transferred test setFollowing (Geirhos et al 2019) we construct stylized testset for Caltech-256 and Tiny ImageNet by applying theAdaIn style transfer(Huang amp Belongie 2017) with a styl-ization coefficient of α = 10 to every test image withthe style of a randomly selected painting from 3KagglersquosPainter by numbers dataset we used source code providedby(Geirhos et al 2019)
C Experiments on Fourier-filtered datasets(Jo amp Bengio 2017) showed deep neural networks tendto learn surface statistical regularities as opposed to high-level abstractions Following them we test the performanceof different trained CNNs on the high-pass and low-passfiltered dataset to show their tendencies
C1 Fourier filtering setup
Following (Jo amp Bengio 2017) We construct three types ofFourier filtered version of test set
bull The low frequency filtered version We use a radialmask in the Fourier domain to set higher frequencymodes to zero(low-pass filtering)
bull The high frequency filtered version We use a radialmask in the Fourier domain to preserve only the higherfrequency modes(high-pass filtering)
bull The random filtered version We use a random maskin the Fourier domain to set each mode to 0 with prob-ability p uniformly The random mask is generated onthe fly during the test
C2 Results
We measure generalization performance (accuracy on cor-rectly classified images) of each model on these three fil-tered datasets from Caltech-256 results are listed in Ta-ble 3 AT-CNNs performs better on Low-pass filtered datasetand worse on High-pass filtered dataset Results indicatethat AT-CNNs make their predictions depend more on low-frequency information This finding is consistent with ourconclusions since local features such as textures are oftenconsidered as high-frequency information and shapes andcontours are more like low-frequency
D Detailed resultsWe the detailed results for our quantitative experimentshere Table 5 4 6 show the results of each models on
3httpswwwkagglecomcpainter-by-numbers
Interpreting Adversarially Trained Convolutional Neural Networks
Table 3 ldquoAccuracy on correctly classified imagesrdquo for different models on three Fourier-filtered Caltech-256 test setsDATA SET THE LOW FREQUENCY FILTERED VERSION THE HIGH FREQUENCY FILTERED VERSION THE RANDOM FILTERED VERSION
STANDARD 158 165 735UNDERFIT 145 176 622PGD-linfin 711 36 734
test set with different saturation levels Table 8 7 list allthe results of each models on test set after different path-shuffling operations
E Additional FiguresWe show additional sensitive maps in Figure 9 We alsocompare the sensitive maps using Grad and SmoothGradin Figure 10
ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt
M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018
Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018
Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018
Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015
Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016
Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019
Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009
Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input data
distributions In International Conference on LearningRepresentations 2019
Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017
Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009
Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015
Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019
Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014
Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014
Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007
He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a
He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b
Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017
Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017
Interpreting Adversarially Trained Convolutional Neural Networks
Table 4 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Caltech-256
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 2862 5745 8520 9013 6537 4237 2345 2003UNDERFIT 3184 6336 9096 8451 5751 3858 2600 2308PGD-linfin 8 3284 5347 8272 8645 7033 6109 5376 5191PGD-linfin 4 3199 5774 8518 8795 7033 5838 4816 4545PGD-linfin 2 3299 6075 8775 8935 6878 5199 4069 3783PGD-linfin 1 3267 6185 8936 9018 6907 5005 3798 3480PGD-l2 12 3138 5307 8210 8389 6706 5851 5245 5075PGD-l2 8 3282 5665 8501 8609 6890 5875 5159 4930PGD-l2 4 3282 5877 8630 8636 6794 5368 4443 4198FGSM 8 2953 5546 8510 8665 6901 5564 4592 4342FGSM 4 3268 5937 8722 8790 6671 5113 4166 3878
Table 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Tiny ImageNet test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Tiny ImageNet
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 724 2588 7252 7273 2538 824 262 193UNDERFIT 734 2544 6980 6067 1801 672 316 265PGD-linfin 8 1107 2908 6711 7453 498 4016 3544 3396PGD-linfin 4 1244 3353 7294 7575 4638 3212 2492 2265PGD-linfin 2 1209 3485 7577 7615 4135 2520 1693 1452PGD-linfin 1 1130 3503 7685 7863 4048 2137 1270 1081PGD-l2 12 1130 2948 6694 7522 5226 4211 3720 3585PGD-l2 8 1242 3278 7194 7515 4792 3566 2955 2790PGD-l2 4 1263 3410 7406 7732 4500 2873 2016 1804FGSM 8 1259 3266 7055 8153 4183 1752 729 582FGSM 4 1263 3410 7406 7505 4291 2909 2215 2014
Table 6 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated CIFAR-10 test set It is easily observed AT-CNNs aremuch more robust to increasing saturation levels on CIFAR-10
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 2736 5595 9103 9312 6998 4830 3439 3106UNDERFIT 2143 5028 8771 8989 6609 4335 2910 2613PGD-linfin 8 2605 4696 8097 8916 7546 6908 5898 6464PGD-linfin 4 2722 4981 8416 8979 7389 6535 5999 5847PGD-linfin 2 2832 5312 8693 9137 7402 6282 5525 5260PGD-linfin 1 2718 5359 8854 9177 7267 5839 4725 4175PGD-l2 12 2599 4692 8172 8844 7392 6603 6098 5941PGD-l2 8 2775 5029 8376 8092 7317 6483 5864 4694PGD-l2 4 2726 5117 8578 9008 7312 6150 5204 4879FGSM 8 2550 4611 8172 8767 7422 6712 6251 6132FGSM 4 2639 5893 8430 8902 7347 6443 5880 5682
Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012
Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016
Long J Shelhamer E and Darrell T Fully convolutional
networks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015
Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018
Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant to
Interpreting Adversarially Trained Convolutional Neural Networks
Table 7 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Caltech-256 test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Caltech-256
DATA SET 2times 2 4times 4 8times 8T
STANDARD 8476 5150 1084UNDERFIT 7559 3341 603PGD-linfin 8 5813 2014 770PGD-linfin 4 6854 2645 818PGD-linfin 2 7425 3077 900PGD-linfin 1 7811 3503 842PGD-l2 12 5825 2103 785PGD-l2 8 6336 2219 848PGD-l2 4 6965 2821 772FGSM 8 6448 2294 807FGSM 4 7050 2841 603
Table 8 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Tiny ImageNet test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Tiny ImageNet
DATA SET 2times 2 4times 4 8times 8T
STANDARD 6673 2487 448UNDERFIT 5922 2362 438PGD-linfin 8 4108 1605 683PGD-linfin 4 4954 1823 630PGD-linfin 2 5596 1995 561PGD-linfin 1 6019 2324 608PGD-l2 12 4223 1695 766PGD-l2 8 4767 1628 650PGD-l2 4 5194 1779 589FGSM 8 5742 2070 473FGSM 4 5068 1684 598
Figure 9 Visualization of Salience maps generated from SmoothGrad (Smilkov et al 2017) for all 11 models From left to rightStandard CNNs underfitting CNNs PGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4
Interpreting Adversarially Trained Convolutional Neural Networks
Figure 10 Visualization of Salience maps generated from Grad for all 11 models From left to right Standard CNNs underfitting CNNsPGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4 Itrsquos easily observed that sensitivity maps generated from Grad are more noisycompared with its smoothed variant SmoothGrad especially for Standard CNNs and underfitting CNNs
adversarial attacks In International Conference on Learn-ing Representations 2018
Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017
Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018
Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017
Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015
Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017
Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013
Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018
Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017
Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018
Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017
Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018
Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017
Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018
Interpreting Adversarially Trained Convolutional Neural Networks
Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018
Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018
Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014
Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a
Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b
Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016
Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017
Interpreting Adversarially Trained Convolutional Neural Networks
shuffled images then evaluate the classification accuracyof AT-CNN and normal CNNs on these datasets Thesesophisticated designed experiments provide a quantitativecomparison between the two CNNs and demonstrate theirbiases when making predictions
To the best of our knowledge we are the first to implementsystematic investigation on interpreting the adversariallytrained CNNs both visually and quantitatively Our find-ings shed some light on why AT-CNNs are more robustthan those normally trained ones and also contribute to bet-ter understanding adversarial training over CNNs from aninterpretation perspective1
The remaining of the paper is structured as follows Weintroduce background knowledge on adversarial training andsalience methods in Section 2 The methods for interpretingAT-CNNS are described in Section 3 Then we present theexperimental results to support our findings in Section 4The related works and discussions are presented in Section 5Section 6 concludes the paper
2 Preliminary21 Adversarial training
This training method was first proposed by (Goodfellowet al 2014) which is the most successful approach forbuilding robust models so far for defending adversarial ex-amples (Madry et al 2018 Sinha et al 2018 Athalye et al2018 Zhang et al 2019ba) It can be formulated as solvinga robust optimization problem (Shaham et al 2015)
minθ
E(xy)simD
[maxδisinS
`(f(x+ δ θ) y)
] (1)
where f(x θ) represents the neural network parameterizedby weights θ the input-output pair (x y) is sample from thetraining set D δ denotes the adversarial perturbation and`(middot middot) is the chosen loss function eg cross entropy loss Sdenotes a certain norm constraints such as `infin or `2
The inner maximization is approximated by adversarialexamples generated by various attack methods Trainingagainst a projected gradient descent (PGD Madry et al(2018)) adversary leads to state-of-the-art white-box ro-bustness We use PGD based adversarial training withbounded linfin and l2 norm constraints We also investigateFGSM (Goodfellow et al 2014) based adversarial training
22 Salience maps
Given a trained neural network visualizing the saliencemaps aims at assigning a sensitivity value sometimes alsocalled ldquoattributionrdquo to show the sensitivity of the output
1Our codes are available at httpsgithubcomPKUAI26AT-CNN
to each pixel of an input image Salience methods canmainly be divided into (Ancona et al 2018) perturbation-based methods (Zeiler amp Fergus 2014 Zintgraf et al 2017)and gradient-based method (Erhan et al 2009 Simonyanet al 2013 Shrikumar et al 2017 Sundararajan et al2017 Selvaraju et al 2017 Zhou et al 2016 Smilkovet al 2017 Bach et al 2015) Recently (Adebayo et al2018) carries out a systematic test for many of the gradient-based salience methods and only variants of Grad andGradCAM (Selvaraju et al 2017) pass the proposed sanitychecks We thus choose Grad and its smoothed versionSmoothGrad (Smilkov et al 2017) for visualization
Formally let x isin Rd denote the input image a trainednetwork is a function f Rd rarr RK where K is the to-tal number of classes Let Sc denotes the class activationfunction for each class c We seek to obtain a salience mapE isin Rd The Grad explanation is the gradient of classactivation with respect to the input image x
E =partSc(x)
partx (2)
SmoothGrad (Smilkov et al 2017) was proposed to alle-viate noises in gradient explanation by averaging over thegradient of noisy copies of an input Thus for an input xthe smoothed variant of Grad SmoothGrad can be writtenas
E =1
n
nsumi=1
partSc(xi)
partxi (3)
where xi = x+gi and gi are noise vectors drawn iid froma Gaussian distribution N (0 σ2) In all our experimentswe set n = 100 and the noise level σ(xmax minus xmin) =01 We choose Sc(x) = log pc(x) where pc(x) is theprobability of class c assigned by a classifier to input x
3 MethodsIn this section we elaborate our method for interpretingthe adversarially trained CNNs and comparing them withnormally trained ones Three image datasets are consideredincluding Tiny ImageNet2 Caltech-256 (Griffin et al 2007)and CIFAR-10
We first visualize the salience maps of AT-CNNs and nor-mal CNNs to demonstrate that the two models trained withdifferent ways are sensitive to different kinds of features Be-sides this qualitative comparison we also test the two kindsof CNNs on different transformed datasets to distinguishthe difference of their preferred features
31 Visualizing the salience maps
A straightforward way of investigating the difference be-tween AT-CNNs and CNNs is to visualize which group of
2httpstiny-imagenetherokuappcom
Interpreting Adversarially Trained Convolutional Neural Networks
(a) Original (b) Stylized (c) Saturated 8 (d) Saturated 1024 (e) patch-shuffle 2 (f) patch-shuffle 4
Figure 1 Visualization of three transformations Original images are from Caltech-256 From left to right original stylized saturationlevel as 8 1024 2times 2 patch-shuffling 4times 4 patch-shuffling
pixels the network outputs are most sensitive to Saliencemaps generated by Grad and its smoothed variant Smooth-Grad are good candidates to show what features a modelis sensitive to We compare the salience maps between AT-CNNs and CNNs on clean images and images under texturepreserving and shape preserving distortions Extensive re-sults can been seen in Section 41
As pointed by Smilkov et al (2017) sensitivity maps basedon Grad method are often visually noisy highlighting thatsome pixels to a human eye seem randomly selectedSmoothGrad in Eq (3) on the other hand could reducevisual noise by averaging the gradient over the Gaussianperturbed images Thus we mainly report the salience mapsproduced by SmoothGrad and the Grad visualization resultsare provided in the appendix Note that the two visualizationmethods could help us draw a consistent conclusion on thedifference between the two trained CNNs
32 Generalization on shapetexture preservingdistortions
Besides visual inspection of sensitivity maps we proposeto measure the sensitivity of AT-CNNs and CNNs to dif-ferent features by evaluating the performance degradationunder several distortions that either preserves shapes or tex-tures Intuitively if one model relies on textures a lot theperformance would degrade severely if we destroy mostof the textures while preserving other information such asthe shapes and other features However a perfect disentan-glement of texture shape and other feature information isimpossible (Gatys et al 2015) In this work we mainlyconstruct three kinds of image translations to achieve theshape or texture distortion style-transfer saturating andpatch-shuffling operation Some of the image samples areshown in Figure 1 We also added three Fourier-filteredtest set in the appendix We now describe each of thesetransformations and their properties
Note that we conduct normal training or adversarial training
on the original training sets and then evaluate their general-izability over the transformed data During the training wenever use the transformed datasets
Stylizing Geirhos et al (2019) utilized style trans-fer (Huang amp Belongie 2017) to generate images with con-flicting shape and texture information to demonstrate thetexture bias of ImageNet-trained standard CNNs Followingthe same rationale we utilize style transfer to destroy mostof the textures while preserving the global shape structuresin images and build a stylized test dataset Therefore withsimilar generalization error models capturing shapes bet-ter should also perform better on stylized test images thanthose biased towards textures The style-transferred imagesamples are shown in Figure 1(b)
Saturation Similar to (Ding et al 2019) we denote thesaturation of the image x by xp where p indicates the sat-uration level ranging from 0 toinfin When p = 2 the satu-ration operation does not change the image When p ge 2increasing the saturation level will push the pixel valuestowards binarized ones and p = infin leads to the pure bi-narization Specifically for each pixel of image x withvalue v isin [0 1] its corresponding saturated pixel of xp isdefined as sign(2v minus 1)|2v minus 1|
2p 2 + 12 One can ob-
serve that from Figure 1(c) and (d) increasing saturationlevel can gradually destroy some texture information whilepreserving most parts of the contour structures
Patch-Shuffling To destroy long-range shape informationwe split images into k times k small patches and randomlyrearranging the order of these patches with k isin 2 4 8Favorably this operation preserves most of the texture in-formation and destroys most of the shape information Thepatch-shuffled image samples are showed in Figure 1(e) (f)Note that as k increasing more information of the originalimage is lost especially for images with low resolution
Interpreting Adversarially Trained Convolutional Neural Networks
Table 1 Accuracy and robustness of all the trained models Robustness is measured against the PGD attack with bounded linfin normDetails are listed in the appendix Note that underfitting CNNs have similar generalization performance with some of the AT-CNNs onclean images
CIFAR10 TinyImageNet Caltech 256Accuracy Robustness Accuracy Robustness Accuracy Robustness
PGD-inf 8 8627 4481 5442 1425 6641 3116PGD-inf 4 8917 3085 6185 687 7222 2010PGD-inf 2 914 3911 6706 166 7651 751PGD-inf 1 9340 753 6942 018 7911 170
PGD-L2 12 8579 3461 5344 1480 6554 3136PGD-L2 8 8801 2688 5821 1003 6975 2619PGD-L2 4 9077 1319 6424 361 7412 1433FGSM 8 8490 3425 6621 001 7088 2002FGSM 4 8813 2508 6343 013 7391 1516Normal 9452 0 7202 001 8332 0Underfit 8679 0 6005 001 6904 0
4 Experiments and analysisExperiments setup We describe the experiment setup toevaluate the performance of AT-CNNs and standard CNNsin data distributions manipulated by above-mentioned opera-tions We conduct experiments on three datasets CIFAR-10Tiny ImageNet and Caltech-256 (Griffin et al 2007) Notethat we do not create the style-transferred and patch-shuffledtest set for CIFAR-10 due to its limited resolution
When training on CIFAR-10 we use the ResNet-18model (He et al 2016ab) for data augmentation we per-form zero paddings with width as 4 horizontal flip andrandom crop
Tiny ImageNet has 200 classes of objects Each class has500 training images 50 validation images and 50 test im-ages All images from Tiny ImageNet are of size 64times 64We re-scale them to 224times224 and perform random horizon-tal flip and per-image standardization as data augmentation
Caltech-256 (Griffin et al 2007) consists of 257 objectcategories containing a total of 30607 images Resolutionof images from Caltech is much higher compared with theabove two datasets We manually split 20 of images asthe test set We perform re-scaling and random croppingfollowing (He et al 2016a) For both Tiny ImageNet andCaltech-256 we use ResNet-18 model as the network archi-tecture
Compared models their generalization and robustnessFor all above three datasets we train three types of AT-CNNs they mainly differ in the way of generating adver-sarial examples FGSM PGD with bounded linfin norm andPGD with bounded l2 norm and for each attack method wetrain several models under different attack strengths Details
are listed in the appendix To understand whether the differ-ence of performance degradation for AT-CNNs and standardCNNs is due to the poor generalization (Schmidt et al 2018Tsipras et al 2018) of adversarial training we also comparethe AT-CNNs with an underfitting CNN (trained over cleandata) with similar generalization performance as AT-CNNsWe train 11 models on each dataset Their generalizationperformance on clean data and robustness measured byPGD attack are shown in Table 1
41 Visualization results
To investigate what features of an input image AT-CNNsand normal CNNs are most sensitive to we generate sen-sitivity maps using SmoothGrad (Smilkov et al 2017) onclean images saturated images and stylized images Thevisualization results are presented in Figure 2
We can easily observe that the salience maps of AT-CNNsare much more sparse and mainly focus on contours of eachobject on all kinds of images including the clean saturatedand stylized ones Differently sensitivity maps of standardCNNs are more noisy and less biased towards the shapesof objects This is consistent with the findings in (Geirhoset al 2019)
Particularly in the second row of Figure 2 sensitivity mapsof normal CNNs of the ldquodogrdquo class are still noisy even whenthe input saturated image are nearly binarized On the otherhand after adversarial training the models successfullycapture the shape information of the object providing amore interpretable prediction
For stylized images shown in the third row of Figure 2 evenwith dramatically changed textures after style transfer AT-CNNs can still be able to focus the shapes of original object
Interpreting Adversarially Trained Convolutional Neural Networks
(a) Images from Caltech-256 (b) Images from Tiny ImageNet
Figure 2 Sensitivity maps based on SmoothGrad (Smilkov et al 2017) of three models on images under saturation and stylizing Fromtop to bottom Original Saturation 1024 and Stylizing For each group of images from left to right original image sensitivity maps ofstandard CNN underfitting CNN and PGD-linfin AT-CNN
while standard CNNs totally fail
Due to the limited space we provide more visualizationresults (including the sensitivity maps generated by Gradmethod) in appendix
42 Generalization performance on transformed data
In this part we mainly show generalization performanceof AT-CCNs and normal CNNs on either shape or texturepreserving distorted image datasets This could help us tounderstand how different that the two types of models arebiased in a quantitative way
For all experimental results below besides the top-1 accu-racy we also report an ldquoaccuracy on correctly classifiedimagesrdquo This accuracy is measured by first selecting theimages from the clean test set that is being correctly clas-sified then measuring the accuracy of transformed imagesfrom these correctly classified ones
421 STYLIZING
Following Geirhos et al (2019) we generate stylized ver-sion of test set for Caltech-256 and Tiny ImageNet
We report the ldquoaccuracy on correctly classified imagesrdquo ofall the trained models on stylized test set in Table 2 Com-pared with standard CNNs though with a lower accuracyon original test images AT-CNNs achieve higher accuracyon stylized ones with textures being dramatically changedThe comparison quantitatively shows that AT-CNNs tend tobe more invariant with respect to local textures
422 SATURATION
We use the saturation operation to manipulate the imagesand show the how increasing saturation levels affects theaccuracy of models trained in different ways
In Figure 4 we visualize images with varying saturation lev-els It can be easily observed that increasing saturation levelspushes images more ldquobinnarizedrdquo where some textures arewiped out but produces sharper edges and preserving shapeinformation When saturation level is smaller than 2 ieclean image it pushes all the pixels towards 12 and nearlyall the information is lost and p = 0 leads to a totally grayimage with constant pixel value
We measure the ldquoaccuracy on correctly classified imagesrdquofor all the trained models and show them in Figure 5 Wecan observe that with the increasing level of saturation moretexture information is lost Favorably adversarially trainedmodels exhibit a much less sensitivity to this texture lossstill obtaining a high classification accuracy The resultsindicate that AT-CNNs are more robust to ldquosaturationrdquo orldquobinarizingrdquo operations which may demonstrate that theprediction capability of AT-CNNs relies less on texture andmore on shapes Results on CIFAR-10 tells the same storyas presented in appendix due to the limited space
Additionally in our experiments for each adversarial train-ing approach either PGD or FGSM based AT-CNNs withhigher robustness towards PGD adversary are more invariantto the increasing of the saturation level and texture loss Onthe other hand adversarial training with higher robustnesstypically ruin the generalization over the clean dataset Ourfinding also supports the claim ldquorobustness maybe at oddswith accuracyrdquo (Tsipras et al 2018)
Interpreting Adversarially Trained Convolutional Neural Networks
Figure 3 Visualization of images from style-transferred test set Applying AdaIn (Huang amp Belongie 2017) style transfer distorts localtextures of original images while the global shape structure is retained The first row are images from Caltech-256 and the second roware images from Tiny ImageNet
Table 2 ldquoAccuracy on correctly classified imagesrdquo for different models on stylized test set The columns named ldquoCaltech-256rdquo andldquoTinyImageNetrdquo show the generalization of different models on the clean test set
DATASET CALTECH-256 STYLIZED CALTECH-256 TINYIMAGENET STYLIZED TINYIMAGENET
STANDARD 8332 1683 7202 725UNDERFIT 6904 975 6035 716PGD-linfin 8 6641 1975 5442 1881PGD-linfin 4 7222 2110 6185 2051PGD-linfin 2 7651 2189 6706 1925PGD-linfin 1 7911 2207 6942 1831PGD-l2 12 6524 2014 5344 1933PGD-l2 8 6975 2162 5821 2042PGD-l2 4 7412 2253 6424 2105FGSM 8 7088 2123 6621 1507FGSM 4 7391 2199 6343 2022
Figure 4 Illustration of how varying saturation changes the appearance of the image From left to right saturation level 025 05 1 2(original image) 4 8 16 64 1024 Increasing saturation level pushes pixels towards 0 or 1 which preserves most of the shape whilewiping most of the textures Decreasing saturation level pushes all pixels to 12
2minus 2 20 22 24 26 28 210
Saturation Level
0
20
40
60
80
100
Accura
cy o
n c
orr
ectl
y c
lassifie
d im
ages
cle
an im
age
PGD AT with inf norm
PGD AT with l2 norm
FGSM AT
Stardard Training
Underfitting
2minus 2 20 22 24 26 28 210
Saturation Level
0
20
40
60
80
100
Accura
cy o
n c
orr
ectl
y c
lassifie
d im
ages
cle
an im
age
PGD AT with inf norm
PGD AT with l2 norm
FGSM AT
Stardard Training
Underfitting
(a) Caltech-256 (b) Tiny ImageNetFigure 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 and Tiny ImageNet with respect todifferent saturation levels Note that in the plot there are several curves with same color and line type shown for each adversarial trainingmethod PGD and FGSM-based those of which with larger perturbation achieves better robustness for most of the cases Detailed resultsare list in the appendix
Interpreting Adversarially Trained Convolutional Neural Networks
NormalUnderf
itPGD-in
f8PGD-L2
12 FGSM800
02
04
06
08
10
0750
0952
0738 0769
0932
NormalUnderf
itPGD-in
f8PGD-L2
12 FGSM800
02
04
06
08
10
0550
0877
0012 0043 0028
NormalUnderf
itPGD-in
f8PGD-L2
12 FGSM800
02
04
06
08
10
0541
0913
0002 0012 0012Norma
lUnderf
itPGD-in
f8PGD-L2
12 FGSM800
02
04
06
08
10
0005
0305
0002 0002 0003
(a) Original Image (b) Patch-Shuffle 2 (c) Patch-Shuffle 4 (d) Patch-Shuffle 8Figure 6 Visualization of patch-shuffling transformation The first row shows probability of ldquocakerdquo assigned by different models
clean patch-shuffle 2 patch-shuffle 4 patch-shuffle 8
Patch-Shuffle
0
20
40
60
80
100
Accura
cy o
n c
orr
ectl
y c
lassifie
d im
ages
PGD AT with inf norm
PGD AT with l2 norm
FGSM AT
Stardard Training
Underfitting
clean patch-shuffle 2 patch-shuffle 4 patch-shuffle 8
Patch-Shuffle
0
20
40
60
80
100
Accura
cy o
n c
orr
ectl
y c
lassifie
d im
ages
PGD AT with inf norm
PGD AT with l2 norm
FGSM AT
Stardard Training
Underfitting
(a) Caltech-256 (b) Tiny ImageNetFigure 7 ldquoAccuracy on correctly classified imagesrdquo for different models on patch-shuffled Tiny ImageNet and Caltech-256 with differentsplitting numbers Detailed results are listed in the appendix
When decreasing the saturation level all models have simi-lar degree of performance degradation indicating that AT-CNNs are not robust to all kinds of image distortions Theytend to be more robust for fixed types of distortions Weleave the further investigation regarding this issue as futurework
423 PATCH-SHUFFLING
Stylizing and saturation operation aim at changing or re-moving the texture information of original images whilepreserving the features of shapes and edges In order to testthe different bias of AT-CNN and standard CNN in the otherway around we shatter the shape and edge information bysplitting the images into k times k patches and then randomlyshuffling them This operation could still maintains the localtextures if k is not too large
Figure 6 shows one example of patch-shuffled images underdifferent numbers of splitting The first row shows the proba-bilities assigned by different models to the ground truth class
of the original image Obviously after random shufflingthe shapes and edge features are destroyed dramaticallythe prediction probability of the adverarially trained CNNsdrops significantly while the normal CNNs still maintainsa high confidence over the ground truth class This revealsAT-CNNs are more baised towards shapes and edges thannormally trained ones
Moreover Figure 7 depicts the ldquo accuracy of correctly classi-fied imagesrdquo for all the models measured on ldquoPatch-shuffledrdquotest set with increasing number of splitting pieces AT-CNNs especially trained against with a stronger attack aremore sensitive to ldquoPatch-shufflingrdquo operations in most ofour experiments
Note that under ldquoPatch-shuffle 8rdquo operation all models havesimilar ldquo accuracy of correctly classified imagesrdquo which islargely due to the severe information loss Also note that thisaccuracy of all models on Tiny ImageNet shown in 7(a) ismush lower than that on Caltech-256 in 7(b) That is underldquoPatch-shuffle 1rdquo normally trained CNN has an accuracy
Interpreting Adversarially Trained Convolutional Neural Networks
of 8476 on Caltech-256 while only 6673 on TinyImageNet This mainly origins from the limited resolutionof Tiny ImageNet since ldquoPatch-Shufflerdquo operation on low-resolution images destroys more useful features than thosewith higher resolution
5 Related work and discussionInterpreting AT-CNNs Recently there are some relevantfindings indicating that AT-CNNs learn fundamentally differ-ent feature representations than standard classifiers Tsipraset al (2018) showed that sensitivity maps of AT-CNNs inthe input space align well with human perception Addi-tionally by visualizing large-ε adversarial examples againstAT-CNNs it can be observed that the adversarial examplescould capture salient data characteristics of a different classwhich appear semantically similar to the images of the differ-ent class Dong et al (2017) leveraged adversarial trainingto produce a more interpretable representation by visualiz-ing active neurons Compared with Tsipras et al (2018) andDong et al (2017) we have conducted a more systematicalinvestigation for interpreting AT-CNNs We construct threetypes of image transformation that can largely change thetextures while preserving shape information (ie stylizingand saturation) or shatter the shapeedge features whilekeeping the local textures (ie patch-shuffling) Evaluatingthe generalization of AT-CNNs over these designed datasetsprovides a quantitative way to verify and interpret theirstrong shape-bias compared with normal CNNs
Insights for defensing adversarial examples Based onour investigation over the AT-CNNs we find that the ro-bustness towards adversarial examples is correlated withthe capability of capturing long-range features like shapesor contours This naturally raises the question whetherany other models that can capture more global features orwith more texture invariance could lead to more robustnessto adversarial examples even without adversarial train-ing This might provide us some insights on designing newnetwork architecture or new strategies for enhancing thebias towards long-range features Some recent works turnout partially answering this question (Xie et al 2018) en-hanced standard CNNs with non-local blocks inspired from(Wang et al 2018 Vaswani et al 2017) which capture long-range dependencies in a data-dependent manner and whencombined with adversarial training their networks achievedstate-of-the-art adversarial robustness on ImageNet (Luoet al 2018) destroyed some of the local connection of stan-dard CNNs by randomly select a set of neurons and removethem from the network before training and thus forcingthe CNNs to less focus on local texture features With thisdesign they achieved improved black-box robustness
Adversarial training with other types of attacks In thiswork we mainly interpret the AT-CNNs based on norm-
constrained perturbation over the original images It is wor-thy of noting that the difference between normally trainedand adversarially trained CNNs may highly depends onthe type of adversaries Models trained against spatially-transformed adversary (Xiao et al 2018) denoted as ST-ST-CNNs have similar robustness towards PGD attack withstandard models and their salience maps are still quite dif-ferent as shown in Figure 8 Also the average distancebetween salience maps is close to that of standard CNNwhich is much higher than that of PGD-AT-CNN There ex-ists a variety of generalized types of attacks xadv = G(xw)parameterized by w such as spatially transformed (Xiaoet al 2018) and GAN-based adversarial examples (Songet al 2018) We leave interpreting the AT-CNNs based onthese generalized types of attacks as future work
Figure 8 Sensitivity maps based on SmoothGrad (Smilkov et al2017) of three models From left to right original image sensitiv-ity maps of standard CNN PGD-linfin AT-CNN and ST-AT-CNN
6 ConclusionFrom both qualitative and quantitative perspectives we haveimplemented a systematic study on interpreting the adversar-ially trained convolutional neural networks Through con-structing distorted test sets either preserving shapes or localtextures we compare the sensitivity maps of AT-CNNs andnormal CNNs on the clean stylized and saturated imageswhich visually demonstrates that AT-CNNs are more biasedtowards global structures such as shapes and edges Moreimportantly we evaluate the generalization performance ofthe two models on the three constructed datasets stylizedsaturated and patch-shuffled ones The results clearly indi-cate that AT-CNNs are less sensitive to the texture distortionand focus more on shape information while the normallytrained CNNs the other way around
Understanding what a model has learned is an essentialtopic in both machine learning and computer vision Thestrategies we propose can also be extended to interpret otherneural networks such as models for object detection andsemantic segmentation
AcknowledgementThis work is supported by National Natural Science Foun-dation of China (No61806009) Beijing Natural Sci-ence Foundation (No4184090) Beijing Academy of Ar-tificial Intelligence (BAAI) and Intelligent Manufactur-
Interpreting Adversarially Trained Convolutional Neural Networks
ing Action Plan of Industrial Solid Foundation Program(NoJCKY2018204C004) We also appreciate insightfuldiscussions with Dinghuai Zhang and Dr Lei Wu
ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt
M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018
Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018
Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018
Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015
Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016
Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019
Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009
Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input datadistributions In International Conference on LearningRepresentations 2019
Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017
Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009
Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015
Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019
Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014
Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014
Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007
He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a
He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b
Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017
Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017
Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012
Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016
Long J Shelhamer E and Darrell T Fully convolutionalnetworks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015
Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018
Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant toadversarial attacks In International Conference on Learn-ing Representations 2018
Interpreting Adversarially Trained Convolutional Neural Networks
Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017
Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018
Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017
Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015
Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017
Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013
Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018
Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017
Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018
Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017
Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018
Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017
Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018
Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018
Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018
Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014
Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a
Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b
Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016
Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017
Interpreting Adversarially Trained Convolutional Neural Networks
A Experiment SetupA1 Models
bull CIFAR-10 We train a standard ResNet-18 (He et al2016a) architecture it has 4 groups of residual layerswith filter sizes (64 128 256 512) and 2 residual units
bull Caltech-256 amp Tiny ImageNet We use a ResNet-18architecture using the code from pytorch(Paszke et al2017) Note that for models on Caltech-256 amp TinyImageNet we initialize them using ImageNet(Denget al 2009) pre-trained weighs provided by pytorch
We evaluate the robustness of all our models using a linfinprojected gradient descent adversary with ε = 8255 stepsize = 2 and number of iterations as 40
A2 Adversarial Training
We perform 9 types of adversarial training on each of thedataset 7 of the 9 kinds of adversarial training are againsta projected gradient descent (PGD) adversary(Madry et al2018) the other 2 are against FGSM adversary(Goodfellowet al 2014)
A21 TRAIN AGAINST A PROJECTED GRADIENTDESCENT (PGD) ADVERSARY
We list value of ε for adversarial training of each dataset andlp-norm In all settings PGD runs 20 iterations
bull linfin-norm bounded adversary For all of thethree data set pixel vaules range from 0 1 wetrain 4 adversarially trained CNNs with ε isin1255 2255 4255 8255 these four models aredenoted as PGD-inf1 2 4 8 respectively and stepssize as 1255 1255 2255 4255
bull l2-norm bounded adversary For Caltech-256 ampTiny ImageNet the input size for our model is 224times224 we train three adversarially trained CNNs withε isin 4 8 12 and these four models are denoted asPGD-l2 4 8 12 respectively Step sizes for thesethree models are 2255 4255 6255 For CIFAR-10where images are of size 32times 32 the three adversari-ally trained CNNs have ε isin 410 810 1210 butthey are denoted in the same way and have the samestep size as that in Caltech-256 amp Tiny ImageNet
A22 TRAIN AGAINST A FGSM ADVERSARY
ε for these two adversarially trained CNNs are ε isin4 8 and they are denoted as FGSM 4 8 respectively
B Style-transferred test setFollowing (Geirhos et al 2019) we construct stylized testset for Caltech-256 and Tiny ImageNet by applying theAdaIn style transfer(Huang amp Belongie 2017) with a styl-ization coefficient of α = 10 to every test image withthe style of a randomly selected painting from 3KagglersquosPainter by numbers dataset we used source code providedby(Geirhos et al 2019)
C Experiments on Fourier-filtered datasets(Jo amp Bengio 2017) showed deep neural networks tendto learn surface statistical regularities as opposed to high-level abstractions Following them we test the performanceof different trained CNNs on the high-pass and low-passfiltered dataset to show their tendencies
C1 Fourier filtering setup
Following (Jo amp Bengio 2017) We construct three types ofFourier filtered version of test set
bull The low frequency filtered version We use a radialmask in the Fourier domain to set higher frequencymodes to zero(low-pass filtering)
bull The high frequency filtered version We use a radialmask in the Fourier domain to preserve only the higherfrequency modes(high-pass filtering)
bull The random filtered version We use a random maskin the Fourier domain to set each mode to 0 with prob-ability p uniformly The random mask is generated onthe fly during the test
C2 Results
We measure generalization performance (accuracy on cor-rectly classified images) of each model on these three fil-tered datasets from Caltech-256 results are listed in Ta-ble 3 AT-CNNs performs better on Low-pass filtered datasetand worse on High-pass filtered dataset Results indicatethat AT-CNNs make their predictions depend more on low-frequency information This finding is consistent with ourconclusions since local features such as textures are oftenconsidered as high-frequency information and shapes andcontours are more like low-frequency
D Detailed resultsWe the detailed results for our quantitative experimentshere Table 5 4 6 show the results of each models on
3httpswwwkagglecomcpainter-by-numbers
Interpreting Adversarially Trained Convolutional Neural Networks
Table 3 ldquoAccuracy on correctly classified imagesrdquo for different models on three Fourier-filtered Caltech-256 test setsDATA SET THE LOW FREQUENCY FILTERED VERSION THE HIGH FREQUENCY FILTERED VERSION THE RANDOM FILTERED VERSION
STANDARD 158 165 735UNDERFIT 145 176 622PGD-linfin 711 36 734
test set with different saturation levels Table 8 7 list allthe results of each models on test set after different path-shuffling operations
E Additional FiguresWe show additional sensitive maps in Figure 9 We alsocompare the sensitive maps using Grad and SmoothGradin Figure 10
ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt
M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018
Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018
Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018
Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015
Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016
Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019
Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009
Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input data
distributions In International Conference on LearningRepresentations 2019
Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017
Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009
Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015
Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019
Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014
Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014
Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007
He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a
He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b
Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017
Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017
Interpreting Adversarially Trained Convolutional Neural Networks
Table 4 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Caltech-256
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 2862 5745 8520 9013 6537 4237 2345 2003UNDERFIT 3184 6336 9096 8451 5751 3858 2600 2308PGD-linfin 8 3284 5347 8272 8645 7033 6109 5376 5191PGD-linfin 4 3199 5774 8518 8795 7033 5838 4816 4545PGD-linfin 2 3299 6075 8775 8935 6878 5199 4069 3783PGD-linfin 1 3267 6185 8936 9018 6907 5005 3798 3480PGD-l2 12 3138 5307 8210 8389 6706 5851 5245 5075PGD-l2 8 3282 5665 8501 8609 6890 5875 5159 4930PGD-l2 4 3282 5877 8630 8636 6794 5368 4443 4198FGSM 8 2953 5546 8510 8665 6901 5564 4592 4342FGSM 4 3268 5937 8722 8790 6671 5113 4166 3878
Table 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Tiny ImageNet test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Tiny ImageNet
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 724 2588 7252 7273 2538 824 262 193UNDERFIT 734 2544 6980 6067 1801 672 316 265PGD-linfin 8 1107 2908 6711 7453 498 4016 3544 3396PGD-linfin 4 1244 3353 7294 7575 4638 3212 2492 2265PGD-linfin 2 1209 3485 7577 7615 4135 2520 1693 1452PGD-linfin 1 1130 3503 7685 7863 4048 2137 1270 1081PGD-l2 12 1130 2948 6694 7522 5226 4211 3720 3585PGD-l2 8 1242 3278 7194 7515 4792 3566 2955 2790PGD-l2 4 1263 3410 7406 7732 4500 2873 2016 1804FGSM 8 1259 3266 7055 8153 4183 1752 729 582FGSM 4 1263 3410 7406 7505 4291 2909 2215 2014
Table 6 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated CIFAR-10 test set It is easily observed AT-CNNs aremuch more robust to increasing saturation levels on CIFAR-10
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 2736 5595 9103 9312 6998 4830 3439 3106UNDERFIT 2143 5028 8771 8989 6609 4335 2910 2613PGD-linfin 8 2605 4696 8097 8916 7546 6908 5898 6464PGD-linfin 4 2722 4981 8416 8979 7389 6535 5999 5847PGD-linfin 2 2832 5312 8693 9137 7402 6282 5525 5260PGD-linfin 1 2718 5359 8854 9177 7267 5839 4725 4175PGD-l2 12 2599 4692 8172 8844 7392 6603 6098 5941PGD-l2 8 2775 5029 8376 8092 7317 6483 5864 4694PGD-l2 4 2726 5117 8578 9008 7312 6150 5204 4879FGSM 8 2550 4611 8172 8767 7422 6712 6251 6132FGSM 4 2639 5893 8430 8902 7347 6443 5880 5682
Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012
Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016
Long J Shelhamer E and Darrell T Fully convolutional
networks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015
Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018
Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant to
Interpreting Adversarially Trained Convolutional Neural Networks
Table 7 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Caltech-256 test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Caltech-256
DATA SET 2times 2 4times 4 8times 8T
STANDARD 8476 5150 1084UNDERFIT 7559 3341 603PGD-linfin 8 5813 2014 770PGD-linfin 4 6854 2645 818PGD-linfin 2 7425 3077 900PGD-linfin 1 7811 3503 842PGD-l2 12 5825 2103 785PGD-l2 8 6336 2219 848PGD-l2 4 6965 2821 772FGSM 8 6448 2294 807FGSM 4 7050 2841 603
Table 8 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Tiny ImageNet test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Tiny ImageNet
DATA SET 2times 2 4times 4 8times 8T
STANDARD 6673 2487 448UNDERFIT 5922 2362 438PGD-linfin 8 4108 1605 683PGD-linfin 4 4954 1823 630PGD-linfin 2 5596 1995 561PGD-linfin 1 6019 2324 608PGD-l2 12 4223 1695 766PGD-l2 8 4767 1628 650PGD-l2 4 5194 1779 589FGSM 8 5742 2070 473FGSM 4 5068 1684 598
Figure 9 Visualization of Salience maps generated from SmoothGrad (Smilkov et al 2017) for all 11 models From left to rightStandard CNNs underfitting CNNs PGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4
Interpreting Adversarially Trained Convolutional Neural Networks
Figure 10 Visualization of Salience maps generated from Grad for all 11 models From left to right Standard CNNs underfitting CNNsPGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4 Itrsquos easily observed that sensitivity maps generated from Grad are more noisycompared with its smoothed variant SmoothGrad especially for Standard CNNs and underfitting CNNs
adversarial attacks In International Conference on Learn-ing Representations 2018
Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017
Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018
Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017
Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015
Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017
Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013
Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018
Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017
Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018
Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017
Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018
Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017
Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018
Interpreting Adversarially Trained Convolutional Neural Networks
Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018
Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018
Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014
Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a
Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b
Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016
Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017
Interpreting Adversarially Trained Convolutional Neural Networks
(a) Original (b) Stylized (c) Saturated 8 (d) Saturated 1024 (e) patch-shuffle 2 (f) patch-shuffle 4
Figure 1 Visualization of three transformations Original images are from Caltech-256 From left to right original stylized saturationlevel as 8 1024 2times 2 patch-shuffling 4times 4 patch-shuffling
pixels the network outputs are most sensitive to Saliencemaps generated by Grad and its smoothed variant Smooth-Grad are good candidates to show what features a modelis sensitive to We compare the salience maps between AT-CNNs and CNNs on clean images and images under texturepreserving and shape preserving distortions Extensive re-sults can been seen in Section 41
As pointed by Smilkov et al (2017) sensitivity maps basedon Grad method are often visually noisy highlighting thatsome pixels to a human eye seem randomly selectedSmoothGrad in Eq (3) on the other hand could reducevisual noise by averaging the gradient over the Gaussianperturbed images Thus we mainly report the salience mapsproduced by SmoothGrad and the Grad visualization resultsare provided in the appendix Note that the two visualizationmethods could help us draw a consistent conclusion on thedifference between the two trained CNNs
32 Generalization on shapetexture preservingdistortions
Besides visual inspection of sensitivity maps we proposeto measure the sensitivity of AT-CNNs and CNNs to dif-ferent features by evaluating the performance degradationunder several distortions that either preserves shapes or tex-tures Intuitively if one model relies on textures a lot theperformance would degrade severely if we destroy mostof the textures while preserving other information such asthe shapes and other features However a perfect disentan-glement of texture shape and other feature information isimpossible (Gatys et al 2015) In this work we mainlyconstruct three kinds of image translations to achieve theshape or texture distortion style-transfer saturating andpatch-shuffling operation Some of the image samples areshown in Figure 1 We also added three Fourier-filteredtest set in the appendix We now describe each of thesetransformations and their properties
Note that we conduct normal training or adversarial training
on the original training sets and then evaluate their general-izability over the transformed data During the training wenever use the transformed datasets
Stylizing Geirhos et al (2019) utilized style trans-fer (Huang amp Belongie 2017) to generate images with con-flicting shape and texture information to demonstrate thetexture bias of ImageNet-trained standard CNNs Followingthe same rationale we utilize style transfer to destroy mostof the textures while preserving the global shape structuresin images and build a stylized test dataset Therefore withsimilar generalization error models capturing shapes bet-ter should also perform better on stylized test images thanthose biased towards textures The style-transferred imagesamples are shown in Figure 1(b)
Saturation Similar to (Ding et al 2019) we denote thesaturation of the image x by xp where p indicates the sat-uration level ranging from 0 toinfin When p = 2 the satu-ration operation does not change the image When p ge 2increasing the saturation level will push the pixel valuestowards binarized ones and p = infin leads to the pure bi-narization Specifically for each pixel of image x withvalue v isin [0 1] its corresponding saturated pixel of xp isdefined as sign(2v minus 1)|2v minus 1|
2p 2 + 12 One can ob-
serve that from Figure 1(c) and (d) increasing saturationlevel can gradually destroy some texture information whilepreserving most parts of the contour structures
Patch-Shuffling To destroy long-range shape informationwe split images into k times k small patches and randomlyrearranging the order of these patches with k isin 2 4 8Favorably this operation preserves most of the texture in-formation and destroys most of the shape information Thepatch-shuffled image samples are showed in Figure 1(e) (f)Note that as k increasing more information of the originalimage is lost especially for images with low resolution
Interpreting Adversarially Trained Convolutional Neural Networks
Table 1 Accuracy and robustness of all the trained models Robustness is measured against the PGD attack with bounded linfin normDetails are listed in the appendix Note that underfitting CNNs have similar generalization performance with some of the AT-CNNs onclean images
CIFAR10 TinyImageNet Caltech 256Accuracy Robustness Accuracy Robustness Accuracy Robustness
PGD-inf 8 8627 4481 5442 1425 6641 3116PGD-inf 4 8917 3085 6185 687 7222 2010PGD-inf 2 914 3911 6706 166 7651 751PGD-inf 1 9340 753 6942 018 7911 170
PGD-L2 12 8579 3461 5344 1480 6554 3136PGD-L2 8 8801 2688 5821 1003 6975 2619PGD-L2 4 9077 1319 6424 361 7412 1433FGSM 8 8490 3425 6621 001 7088 2002FGSM 4 8813 2508 6343 013 7391 1516Normal 9452 0 7202 001 8332 0Underfit 8679 0 6005 001 6904 0
4 Experiments and analysisExperiments setup We describe the experiment setup toevaluate the performance of AT-CNNs and standard CNNsin data distributions manipulated by above-mentioned opera-tions We conduct experiments on three datasets CIFAR-10Tiny ImageNet and Caltech-256 (Griffin et al 2007) Notethat we do not create the style-transferred and patch-shuffledtest set for CIFAR-10 due to its limited resolution
When training on CIFAR-10 we use the ResNet-18model (He et al 2016ab) for data augmentation we per-form zero paddings with width as 4 horizontal flip andrandom crop
Tiny ImageNet has 200 classes of objects Each class has500 training images 50 validation images and 50 test im-ages All images from Tiny ImageNet are of size 64times 64We re-scale them to 224times224 and perform random horizon-tal flip and per-image standardization as data augmentation
Caltech-256 (Griffin et al 2007) consists of 257 objectcategories containing a total of 30607 images Resolutionof images from Caltech is much higher compared with theabove two datasets We manually split 20 of images asthe test set We perform re-scaling and random croppingfollowing (He et al 2016a) For both Tiny ImageNet andCaltech-256 we use ResNet-18 model as the network archi-tecture
Compared models their generalization and robustnessFor all above three datasets we train three types of AT-CNNs they mainly differ in the way of generating adver-sarial examples FGSM PGD with bounded linfin norm andPGD with bounded l2 norm and for each attack method wetrain several models under different attack strengths Details
are listed in the appendix To understand whether the differ-ence of performance degradation for AT-CNNs and standardCNNs is due to the poor generalization (Schmidt et al 2018Tsipras et al 2018) of adversarial training we also comparethe AT-CNNs with an underfitting CNN (trained over cleandata) with similar generalization performance as AT-CNNsWe train 11 models on each dataset Their generalizationperformance on clean data and robustness measured byPGD attack are shown in Table 1
41 Visualization results
To investigate what features of an input image AT-CNNsand normal CNNs are most sensitive to we generate sen-sitivity maps using SmoothGrad (Smilkov et al 2017) onclean images saturated images and stylized images Thevisualization results are presented in Figure 2
We can easily observe that the salience maps of AT-CNNsare much more sparse and mainly focus on contours of eachobject on all kinds of images including the clean saturatedand stylized ones Differently sensitivity maps of standardCNNs are more noisy and less biased towards the shapesof objects This is consistent with the findings in (Geirhoset al 2019)
Particularly in the second row of Figure 2 sensitivity mapsof normal CNNs of the ldquodogrdquo class are still noisy even whenthe input saturated image are nearly binarized On the otherhand after adversarial training the models successfullycapture the shape information of the object providing amore interpretable prediction
For stylized images shown in the third row of Figure 2 evenwith dramatically changed textures after style transfer AT-CNNs can still be able to focus the shapes of original object
Interpreting Adversarially Trained Convolutional Neural Networks
(a) Images from Caltech-256 (b) Images from Tiny ImageNet
Figure 2 Sensitivity maps based on SmoothGrad (Smilkov et al 2017) of three models on images under saturation and stylizing Fromtop to bottom Original Saturation 1024 and Stylizing For each group of images from left to right original image sensitivity maps ofstandard CNN underfitting CNN and PGD-linfin AT-CNN
while standard CNNs totally fail
Due to the limited space we provide more visualizationresults (including the sensitivity maps generated by Gradmethod) in appendix
42 Generalization performance on transformed data
In this part we mainly show generalization performanceof AT-CCNs and normal CNNs on either shape or texturepreserving distorted image datasets This could help us tounderstand how different that the two types of models arebiased in a quantitative way
For all experimental results below besides the top-1 accu-racy we also report an ldquoaccuracy on correctly classifiedimagesrdquo This accuracy is measured by first selecting theimages from the clean test set that is being correctly clas-sified then measuring the accuracy of transformed imagesfrom these correctly classified ones
421 STYLIZING
Following Geirhos et al (2019) we generate stylized ver-sion of test set for Caltech-256 and Tiny ImageNet
We report the ldquoaccuracy on correctly classified imagesrdquo ofall the trained models on stylized test set in Table 2 Com-pared with standard CNNs though with a lower accuracyon original test images AT-CNNs achieve higher accuracyon stylized ones with textures being dramatically changedThe comparison quantitatively shows that AT-CNNs tend tobe more invariant with respect to local textures
422 SATURATION
We use the saturation operation to manipulate the imagesand show the how increasing saturation levels affects theaccuracy of models trained in different ways
In Figure 4 we visualize images with varying saturation lev-els It can be easily observed that increasing saturation levelspushes images more ldquobinnarizedrdquo where some textures arewiped out but produces sharper edges and preserving shapeinformation When saturation level is smaller than 2 ieclean image it pushes all the pixels towards 12 and nearlyall the information is lost and p = 0 leads to a totally grayimage with constant pixel value
We measure the ldquoaccuracy on correctly classified imagesrdquofor all the trained models and show them in Figure 5 Wecan observe that with the increasing level of saturation moretexture information is lost Favorably adversarially trainedmodels exhibit a much less sensitivity to this texture lossstill obtaining a high classification accuracy The resultsindicate that AT-CNNs are more robust to ldquosaturationrdquo orldquobinarizingrdquo operations which may demonstrate that theprediction capability of AT-CNNs relies less on texture andmore on shapes Results on CIFAR-10 tells the same storyas presented in appendix due to the limited space
Additionally in our experiments for each adversarial train-ing approach either PGD or FGSM based AT-CNNs withhigher robustness towards PGD adversary are more invariantto the increasing of the saturation level and texture loss Onthe other hand adversarial training with higher robustnesstypically ruin the generalization over the clean dataset Ourfinding also supports the claim ldquorobustness maybe at oddswith accuracyrdquo (Tsipras et al 2018)
Interpreting Adversarially Trained Convolutional Neural Networks
Figure 3 Visualization of images from style-transferred test set Applying AdaIn (Huang amp Belongie 2017) style transfer distorts localtextures of original images while the global shape structure is retained The first row are images from Caltech-256 and the second roware images from Tiny ImageNet
Table 2 ldquoAccuracy on correctly classified imagesrdquo for different models on stylized test set The columns named ldquoCaltech-256rdquo andldquoTinyImageNetrdquo show the generalization of different models on the clean test set
DATASET CALTECH-256 STYLIZED CALTECH-256 TINYIMAGENET STYLIZED TINYIMAGENET
STANDARD 8332 1683 7202 725UNDERFIT 6904 975 6035 716PGD-linfin 8 6641 1975 5442 1881PGD-linfin 4 7222 2110 6185 2051PGD-linfin 2 7651 2189 6706 1925PGD-linfin 1 7911 2207 6942 1831PGD-l2 12 6524 2014 5344 1933PGD-l2 8 6975 2162 5821 2042PGD-l2 4 7412 2253 6424 2105FGSM 8 7088 2123 6621 1507FGSM 4 7391 2199 6343 2022
Figure 4 Illustration of how varying saturation changes the appearance of the image From left to right saturation level 025 05 1 2(original image) 4 8 16 64 1024 Increasing saturation level pushes pixels towards 0 or 1 which preserves most of the shape whilewiping most of the textures Decreasing saturation level pushes all pixels to 12
2minus 2 20 22 24 26 28 210
Saturation Level
0
20
40
60
80
100
Accura
cy o
n c
orr
ectl
y c
lassifie
d im
ages
cle
an im
age
PGD AT with inf norm
PGD AT with l2 norm
FGSM AT
Stardard Training
Underfitting
2minus 2 20 22 24 26 28 210
Saturation Level
0
20
40
60
80
100
Accura
cy o
n c
orr
ectl
y c
lassifie
d im
ages
cle
an im
age
PGD AT with inf norm
PGD AT with l2 norm
FGSM AT
Stardard Training
Underfitting
(a) Caltech-256 (b) Tiny ImageNetFigure 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 and Tiny ImageNet with respect todifferent saturation levels Note that in the plot there are several curves with same color and line type shown for each adversarial trainingmethod PGD and FGSM-based those of which with larger perturbation achieves better robustness for most of the cases Detailed resultsare list in the appendix
Interpreting Adversarially Trained Convolutional Neural Networks
NormalUnderf
itPGD-in
f8PGD-L2
12 FGSM800
02
04
06
08
10
0750
0952
0738 0769
0932
NormalUnderf
itPGD-in
f8PGD-L2
12 FGSM800
02
04
06
08
10
0550
0877
0012 0043 0028
NormalUnderf
itPGD-in
f8PGD-L2
12 FGSM800
02
04
06
08
10
0541
0913
0002 0012 0012Norma
lUnderf
itPGD-in
f8PGD-L2
12 FGSM800
02
04
06
08
10
0005
0305
0002 0002 0003
(a) Original Image (b) Patch-Shuffle 2 (c) Patch-Shuffle 4 (d) Patch-Shuffle 8Figure 6 Visualization of patch-shuffling transformation The first row shows probability of ldquocakerdquo assigned by different models
clean patch-shuffle 2 patch-shuffle 4 patch-shuffle 8
Patch-Shuffle
0
20
40
60
80
100
Accura
cy o
n c
orr
ectl
y c
lassifie
d im
ages
PGD AT with inf norm
PGD AT with l2 norm
FGSM AT
Stardard Training
Underfitting
clean patch-shuffle 2 patch-shuffle 4 patch-shuffle 8
Patch-Shuffle
0
20
40
60
80
100
Accura
cy o
n c
orr
ectl
y c
lassifie
d im
ages
PGD AT with inf norm
PGD AT with l2 norm
FGSM AT
Stardard Training
Underfitting
(a) Caltech-256 (b) Tiny ImageNetFigure 7 ldquoAccuracy on correctly classified imagesrdquo for different models on patch-shuffled Tiny ImageNet and Caltech-256 with differentsplitting numbers Detailed results are listed in the appendix
When decreasing the saturation level all models have simi-lar degree of performance degradation indicating that AT-CNNs are not robust to all kinds of image distortions Theytend to be more robust for fixed types of distortions Weleave the further investigation regarding this issue as futurework
423 PATCH-SHUFFLING
Stylizing and saturation operation aim at changing or re-moving the texture information of original images whilepreserving the features of shapes and edges In order to testthe different bias of AT-CNN and standard CNN in the otherway around we shatter the shape and edge information bysplitting the images into k times k patches and then randomlyshuffling them This operation could still maintains the localtextures if k is not too large
Figure 6 shows one example of patch-shuffled images underdifferent numbers of splitting The first row shows the proba-bilities assigned by different models to the ground truth class
of the original image Obviously after random shufflingthe shapes and edge features are destroyed dramaticallythe prediction probability of the adverarially trained CNNsdrops significantly while the normal CNNs still maintainsa high confidence over the ground truth class This revealsAT-CNNs are more baised towards shapes and edges thannormally trained ones
Moreover Figure 7 depicts the ldquo accuracy of correctly classi-fied imagesrdquo for all the models measured on ldquoPatch-shuffledrdquotest set with increasing number of splitting pieces AT-CNNs especially trained against with a stronger attack aremore sensitive to ldquoPatch-shufflingrdquo operations in most ofour experiments
Note that under ldquoPatch-shuffle 8rdquo operation all models havesimilar ldquo accuracy of correctly classified imagesrdquo which islargely due to the severe information loss Also note that thisaccuracy of all models on Tiny ImageNet shown in 7(a) ismush lower than that on Caltech-256 in 7(b) That is underldquoPatch-shuffle 1rdquo normally trained CNN has an accuracy
Interpreting Adversarially Trained Convolutional Neural Networks
of 8476 on Caltech-256 while only 6673 on TinyImageNet This mainly origins from the limited resolutionof Tiny ImageNet since ldquoPatch-Shufflerdquo operation on low-resolution images destroys more useful features than thosewith higher resolution
5 Related work and discussionInterpreting AT-CNNs Recently there are some relevantfindings indicating that AT-CNNs learn fundamentally differ-ent feature representations than standard classifiers Tsipraset al (2018) showed that sensitivity maps of AT-CNNs inthe input space align well with human perception Addi-tionally by visualizing large-ε adversarial examples againstAT-CNNs it can be observed that the adversarial examplescould capture salient data characteristics of a different classwhich appear semantically similar to the images of the differ-ent class Dong et al (2017) leveraged adversarial trainingto produce a more interpretable representation by visualiz-ing active neurons Compared with Tsipras et al (2018) andDong et al (2017) we have conducted a more systematicalinvestigation for interpreting AT-CNNs We construct threetypes of image transformation that can largely change thetextures while preserving shape information (ie stylizingand saturation) or shatter the shapeedge features whilekeeping the local textures (ie patch-shuffling) Evaluatingthe generalization of AT-CNNs over these designed datasetsprovides a quantitative way to verify and interpret theirstrong shape-bias compared with normal CNNs
Insights for defensing adversarial examples Based onour investigation over the AT-CNNs we find that the ro-bustness towards adversarial examples is correlated withthe capability of capturing long-range features like shapesor contours This naturally raises the question whetherany other models that can capture more global features orwith more texture invariance could lead to more robustnessto adversarial examples even without adversarial train-ing This might provide us some insights on designing newnetwork architecture or new strategies for enhancing thebias towards long-range features Some recent works turnout partially answering this question (Xie et al 2018) en-hanced standard CNNs with non-local blocks inspired from(Wang et al 2018 Vaswani et al 2017) which capture long-range dependencies in a data-dependent manner and whencombined with adversarial training their networks achievedstate-of-the-art adversarial robustness on ImageNet (Luoet al 2018) destroyed some of the local connection of stan-dard CNNs by randomly select a set of neurons and removethem from the network before training and thus forcingthe CNNs to less focus on local texture features With thisdesign they achieved improved black-box robustness
Adversarial training with other types of attacks In thiswork we mainly interpret the AT-CNNs based on norm-
constrained perturbation over the original images It is wor-thy of noting that the difference between normally trainedand adversarially trained CNNs may highly depends onthe type of adversaries Models trained against spatially-transformed adversary (Xiao et al 2018) denoted as ST-ST-CNNs have similar robustness towards PGD attack withstandard models and their salience maps are still quite dif-ferent as shown in Figure 8 Also the average distancebetween salience maps is close to that of standard CNNwhich is much higher than that of PGD-AT-CNN There ex-ists a variety of generalized types of attacks xadv = G(xw)parameterized by w such as spatially transformed (Xiaoet al 2018) and GAN-based adversarial examples (Songet al 2018) We leave interpreting the AT-CNNs based onthese generalized types of attacks as future work
Figure 8 Sensitivity maps based on SmoothGrad (Smilkov et al2017) of three models From left to right original image sensitiv-ity maps of standard CNN PGD-linfin AT-CNN and ST-AT-CNN
6 ConclusionFrom both qualitative and quantitative perspectives we haveimplemented a systematic study on interpreting the adversar-ially trained convolutional neural networks Through con-structing distorted test sets either preserving shapes or localtextures we compare the sensitivity maps of AT-CNNs andnormal CNNs on the clean stylized and saturated imageswhich visually demonstrates that AT-CNNs are more biasedtowards global structures such as shapes and edges Moreimportantly we evaluate the generalization performance ofthe two models on the three constructed datasets stylizedsaturated and patch-shuffled ones The results clearly indi-cate that AT-CNNs are less sensitive to the texture distortionand focus more on shape information while the normallytrained CNNs the other way around
Understanding what a model has learned is an essentialtopic in both machine learning and computer vision Thestrategies we propose can also be extended to interpret otherneural networks such as models for object detection andsemantic segmentation
AcknowledgementThis work is supported by National Natural Science Foun-dation of China (No61806009) Beijing Natural Sci-ence Foundation (No4184090) Beijing Academy of Ar-tificial Intelligence (BAAI) and Intelligent Manufactur-
Interpreting Adversarially Trained Convolutional Neural Networks
ing Action Plan of Industrial Solid Foundation Program(NoJCKY2018204C004) We also appreciate insightfuldiscussions with Dinghuai Zhang and Dr Lei Wu
ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt
M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018
Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018
Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018
Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015
Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016
Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019
Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009
Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input datadistributions In International Conference on LearningRepresentations 2019
Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017
Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009
Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015
Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019
Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014
Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014
Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007
He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a
He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b
Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017
Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017
Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012
Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016
Long J Shelhamer E and Darrell T Fully convolutionalnetworks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015
Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018
Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant toadversarial attacks In International Conference on Learn-ing Representations 2018
Interpreting Adversarially Trained Convolutional Neural Networks
Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017
Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018
Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017
Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015
Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017
Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013
Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018
Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017
Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018
Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017
Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018
Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017
Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018
Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018
Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018
Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014
Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a
Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b
Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016
Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017
Interpreting Adversarially Trained Convolutional Neural Networks
A Experiment SetupA1 Models
bull CIFAR-10 We train a standard ResNet-18 (He et al2016a) architecture it has 4 groups of residual layerswith filter sizes (64 128 256 512) and 2 residual units
bull Caltech-256 amp Tiny ImageNet We use a ResNet-18architecture using the code from pytorch(Paszke et al2017) Note that for models on Caltech-256 amp TinyImageNet we initialize them using ImageNet(Denget al 2009) pre-trained weighs provided by pytorch
We evaluate the robustness of all our models using a linfinprojected gradient descent adversary with ε = 8255 stepsize = 2 and number of iterations as 40
A2 Adversarial Training
We perform 9 types of adversarial training on each of thedataset 7 of the 9 kinds of adversarial training are againsta projected gradient descent (PGD) adversary(Madry et al2018) the other 2 are against FGSM adversary(Goodfellowet al 2014)
A21 TRAIN AGAINST A PROJECTED GRADIENTDESCENT (PGD) ADVERSARY
We list value of ε for adversarial training of each dataset andlp-norm In all settings PGD runs 20 iterations
bull linfin-norm bounded adversary For all of thethree data set pixel vaules range from 0 1 wetrain 4 adversarially trained CNNs with ε isin1255 2255 4255 8255 these four models aredenoted as PGD-inf1 2 4 8 respectively and stepssize as 1255 1255 2255 4255
bull l2-norm bounded adversary For Caltech-256 ampTiny ImageNet the input size for our model is 224times224 we train three adversarially trained CNNs withε isin 4 8 12 and these four models are denoted asPGD-l2 4 8 12 respectively Step sizes for thesethree models are 2255 4255 6255 For CIFAR-10where images are of size 32times 32 the three adversari-ally trained CNNs have ε isin 410 810 1210 butthey are denoted in the same way and have the samestep size as that in Caltech-256 amp Tiny ImageNet
A22 TRAIN AGAINST A FGSM ADVERSARY
ε for these two adversarially trained CNNs are ε isin4 8 and they are denoted as FGSM 4 8 respectively
B Style-transferred test setFollowing (Geirhos et al 2019) we construct stylized testset for Caltech-256 and Tiny ImageNet by applying theAdaIn style transfer(Huang amp Belongie 2017) with a styl-ization coefficient of α = 10 to every test image withthe style of a randomly selected painting from 3KagglersquosPainter by numbers dataset we used source code providedby(Geirhos et al 2019)
C Experiments on Fourier-filtered datasets(Jo amp Bengio 2017) showed deep neural networks tendto learn surface statistical regularities as opposed to high-level abstractions Following them we test the performanceof different trained CNNs on the high-pass and low-passfiltered dataset to show their tendencies
C1 Fourier filtering setup
Following (Jo amp Bengio 2017) We construct three types ofFourier filtered version of test set
bull The low frequency filtered version We use a radialmask in the Fourier domain to set higher frequencymodes to zero(low-pass filtering)
bull The high frequency filtered version We use a radialmask in the Fourier domain to preserve only the higherfrequency modes(high-pass filtering)
bull The random filtered version We use a random maskin the Fourier domain to set each mode to 0 with prob-ability p uniformly The random mask is generated onthe fly during the test
C2 Results
We measure generalization performance (accuracy on cor-rectly classified images) of each model on these three fil-tered datasets from Caltech-256 results are listed in Ta-ble 3 AT-CNNs performs better on Low-pass filtered datasetand worse on High-pass filtered dataset Results indicatethat AT-CNNs make their predictions depend more on low-frequency information This finding is consistent with ourconclusions since local features such as textures are oftenconsidered as high-frequency information and shapes andcontours are more like low-frequency
D Detailed resultsWe the detailed results for our quantitative experimentshere Table 5 4 6 show the results of each models on
3httpswwwkagglecomcpainter-by-numbers
Interpreting Adversarially Trained Convolutional Neural Networks
Table 3 ldquoAccuracy on correctly classified imagesrdquo for different models on three Fourier-filtered Caltech-256 test setsDATA SET THE LOW FREQUENCY FILTERED VERSION THE HIGH FREQUENCY FILTERED VERSION THE RANDOM FILTERED VERSION
STANDARD 158 165 735UNDERFIT 145 176 622PGD-linfin 711 36 734
test set with different saturation levels Table 8 7 list allthe results of each models on test set after different path-shuffling operations
E Additional FiguresWe show additional sensitive maps in Figure 9 We alsocompare the sensitive maps using Grad and SmoothGradin Figure 10
ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt
M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018
Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018
Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018
Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015
Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016
Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019
Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009
Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input data
distributions In International Conference on LearningRepresentations 2019
Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017
Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009
Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015
Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019
Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014
Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014
Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007
He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a
He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b
Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017
Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017
Interpreting Adversarially Trained Convolutional Neural Networks
Table 4 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Caltech-256
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 2862 5745 8520 9013 6537 4237 2345 2003UNDERFIT 3184 6336 9096 8451 5751 3858 2600 2308PGD-linfin 8 3284 5347 8272 8645 7033 6109 5376 5191PGD-linfin 4 3199 5774 8518 8795 7033 5838 4816 4545PGD-linfin 2 3299 6075 8775 8935 6878 5199 4069 3783PGD-linfin 1 3267 6185 8936 9018 6907 5005 3798 3480PGD-l2 12 3138 5307 8210 8389 6706 5851 5245 5075PGD-l2 8 3282 5665 8501 8609 6890 5875 5159 4930PGD-l2 4 3282 5877 8630 8636 6794 5368 4443 4198FGSM 8 2953 5546 8510 8665 6901 5564 4592 4342FGSM 4 3268 5937 8722 8790 6671 5113 4166 3878
Table 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Tiny ImageNet test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Tiny ImageNet
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 724 2588 7252 7273 2538 824 262 193UNDERFIT 734 2544 6980 6067 1801 672 316 265PGD-linfin 8 1107 2908 6711 7453 498 4016 3544 3396PGD-linfin 4 1244 3353 7294 7575 4638 3212 2492 2265PGD-linfin 2 1209 3485 7577 7615 4135 2520 1693 1452PGD-linfin 1 1130 3503 7685 7863 4048 2137 1270 1081PGD-l2 12 1130 2948 6694 7522 5226 4211 3720 3585PGD-l2 8 1242 3278 7194 7515 4792 3566 2955 2790PGD-l2 4 1263 3410 7406 7732 4500 2873 2016 1804FGSM 8 1259 3266 7055 8153 4183 1752 729 582FGSM 4 1263 3410 7406 7505 4291 2909 2215 2014
Table 6 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated CIFAR-10 test set It is easily observed AT-CNNs aremuch more robust to increasing saturation levels on CIFAR-10
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 2736 5595 9103 9312 6998 4830 3439 3106UNDERFIT 2143 5028 8771 8989 6609 4335 2910 2613PGD-linfin 8 2605 4696 8097 8916 7546 6908 5898 6464PGD-linfin 4 2722 4981 8416 8979 7389 6535 5999 5847PGD-linfin 2 2832 5312 8693 9137 7402 6282 5525 5260PGD-linfin 1 2718 5359 8854 9177 7267 5839 4725 4175PGD-l2 12 2599 4692 8172 8844 7392 6603 6098 5941PGD-l2 8 2775 5029 8376 8092 7317 6483 5864 4694PGD-l2 4 2726 5117 8578 9008 7312 6150 5204 4879FGSM 8 2550 4611 8172 8767 7422 6712 6251 6132FGSM 4 2639 5893 8430 8902 7347 6443 5880 5682
Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012
Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016
Long J Shelhamer E and Darrell T Fully convolutional
networks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015
Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018
Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant to
Interpreting Adversarially Trained Convolutional Neural Networks
Table 7 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Caltech-256 test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Caltech-256
DATA SET 2times 2 4times 4 8times 8T
STANDARD 8476 5150 1084UNDERFIT 7559 3341 603PGD-linfin 8 5813 2014 770PGD-linfin 4 6854 2645 818PGD-linfin 2 7425 3077 900PGD-linfin 1 7811 3503 842PGD-l2 12 5825 2103 785PGD-l2 8 6336 2219 848PGD-l2 4 6965 2821 772FGSM 8 6448 2294 807FGSM 4 7050 2841 603
Table 8 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Tiny ImageNet test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Tiny ImageNet
DATA SET 2times 2 4times 4 8times 8T
STANDARD 6673 2487 448UNDERFIT 5922 2362 438PGD-linfin 8 4108 1605 683PGD-linfin 4 4954 1823 630PGD-linfin 2 5596 1995 561PGD-linfin 1 6019 2324 608PGD-l2 12 4223 1695 766PGD-l2 8 4767 1628 650PGD-l2 4 5194 1779 589FGSM 8 5742 2070 473FGSM 4 5068 1684 598
Figure 9 Visualization of Salience maps generated from SmoothGrad (Smilkov et al 2017) for all 11 models From left to rightStandard CNNs underfitting CNNs PGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4
Interpreting Adversarially Trained Convolutional Neural Networks
Figure 10 Visualization of Salience maps generated from Grad for all 11 models From left to right Standard CNNs underfitting CNNsPGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4 Itrsquos easily observed that sensitivity maps generated from Grad are more noisycompared with its smoothed variant SmoothGrad especially for Standard CNNs and underfitting CNNs
adversarial attacks In International Conference on Learn-ing Representations 2018
Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017
Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018
Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017
Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015
Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017
Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013
Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018
Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017
Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018
Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017
Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018
Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017
Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018
Interpreting Adversarially Trained Convolutional Neural Networks
Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018
Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018
Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014
Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a
Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b
Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016
Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017
Interpreting Adversarially Trained Convolutional Neural Networks
Table 1 Accuracy and robustness of all the trained models Robustness is measured against the PGD attack with bounded linfin normDetails are listed in the appendix Note that underfitting CNNs have similar generalization performance with some of the AT-CNNs onclean images
CIFAR10 TinyImageNet Caltech 256Accuracy Robustness Accuracy Robustness Accuracy Robustness
PGD-inf 8 8627 4481 5442 1425 6641 3116PGD-inf 4 8917 3085 6185 687 7222 2010PGD-inf 2 914 3911 6706 166 7651 751PGD-inf 1 9340 753 6942 018 7911 170
PGD-L2 12 8579 3461 5344 1480 6554 3136PGD-L2 8 8801 2688 5821 1003 6975 2619PGD-L2 4 9077 1319 6424 361 7412 1433FGSM 8 8490 3425 6621 001 7088 2002FGSM 4 8813 2508 6343 013 7391 1516Normal 9452 0 7202 001 8332 0Underfit 8679 0 6005 001 6904 0
4 Experiments and analysisExperiments setup We describe the experiment setup toevaluate the performance of AT-CNNs and standard CNNsin data distributions manipulated by above-mentioned opera-tions We conduct experiments on three datasets CIFAR-10Tiny ImageNet and Caltech-256 (Griffin et al 2007) Notethat we do not create the style-transferred and patch-shuffledtest set for CIFAR-10 due to its limited resolution
When training on CIFAR-10 we use the ResNet-18model (He et al 2016ab) for data augmentation we per-form zero paddings with width as 4 horizontal flip andrandom crop
Tiny ImageNet has 200 classes of objects Each class has500 training images 50 validation images and 50 test im-ages All images from Tiny ImageNet are of size 64times 64We re-scale them to 224times224 and perform random horizon-tal flip and per-image standardization as data augmentation
Caltech-256 (Griffin et al 2007) consists of 257 objectcategories containing a total of 30607 images Resolutionof images from Caltech is much higher compared with theabove two datasets We manually split 20 of images asthe test set We perform re-scaling and random croppingfollowing (He et al 2016a) For both Tiny ImageNet andCaltech-256 we use ResNet-18 model as the network archi-tecture
Compared models their generalization and robustnessFor all above three datasets we train three types of AT-CNNs they mainly differ in the way of generating adver-sarial examples FGSM PGD with bounded linfin norm andPGD with bounded l2 norm and for each attack method wetrain several models under different attack strengths Details
are listed in the appendix To understand whether the differ-ence of performance degradation for AT-CNNs and standardCNNs is due to the poor generalization (Schmidt et al 2018Tsipras et al 2018) of adversarial training we also comparethe AT-CNNs with an underfitting CNN (trained over cleandata) with similar generalization performance as AT-CNNsWe train 11 models on each dataset Their generalizationperformance on clean data and robustness measured byPGD attack are shown in Table 1
41 Visualization results
To investigate what features of an input image AT-CNNsand normal CNNs are most sensitive to we generate sen-sitivity maps using SmoothGrad (Smilkov et al 2017) onclean images saturated images and stylized images Thevisualization results are presented in Figure 2
We can easily observe that the salience maps of AT-CNNsare much more sparse and mainly focus on contours of eachobject on all kinds of images including the clean saturatedand stylized ones Differently sensitivity maps of standardCNNs are more noisy and less biased towards the shapesof objects This is consistent with the findings in (Geirhoset al 2019)
Particularly in the second row of Figure 2 sensitivity mapsof normal CNNs of the ldquodogrdquo class are still noisy even whenthe input saturated image are nearly binarized On the otherhand after adversarial training the models successfullycapture the shape information of the object providing amore interpretable prediction
For stylized images shown in the third row of Figure 2 evenwith dramatically changed textures after style transfer AT-CNNs can still be able to focus the shapes of original object
Interpreting Adversarially Trained Convolutional Neural Networks
(a) Images from Caltech-256 (b) Images from Tiny ImageNet
Figure 2 Sensitivity maps based on SmoothGrad (Smilkov et al 2017) of three models on images under saturation and stylizing Fromtop to bottom Original Saturation 1024 and Stylizing For each group of images from left to right original image sensitivity maps ofstandard CNN underfitting CNN and PGD-linfin AT-CNN
while standard CNNs totally fail
Due to the limited space we provide more visualizationresults (including the sensitivity maps generated by Gradmethod) in appendix
42 Generalization performance on transformed data
In this part we mainly show generalization performanceof AT-CCNs and normal CNNs on either shape or texturepreserving distorted image datasets This could help us tounderstand how different that the two types of models arebiased in a quantitative way
For all experimental results below besides the top-1 accu-racy we also report an ldquoaccuracy on correctly classifiedimagesrdquo This accuracy is measured by first selecting theimages from the clean test set that is being correctly clas-sified then measuring the accuracy of transformed imagesfrom these correctly classified ones
421 STYLIZING
Following Geirhos et al (2019) we generate stylized ver-sion of test set for Caltech-256 and Tiny ImageNet
We report the ldquoaccuracy on correctly classified imagesrdquo ofall the trained models on stylized test set in Table 2 Com-pared with standard CNNs though with a lower accuracyon original test images AT-CNNs achieve higher accuracyon stylized ones with textures being dramatically changedThe comparison quantitatively shows that AT-CNNs tend tobe more invariant with respect to local textures
422 SATURATION
We use the saturation operation to manipulate the imagesand show the how increasing saturation levels affects theaccuracy of models trained in different ways
In Figure 4 we visualize images with varying saturation lev-els It can be easily observed that increasing saturation levelspushes images more ldquobinnarizedrdquo where some textures arewiped out but produces sharper edges and preserving shapeinformation When saturation level is smaller than 2 ieclean image it pushes all the pixels towards 12 and nearlyall the information is lost and p = 0 leads to a totally grayimage with constant pixel value
We measure the ldquoaccuracy on correctly classified imagesrdquofor all the trained models and show them in Figure 5 Wecan observe that with the increasing level of saturation moretexture information is lost Favorably adversarially trainedmodels exhibit a much less sensitivity to this texture lossstill obtaining a high classification accuracy The resultsindicate that AT-CNNs are more robust to ldquosaturationrdquo orldquobinarizingrdquo operations which may demonstrate that theprediction capability of AT-CNNs relies less on texture andmore on shapes Results on CIFAR-10 tells the same storyas presented in appendix due to the limited space
Additionally in our experiments for each adversarial train-ing approach either PGD or FGSM based AT-CNNs withhigher robustness towards PGD adversary are more invariantto the increasing of the saturation level and texture loss Onthe other hand adversarial training with higher robustnesstypically ruin the generalization over the clean dataset Ourfinding also supports the claim ldquorobustness maybe at oddswith accuracyrdquo (Tsipras et al 2018)
Interpreting Adversarially Trained Convolutional Neural Networks
Figure 3 Visualization of images from style-transferred test set Applying AdaIn (Huang amp Belongie 2017) style transfer distorts localtextures of original images while the global shape structure is retained The first row are images from Caltech-256 and the second roware images from Tiny ImageNet
Table 2 ldquoAccuracy on correctly classified imagesrdquo for different models on stylized test set The columns named ldquoCaltech-256rdquo andldquoTinyImageNetrdquo show the generalization of different models on the clean test set
DATASET CALTECH-256 STYLIZED CALTECH-256 TINYIMAGENET STYLIZED TINYIMAGENET
STANDARD 8332 1683 7202 725UNDERFIT 6904 975 6035 716PGD-linfin 8 6641 1975 5442 1881PGD-linfin 4 7222 2110 6185 2051PGD-linfin 2 7651 2189 6706 1925PGD-linfin 1 7911 2207 6942 1831PGD-l2 12 6524 2014 5344 1933PGD-l2 8 6975 2162 5821 2042PGD-l2 4 7412 2253 6424 2105FGSM 8 7088 2123 6621 1507FGSM 4 7391 2199 6343 2022
Figure 4 Illustration of how varying saturation changes the appearance of the image From left to right saturation level 025 05 1 2(original image) 4 8 16 64 1024 Increasing saturation level pushes pixels towards 0 or 1 which preserves most of the shape whilewiping most of the textures Decreasing saturation level pushes all pixels to 12
2minus 2 20 22 24 26 28 210
Saturation Level
0
20
40
60
80
100
Accura
cy o
n c
orr
ectl
y c
lassifie
d im
ages
cle
an im
age
PGD AT with inf norm
PGD AT with l2 norm
FGSM AT
Stardard Training
Underfitting
2minus 2 20 22 24 26 28 210
Saturation Level
0
20
40
60
80
100
Accura
cy o
n c
orr
ectl
y c
lassifie
d im
ages
cle
an im
age
PGD AT with inf norm
PGD AT with l2 norm
FGSM AT
Stardard Training
Underfitting
(a) Caltech-256 (b) Tiny ImageNetFigure 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 and Tiny ImageNet with respect todifferent saturation levels Note that in the plot there are several curves with same color and line type shown for each adversarial trainingmethod PGD and FGSM-based those of which with larger perturbation achieves better robustness for most of the cases Detailed resultsare list in the appendix
Interpreting Adversarially Trained Convolutional Neural Networks
NormalUnderf
itPGD-in
f8PGD-L2
12 FGSM800
02
04
06
08
10
0750
0952
0738 0769
0932
NormalUnderf
itPGD-in
f8PGD-L2
12 FGSM800
02
04
06
08
10
0550
0877
0012 0043 0028
NormalUnderf
itPGD-in
f8PGD-L2
12 FGSM800
02
04
06
08
10
0541
0913
0002 0012 0012Norma
lUnderf
itPGD-in
f8PGD-L2
12 FGSM800
02
04
06
08
10
0005
0305
0002 0002 0003
(a) Original Image (b) Patch-Shuffle 2 (c) Patch-Shuffle 4 (d) Patch-Shuffle 8Figure 6 Visualization of patch-shuffling transformation The first row shows probability of ldquocakerdquo assigned by different models
clean patch-shuffle 2 patch-shuffle 4 patch-shuffle 8
Patch-Shuffle
0
20
40
60
80
100
Accura
cy o
n c
orr
ectl
y c
lassifie
d im
ages
PGD AT with inf norm
PGD AT with l2 norm
FGSM AT
Stardard Training
Underfitting
clean patch-shuffle 2 patch-shuffle 4 patch-shuffle 8
Patch-Shuffle
0
20
40
60
80
100
Accura
cy o
n c
orr
ectl
y c
lassifie
d im
ages
PGD AT with inf norm
PGD AT with l2 norm
FGSM AT
Stardard Training
Underfitting
(a) Caltech-256 (b) Tiny ImageNetFigure 7 ldquoAccuracy on correctly classified imagesrdquo for different models on patch-shuffled Tiny ImageNet and Caltech-256 with differentsplitting numbers Detailed results are listed in the appendix
When decreasing the saturation level all models have simi-lar degree of performance degradation indicating that AT-CNNs are not robust to all kinds of image distortions Theytend to be more robust for fixed types of distortions Weleave the further investigation regarding this issue as futurework
423 PATCH-SHUFFLING
Stylizing and saturation operation aim at changing or re-moving the texture information of original images whilepreserving the features of shapes and edges In order to testthe different bias of AT-CNN and standard CNN in the otherway around we shatter the shape and edge information bysplitting the images into k times k patches and then randomlyshuffling them This operation could still maintains the localtextures if k is not too large
Figure 6 shows one example of patch-shuffled images underdifferent numbers of splitting The first row shows the proba-bilities assigned by different models to the ground truth class
of the original image Obviously after random shufflingthe shapes and edge features are destroyed dramaticallythe prediction probability of the adverarially trained CNNsdrops significantly while the normal CNNs still maintainsa high confidence over the ground truth class This revealsAT-CNNs are more baised towards shapes and edges thannormally trained ones
Moreover Figure 7 depicts the ldquo accuracy of correctly classi-fied imagesrdquo for all the models measured on ldquoPatch-shuffledrdquotest set with increasing number of splitting pieces AT-CNNs especially trained against with a stronger attack aremore sensitive to ldquoPatch-shufflingrdquo operations in most ofour experiments
Note that under ldquoPatch-shuffle 8rdquo operation all models havesimilar ldquo accuracy of correctly classified imagesrdquo which islargely due to the severe information loss Also note that thisaccuracy of all models on Tiny ImageNet shown in 7(a) ismush lower than that on Caltech-256 in 7(b) That is underldquoPatch-shuffle 1rdquo normally trained CNN has an accuracy
Interpreting Adversarially Trained Convolutional Neural Networks
of 8476 on Caltech-256 while only 6673 on TinyImageNet This mainly origins from the limited resolutionof Tiny ImageNet since ldquoPatch-Shufflerdquo operation on low-resolution images destroys more useful features than thosewith higher resolution
5 Related work and discussionInterpreting AT-CNNs Recently there are some relevantfindings indicating that AT-CNNs learn fundamentally differ-ent feature representations than standard classifiers Tsipraset al (2018) showed that sensitivity maps of AT-CNNs inthe input space align well with human perception Addi-tionally by visualizing large-ε adversarial examples againstAT-CNNs it can be observed that the adversarial examplescould capture salient data characteristics of a different classwhich appear semantically similar to the images of the differ-ent class Dong et al (2017) leveraged adversarial trainingto produce a more interpretable representation by visualiz-ing active neurons Compared with Tsipras et al (2018) andDong et al (2017) we have conducted a more systematicalinvestigation for interpreting AT-CNNs We construct threetypes of image transformation that can largely change thetextures while preserving shape information (ie stylizingand saturation) or shatter the shapeedge features whilekeeping the local textures (ie patch-shuffling) Evaluatingthe generalization of AT-CNNs over these designed datasetsprovides a quantitative way to verify and interpret theirstrong shape-bias compared with normal CNNs
Insights for defensing adversarial examples Based onour investigation over the AT-CNNs we find that the ro-bustness towards adversarial examples is correlated withthe capability of capturing long-range features like shapesor contours This naturally raises the question whetherany other models that can capture more global features orwith more texture invariance could lead to more robustnessto adversarial examples even without adversarial train-ing This might provide us some insights on designing newnetwork architecture or new strategies for enhancing thebias towards long-range features Some recent works turnout partially answering this question (Xie et al 2018) en-hanced standard CNNs with non-local blocks inspired from(Wang et al 2018 Vaswani et al 2017) which capture long-range dependencies in a data-dependent manner and whencombined with adversarial training their networks achievedstate-of-the-art adversarial robustness on ImageNet (Luoet al 2018) destroyed some of the local connection of stan-dard CNNs by randomly select a set of neurons and removethem from the network before training and thus forcingthe CNNs to less focus on local texture features With thisdesign they achieved improved black-box robustness
Adversarial training with other types of attacks In thiswork we mainly interpret the AT-CNNs based on norm-
constrained perturbation over the original images It is wor-thy of noting that the difference between normally trainedand adversarially trained CNNs may highly depends onthe type of adversaries Models trained against spatially-transformed adversary (Xiao et al 2018) denoted as ST-ST-CNNs have similar robustness towards PGD attack withstandard models and their salience maps are still quite dif-ferent as shown in Figure 8 Also the average distancebetween salience maps is close to that of standard CNNwhich is much higher than that of PGD-AT-CNN There ex-ists a variety of generalized types of attacks xadv = G(xw)parameterized by w such as spatially transformed (Xiaoet al 2018) and GAN-based adversarial examples (Songet al 2018) We leave interpreting the AT-CNNs based onthese generalized types of attacks as future work
Figure 8 Sensitivity maps based on SmoothGrad (Smilkov et al2017) of three models From left to right original image sensitiv-ity maps of standard CNN PGD-linfin AT-CNN and ST-AT-CNN
6 ConclusionFrom both qualitative and quantitative perspectives we haveimplemented a systematic study on interpreting the adversar-ially trained convolutional neural networks Through con-structing distorted test sets either preserving shapes or localtextures we compare the sensitivity maps of AT-CNNs andnormal CNNs on the clean stylized and saturated imageswhich visually demonstrates that AT-CNNs are more biasedtowards global structures such as shapes and edges Moreimportantly we evaluate the generalization performance ofthe two models on the three constructed datasets stylizedsaturated and patch-shuffled ones The results clearly indi-cate that AT-CNNs are less sensitive to the texture distortionand focus more on shape information while the normallytrained CNNs the other way around
Understanding what a model has learned is an essentialtopic in both machine learning and computer vision Thestrategies we propose can also be extended to interpret otherneural networks such as models for object detection andsemantic segmentation
AcknowledgementThis work is supported by National Natural Science Foun-dation of China (No61806009) Beijing Natural Sci-ence Foundation (No4184090) Beijing Academy of Ar-tificial Intelligence (BAAI) and Intelligent Manufactur-
Interpreting Adversarially Trained Convolutional Neural Networks
ing Action Plan of Industrial Solid Foundation Program(NoJCKY2018204C004) We also appreciate insightfuldiscussions with Dinghuai Zhang and Dr Lei Wu
ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt
M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018
Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018
Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018
Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015
Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016
Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019
Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009
Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input datadistributions In International Conference on LearningRepresentations 2019
Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017
Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009
Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015
Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019
Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014
Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014
Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007
He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a
He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b
Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017
Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017
Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012
Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016
Long J Shelhamer E and Darrell T Fully convolutionalnetworks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015
Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018
Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant toadversarial attacks In International Conference on Learn-ing Representations 2018
Interpreting Adversarially Trained Convolutional Neural Networks
Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017
Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018
Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017
Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015
Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017
Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013
Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018
Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017
Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018
Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017
Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018
Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017
Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018
Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018
Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018
Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014
Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a
Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b
Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016
Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017
Interpreting Adversarially Trained Convolutional Neural Networks
A Experiment SetupA1 Models
bull CIFAR-10 We train a standard ResNet-18 (He et al2016a) architecture it has 4 groups of residual layerswith filter sizes (64 128 256 512) and 2 residual units
bull Caltech-256 amp Tiny ImageNet We use a ResNet-18architecture using the code from pytorch(Paszke et al2017) Note that for models on Caltech-256 amp TinyImageNet we initialize them using ImageNet(Denget al 2009) pre-trained weighs provided by pytorch
We evaluate the robustness of all our models using a linfinprojected gradient descent adversary with ε = 8255 stepsize = 2 and number of iterations as 40
A2 Adversarial Training
We perform 9 types of adversarial training on each of thedataset 7 of the 9 kinds of adversarial training are againsta projected gradient descent (PGD) adversary(Madry et al2018) the other 2 are against FGSM adversary(Goodfellowet al 2014)
A21 TRAIN AGAINST A PROJECTED GRADIENTDESCENT (PGD) ADVERSARY
We list value of ε for adversarial training of each dataset andlp-norm In all settings PGD runs 20 iterations
bull linfin-norm bounded adversary For all of thethree data set pixel vaules range from 0 1 wetrain 4 adversarially trained CNNs with ε isin1255 2255 4255 8255 these four models aredenoted as PGD-inf1 2 4 8 respectively and stepssize as 1255 1255 2255 4255
bull l2-norm bounded adversary For Caltech-256 ampTiny ImageNet the input size for our model is 224times224 we train three adversarially trained CNNs withε isin 4 8 12 and these four models are denoted asPGD-l2 4 8 12 respectively Step sizes for thesethree models are 2255 4255 6255 For CIFAR-10where images are of size 32times 32 the three adversari-ally trained CNNs have ε isin 410 810 1210 butthey are denoted in the same way and have the samestep size as that in Caltech-256 amp Tiny ImageNet
A22 TRAIN AGAINST A FGSM ADVERSARY
ε for these two adversarially trained CNNs are ε isin4 8 and they are denoted as FGSM 4 8 respectively
B Style-transferred test setFollowing (Geirhos et al 2019) we construct stylized testset for Caltech-256 and Tiny ImageNet by applying theAdaIn style transfer(Huang amp Belongie 2017) with a styl-ization coefficient of α = 10 to every test image withthe style of a randomly selected painting from 3KagglersquosPainter by numbers dataset we used source code providedby(Geirhos et al 2019)
C Experiments on Fourier-filtered datasets(Jo amp Bengio 2017) showed deep neural networks tendto learn surface statistical regularities as opposed to high-level abstractions Following them we test the performanceof different trained CNNs on the high-pass and low-passfiltered dataset to show their tendencies
C1 Fourier filtering setup
Following (Jo amp Bengio 2017) We construct three types ofFourier filtered version of test set
bull The low frequency filtered version We use a radialmask in the Fourier domain to set higher frequencymodes to zero(low-pass filtering)
bull The high frequency filtered version We use a radialmask in the Fourier domain to preserve only the higherfrequency modes(high-pass filtering)
bull The random filtered version We use a random maskin the Fourier domain to set each mode to 0 with prob-ability p uniformly The random mask is generated onthe fly during the test
C2 Results
We measure generalization performance (accuracy on cor-rectly classified images) of each model on these three fil-tered datasets from Caltech-256 results are listed in Ta-ble 3 AT-CNNs performs better on Low-pass filtered datasetand worse on High-pass filtered dataset Results indicatethat AT-CNNs make their predictions depend more on low-frequency information This finding is consistent with ourconclusions since local features such as textures are oftenconsidered as high-frequency information and shapes andcontours are more like low-frequency
D Detailed resultsWe the detailed results for our quantitative experimentshere Table 5 4 6 show the results of each models on
3httpswwwkagglecomcpainter-by-numbers
Interpreting Adversarially Trained Convolutional Neural Networks
Table 3 ldquoAccuracy on correctly classified imagesrdquo for different models on three Fourier-filtered Caltech-256 test setsDATA SET THE LOW FREQUENCY FILTERED VERSION THE HIGH FREQUENCY FILTERED VERSION THE RANDOM FILTERED VERSION
STANDARD 158 165 735UNDERFIT 145 176 622PGD-linfin 711 36 734
test set with different saturation levels Table 8 7 list allthe results of each models on test set after different path-shuffling operations
E Additional FiguresWe show additional sensitive maps in Figure 9 We alsocompare the sensitive maps using Grad and SmoothGradin Figure 10
ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt
M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018
Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018
Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018
Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015
Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016
Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019
Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009
Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input data
distributions In International Conference on LearningRepresentations 2019
Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017
Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009
Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015
Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019
Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014
Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014
Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007
He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a
He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b
Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017
Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017
Interpreting Adversarially Trained Convolutional Neural Networks
Table 4 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Caltech-256
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 2862 5745 8520 9013 6537 4237 2345 2003UNDERFIT 3184 6336 9096 8451 5751 3858 2600 2308PGD-linfin 8 3284 5347 8272 8645 7033 6109 5376 5191PGD-linfin 4 3199 5774 8518 8795 7033 5838 4816 4545PGD-linfin 2 3299 6075 8775 8935 6878 5199 4069 3783PGD-linfin 1 3267 6185 8936 9018 6907 5005 3798 3480PGD-l2 12 3138 5307 8210 8389 6706 5851 5245 5075PGD-l2 8 3282 5665 8501 8609 6890 5875 5159 4930PGD-l2 4 3282 5877 8630 8636 6794 5368 4443 4198FGSM 8 2953 5546 8510 8665 6901 5564 4592 4342FGSM 4 3268 5937 8722 8790 6671 5113 4166 3878
Table 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Tiny ImageNet test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Tiny ImageNet
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 724 2588 7252 7273 2538 824 262 193UNDERFIT 734 2544 6980 6067 1801 672 316 265PGD-linfin 8 1107 2908 6711 7453 498 4016 3544 3396PGD-linfin 4 1244 3353 7294 7575 4638 3212 2492 2265PGD-linfin 2 1209 3485 7577 7615 4135 2520 1693 1452PGD-linfin 1 1130 3503 7685 7863 4048 2137 1270 1081PGD-l2 12 1130 2948 6694 7522 5226 4211 3720 3585PGD-l2 8 1242 3278 7194 7515 4792 3566 2955 2790PGD-l2 4 1263 3410 7406 7732 4500 2873 2016 1804FGSM 8 1259 3266 7055 8153 4183 1752 729 582FGSM 4 1263 3410 7406 7505 4291 2909 2215 2014
Table 6 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated CIFAR-10 test set It is easily observed AT-CNNs aremuch more robust to increasing saturation levels on CIFAR-10
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 2736 5595 9103 9312 6998 4830 3439 3106UNDERFIT 2143 5028 8771 8989 6609 4335 2910 2613PGD-linfin 8 2605 4696 8097 8916 7546 6908 5898 6464PGD-linfin 4 2722 4981 8416 8979 7389 6535 5999 5847PGD-linfin 2 2832 5312 8693 9137 7402 6282 5525 5260PGD-linfin 1 2718 5359 8854 9177 7267 5839 4725 4175PGD-l2 12 2599 4692 8172 8844 7392 6603 6098 5941PGD-l2 8 2775 5029 8376 8092 7317 6483 5864 4694PGD-l2 4 2726 5117 8578 9008 7312 6150 5204 4879FGSM 8 2550 4611 8172 8767 7422 6712 6251 6132FGSM 4 2639 5893 8430 8902 7347 6443 5880 5682
Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012
Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016
Long J Shelhamer E and Darrell T Fully convolutional
networks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015
Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018
Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant to
Interpreting Adversarially Trained Convolutional Neural Networks
Table 7 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Caltech-256 test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Caltech-256
DATA SET 2times 2 4times 4 8times 8T
STANDARD 8476 5150 1084UNDERFIT 7559 3341 603PGD-linfin 8 5813 2014 770PGD-linfin 4 6854 2645 818PGD-linfin 2 7425 3077 900PGD-linfin 1 7811 3503 842PGD-l2 12 5825 2103 785PGD-l2 8 6336 2219 848PGD-l2 4 6965 2821 772FGSM 8 6448 2294 807FGSM 4 7050 2841 603
Table 8 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Tiny ImageNet test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Tiny ImageNet
DATA SET 2times 2 4times 4 8times 8T
STANDARD 6673 2487 448UNDERFIT 5922 2362 438PGD-linfin 8 4108 1605 683PGD-linfin 4 4954 1823 630PGD-linfin 2 5596 1995 561PGD-linfin 1 6019 2324 608PGD-l2 12 4223 1695 766PGD-l2 8 4767 1628 650PGD-l2 4 5194 1779 589FGSM 8 5742 2070 473FGSM 4 5068 1684 598
Figure 9 Visualization of Salience maps generated from SmoothGrad (Smilkov et al 2017) for all 11 models From left to rightStandard CNNs underfitting CNNs PGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4
Interpreting Adversarially Trained Convolutional Neural Networks
Figure 10 Visualization of Salience maps generated from Grad for all 11 models From left to right Standard CNNs underfitting CNNsPGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4 Itrsquos easily observed that sensitivity maps generated from Grad are more noisycompared with its smoothed variant SmoothGrad especially for Standard CNNs and underfitting CNNs
adversarial attacks In International Conference on Learn-ing Representations 2018
Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017
Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018
Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017
Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015
Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017
Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013
Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018
Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017
Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018
Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017
Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018
Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017
Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018
Interpreting Adversarially Trained Convolutional Neural Networks
Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018
Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018
Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014
Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a
Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b
Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016
Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017
Interpreting Adversarially Trained Convolutional Neural Networks
(a) Images from Caltech-256 (b) Images from Tiny ImageNet
Figure 2 Sensitivity maps based on SmoothGrad (Smilkov et al 2017) of three models on images under saturation and stylizing Fromtop to bottom Original Saturation 1024 and Stylizing For each group of images from left to right original image sensitivity maps ofstandard CNN underfitting CNN and PGD-linfin AT-CNN
while standard CNNs totally fail
Due to the limited space we provide more visualizationresults (including the sensitivity maps generated by Gradmethod) in appendix
42 Generalization performance on transformed data
In this part we mainly show generalization performanceof AT-CCNs and normal CNNs on either shape or texturepreserving distorted image datasets This could help us tounderstand how different that the two types of models arebiased in a quantitative way
For all experimental results below besides the top-1 accu-racy we also report an ldquoaccuracy on correctly classifiedimagesrdquo This accuracy is measured by first selecting theimages from the clean test set that is being correctly clas-sified then measuring the accuracy of transformed imagesfrom these correctly classified ones
421 STYLIZING
Following Geirhos et al (2019) we generate stylized ver-sion of test set for Caltech-256 and Tiny ImageNet
We report the ldquoaccuracy on correctly classified imagesrdquo ofall the trained models on stylized test set in Table 2 Com-pared with standard CNNs though with a lower accuracyon original test images AT-CNNs achieve higher accuracyon stylized ones with textures being dramatically changedThe comparison quantitatively shows that AT-CNNs tend tobe more invariant with respect to local textures
422 SATURATION
We use the saturation operation to manipulate the imagesand show the how increasing saturation levels affects theaccuracy of models trained in different ways
In Figure 4 we visualize images with varying saturation lev-els It can be easily observed that increasing saturation levelspushes images more ldquobinnarizedrdquo where some textures arewiped out but produces sharper edges and preserving shapeinformation When saturation level is smaller than 2 ieclean image it pushes all the pixels towards 12 and nearlyall the information is lost and p = 0 leads to a totally grayimage with constant pixel value
We measure the ldquoaccuracy on correctly classified imagesrdquofor all the trained models and show them in Figure 5 Wecan observe that with the increasing level of saturation moretexture information is lost Favorably adversarially trainedmodels exhibit a much less sensitivity to this texture lossstill obtaining a high classification accuracy The resultsindicate that AT-CNNs are more robust to ldquosaturationrdquo orldquobinarizingrdquo operations which may demonstrate that theprediction capability of AT-CNNs relies less on texture andmore on shapes Results on CIFAR-10 tells the same storyas presented in appendix due to the limited space
Additionally in our experiments for each adversarial train-ing approach either PGD or FGSM based AT-CNNs withhigher robustness towards PGD adversary are more invariantto the increasing of the saturation level and texture loss Onthe other hand adversarial training with higher robustnesstypically ruin the generalization over the clean dataset Ourfinding also supports the claim ldquorobustness maybe at oddswith accuracyrdquo (Tsipras et al 2018)
Interpreting Adversarially Trained Convolutional Neural Networks
Figure 3 Visualization of images from style-transferred test set Applying AdaIn (Huang amp Belongie 2017) style transfer distorts localtextures of original images while the global shape structure is retained The first row are images from Caltech-256 and the second roware images from Tiny ImageNet
Table 2 ldquoAccuracy on correctly classified imagesrdquo for different models on stylized test set The columns named ldquoCaltech-256rdquo andldquoTinyImageNetrdquo show the generalization of different models on the clean test set
DATASET CALTECH-256 STYLIZED CALTECH-256 TINYIMAGENET STYLIZED TINYIMAGENET
STANDARD 8332 1683 7202 725UNDERFIT 6904 975 6035 716PGD-linfin 8 6641 1975 5442 1881PGD-linfin 4 7222 2110 6185 2051PGD-linfin 2 7651 2189 6706 1925PGD-linfin 1 7911 2207 6942 1831PGD-l2 12 6524 2014 5344 1933PGD-l2 8 6975 2162 5821 2042PGD-l2 4 7412 2253 6424 2105FGSM 8 7088 2123 6621 1507FGSM 4 7391 2199 6343 2022
Figure 4 Illustration of how varying saturation changes the appearance of the image From left to right saturation level 025 05 1 2(original image) 4 8 16 64 1024 Increasing saturation level pushes pixels towards 0 or 1 which preserves most of the shape whilewiping most of the textures Decreasing saturation level pushes all pixels to 12
2minus 2 20 22 24 26 28 210
Saturation Level
0
20
40
60
80
100
Accura
cy o
n c
orr
ectl
y c
lassifie
d im
ages
cle
an im
age
PGD AT with inf norm
PGD AT with l2 norm
FGSM AT
Stardard Training
Underfitting
2minus 2 20 22 24 26 28 210
Saturation Level
0
20
40
60
80
100
Accura
cy o
n c
orr
ectl
y c
lassifie
d im
ages
cle
an im
age
PGD AT with inf norm
PGD AT with l2 norm
FGSM AT
Stardard Training
Underfitting
(a) Caltech-256 (b) Tiny ImageNetFigure 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 and Tiny ImageNet with respect todifferent saturation levels Note that in the plot there are several curves with same color and line type shown for each adversarial trainingmethod PGD and FGSM-based those of which with larger perturbation achieves better robustness for most of the cases Detailed resultsare list in the appendix
Interpreting Adversarially Trained Convolutional Neural Networks
NormalUnderf
itPGD-in
f8PGD-L2
12 FGSM800
02
04
06
08
10
0750
0952
0738 0769
0932
NormalUnderf
itPGD-in
f8PGD-L2
12 FGSM800
02
04
06
08
10
0550
0877
0012 0043 0028
NormalUnderf
itPGD-in
f8PGD-L2
12 FGSM800
02
04
06
08
10
0541
0913
0002 0012 0012Norma
lUnderf
itPGD-in
f8PGD-L2
12 FGSM800
02
04
06
08
10
0005
0305
0002 0002 0003
(a) Original Image (b) Patch-Shuffle 2 (c) Patch-Shuffle 4 (d) Patch-Shuffle 8Figure 6 Visualization of patch-shuffling transformation The first row shows probability of ldquocakerdquo assigned by different models
clean patch-shuffle 2 patch-shuffle 4 patch-shuffle 8
Patch-Shuffle
0
20
40
60
80
100
Accura
cy o
n c
orr
ectl
y c
lassifie
d im
ages
PGD AT with inf norm
PGD AT with l2 norm
FGSM AT
Stardard Training
Underfitting
clean patch-shuffle 2 patch-shuffle 4 patch-shuffle 8
Patch-Shuffle
0
20
40
60
80
100
Accura
cy o
n c
orr
ectl
y c
lassifie
d im
ages
PGD AT with inf norm
PGD AT with l2 norm
FGSM AT
Stardard Training
Underfitting
(a) Caltech-256 (b) Tiny ImageNetFigure 7 ldquoAccuracy on correctly classified imagesrdquo for different models on patch-shuffled Tiny ImageNet and Caltech-256 with differentsplitting numbers Detailed results are listed in the appendix
When decreasing the saturation level all models have simi-lar degree of performance degradation indicating that AT-CNNs are not robust to all kinds of image distortions Theytend to be more robust for fixed types of distortions Weleave the further investigation regarding this issue as futurework
423 PATCH-SHUFFLING
Stylizing and saturation operation aim at changing or re-moving the texture information of original images whilepreserving the features of shapes and edges In order to testthe different bias of AT-CNN and standard CNN in the otherway around we shatter the shape and edge information bysplitting the images into k times k patches and then randomlyshuffling them This operation could still maintains the localtextures if k is not too large
Figure 6 shows one example of patch-shuffled images underdifferent numbers of splitting The first row shows the proba-bilities assigned by different models to the ground truth class
of the original image Obviously after random shufflingthe shapes and edge features are destroyed dramaticallythe prediction probability of the adverarially trained CNNsdrops significantly while the normal CNNs still maintainsa high confidence over the ground truth class This revealsAT-CNNs are more baised towards shapes and edges thannormally trained ones
Moreover Figure 7 depicts the ldquo accuracy of correctly classi-fied imagesrdquo for all the models measured on ldquoPatch-shuffledrdquotest set with increasing number of splitting pieces AT-CNNs especially trained against with a stronger attack aremore sensitive to ldquoPatch-shufflingrdquo operations in most ofour experiments
Note that under ldquoPatch-shuffle 8rdquo operation all models havesimilar ldquo accuracy of correctly classified imagesrdquo which islargely due to the severe information loss Also note that thisaccuracy of all models on Tiny ImageNet shown in 7(a) ismush lower than that on Caltech-256 in 7(b) That is underldquoPatch-shuffle 1rdquo normally trained CNN has an accuracy
Interpreting Adversarially Trained Convolutional Neural Networks
of 8476 on Caltech-256 while only 6673 on TinyImageNet This mainly origins from the limited resolutionof Tiny ImageNet since ldquoPatch-Shufflerdquo operation on low-resolution images destroys more useful features than thosewith higher resolution
5 Related work and discussionInterpreting AT-CNNs Recently there are some relevantfindings indicating that AT-CNNs learn fundamentally differ-ent feature representations than standard classifiers Tsipraset al (2018) showed that sensitivity maps of AT-CNNs inthe input space align well with human perception Addi-tionally by visualizing large-ε adversarial examples againstAT-CNNs it can be observed that the adversarial examplescould capture salient data characteristics of a different classwhich appear semantically similar to the images of the differ-ent class Dong et al (2017) leveraged adversarial trainingto produce a more interpretable representation by visualiz-ing active neurons Compared with Tsipras et al (2018) andDong et al (2017) we have conducted a more systematicalinvestigation for interpreting AT-CNNs We construct threetypes of image transformation that can largely change thetextures while preserving shape information (ie stylizingand saturation) or shatter the shapeedge features whilekeeping the local textures (ie patch-shuffling) Evaluatingthe generalization of AT-CNNs over these designed datasetsprovides a quantitative way to verify and interpret theirstrong shape-bias compared with normal CNNs
Insights for defensing adversarial examples Based onour investigation over the AT-CNNs we find that the ro-bustness towards adversarial examples is correlated withthe capability of capturing long-range features like shapesor contours This naturally raises the question whetherany other models that can capture more global features orwith more texture invariance could lead to more robustnessto adversarial examples even without adversarial train-ing This might provide us some insights on designing newnetwork architecture or new strategies for enhancing thebias towards long-range features Some recent works turnout partially answering this question (Xie et al 2018) en-hanced standard CNNs with non-local blocks inspired from(Wang et al 2018 Vaswani et al 2017) which capture long-range dependencies in a data-dependent manner and whencombined with adversarial training their networks achievedstate-of-the-art adversarial robustness on ImageNet (Luoet al 2018) destroyed some of the local connection of stan-dard CNNs by randomly select a set of neurons and removethem from the network before training and thus forcingthe CNNs to less focus on local texture features With thisdesign they achieved improved black-box robustness
Adversarial training with other types of attacks In thiswork we mainly interpret the AT-CNNs based on norm-
constrained perturbation over the original images It is wor-thy of noting that the difference between normally trainedand adversarially trained CNNs may highly depends onthe type of adversaries Models trained against spatially-transformed adversary (Xiao et al 2018) denoted as ST-ST-CNNs have similar robustness towards PGD attack withstandard models and their salience maps are still quite dif-ferent as shown in Figure 8 Also the average distancebetween salience maps is close to that of standard CNNwhich is much higher than that of PGD-AT-CNN There ex-ists a variety of generalized types of attacks xadv = G(xw)parameterized by w such as spatially transformed (Xiaoet al 2018) and GAN-based adversarial examples (Songet al 2018) We leave interpreting the AT-CNNs based onthese generalized types of attacks as future work
Figure 8 Sensitivity maps based on SmoothGrad (Smilkov et al2017) of three models From left to right original image sensitiv-ity maps of standard CNN PGD-linfin AT-CNN and ST-AT-CNN
6 ConclusionFrom both qualitative and quantitative perspectives we haveimplemented a systematic study on interpreting the adversar-ially trained convolutional neural networks Through con-structing distorted test sets either preserving shapes or localtextures we compare the sensitivity maps of AT-CNNs andnormal CNNs on the clean stylized and saturated imageswhich visually demonstrates that AT-CNNs are more biasedtowards global structures such as shapes and edges Moreimportantly we evaluate the generalization performance ofthe two models on the three constructed datasets stylizedsaturated and patch-shuffled ones The results clearly indi-cate that AT-CNNs are less sensitive to the texture distortionand focus more on shape information while the normallytrained CNNs the other way around
Understanding what a model has learned is an essentialtopic in both machine learning and computer vision Thestrategies we propose can also be extended to interpret otherneural networks such as models for object detection andsemantic segmentation
AcknowledgementThis work is supported by National Natural Science Foun-dation of China (No61806009) Beijing Natural Sci-ence Foundation (No4184090) Beijing Academy of Ar-tificial Intelligence (BAAI) and Intelligent Manufactur-
Interpreting Adversarially Trained Convolutional Neural Networks
ing Action Plan of Industrial Solid Foundation Program(NoJCKY2018204C004) We also appreciate insightfuldiscussions with Dinghuai Zhang and Dr Lei Wu
ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt
M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018
Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018
Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018
Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015
Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016
Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019
Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009
Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input datadistributions In International Conference on LearningRepresentations 2019
Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017
Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009
Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015
Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019
Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014
Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014
Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007
He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a
He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b
Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017
Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017
Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012
Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016
Long J Shelhamer E and Darrell T Fully convolutionalnetworks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015
Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018
Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant toadversarial attacks In International Conference on Learn-ing Representations 2018
Interpreting Adversarially Trained Convolutional Neural Networks
Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017
Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018
Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017
Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015
Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017
Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013
Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018
Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017
Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018
Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017
Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018
Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017
Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018
Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018
Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018
Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014
Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a
Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b
Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016
Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017
Interpreting Adversarially Trained Convolutional Neural Networks
A Experiment SetupA1 Models
bull CIFAR-10 We train a standard ResNet-18 (He et al2016a) architecture it has 4 groups of residual layerswith filter sizes (64 128 256 512) and 2 residual units
bull Caltech-256 amp Tiny ImageNet We use a ResNet-18architecture using the code from pytorch(Paszke et al2017) Note that for models on Caltech-256 amp TinyImageNet we initialize them using ImageNet(Denget al 2009) pre-trained weighs provided by pytorch
We evaluate the robustness of all our models using a linfinprojected gradient descent adversary with ε = 8255 stepsize = 2 and number of iterations as 40
A2 Adversarial Training
We perform 9 types of adversarial training on each of thedataset 7 of the 9 kinds of adversarial training are againsta projected gradient descent (PGD) adversary(Madry et al2018) the other 2 are against FGSM adversary(Goodfellowet al 2014)
A21 TRAIN AGAINST A PROJECTED GRADIENTDESCENT (PGD) ADVERSARY
We list value of ε for adversarial training of each dataset andlp-norm In all settings PGD runs 20 iterations
bull linfin-norm bounded adversary For all of thethree data set pixel vaules range from 0 1 wetrain 4 adversarially trained CNNs with ε isin1255 2255 4255 8255 these four models aredenoted as PGD-inf1 2 4 8 respectively and stepssize as 1255 1255 2255 4255
bull l2-norm bounded adversary For Caltech-256 ampTiny ImageNet the input size for our model is 224times224 we train three adversarially trained CNNs withε isin 4 8 12 and these four models are denoted asPGD-l2 4 8 12 respectively Step sizes for thesethree models are 2255 4255 6255 For CIFAR-10where images are of size 32times 32 the three adversari-ally trained CNNs have ε isin 410 810 1210 butthey are denoted in the same way and have the samestep size as that in Caltech-256 amp Tiny ImageNet
A22 TRAIN AGAINST A FGSM ADVERSARY
ε for these two adversarially trained CNNs are ε isin4 8 and they are denoted as FGSM 4 8 respectively
B Style-transferred test setFollowing (Geirhos et al 2019) we construct stylized testset for Caltech-256 and Tiny ImageNet by applying theAdaIn style transfer(Huang amp Belongie 2017) with a styl-ization coefficient of α = 10 to every test image withthe style of a randomly selected painting from 3KagglersquosPainter by numbers dataset we used source code providedby(Geirhos et al 2019)
C Experiments on Fourier-filtered datasets(Jo amp Bengio 2017) showed deep neural networks tendto learn surface statistical regularities as opposed to high-level abstractions Following them we test the performanceof different trained CNNs on the high-pass and low-passfiltered dataset to show their tendencies
C1 Fourier filtering setup
Following (Jo amp Bengio 2017) We construct three types ofFourier filtered version of test set
bull The low frequency filtered version We use a radialmask in the Fourier domain to set higher frequencymodes to zero(low-pass filtering)
bull The high frequency filtered version We use a radialmask in the Fourier domain to preserve only the higherfrequency modes(high-pass filtering)
bull The random filtered version We use a random maskin the Fourier domain to set each mode to 0 with prob-ability p uniformly The random mask is generated onthe fly during the test
C2 Results
We measure generalization performance (accuracy on cor-rectly classified images) of each model on these three fil-tered datasets from Caltech-256 results are listed in Ta-ble 3 AT-CNNs performs better on Low-pass filtered datasetand worse on High-pass filtered dataset Results indicatethat AT-CNNs make their predictions depend more on low-frequency information This finding is consistent with ourconclusions since local features such as textures are oftenconsidered as high-frequency information and shapes andcontours are more like low-frequency
D Detailed resultsWe the detailed results for our quantitative experimentshere Table 5 4 6 show the results of each models on
3httpswwwkagglecomcpainter-by-numbers
Interpreting Adversarially Trained Convolutional Neural Networks
Table 3 ldquoAccuracy on correctly classified imagesrdquo for different models on three Fourier-filtered Caltech-256 test setsDATA SET THE LOW FREQUENCY FILTERED VERSION THE HIGH FREQUENCY FILTERED VERSION THE RANDOM FILTERED VERSION
STANDARD 158 165 735UNDERFIT 145 176 622PGD-linfin 711 36 734
test set with different saturation levels Table 8 7 list allthe results of each models on test set after different path-shuffling operations
E Additional FiguresWe show additional sensitive maps in Figure 9 We alsocompare the sensitive maps using Grad and SmoothGradin Figure 10
ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt
M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018
Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018
Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018
Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015
Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016
Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019
Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009
Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input data
distributions In International Conference on LearningRepresentations 2019
Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017
Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009
Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015
Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019
Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014
Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014
Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007
He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a
He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b
Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017
Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017
Interpreting Adversarially Trained Convolutional Neural Networks
Table 4 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Caltech-256
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 2862 5745 8520 9013 6537 4237 2345 2003UNDERFIT 3184 6336 9096 8451 5751 3858 2600 2308PGD-linfin 8 3284 5347 8272 8645 7033 6109 5376 5191PGD-linfin 4 3199 5774 8518 8795 7033 5838 4816 4545PGD-linfin 2 3299 6075 8775 8935 6878 5199 4069 3783PGD-linfin 1 3267 6185 8936 9018 6907 5005 3798 3480PGD-l2 12 3138 5307 8210 8389 6706 5851 5245 5075PGD-l2 8 3282 5665 8501 8609 6890 5875 5159 4930PGD-l2 4 3282 5877 8630 8636 6794 5368 4443 4198FGSM 8 2953 5546 8510 8665 6901 5564 4592 4342FGSM 4 3268 5937 8722 8790 6671 5113 4166 3878
Table 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Tiny ImageNet test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Tiny ImageNet
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 724 2588 7252 7273 2538 824 262 193UNDERFIT 734 2544 6980 6067 1801 672 316 265PGD-linfin 8 1107 2908 6711 7453 498 4016 3544 3396PGD-linfin 4 1244 3353 7294 7575 4638 3212 2492 2265PGD-linfin 2 1209 3485 7577 7615 4135 2520 1693 1452PGD-linfin 1 1130 3503 7685 7863 4048 2137 1270 1081PGD-l2 12 1130 2948 6694 7522 5226 4211 3720 3585PGD-l2 8 1242 3278 7194 7515 4792 3566 2955 2790PGD-l2 4 1263 3410 7406 7732 4500 2873 2016 1804FGSM 8 1259 3266 7055 8153 4183 1752 729 582FGSM 4 1263 3410 7406 7505 4291 2909 2215 2014
Table 6 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated CIFAR-10 test set It is easily observed AT-CNNs aremuch more robust to increasing saturation levels on CIFAR-10
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 2736 5595 9103 9312 6998 4830 3439 3106UNDERFIT 2143 5028 8771 8989 6609 4335 2910 2613PGD-linfin 8 2605 4696 8097 8916 7546 6908 5898 6464PGD-linfin 4 2722 4981 8416 8979 7389 6535 5999 5847PGD-linfin 2 2832 5312 8693 9137 7402 6282 5525 5260PGD-linfin 1 2718 5359 8854 9177 7267 5839 4725 4175PGD-l2 12 2599 4692 8172 8844 7392 6603 6098 5941PGD-l2 8 2775 5029 8376 8092 7317 6483 5864 4694PGD-l2 4 2726 5117 8578 9008 7312 6150 5204 4879FGSM 8 2550 4611 8172 8767 7422 6712 6251 6132FGSM 4 2639 5893 8430 8902 7347 6443 5880 5682
Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012
Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016
Long J Shelhamer E and Darrell T Fully convolutional
networks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015
Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018
Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant to
Interpreting Adversarially Trained Convolutional Neural Networks
Table 7 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Caltech-256 test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Caltech-256
DATA SET 2times 2 4times 4 8times 8T
STANDARD 8476 5150 1084UNDERFIT 7559 3341 603PGD-linfin 8 5813 2014 770PGD-linfin 4 6854 2645 818PGD-linfin 2 7425 3077 900PGD-linfin 1 7811 3503 842PGD-l2 12 5825 2103 785PGD-l2 8 6336 2219 848PGD-l2 4 6965 2821 772FGSM 8 6448 2294 807FGSM 4 7050 2841 603
Table 8 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Tiny ImageNet test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Tiny ImageNet
DATA SET 2times 2 4times 4 8times 8T
STANDARD 6673 2487 448UNDERFIT 5922 2362 438PGD-linfin 8 4108 1605 683PGD-linfin 4 4954 1823 630PGD-linfin 2 5596 1995 561PGD-linfin 1 6019 2324 608PGD-l2 12 4223 1695 766PGD-l2 8 4767 1628 650PGD-l2 4 5194 1779 589FGSM 8 5742 2070 473FGSM 4 5068 1684 598
Figure 9 Visualization of Salience maps generated from SmoothGrad (Smilkov et al 2017) for all 11 models From left to rightStandard CNNs underfitting CNNs PGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4
Interpreting Adversarially Trained Convolutional Neural Networks
Figure 10 Visualization of Salience maps generated from Grad for all 11 models From left to right Standard CNNs underfitting CNNsPGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4 Itrsquos easily observed that sensitivity maps generated from Grad are more noisycompared with its smoothed variant SmoothGrad especially for Standard CNNs and underfitting CNNs
adversarial attacks In International Conference on Learn-ing Representations 2018
Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017
Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018
Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017
Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015
Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017
Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013
Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018
Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017
Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018
Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017
Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018
Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017
Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018
Interpreting Adversarially Trained Convolutional Neural Networks
Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018
Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018
Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014
Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a
Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b
Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016
Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017
Interpreting Adversarially Trained Convolutional Neural Networks
Figure 3 Visualization of images from style-transferred test set Applying AdaIn (Huang amp Belongie 2017) style transfer distorts localtextures of original images while the global shape structure is retained The first row are images from Caltech-256 and the second roware images from Tiny ImageNet
Table 2 ldquoAccuracy on correctly classified imagesrdquo for different models on stylized test set The columns named ldquoCaltech-256rdquo andldquoTinyImageNetrdquo show the generalization of different models on the clean test set
DATASET CALTECH-256 STYLIZED CALTECH-256 TINYIMAGENET STYLIZED TINYIMAGENET
STANDARD 8332 1683 7202 725UNDERFIT 6904 975 6035 716PGD-linfin 8 6641 1975 5442 1881PGD-linfin 4 7222 2110 6185 2051PGD-linfin 2 7651 2189 6706 1925PGD-linfin 1 7911 2207 6942 1831PGD-l2 12 6524 2014 5344 1933PGD-l2 8 6975 2162 5821 2042PGD-l2 4 7412 2253 6424 2105FGSM 8 7088 2123 6621 1507FGSM 4 7391 2199 6343 2022
Figure 4 Illustration of how varying saturation changes the appearance of the image From left to right saturation level 025 05 1 2(original image) 4 8 16 64 1024 Increasing saturation level pushes pixels towards 0 or 1 which preserves most of the shape whilewiping most of the textures Decreasing saturation level pushes all pixels to 12
2minus 2 20 22 24 26 28 210
Saturation Level
0
20
40
60
80
100
Accura
cy o
n c
orr
ectl
y c
lassifie
d im
ages
cle
an im
age
PGD AT with inf norm
PGD AT with l2 norm
FGSM AT
Stardard Training
Underfitting
2minus 2 20 22 24 26 28 210
Saturation Level
0
20
40
60
80
100
Accura
cy o
n c
orr
ectl
y c
lassifie
d im
ages
cle
an im
age
PGD AT with inf norm
PGD AT with l2 norm
FGSM AT
Stardard Training
Underfitting
(a) Caltech-256 (b) Tiny ImageNetFigure 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 and Tiny ImageNet with respect todifferent saturation levels Note that in the plot there are several curves with same color and line type shown for each adversarial trainingmethod PGD and FGSM-based those of which with larger perturbation achieves better robustness for most of the cases Detailed resultsare list in the appendix
Interpreting Adversarially Trained Convolutional Neural Networks
NormalUnderf
itPGD-in
f8PGD-L2
12 FGSM800
02
04
06
08
10
0750
0952
0738 0769
0932
NormalUnderf
itPGD-in
f8PGD-L2
12 FGSM800
02
04
06
08
10
0550
0877
0012 0043 0028
NormalUnderf
itPGD-in
f8PGD-L2
12 FGSM800
02
04
06
08
10
0541
0913
0002 0012 0012Norma
lUnderf
itPGD-in
f8PGD-L2
12 FGSM800
02
04
06
08
10
0005
0305
0002 0002 0003
(a) Original Image (b) Patch-Shuffle 2 (c) Patch-Shuffle 4 (d) Patch-Shuffle 8Figure 6 Visualization of patch-shuffling transformation The first row shows probability of ldquocakerdquo assigned by different models
clean patch-shuffle 2 patch-shuffle 4 patch-shuffle 8
Patch-Shuffle
0
20
40
60
80
100
Accura
cy o
n c
orr
ectl
y c
lassifie
d im
ages
PGD AT with inf norm
PGD AT with l2 norm
FGSM AT
Stardard Training
Underfitting
clean patch-shuffle 2 patch-shuffle 4 patch-shuffle 8
Patch-Shuffle
0
20
40
60
80
100
Accura
cy o
n c
orr
ectl
y c
lassifie
d im
ages
PGD AT with inf norm
PGD AT with l2 norm
FGSM AT
Stardard Training
Underfitting
(a) Caltech-256 (b) Tiny ImageNetFigure 7 ldquoAccuracy on correctly classified imagesrdquo for different models on patch-shuffled Tiny ImageNet and Caltech-256 with differentsplitting numbers Detailed results are listed in the appendix
When decreasing the saturation level all models have simi-lar degree of performance degradation indicating that AT-CNNs are not robust to all kinds of image distortions Theytend to be more robust for fixed types of distortions Weleave the further investigation regarding this issue as futurework
423 PATCH-SHUFFLING
Stylizing and saturation operation aim at changing or re-moving the texture information of original images whilepreserving the features of shapes and edges In order to testthe different bias of AT-CNN and standard CNN in the otherway around we shatter the shape and edge information bysplitting the images into k times k patches and then randomlyshuffling them This operation could still maintains the localtextures if k is not too large
Figure 6 shows one example of patch-shuffled images underdifferent numbers of splitting The first row shows the proba-bilities assigned by different models to the ground truth class
of the original image Obviously after random shufflingthe shapes and edge features are destroyed dramaticallythe prediction probability of the adverarially trained CNNsdrops significantly while the normal CNNs still maintainsa high confidence over the ground truth class This revealsAT-CNNs are more baised towards shapes and edges thannormally trained ones
Moreover Figure 7 depicts the ldquo accuracy of correctly classi-fied imagesrdquo for all the models measured on ldquoPatch-shuffledrdquotest set with increasing number of splitting pieces AT-CNNs especially trained against with a stronger attack aremore sensitive to ldquoPatch-shufflingrdquo operations in most ofour experiments
Note that under ldquoPatch-shuffle 8rdquo operation all models havesimilar ldquo accuracy of correctly classified imagesrdquo which islargely due to the severe information loss Also note that thisaccuracy of all models on Tiny ImageNet shown in 7(a) ismush lower than that on Caltech-256 in 7(b) That is underldquoPatch-shuffle 1rdquo normally trained CNN has an accuracy
Interpreting Adversarially Trained Convolutional Neural Networks
of 8476 on Caltech-256 while only 6673 on TinyImageNet This mainly origins from the limited resolutionof Tiny ImageNet since ldquoPatch-Shufflerdquo operation on low-resolution images destroys more useful features than thosewith higher resolution
5 Related work and discussionInterpreting AT-CNNs Recently there are some relevantfindings indicating that AT-CNNs learn fundamentally differ-ent feature representations than standard classifiers Tsipraset al (2018) showed that sensitivity maps of AT-CNNs inthe input space align well with human perception Addi-tionally by visualizing large-ε adversarial examples againstAT-CNNs it can be observed that the adversarial examplescould capture salient data characteristics of a different classwhich appear semantically similar to the images of the differ-ent class Dong et al (2017) leveraged adversarial trainingto produce a more interpretable representation by visualiz-ing active neurons Compared with Tsipras et al (2018) andDong et al (2017) we have conducted a more systematicalinvestigation for interpreting AT-CNNs We construct threetypes of image transformation that can largely change thetextures while preserving shape information (ie stylizingand saturation) or shatter the shapeedge features whilekeeping the local textures (ie patch-shuffling) Evaluatingthe generalization of AT-CNNs over these designed datasetsprovides a quantitative way to verify and interpret theirstrong shape-bias compared with normal CNNs
Insights for defensing adversarial examples Based onour investigation over the AT-CNNs we find that the ro-bustness towards adversarial examples is correlated withthe capability of capturing long-range features like shapesor contours This naturally raises the question whetherany other models that can capture more global features orwith more texture invariance could lead to more robustnessto adversarial examples even without adversarial train-ing This might provide us some insights on designing newnetwork architecture or new strategies for enhancing thebias towards long-range features Some recent works turnout partially answering this question (Xie et al 2018) en-hanced standard CNNs with non-local blocks inspired from(Wang et al 2018 Vaswani et al 2017) which capture long-range dependencies in a data-dependent manner and whencombined with adversarial training their networks achievedstate-of-the-art adversarial robustness on ImageNet (Luoet al 2018) destroyed some of the local connection of stan-dard CNNs by randomly select a set of neurons and removethem from the network before training and thus forcingthe CNNs to less focus on local texture features With thisdesign they achieved improved black-box robustness
Adversarial training with other types of attacks In thiswork we mainly interpret the AT-CNNs based on norm-
constrained perturbation over the original images It is wor-thy of noting that the difference between normally trainedand adversarially trained CNNs may highly depends onthe type of adversaries Models trained against spatially-transformed adversary (Xiao et al 2018) denoted as ST-ST-CNNs have similar robustness towards PGD attack withstandard models and their salience maps are still quite dif-ferent as shown in Figure 8 Also the average distancebetween salience maps is close to that of standard CNNwhich is much higher than that of PGD-AT-CNN There ex-ists a variety of generalized types of attacks xadv = G(xw)parameterized by w such as spatially transformed (Xiaoet al 2018) and GAN-based adversarial examples (Songet al 2018) We leave interpreting the AT-CNNs based onthese generalized types of attacks as future work
Figure 8 Sensitivity maps based on SmoothGrad (Smilkov et al2017) of three models From left to right original image sensitiv-ity maps of standard CNN PGD-linfin AT-CNN and ST-AT-CNN
6 ConclusionFrom both qualitative and quantitative perspectives we haveimplemented a systematic study on interpreting the adversar-ially trained convolutional neural networks Through con-structing distorted test sets either preserving shapes or localtextures we compare the sensitivity maps of AT-CNNs andnormal CNNs on the clean stylized and saturated imageswhich visually demonstrates that AT-CNNs are more biasedtowards global structures such as shapes and edges Moreimportantly we evaluate the generalization performance ofthe two models on the three constructed datasets stylizedsaturated and patch-shuffled ones The results clearly indi-cate that AT-CNNs are less sensitive to the texture distortionand focus more on shape information while the normallytrained CNNs the other way around
Understanding what a model has learned is an essentialtopic in both machine learning and computer vision Thestrategies we propose can also be extended to interpret otherneural networks such as models for object detection andsemantic segmentation
AcknowledgementThis work is supported by National Natural Science Foun-dation of China (No61806009) Beijing Natural Sci-ence Foundation (No4184090) Beijing Academy of Ar-tificial Intelligence (BAAI) and Intelligent Manufactur-
Interpreting Adversarially Trained Convolutional Neural Networks
ing Action Plan of Industrial Solid Foundation Program(NoJCKY2018204C004) We also appreciate insightfuldiscussions with Dinghuai Zhang and Dr Lei Wu
ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt
M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018
Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018
Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018
Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015
Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016
Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019
Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009
Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input datadistributions In International Conference on LearningRepresentations 2019
Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017
Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009
Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015
Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019
Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014
Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014
Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007
He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a
He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b
Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017
Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017
Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012
Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016
Long J Shelhamer E and Darrell T Fully convolutionalnetworks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015
Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018
Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant toadversarial attacks In International Conference on Learn-ing Representations 2018
Interpreting Adversarially Trained Convolutional Neural Networks
Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017
Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018
Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017
Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015
Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017
Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013
Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018
Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017
Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018
Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017
Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018
Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017
Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018
Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018
Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018
Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014
Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a
Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b
Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016
Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017
Interpreting Adversarially Trained Convolutional Neural Networks
A Experiment SetupA1 Models
bull CIFAR-10 We train a standard ResNet-18 (He et al2016a) architecture it has 4 groups of residual layerswith filter sizes (64 128 256 512) and 2 residual units
bull Caltech-256 amp Tiny ImageNet We use a ResNet-18architecture using the code from pytorch(Paszke et al2017) Note that for models on Caltech-256 amp TinyImageNet we initialize them using ImageNet(Denget al 2009) pre-trained weighs provided by pytorch
We evaluate the robustness of all our models using a linfinprojected gradient descent adversary with ε = 8255 stepsize = 2 and number of iterations as 40
A2 Adversarial Training
We perform 9 types of adversarial training on each of thedataset 7 of the 9 kinds of adversarial training are againsta projected gradient descent (PGD) adversary(Madry et al2018) the other 2 are against FGSM adversary(Goodfellowet al 2014)
A21 TRAIN AGAINST A PROJECTED GRADIENTDESCENT (PGD) ADVERSARY
We list value of ε for adversarial training of each dataset andlp-norm In all settings PGD runs 20 iterations
bull linfin-norm bounded adversary For all of thethree data set pixel vaules range from 0 1 wetrain 4 adversarially trained CNNs with ε isin1255 2255 4255 8255 these four models aredenoted as PGD-inf1 2 4 8 respectively and stepssize as 1255 1255 2255 4255
bull l2-norm bounded adversary For Caltech-256 ampTiny ImageNet the input size for our model is 224times224 we train three adversarially trained CNNs withε isin 4 8 12 and these four models are denoted asPGD-l2 4 8 12 respectively Step sizes for thesethree models are 2255 4255 6255 For CIFAR-10where images are of size 32times 32 the three adversari-ally trained CNNs have ε isin 410 810 1210 butthey are denoted in the same way and have the samestep size as that in Caltech-256 amp Tiny ImageNet
A22 TRAIN AGAINST A FGSM ADVERSARY
ε for these two adversarially trained CNNs are ε isin4 8 and they are denoted as FGSM 4 8 respectively
B Style-transferred test setFollowing (Geirhos et al 2019) we construct stylized testset for Caltech-256 and Tiny ImageNet by applying theAdaIn style transfer(Huang amp Belongie 2017) with a styl-ization coefficient of α = 10 to every test image withthe style of a randomly selected painting from 3KagglersquosPainter by numbers dataset we used source code providedby(Geirhos et al 2019)
C Experiments on Fourier-filtered datasets(Jo amp Bengio 2017) showed deep neural networks tendto learn surface statistical regularities as opposed to high-level abstractions Following them we test the performanceof different trained CNNs on the high-pass and low-passfiltered dataset to show their tendencies
C1 Fourier filtering setup
Following (Jo amp Bengio 2017) We construct three types ofFourier filtered version of test set
bull The low frequency filtered version We use a radialmask in the Fourier domain to set higher frequencymodes to zero(low-pass filtering)
bull The high frequency filtered version We use a radialmask in the Fourier domain to preserve only the higherfrequency modes(high-pass filtering)
bull The random filtered version We use a random maskin the Fourier domain to set each mode to 0 with prob-ability p uniformly The random mask is generated onthe fly during the test
C2 Results
We measure generalization performance (accuracy on cor-rectly classified images) of each model on these three fil-tered datasets from Caltech-256 results are listed in Ta-ble 3 AT-CNNs performs better on Low-pass filtered datasetand worse on High-pass filtered dataset Results indicatethat AT-CNNs make their predictions depend more on low-frequency information This finding is consistent with ourconclusions since local features such as textures are oftenconsidered as high-frequency information and shapes andcontours are more like low-frequency
D Detailed resultsWe the detailed results for our quantitative experimentshere Table 5 4 6 show the results of each models on
3httpswwwkagglecomcpainter-by-numbers
Interpreting Adversarially Trained Convolutional Neural Networks
Table 3 ldquoAccuracy on correctly classified imagesrdquo for different models on three Fourier-filtered Caltech-256 test setsDATA SET THE LOW FREQUENCY FILTERED VERSION THE HIGH FREQUENCY FILTERED VERSION THE RANDOM FILTERED VERSION
STANDARD 158 165 735UNDERFIT 145 176 622PGD-linfin 711 36 734
test set with different saturation levels Table 8 7 list allthe results of each models on test set after different path-shuffling operations
E Additional FiguresWe show additional sensitive maps in Figure 9 We alsocompare the sensitive maps using Grad and SmoothGradin Figure 10
ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt
M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018
Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018
Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018
Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015
Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016
Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019
Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009
Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input data
distributions In International Conference on LearningRepresentations 2019
Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017
Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009
Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015
Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019
Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014
Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014
Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007
He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a
He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b
Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017
Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017
Interpreting Adversarially Trained Convolutional Neural Networks
Table 4 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Caltech-256
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 2862 5745 8520 9013 6537 4237 2345 2003UNDERFIT 3184 6336 9096 8451 5751 3858 2600 2308PGD-linfin 8 3284 5347 8272 8645 7033 6109 5376 5191PGD-linfin 4 3199 5774 8518 8795 7033 5838 4816 4545PGD-linfin 2 3299 6075 8775 8935 6878 5199 4069 3783PGD-linfin 1 3267 6185 8936 9018 6907 5005 3798 3480PGD-l2 12 3138 5307 8210 8389 6706 5851 5245 5075PGD-l2 8 3282 5665 8501 8609 6890 5875 5159 4930PGD-l2 4 3282 5877 8630 8636 6794 5368 4443 4198FGSM 8 2953 5546 8510 8665 6901 5564 4592 4342FGSM 4 3268 5937 8722 8790 6671 5113 4166 3878
Table 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Tiny ImageNet test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Tiny ImageNet
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 724 2588 7252 7273 2538 824 262 193UNDERFIT 734 2544 6980 6067 1801 672 316 265PGD-linfin 8 1107 2908 6711 7453 498 4016 3544 3396PGD-linfin 4 1244 3353 7294 7575 4638 3212 2492 2265PGD-linfin 2 1209 3485 7577 7615 4135 2520 1693 1452PGD-linfin 1 1130 3503 7685 7863 4048 2137 1270 1081PGD-l2 12 1130 2948 6694 7522 5226 4211 3720 3585PGD-l2 8 1242 3278 7194 7515 4792 3566 2955 2790PGD-l2 4 1263 3410 7406 7732 4500 2873 2016 1804FGSM 8 1259 3266 7055 8153 4183 1752 729 582FGSM 4 1263 3410 7406 7505 4291 2909 2215 2014
Table 6 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated CIFAR-10 test set It is easily observed AT-CNNs aremuch more robust to increasing saturation levels on CIFAR-10
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 2736 5595 9103 9312 6998 4830 3439 3106UNDERFIT 2143 5028 8771 8989 6609 4335 2910 2613PGD-linfin 8 2605 4696 8097 8916 7546 6908 5898 6464PGD-linfin 4 2722 4981 8416 8979 7389 6535 5999 5847PGD-linfin 2 2832 5312 8693 9137 7402 6282 5525 5260PGD-linfin 1 2718 5359 8854 9177 7267 5839 4725 4175PGD-l2 12 2599 4692 8172 8844 7392 6603 6098 5941PGD-l2 8 2775 5029 8376 8092 7317 6483 5864 4694PGD-l2 4 2726 5117 8578 9008 7312 6150 5204 4879FGSM 8 2550 4611 8172 8767 7422 6712 6251 6132FGSM 4 2639 5893 8430 8902 7347 6443 5880 5682
Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012
Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016
Long J Shelhamer E and Darrell T Fully convolutional
networks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015
Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018
Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant to
Interpreting Adversarially Trained Convolutional Neural Networks
Table 7 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Caltech-256 test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Caltech-256
DATA SET 2times 2 4times 4 8times 8T
STANDARD 8476 5150 1084UNDERFIT 7559 3341 603PGD-linfin 8 5813 2014 770PGD-linfin 4 6854 2645 818PGD-linfin 2 7425 3077 900PGD-linfin 1 7811 3503 842PGD-l2 12 5825 2103 785PGD-l2 8 6336 2219 848PGD-l2 4 6965 2821 772FGSM 8 6448 2294 807FGSM 4 7050 2841 603
Table 8 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Tiny ImageNet test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Tiny ImageNet
DATA SET 2times 2 4times 4 8times 8T
STANDARD 6673 2487 448UNDERFIT 5922 2362 438PGD-linfin 8 4108 1605 683PGD-linfin 4 4954 1823 630PGD-linfin 2 5596 1995 561PGD-linfin 1 6019 2324 608PGD-l2 12 4223 1695 766PGD-l2 8 4767 1628 650PGD-l2 4 5194 1779 589FGSM 8 5742 2070 473FGSM 4 5068 1684 598
Figure 9 Visualization of Salience maps generated from SmoothGrad (Smilkov et al 2017) for all 11 models From left to rightStandard CNNs underfitting CNNs PGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4
Interpreting Adversarially Trained Convolutional Neural Networks
Figure 10 Visualization of Salience maps generated from Grad for all 11 models From left to right Standard CNNs underfitting CNNsPGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4 Itrsquos easily observed that sensitivity maps generated from Grad are more noisycompared with its smoothed variant SmoothGrad especially for Standard CNNs and underfitting CNNs
adversarial attacks In International Conference on Learn-ing Representations 2018
Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017
Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018
Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017
Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015
Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017
Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013
Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018
Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017
Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018
Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017
Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018
Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017
Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018
Interpreting Adversarially Trained Convolutional Neural Networks
Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018
Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018
Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014
Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a
Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b
Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016
Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017
Interpreting Adversarially Trained Convolutional Neural Networks
NormalUnderf
itPGD-in
f8PGD-L2
12 FGSM800
02
04
06
08
10
0750
0952
0738 0769
0932
NormalUnderf
itPGD-in
f8PGD-L2
12 FGSM800
02
04
06
08
10
0550
0877
0012 0043 0028
NormalUnderf
itPGD-in
f8PGD-L2
12 FGSM800
02
04
06
08
10
0541
0913
0002 0012 0012Norma
lUnderf
itPGD-in
f8PGD-L2
12 FGSM800
02
04
06
08
10
0005
0305
0002 0002 0003
(a) Original Image (b) Patch-Shuffle 2 (c) Patch-Shuffle 4 (d) Patch-Shuffle 8Figure 6 Visualization of patch-shuffling transformation The first row shows probability of ldquocakerdquo assigned by different models
clean patch-shuffle 2 patch-shuffle 4 patch-shuffle 8
Patch-Shuffle
0
20
40
60
80
100
Accura
cy o
n c
orr
ectl
y c
lassifie
d im
ages
PGD AT with inf norm
PGD AT with l2 norm
FGSM AT
Stardard Training
Underfitting
clean patch-shuffle 2 patch-shuffle 4 patch-shuffle 8
Patch-Shuffle
0
20
40
60
80
100
Accura
cy o
n c
orr
ectl
y c
lassifie
d im
ages
PGD AT with inf norm
PGD AT with l2 norm
FGSM AT
Stardard Training
Underfitting
(a) Caltech-256 (b) Tiny ImageNetFigure 7 ldquoAccuracy on correctly classified imagesrdquo for different models on patch-shuffled Tiny ImageNet and Caltech-256 with differentsplitting numbers Detailed results are listed in the appendix
When decreasing the saturation level all models have simi-lar degree of performance degradation indicating that AT-CNNs are not robust to all kinds of image distortions Theytend to be more robust for fixed types of distortions Weleave the further investigation regarding this issue as futurework
423 PATCH-SHUFFLING
Stylizing and saturation operation aim at changing or re-moving the texture information of original images whilepreserving the features of shapes and edges In order to testthe different bias of AT-CNN and standard CNN in the otherway around we shatter the shape and edge information bysplitting the images into k times k patches and then randomlyshuffling them This operation could still maintains the localtextures if k is not too large
Figure 6 shows one example of patch-shuffled images underdifferent numbers of splitting The first row shows the proba-bilities assigned by different models to the ground truth class
of the original image Obviously after random shufflingthe shapes and edge features are destroyed dramaticallythe prediction probability of the adverarially trained CNNsdrops significantly while the normal CNNs still maintainsa high confidence over the ground truth class This revealsAT-CNNs are more baised towards shapes and edges thannormally trained ones
Moreover Figure 7 depicts the ldquo accuracy of correctly classi-fied imagesrdquo for all the models measured on ldquoPatch-shuffledrdquotest set with increasing number of splitting pieces AT-CNNs especially trained against with a stronger attack aremore sensitive to ldquoPatch-shufflingrdquo operations in most ofour experiments
Note that under ldquoPatch-shuffle 8rdquo operation all models havesimilar ldquo accuracy of correctly classified imagesrdquo which islargely due to the severe information loss Also note that thisaccuracy of all models on Tiny ImageNet shown in 7(a) ismush lower than that on Caltech-256 in 7(b) That is underldquoPatch-shuffle 1rdquo normally trained CNN has an accuracy
Interpreting Adversarially Trained Convolutional Neural Networks
of 8476 on Caltech-256 while only 6673 on TinyImageNet This mainly origins from the limited resolutionof Tiny ImageNet since ldquoPatch-Shufflerdquo operation on low-resolution images destroys more useful features than thosewith higher resolution
5 Related work and discussionInterpreting AT-CNNs Recently there are some relevantfindings indicating that AT-CNNs learn fundamentally differ-ent feature representations than standard classifiers Tsipraset al (2018) showed that sensitivity maps of AT-CNNs inthe input space align well with human perception Addi-tionally by visualizing large-ε adversarial examples againstAT-CNNs it can be observed that the adversarial examplescould capture salient data characteristics of a different classwhich appear semantically similar to the images of the differ-ent class Dong et al (2017) leveraged adversarial trainingto produce a more interpretable representation by visualiz-ing active neurons Compared with Tsipras et al (2018) andDong et al (2017) we have conducted a more systematicalinvestigation for interpreting AT-CNNs We construct threetypes of image transformation that can largely change thetextures while preserving shape information (ie stylizingand saturation) or shatter the shapeedge features whilekeeping the local textures (ie patch-shuffling) Evaluatingthe generalization of AT-CNNs over these designed datasetsprovides a quantitative way to verify and interpret theirstrong shape-bias compared with normal CNNs
Insights for defensing adversarial examples Based onour investigation over the AT-CNNs we find that the ro-bustness towards adversarial examples is correlated withthe capability of capturing long-range features like shapesor contours This naturally raises the question whetherany other models that can capture more global features orwith more texture invariance could lead to more robustnessto adversarial examples even without adversarial train-ing This might provide us some insights on designing newnetwork architecture or new strategies for enhancing thebias towards long-range features Some recent works turnout partially answering this question (Xie et al 2018) en-hanced standard CNNs with non-local blocks inspired from(Wang et al 2018 Vaswani et al 2017) which capture long-range dependencies in a data-dependent manner and whencombined with adversarial training their networks achievedstate-of-the-art adversarial robustness on ImageNet (Luoet al 2018) destroyed some of the local connection of stan-dard CNNs by randomly select a set of neurons and removethem from the network before training and thus forcingthe CNNs to less focus on local texture features With thisdesign they achieved improved black-box robustness
Adversarial training with other types of attacks In thiswork we mainly interpret the AT-CNNs based on norm-
constrained perturbation over the original images It is wor-thy of noting that the difference between normally trainedand adversarially trained CNNs may highly depends onthe type of adversaries Models trained against spatially-transformed adversary (Xiao et al 2018) denoted as ST-ST-CNNs have similar robustness towards PGD attack withstandard models and their salience maps are still quite dif-ferent as shown in Figure 8 Also the average distancebetween salience maps is close to that of standard CNNwhich is much higher than that of PGD-AT-CNN There ex-ists a variety of generalized types of attacks xadv = G(xw)parameterized by w such as spatially transformed (Xiaoet al 2018) and GAN-based adversarial examples (Songet al 2018) We leave interpreting the AT-CNNs based onthese generalized types of attacks as future work
Figure 8 Sensitivity maps based on SmoothGrad (Smilkov et al2017) of three models From left to right original image sensitiv-ity maps of standard CNN PGD-linfin AT-CNN and ST-AT-CNN
6 ConclusionFrom both qualitative and quantitative perspectives we haveimplemented a systematic study on interpreting the adversar-ially trained convolutional neural networks Through con-structing distorted test sets either preserving shapes or localtextures we compare the sensitivity maps of AT-CNNs andnormal CNNs on the clean stylized and saturated imageswhich visually demonstrates that AT-CNNs are more biasedtowards global structures such as shapes and edges Moreimportantly we evaluate the generalization performance ofthe two models on the three constructed datasets stylizedsaturated and patch-shuffled ones The results clearly indi-cate that AT-CNNs are less sensitive to the texture distortionand focus more on shape information while the normallytrained CNNs the other way around
Understanding what a model has learned is an essentialtopic in both machine learning and computer vision Thestrategies we propose can also be extended to interpret otherneural networks such as models for object detection andsemantic segmentation
AcknowledgementThis work is supported by National Natural Science Foun-dation of China (No61806009) Beijing Natural Sci-ence Foundation (No4184090) Beijing Academy of Ar-tificial Intelligence (BAAI) and Intelligent Manufactur-
Interpreting Adversarially Trained Convolutional Neural Networks
ing Action Plan of Industrial Solid Foundation Program(NoJCKY2018204C004) We also appreciate insightfuldiscussions with Dinghuai Zhang and Dr Lei Wu
ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt
M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018
Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018
Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018
Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015
Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016
Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019
Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009
Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input datadistributions In International Conference on LearningRepresentations 2019
Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017
Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009
Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015
Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019
Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014
Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014
Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007
He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a
He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b
Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017
Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017
Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012
Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016
Long J Shelhamer E and Darrell T Fully convolutionalnetworks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015
Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018
Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant toadversarial attacks In International Conference on Learn-ing Representations 2018
Interpreting Adversarially Trained Convolutional Neural Networks
Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017
Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018
Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017
Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015
Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017
Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013
Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018
Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017
Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018
Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017
Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018
Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017
Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018
Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018
Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018
Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014
Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a
Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b
Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016
Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017
Interpreting Adversarially Trained Convolutional Neural Networks
A Experiment SetupA1 Models
bull CIFAR-10 We train a standard ResNet-18 (He et al2016a) architecture it has 4 groups of residual layerswith filter sizes (64 128 256 512) and 2 residual units
bull Caltech-256 amp Tiny ImageNet We use a ResNet-18architecture using the code from pytorch(Paszke et al2017) Note that for models on Caltech-256 amp TinyImageNet we initialize them using ImageNet(Denget al 2009) pre-trained weighs provided by pytorch
We evaluate the robustness of all our models using a linfinprojected gradient descent adversary with ε = 8255 stepsize = 2 and number of iterations as 40
A2 Adversarial Training
We perform 9 types of adversarial training on each of thedataset 7 of the 9 kinds of adversarial training are againsta projected gradient descent (PGD) adversary(Madry et al2018) the other 2 are against FGSM adversary(Goodfellowet al 2014)
A21 TRAIN AGAINST A PROJECTED GRADIENTDESCENT (PGD) ADVERSARY
We list value of ε for adversarial training of each dataset andlp-norm In all settings PGD runs 20 iterations
bull linfin-norm bounded adversary For all of thethree data set pixel vaules range from 0 1 wetrain 4 adversarially trained CNNs with ε isin1255 2255 4255 8255 these four models aredenoted as PGD-inf1 2 4 8 respectively and stepssize as 1255 1255 2255 4255
bull l2-norm bounded adversary For Caltech-256 ampTiny ImageNet the input size for our model is 224times224 we train three adversarially trained CNNs withε isin 4 8 12 and these four models are denoted asPGD-l2 4 8 12 respectively Step sizes for thesethree models are 2255 4255 6255 For CIFAR-10where images are of size 32times 32 the three adversari-ally trained CNNs have ε isin 410 810 1210 butthey are denoted in the same way and have the samestep size as that in Caltech-256 amp Tiny ImageNet
A22 TRAIN AGAINST A FGSM ADVERSARY
ε for these two adversarially trained CNNs are ε isin4 8 and they are denoted as FGSM 4 8 respectively
B Style-transferred test setFollowing (Geirhos et al 2019) we construct stylized testset for Caltech-256 and Tiny ImageNet by applying theAdaIn style transfer(Huang amp Belongie 2017) with a styl-ization coefficient of α = 10 to every test image withthe style of a randomly selected painting from 3KagglersquosPainter by numbers dataset we used source code providedby(Geirhos et al 2019)
C Experiments on Fourier-filtered datasets(Jo amp Bengio 2017) showed deep neural networks tendto learn surface statistical regularities as opposed to high-level abstractions Following them we test the performanceof different trained CNNs on the high-pass and low-passfiltered dataset to show their tendencies
C1 Fourier filtering setup
Following (Jo amp Bengio 2017) We construct three types ofFourier filtered version of test set
bull The low frequency filtered version We use a radialmask in the Fourier domain to set higher frequencymodes to zero(low-pass filtering)
bull The high frequency filtered version We use a radialmask in the Fourier domain to preserve only the higherfrequency modes(high-pass filtering)
bull The random filtered version We use a random maskin the Fourier domain to set each mode to 0 with prob-ability p uniformly The random mask is generated onthe fly during the test
C2 Results
We measure generalization performance (accuracy on cor-rectly classified images) of each model on these three fil-tered datasets from Caltech-256 results are listed in Ta-ble 3 AT-CNNs performs better on Low-pass filtered datasetand worse on High-pass filtered dataset Results indicatethat AT-CNNs make their predictions depend more on low-frequency information This finding is consistent with ourconclusions since local features such as textures are oftenconsidered as high-frequency information and shapes andcontours are more like low-frequency
D Detailed resultsWe the detailed results for our quantitative experimentshere Table 5 4 6 show the results of each models on
3httpswwwkagglecomcpainter-by-numbers
Interpreting Adversarially Trained Convolutional Neural Networks
Table 3 ldquoAccuracy on correctly classified imagesrdquo for different models on three Fourier-filtered Caltech-256 test setsDATA SET THE LOW FREQUENCY FILTERED VERSION THE HIGH FREQUENCY FILTERED VERSION THE RANDOM FILTERED VERSION
STANDARD 158 165 735UNDERFIT 145 176 622PGD-linfin 711 36 734
test set with different saturation levels Table 8 7 list allthe results of each models on test set after different path-shuffling operations
E Additional FiguresWe show additional sensitive maps in Figure 9 We alsocompare the sensitive maps using Grad and SmoothGradin Figure 10
ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt
M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018
Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018
Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018
Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015
Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016
Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019
Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009
Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input data
distributions In International Conference on LearningRepresentations 2019
Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017
Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009
Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015
Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019
Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014
Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014
Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007
He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a
He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b
Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017
Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017
Interpreting Adversarially Trained Convolutional Neural Networks
Table 4 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Caltech-256
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 2862 5745 8520 9013 6537 4237 2345 2003UNDERFIT 3184 6336 9096 8451 5751 3858 2600 2308PGD-linfin 8 3284 5347 8272 8645 7033 6109 5376 5191PGD-linfin 4 3199 5774 8518 8795 7033 5838 4816 4545PGD-linfin 2 3299 6075 8775 8935 6878 5199 4069 3783PGD-linfin 1 3267 6185 8936 9018 6907 5005 3798 3480PGD-l2 12 3138 5307 8210 8389 6706 5851 5245 5075PGD-l2 8 3282 5665 8501 8609 6890 5875 5159 4930PGD-l2 4 3282 5877 8630 8636 6794 5368 4443 4198FGSM 8 2953 5546 8510 8665 6901 5564 4592 4342FGSM 4 3268 5937 8722 8790 6671 5113 4166 3878
Table 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Tiny ImageNet test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Tiny ImageNet
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 724 2588 7252 7273 2538 824 262 193UNDERFIT 734 2544 6980 6067 1801 672 316 265PGD-linfin 8 1107 2908 6711 7453 498 4016 3544 3396PGD-linfin 4 1244 3353 7294 7575 4638 3212 2492 2265PGD-linfin 2 1209 3485 7577 7615 4135 2520 1693 1452PGD-linfin 1 1130 3503 7685 7863 4048 2137 1270 1081PGD-l2 12 1130 2948 6694 7522 5226 4211 3720 3585PGD-l2 8 1242 3278 7194 7515 4792 3566 2955 2790PGD-l2 4 1263 3410 7406 7732 4500 2873 2016 1804FGSM 8 1259 3266 7055 8153 4183 1752 729 582FGSM 4 1263 3410 7406 7505 4291 2909 2215 2014
Table 6 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated CIFAR-10 test set It is easily observed AT-CNNs aremuch more robust to increasing saturation levels on CIFAR-10
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 2736 5595 9103 9312 6998 4830 3439 3106UNDERFIT 2143 5028 8771 8989 6609 4335 2910 2613PGD-linfin 8 2605 4696 8097 8916 7546 6908 5898 6464PGD-linfin 4 2722 4981 8416 8979 7389 6535 5999 5847PGD-linfin 2 2832 5312 8693 9137 7402 6282 5525 5260PGD-linfin 1 2718 5359 8854 9177 7267 5839 4725 4175PGD-l2 12 2599 4692 8172 8844 7392 6603 6098 5941PGD-l2 8 2775 5029 8376 8092 7317 6483 5864 4694PGD-l2 4 2726 5117 8578 9008 7312 6150 5204 4879FGSM 8 2550 4611 8172 8767 7422 6712 6251 6132FGSM 4 2639 5893 8430 8902 7347 6443 5880 5682
Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012
Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016
Long J Shelhamer E and Darrell T Fully convolutional
networks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015
Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018
Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant to
Interpreting Adversarially Trained Convolutional Neural Networks
Table 7 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Caltech-256 test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Caltech-256
DATA SET 2times 2 4times 4 8times 8T
STANDARD 8476 5150 1084UNDERFIT 7559 3341 603PGD-linfin 8 5813 2014 770PGD-linfin 4 6854 2645 818PGD-linfin 2 7425 3077 900PGD-linfin 1 7811 3503 842PGD-l2 12 5825 2103 785PGD-l2 8 6336 2219 848PGD-l2 4 6965 2821 772FGSM 8 6448 2294 807FGSM 4 7050 2841 603
Table 8 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Tiny ImageNet test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Tiny ImageNet
DATA SET 2times 2 4times 4 8times 8T
STANDARD 6673 2487 448UNDERFIT 5922 2362 438PGD-linfin 8 4108 1605 683PGD-linfin 4 4954 1823 630PGD-linfin 2 5596 1995 561PGD-linfin 1 6019 2324 608PGD-l2 12 4223 1695 766PGD-l2 8 4767 1628 650PGD-l2 4 5194 1779 589FGSM 8 5742 2070 473FGSM 4 5068 1684 598
Figure 9 Visualization of Salience maps generated from SmoothGrad (Smilkov et al 2017) for all 11 models From left to rightStandard CNNs underfitting CNNs PGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4
Interpreting Adversarially Trained Convolutional Neural Networks
Figure 10 Visualization of Salience maps generated from Grad for all 11 models From left to right Standard CNNs underfitting CNNsPGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4 Itrsquos easily observed that sensitivity maps generated from Grad are more noisycompared with its smoothed variant SmoothGrad especially for Standard CNNs and underfitting CNNs
adversarial attacks In International Conference on Learn-ing Representations 2018
Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017
Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018
Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017
Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015
Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017
Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013
Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018
Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017
Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018
Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017
Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018
Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017
Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018
Interpreting Adversarially Trained Convolutional Neural Networks
Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018
Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018
Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014
Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a
Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b
Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016
Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017
Interpreting Adversarially Trained Convolutional Neural Networks
of 8476 on Caltech-256 while only 6673 on TinyImageNet This mainly origins from the limited resolutionof Tiny ImageNet since ldquoPatch-Shufflerdquo operation on low-resolution images destroys more useful features than thosewith higher resolution
5 Related work and discussionInterpreting AT-CNNs Recently there are some relevantfindings indicating that AT-CNNs learn fundamentally differ-ent feature representations than standard classifiers Tsipraset al (2018) showed that sensitivity maps of AT-CNNs inthe input space align well with human perception Addi-tionally by visualizing large-ε adversarial examples againstAT-CNNs it can be observed that the adversarial examplescould capture salient data characteristics of a different classwhich appear semantically similar to the images of the differ-ent class Dong et al (2017) leveraged adversarial trainingto produce a more interpretable representation by visualiz-ing active neurons Compared with Tsipras et al (2018) andDong et al (2017) we have conducted a more systematicalinvestigation for interpreting AT-CNNs We construct threetypes of image transformation that can largely change thetextures while preserving shape information (ie stylizingand saturation) or shatter the shapeedge features whilekeeping the local textures (ie patch-shuffling) Evaluatingthe generalization of AT-CNNs over these designed datasetsprovides a quantitative way to verify and interpret theirstrong shape-bias compared with normal CNNs
Insights for defensing adversarial examples Based onour investigation over the AT-CNNs we find that the ro-bustness towards adversarial examples is correlated withthe capability of capturing long-range features like shapesor contours This naturally raises the question whetherany other models that can capture more global features orwith more texture invariance could lead to more robustnessto adversarial examples even without adversarial train-ing This might provide us some insights on designing newnetwork architecture or new strategies for enhancing thebias towards long-range features Some recent works turnout partially answering this question (Xie et al 2018) en-hanced standard CNNs with non-local blocks inspired from(Wang et al 2018 Vaswani et al 2017) which capture long-range dependencies in a data-dependent manner and whencombined with adversarial training their networks achievedstate-of-the-art adversarial robustness on ImageNet (Luoet al 2018) destroyed some of the local connection of stan-dard CNNs by randomly select a set of neurons and removethem from the network before training and thus forcingthe CNNs to less focus on local texture features With thisdesign they achieved improved black-box robustness
Adversarial training with other types of attacks In thiswork we mainly interpret the AT-CNNs based on norm-
constrained perturbation over the original images It is wor-thy of noting that the difference between normally trainedand adversarially trained CNNs may highly depends onthe type of adversaries Models trained against spatially-transformed adversary (Xiao et al 2018) denoted as ST-ST-CNNs have similar robustness towards PGD attack withstandard models and their salience maps are still quite dif-ferent as shown in Figure 8 Also the average distancebetween salience maps is close to that of standard CNNwhich is much higher than that of PGD-AT-CNN There ex-ists a variety of generalized types of attacks xadv = G(xw)parameterized by w such as spatially transformed (Xiaoet al 2018) and GAN-based adversarial examples (Songet al 2018) We leave interpreting the AT-CNNs based onthese generalized types of attacks as future work
Figure 8 Sensitivity maps based on SmoothGrad (Smilkov et al2017) of three models From left to right original image sensitiv-ity maps of standard CNN PGD-linfin AT-CNN and ST-AT-CNN
6 ConclusionFrom both qualitative and quantitative perspectives we haveimplemented a systematic study on interpreting the adversar-ially trained convolutional neural networks Through con-structing distorted test sets either preserving shapes or localtextures we compare the sensitivity maps of AT-CNNs andnormal CNNs on the clean stylized and saturated imageswhich visually demonstrates that AT-CNNs are more biasedtowards global structures such as shapes and edges Moreimportantly we evaluate the generalization performance ofthe two models on the three constructed datasets stylizedsaturated and patch-shuffled ones The results clearly indi-cate that AT-CNNs are less sensitive to the texture distortionand focus more on shape information while the normallytrained CNNs the other way around
Understanding what a model has learned is an essentialtopic in both machine learning and computer vision Thestrategies we propose can also be extended to interpret otherneural networks such as models for object detection andsemantic segmentation
AcknowledgementThis work is supported by National Natural Science Foun-dation of China (No61806009) Beijing Natural Sci-ence Foundation (No4184090) Beijing Academy of Ar-tificial Intelligence (BAAI) and Intelligent Manufactur-
Interpreting Adversarially Trained Convolutional Neural Networks
ing Action Plan of Industrial Solid Foundation Program(NoJCKY2018204C004) We also appreciate insightfuldiscussions with Dinghuai Zhang and Dr Lei Wu
ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt
M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018
Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018
Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018
Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015
Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016
Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019
Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009
Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input datadistributions In International Conference on LearningRepresentations 2019
Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017
Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009
Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015
Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019
Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014
Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014
Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007
He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a
He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b
Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017
Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017
Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012
Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016
Long J Shelhamer E and Darrell T Fully convolutionalnetworks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015
Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018
Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant toadversarial attacks In International Conference on Learn-ing Representations 2018
Interpreting Adversarially Trained Convolutional Neural Networks
Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017
Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018
Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017
Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015
Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017
Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013
Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018
Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017
Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018
Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017
Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018
Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017
Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018
Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018
Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018
Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014
Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a
Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b
Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016
Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017
Interpreting Adversarially Trained Convolutional Neural Networks
A Experiment SetupA1 Models
bull CIFAR-10 We train a standard ResNet-18 (He et al2016a) architecture it has 4 groups of residual layerswith filter sizes (64 128 256 512) and 2 residual units
bull Caltech-256 amp Tiny ImageNet We use a ResNet-18architecture using the code from pytorch(Paszke et al2017) Note that for models on Caltech-256 amp TinyImageNet we initialize them using ImageNet(Denget al 2009) pre-trained weighs provided by pytorch
We evaluate the robustness of all our models using a linfinprojected gradient descent adversary with ε = 8255 stepsize = 2 and number of iterations as 40
A2 Adversarial Training
We perform 9 types of adversarial training on each of thedataset 7 of the 9 kinds of adversarial training are againsta projected gradient descent (PGD) adversary(Madry et al2018) the other 2 are against FGSM adversary(Goodfellowet al 2014)
A21 TRAIN AGAINST A PROJECTED GRADIENTDESCENT (PGD) ADVERSARY
We list value of ε for adversarial training of each dataset andlp-norm In all settings PGD runs 20 iterations
bull linfin-norm bounded adversary For all of thethree data set pixel vaules range from 0 1 wetrain 4 adversarially trained CNNs with ε isin1255 2255 4255 8255 these four models aredenoted as PGD-inf1 2 4 8 respectively and stepssize as 1255 1255 2255 4255
bull l2-norm bounded adversary For Caltech-256 ampTiny ImageNet the input size for our model is 224times224 we train three adversarially trained CNNs withε isin 4 8 12 and these four models are denoted asPGD-l2 4 8 12 respectively Step sizes for thesethree models are 2255 4255 6255 For CIFAR-10where images are of size 32times 32 the three adversari-ally trained CNNs have ε isin 410 810 1210 butthey are denoted in the same way and have the samestep size as that in Caltech-256 amp Tiny ImageNet
A22 TRAIN AGAINST A FGSM ADVERSARY
ε for these two adversarially trained CNNs are ε isin4 8 and they are denoted as FGSM 4 8 respectively
B Style-transferred test setFollowing (Geirhos et al 2019) we construct stylized testset for Caltech-256 and Tiny ImageNet by applying theAdaIn style transfer(Huang amp Belongie 2017) with a styl-ization coefficient of α = 10 to every test image withthe style of a randomly selected painting from 3KagglersquosPainter by numbers dataset we used source code providedby(Geirhos et al 2019)
C Experiments on Fourier-filtered datasets(Jo amp Bengio 2017) showed deep neural networks tendto learn surface statistical regularities as opposed to high-level abstractions Following them we test the performanceof different trained CNNs on the high-pass and low-passfiltered dataset to show their tendencies
C1 Fourier filtering setup
Following (Jo amp Bengio 2017) We construct three types ofFourier filtered version of test set
bull The low frequency filtered version We use a radialmask in the Fourier domain to set higher frequencymodes to zero(low-pass filtering)
bull The high frequency filtered version We use a radialmask in the Fourier domain to preserve only the higherfrequency modes(high-pass filtering)
bull The random filtered version We use a random maskin the Fourier domain to set each mode to 0 with prob-ability p uniformly The random mask is generated onthe fly during the test
C2 Results
We measure generalization performance (accuracy on cor-rectly classified images) of each model on these three fil-tered datasets from Caltech-256 results are listed in Ta-ble 3 AT-CNNs performs better on Low-pass filtered datasetand worse on High-pass filtered dataset Results indicatethat AT-CNNs make their predictions depend more on low-frequency information This finding is consistent with ourconclusions since local features such as textures are oftenconsidered as high-frequency information and shapes andcontours are more like low-frequency
D Detailed resultsWe the detailed results for our quantitative experimentshere Table 5 4 6 show the results of each models on
3httpswwwkagglecomcpainter-by-numbers
Interpreting Adversarially Trained Convolutional Neural Networks
Table 3 ldquoAccuracy on correctly classified imagesrdquo for different models on three Fourier-filtered Caltech-256 test setsDATA SET THE LOW FREQUENCY FILTERED VERSION THE HIGH FREQUENCY FILTERED VERSION THE RANDOM FILTERED VERSION
STANDARD 158 165 735UNDERFIT 145 176 622PGD-linfin 711 36 734
test set with different saturation levels Table 8 7 list allthe results of each models on test set after different path-shuffling operations
E Additional FiguresWe show additional sensitive maps in Figure 9 We alsocompare the sensitive maps using Grad and SmoothGradin Figure 10
ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt
M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018
Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018
Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018
Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015
Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016
Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019
Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009
Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input data
distributions In International Conference on LearningRepresentations 2019
Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017
Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009
Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015
Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019
Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014
Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014
Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007
He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a
He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b
Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017
Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017
Interpreting Adversarially Trained Convolutional Neural Networks
Table 4 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Caltech-256
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 2862 5745 8520 9013 6537 4237 2345 2003UNDERFIT 3184 6336 9096 8451 5751 3858 2600 2308PGD-linfin 8 3284 5347 8272 8645 7033 6109 5376 5191PGD-linfin 4 3199 5774 8518 8795 7033 5838 4816 4545PGD-linfin 2 3299 6075 8775 8935 6878 5199 4069 3783PGD-linfin 1 3267 6185 8936 9018 6907 5005 3798 3480PGD-l2 12 3138 5307 8210 8389 6706 5851 5245 5075PGD-l2 8 3282 5665 8501 8609 6890 5875 5159 4930PGD-l2 4 3282 5877 8630 8636 6794 5368 4443 4198FGSM 8 2953 5546 8510 8665 6901 5564 4592 4342FGSM 4 3268 5937 8722 8790 6671 5113 4166 3878
Table 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Tiny ImageNet test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Tiny ImageNet
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 724 2588 7252 7273 2538 824 262 193UNDERFIT 734 2544 6980 6067 1801 672 316 265PGD-linfin 8 1107 2908 6711 7453 498 4016 3544 3396PGD-linfin 4 1244 3353 7294 7575 4638 3212 2492 2265PGD-linfin 2 1209 3485 7577 7615 4135 2520 1693 1452PGD-linfin 1 1130 3503 7685 7863 4048 2137 1270 1081PGD-l2 12 1130 2948 6694 7522 5226 4211 3720 3585PGD-l2 8 1242 3278 7194 7515 4792 3566 2955 2790PGD-l2 4 1263 3410 7406 7732 4500 2873 2016 1804FGSM 8 1259 3266 7055 8153 4183 1752 729 582FGSM 4 1263 3410 7406 7505 4291 2909 2215 2014
Table 6 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated CIFAR-10 test set It is easily observed AT-CNNs aremuch more robust to increasing saturation levels on CIFAR-10
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 2736 5595 9103 9312 6998 4830 3439 3106UNDERFIT 2143 5028 8771 8989 6609 4335 2910 2613PGD-linfin 8 2605 4696 8097 8916 7546 6908 5898 6464PGD-linfin 4 2722 4981 8416 8979 7389 6535 5999 5847PGD-linfin 2 2832 5312 8693 9137 7402 6282 5525 5260PGD-linfin 1 2718 5359 8854 9177 7267 5839 4725 4175PGD-l2 12 2599 4692 8172 8844 7392 6603 6098 5941PGD-l2 8 2775 5029 8376 8092 7317 6483 5864 4694PGD-l2 4 2726 5117 8578 9008 7312 6150 5204 4879FGSM 8 2550 4611 8172 8767 7422 6712 6251 6132FGSM 4 2639 5893 8430 8902 7347 6443 5880 5682
Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012
Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016
Long J Shelhamer E and Darrell T Fully convolutional
networks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015
Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018
Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant to
Interpreting Adversarially Trained Convolutional Neural Networks
Table 7 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Caltech-256 test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Caltech-256
DATA SET 2times 2 4times 4 8times 8T
STANDARD 8476 5150 1084UNDERFIT 7559 3341 603PGD-linfin 8 5813 2014 770PGD-linfin 4 6854 2645 818PGD-linfin 2 7425 3077 900PGD-linfin 1 7811 3503 842PGD-l2 12 5825 2103 785PGD-l2 8 6336 2219 848PGD-l2 4 6965 2821 772FGSM 8 6448 2294 807FGSM 4 7050 2841 603
Table 8 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Tiny ImageNet test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Tiny ImageNet
DATA SET 2times 2 4times 4 8times 8T
STANDARD 6673 2487 448UNDERFIT 5922 2362 438PGD-linfin 8 4108 1605 683PGD-linfin 4 4954 1823 630PGD-linfin 2 5596 1995 561PGD-linfin 1 6019 2324 608PGD-l2 12 4223 1695 766PGD-l2 8 4767 1628 650PGD-l2 4 5194 1779 589FGSM 8 5742 2070 473FGSM 4 5068 1684 598
Figure 9 Visualization of Salience maps generated from SmoothGrad (Smilkov et al 2017) for all 11 models From left to rightStandard CNNs underfitting CNNs PGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4
Interpreting Adversarially Trained Convolutional Neural Networks
Figure 10 Visualization of Salience maps generated from Grad for all 11 models From left to right Standard CNNs underfitting CNNsPGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4 Itrsquos easily observed that sensitivity maps generated from Grad are more noisycompared with its smoothed variant SmoothGrad especially for Standard CNNs and underfitting CNNs
adversarial attacks In International Conference on Learn-ing Representations 2018
Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017
Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018
Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017
Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015
Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017
Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013
Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018
Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017
Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018
Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017
Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018
Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017
Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018
Interpreting Adversarially Trained Convolutional Neural Networks
Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018
Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018
Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014
Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a
Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b
Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016
Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017
Interpreting Adversarially Trained Convolutional Neural Networks
ing Action Plan of Industrial Solid Foundation Program(NoJCKY2018204C004) We also appreciate insightfuldiscussions with Dinghuai Zhang and Dr Lei Wu
ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt
M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018
Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018
Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018
Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015
Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016
Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019
Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009
Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input datadistributions In International Conference on LearningRepresentations 2019
Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017
Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009
Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015
Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019
Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014
Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014
Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007
He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a
He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b
Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017
Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017
Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012
Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016
Long J Shelhamer E and Darrell T Fully convolutionalnetworks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015
Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018
Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant toadversarial attacks In International Conference on Learn-ing Representations 2018
Interpreting Adversarially Trained Convolutional Neural Networks
Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017
Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018
Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017
Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015
Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017
Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013
Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018
Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017
Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018
Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017
Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018
Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017
Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018
Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018
Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018
Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014
Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a
Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b
Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016
Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017
Interpreting Adversarially Trained Convolutional Neural Networks
A Experiment SetupA1 Models
bull CIFAR-10 We train a standard ResNet-18 (He et al2016a) architecture it has 4 groups of residual layerswith filter sizes (64 128 256 512) and 2 residual units
bull Caltech-256 amp Tiny ImageNet We use a ResNet-18architecture using the code from pytorch(Paszke et al2017) Note that for models on Caltech-256 amp TinyImageNet we initialize them using ImageNet(Denget al 2009) pre-trained weighs provided by pytorch
We evaluate the robustness of all our models using a linfinprojected gradient descent adversary with ε = 8255 stepsize = 2 and number of iterations as 40
A2 Adversarial Training
We perform 9 types of adversarial training on each of thedataset 7 of the 9 kinds of adversarial training are againsta projected gradient descent (PGD) adversary(Madry et al2018) the other 2 are against FGSM adversary(Goodfellowet al 2014)
A21 TRAIN AGAINST A PROJECTED GRADIENTDESCENT (PGD) ADVERSARY
We list value of ε for adversarial training of each dataset andlp-norm In all settings PGD runs 20 iterations
bull linfin-norm bounded adversary For all of thethree data set pixel vaules range from 0 1 wetrain 4 adversarially trained CNNs with ε isin1255 2255 4255 8255 these four models aredenoted as PGD-inf1 2 4 8 respectively and stepssize as 1255 1255 2255 4255
bull l2-norm bounded adversary For Caltech-256 ampTiny ImageNet the input size for our model is 224times224 we train three adversarially trained CNNs withε isin 4 8 12 and these four models are denoted asPGD-l2 4 8 12 respectively Step sizes for thesethree models are 2255 4255 6255 For CIFAR-10where images are of size 32times 32 the three adversari-ally trained CNNs have ε isin 410 810 1210 butthey are denoted in the same way and have the samestep size as that in Caltech-256 amp Tiny ImageNet
A22 TRAIN AGAINST A FGSM ADVERSARY
ε for these two adversarially trained CNNs are ε isin4 8 and they are denoted as FGSM 4 8 respectively
B Style-transferred test setFollowing (Geirhos et al 2019) we construct stylized testset for Caltech-256 and Tiny ImageNet by applying theAdaIn style transfer(Huang amp Belongie 2017) with a styl-ization coefficient of α = 10 to every test image withthe style of a randomly selected painting from 3KagglersquosPainter by numbers dataset we used source code providedby(Geirhos et al 2019)
C Experiments on Fourier-filtered datasets(Jo amp Bengio 2017) showed deep neural networks tendto learn surface statistical regularities as opposed to high-level abstractions Following them we test the performanceof different trained CNNs on the high-pass and low-passfiltered dataset to show their tendencies
C1 Fourier filtering setup
Following (Jo amp Bengio 2017) We construct three types ofFourier filtered version of test set
bull The low frequency filtered version We use a radialmask in the Fourier domain to set higher frequencymodes to zero(low-pass filtering)
bull The high frequency filtered version We use a radialmask in the Fourier domain to preserve only the higherfrequency modes(high-pass filtering)
bull The random filtered version We use a random maskin the Fourier domain to set each mode to 0 with prob-ability p uniformly The random mask is generated onthe fly during the test
C2 Results
We measure generalization performance (accuracy on cor-rectly classified images) of each model on these three fil-tered datasets from Caltech-256 results are listed in Ta-ble 3 AT-CNNs performs better on Low-pass filtered datasetand worse on High-pass filtered dataset Results indicatethat AT-CNNs make their predictions depend more on low-frequency information This finding is consistent with ourconclusions since local features such as textures are oftenconsidered as high-frequency information and shapes andcontours are more like low-frequency
D Detailed resultsWe the detailed results for our quantitative experimentshere Table 5 4 6 show the results of each models on
3httpswwwkagglecomcpainter-by-numbers
Interpreting Adversarially Trained Convolutional Neural Networks
Table 3 ldquoAccuracy on correctly classified imagesrdquo for different models on three Fourier-filtered Caltech-256 test setsDATA SET THE LOW FREQUENCY FILTERED VERSION THE HIGH FREQUENCY FILTERED VERSION THE RANDOM FILTERED VERSION
STANDARD 158 165 735UNDERFIT 145 176 622PGD-linfin 711 36 734
test set with different saturation levels Table 8 7 list allthe results of each models on test set after different path-shuffling operations
E Additional FiguresWe show additional sensitive maps in Figure 9 We alsocompare the sensitive maps using Grad and SmoothGradin Figure 10
ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt
M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018
Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018
Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018
Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015
Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016
Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019
Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009
Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input data
distributions In International Conference on LearningRepresentations 2019
Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017
Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009
Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015
Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019
Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014
Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014
Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007
He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a
He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b
Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017
Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017
Interpreting Adversarially Trained Convolutional Neural Networks
Table 4 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Caltech-256
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 2862 5745 8520 9013 6537 4237 2345 2003UNDERFIT 3184 6336 9096 8451 5751 3858 2600 2308PGD-linfin 8 3284 5347 8272 8645 7033 6109 5376 5191PGD-linfin 4 3199 5774 8518 8795 7033 5838 4816 4545PGD-linfin 2 3299 6075 8775 8935 6878 5199 4069 3783PGD-linfin 1 3267 6185 8936 9018 6907 5005 3798 3480PGD-l2 12 3138 5307 8210 8389 6706 5851 5245 5075PGD-l2 8 3282 5665 8501 8609 6890 5875 5159 4930PGD-l2 4 3282 5877 8630 8636 6794 5368 4443 4198FGSM 8 2953 5546 8510 8665 6901 5564 4592 4342FGSM 4 3268 5937 8722 8790 6671 5113 4166 3878
Table 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Tiny ImageNet test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Tiny ImageNet
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 724 2588 7252 7273 2538 824 262 193UNDERFIT 734 2544 6980 6067 1801 672 316 265PGD-linfin 8 1107 2908 6711 7453 498 4016 3544 3396PGD-linfin 4 1244 3353 7294 7575 4638 3212 2492 2265PGD-linfin 2 1209 3485 7577 7615 4135 2520 1693 1452PGD-linfin 1 1130 3503 7685 7863 4048 2137 1270 1081PGD-l2 12 1130 2948 6694 7522 5226 4211 3720 3585PGD-l2 8 1242 3278 7194 7515 4792 3566 2955 2790PGD-l2 4 1263 3410 7406 7732 4500 2873 2016 1804FGSM 8 1259 3266 7055 8153 4183 1752 729 582FGSM 4 1263 3410 7406 7505 4291 2909 2215 2014
Table 6 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated CIFAR-10 test set It is easily observed AT-CNNs aremuch more robust to increasing saturation levels on CIFAR-10
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 2736 5595 9103 9312 6998 4830 3439 3106UNDERFIT 2143 5028 8771 8989 6609 4335 2910 2613PGD-linfin 8 2605 4696 8097 8916 7546 6908 5898 6464PGD-linfin 4 2722 4981 8416 8979 7389 6535 5999 5847PGD-linfin 2 2832 5312 8693 9137 7402 6282 5525 5260PGD-linfin 1 2718 5359 8854 9177 7267 5839 4725 4175PGD-l2 12 2599 4692 8172 8844 7392 6603 6098 5941PGD-l2 8 2775 5029 8376 8092 7317 6483 5864 4694PGD-l2 4 2726 5117 8578 9008 7312 6150 5204 4879FGSM 8 2550 4611 8172 8767 7422 6712 6251 6132FGSM 4 2639 5893 8430 8902 7347 6443 5880 5682
Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012
Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016
Long J Shelhamer E and Darrell T Fully convolutional
networks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015
Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018
Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant to
Interpreting Adversarially Trained Convolutional Neural Networks
Table 7 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Caltech-256 test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Caltech-256
DATA SET 2times 2 4times 4 8times 8T
STANDARD 8476 5150 1084UNDERFIT 7559 3341 603PGD-linfin 8 5813 2014 770PGD-linfin 4 6854 2645 818PGD-linfin 2 7425 3077 900PGD-linfin 1 7811 3503 842PGD-l2 12 5825 2103 785PGD-l2 8 6336 2219 848PGD-l2 4 6965 2821 772FGSM 8 6448 2294 807FGSM 4 7050 2841 603
Table 8 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Tiny ImageNet test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Tiny ImageNet
DATA SET 2times 2 4times 4 8times 8T
STANDARD 6673 2487 448UNDERFIT 5922 2362 438PGD-linfin 8 4108 1605 683PGD-linfin 4 4954 1823 630PGD-linfin 2 5596 1995 561PGD-linfin 1 6019 2324 608PGD-l2 12 4223 1695 766PGD-l2 8 4767 1628 650PGD-l2 4 5194 1779 589FGSM 8 5742 2070 473FGSM 4 5068 1684 598
Figure 9 Visualization of Salience maps generated from SmoothGrad (Smilkov et al 2017) for all 11 models From left to rightStandard CNNs underfitting CNNs PGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4
Interpreting Adversarially Trained Convolutional Neural Networks
Figure 10 Visualization of Salience maps generated from Grad for all 11 models From left to right Standard CNNs underfitting CNNsPGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4 Itrsquos easily observed that sensitivity maps generated from Grad are more noisycompared with its smoothed variant SmoothGrad especially for Standard CNNs and underfitting CNNs
adversarial attacks In International Conference on Learn-ing Representations 2018
Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017
Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018
Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017
Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015
Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017
Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013
Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018
Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017
Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018
Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017
Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018
Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017
Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018
Interpreting Adversarially Trained Convolutional Neural Networks
Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018
Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018
Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014
Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a
Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b
Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016
Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017
Interpreting Adversarially Trained Convolutional Neural Networks
Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017
Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018
Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017
Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015
Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017
Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013
Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018
Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017
Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018
Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017
Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018
Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017
Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018
Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018
Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018
Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014
Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a
Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b
Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016
Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017
Interpreting Adversarially Trained Convolutional Neural Networks
A Experiment SetupA1 Models
bull CIFAR-10 We train a standard ResNet-18 (He et al2016a) architecture it has 4 groups of residual layerswith filter sizes (64 128 256 512) and 2 residual units
bull Caltech-256 amp Tiny ImageNet We use a ResNet-18architecture using the code from pytorch(Paszke et al2017) Note that for models on Caltech-256 amp TinyImageNet we initialize them using ImageNet(Denget al 2009) pre-trained weighs provided by pytorch
We evaluate the robustness of all our models using a linfinprojected gradient descent adversary with ε = 8255 stepsize = 2 and number of iterations as 40
A2 Adversarial Training
We perform 9 types of adversarial training on each of thedataset 7 of the 9 kinds of adversarial training are againsta projected gradient descent (PGD) adversary(Madry et al2018) the other 2 are against FGSM adversary(Goodfellowet al 2014)
A21 TRAIN AGAINST A PROJECTED GRADIENTDESCENT (PGD) ADVERSARY
We list value of ε for adversarial training of each dataset andlp-norm In all settings PGD runs 20 iterations
bull linfin-norm bounded adversary For all of thethree data set pixel vaules range from 0 1 wetrain 4 adversarially trained CNNs with ε isin1255 2255 4255 8255 these four models aredenoted as PGD-inf1 2 4 8 respectively and stepssize as 1255 1255 2255 4255
bull l2-norm bounded adversary For Caltech-256 ampTiny ImageNet the input size for our model is 224times224 we train three adversarially trained CNNs withε isin 4 8 12 and these four models are denoted asPGD-l2 4 8 12 respectively Step sizes for thesethree models are 2255 4255 6255 For CIFAR-10where images are of size 32times 32 the three adversari-ally trained CNNs have ε isin 410 810 1210 butthey are denoted in the same way and have the samestep size as that in Caltech-256 amp Tiny ImageNet
A22 TRAIN AGAINST A FGSM ADVERSARY
ε for these two adversarially trained CNNs are ε isin4 8 and they are denoted as FGSM 4 8 respectively
B Style-transferred test setFollowing (Geirhos et al 2019) we construct stylized testset for Caltech-256 and Tiny ImageNet by applying theAdaIn style transfer(Huang amp Belongie 2017) with a styl-ization coefficient of α = 10 to every test image withthe style of a randomly selected painting from 3KagglersquosPainter by numbers dataset we used source code providedby(Geirhos et al 2019)
C Experiments on Fourier-filtered datasets(Jo amp Bengio 2017) showed deep neural networks tendto learn surface statistical regularities as opposed to high-level abstractions Following them we test the performanceof different trained CNNs on the high-pass and low-passfiltered dataset to show their tendencies
C1 Fourier filtering setup
Following (Jo amp Bengio 2017) We construct three types ofFourier filtered version of test set
bull The low frequency filtered version We use a radialmask in the Fourier domain to set higher frequencymodes to zero(low-pass filtering)
bull The high frequency filtered version We use a radialmask in the Fourier domain to preserve only the higherfrequency modes(high-pass filtering)
bull The random filtered version We use a random maskin the Fourier domain to set each mode to 0 with prob-ability p uniformly The random mask is generated onthe fly during the test
C2 Results
We measure generalization performance (accuracy on cor-rectly classified images) of each model on these three fil-tered datasets from Caltech-256 results are listed in Ta-ble 3 AT-CNNs performs better on Low-pass filtered datasetand worse on High-pass filtered dataset Results indicatethat AT-CNNs make their predictions depend more on low-frequency information This finding is consistent with ourconclusions since local features such as textures are oftenconsidered as high-frequency information and shapes andcontours are more like low-frequency
D Detailed resultsWe the detailed results for our quantitative experimentshere Table 5 4 6 show the results of each models on
3httpswwwkagglecomcpainter-by-numbers
Interpreting Adversarially Trained Convolutional Neural Networks
Table 3 ldquoAccuracy on correctly classified imagesrdquo for different models on three Fourier-filtered Caltech-256 test setsDATA SET THE LOW FREQUENCY FILTERED VERSION THE HIGH FREQUENCY FILTERED VERSION THE RANDOM FILTERED VERSION
STANDARD 158 165 735UNDERFIT 145 176 622PGD-linfin 711 36 734
test set with different saturation levels Table 8 7 list allthe results of each models on test set after different path-shuffling operations
E Additional FiguresWe show additional sensitive maps in Figure 9 We alsocompare the sensitive maps using Grad and SmoothGradin Figure 10
ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt
M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018
Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018
Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018
Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015
Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016
Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019
Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009
Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input data
distributions In International Conference on LearningRepresentations 2019
Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017
Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009
Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015
Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019
Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014
Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014
Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007
He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a
He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b
Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017
Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017
Interpreting Adversarially Trained Convolutional Neural Networks
Table 4 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Caltech-256
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 2862 5745 8520 9013 6537 4237 2345 2003UNDERFIT 3184 6336 9096 8451 5751 3858 2600 2308PGD-linfin 8 3284 5347 8272 8645 7033 6109 5376 5191PGD-linfin 4 3199 5774 8518 8795 7033 5838 4816 4545PGD-linfin 2 3299 6075 8775 8935 6878 5199 4069 3783PGD-linfin 1 3267 6185 8936 9018 6907 5005 3798 3480PGD-l2 12 3138 5307 8210 8389 6706 5851 5245 5075PGD-l2 8 3282 5665 8501 8609 6890 5875 5159 4930PGD-l2 4 3282 5877 8630 8636 6794 5368 4443 4198FGSM 8 2953 5546 8510 8665 6901 5564 4592 4342FGSM 4 3268 5937 8722 8790 6671 5113 4166 3878
Table 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Tiny ImageNet test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Tiny ImageNet
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 724 2588 7252 7273 2538 824 262 193UNDERFIT 734 2544 6980 6067 1801 672 316 265PGD-linfin 8 1107 2908 6711 7453 498 4016 3544 3396PGD-linfin 4 1244 3353 7294 7575 4638 3212 2492 2265PGD-linfin 2 1209 3485 7577 7615 4135 2520 1693 1452PGD-linfin 1 1130 3503 7685 7863 4048 2137 1270 1081PGD-l2 12 1130 2948 6694 7522 5226 4211 3720 3585PGD-l2 8 1242 3278 7194 7515 4792 3566 2955 2790PGD-l2 4 1263 3410 7406 7732 4500 2873 2016 1804FGSM 8 1259 3266 7055 8153 4183 1752 729 582FGSM 4 1263 3410 7406 7505 4291 2909 2215 2014
Table 6 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated CIFAR-10 test set It is easily observed AT-CNNs aremuch more robust to increasing saturation levels on CIFAR-10
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 2736 5595 9103 9312 6998 4830 3439 3106UNDERFIT 2143 5028 8771 8989 6609 4335 2910 2613PGD-linfin 8 2605 4696 8097 8916 7546 6908 5898 6464PGD-linfin 4 2722 4981 8416 8979 7389 6535 5999 5847PGD-linfin 2 2832 5312 8693 9137 7402 6282 5525 5260PGD-linfin 1 2718 5359 8854 9177 7267 5839 4725 4175PGD-l2 12 2599 4692 8172 8844 7392 6603 6098 5941PGD-l2 8 2775 5029 8376 8092 7317 6483 5864 4694PGD-l2 4 2726 5117 8578 9008 7312 6150 5204 4879FGSM 8 2550 4611 8172 8767 7422 6712 6251 6132FGSM 4 2639 5893 8430 8902 7347 6443 5880 5682
Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012
Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016
Long J Shelhamer E and Darrell T Fully convolutional
networks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015
Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018
Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant to
Interpreting Adversarially Trained Convolutional Neural Networks
Table 7 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Caltech-256 test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Caltech-256
DATA SET 2times 2 4times 4 8times 8T
STANDARD 8476 5150 1084UNDERFIT 7559 3341 603PGD-linfin 8 5813 2014 770PGD-linfin 4 6854 2645 818PGD-linfin 2 7425 3077 900PGD-linfin 1 7811 3503 842PGD-l2 12 5825 2103 785PGD-l2 8 6336 2219 848PGD-l2 4 6965 2821 772FGSM 8 6448 2294 807FGSM 4 7050 2841 603
Table 8 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Tiny ImageNet test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Tiny ImageNet
DATA SET 2times 2 4times 4 8times 8T
STANDARD 6673 2487 448UNDERFIT 5922 2362 438PGD-linfin 8 4108 1605 683PGD-linfin 4 4954 1823 630PGD-linfin 2 5596 1995 561PGD-linfin 1 6019 2324 608PGD-l2 12 4223 1695 766PGD-l2 8 4767 1628 650PGD-l2 4 5194 1779 589FGSM 8 5742 2070 473FGSM 4 5068 1684 598
Figure 9 Visualization of Salience maps generated from SmoothGrad (Smilkov et al 2017) for all 11 models From left to rightStandard CNNs underfitting CNNs PGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4
Interpreting Adversarially Trained Convolutional Neural Networks
Figure 10 Visualization of Salience maps generated from Grad for all 11 models From left to right Standard CNNs underfitting CNNsPGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4 Itrsquos easily observed that sensitivity maps generated from Grad are more noisycompared with its smoothed variant SmoothGrad especially for Standard CNNs and underfitting CNNs
adversarial attacks In International Conference on Learn-ing Representations 2018
Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017
Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018
Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017
Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015
Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017
Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013
Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018
Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017
Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018
Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017
Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018
Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017
Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018
Interpreting Adversarially Trained Convolutional Neural Networks
Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018
Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018
Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014
Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a
Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b
Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016
Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017
Interpreting Adversarially Trained Convolutional Neural Networks
A Experiment SetupA1 Models
bull CIFAR-10 We train a standard ResNet-18 (He et al2016a) architecture it has 4 groups of residual layerswith filter sizes (64 128 256 512) and 2 residual units
bull Caltech-256 amp Tiny ImageNet We use a ResNet-18architecture using the code from pytorch(Paszke et al2017) Note that for models on Caltech-256 amp TinyImageNet we initialize them using ImageNet(Denget al 2009) pre-trained weighs provided by pytorch
We evaluate the robustness of all our models using a linfinprojected gradient descent adversary with ε = 8255 stepsize = 2 and number of iterations as 40
A2 Adversarial Training
We perform 9 types of adversarial training on each of thedataset 7 of the 9 kinds of adversarial training are againsta projected gradient descent (PGD) adversary(Madry et al2018) the other 2 are against FGSM adversary(Goodfellowet al 2014)
A21 TRAIN AGAINST A PROJECTED GRADIENTDESCENT (PGD) ADVERSARY
We list value of ε for adversarial training of each dataset andlp-norm In all settings PGD runs 20 iterations
bull linfin-norm bounded adversary For all of thethree data set pixel vaules range from 0 1 wetrain 4 adversarially trained CNNs with ε isin1255 2255 4255 8255 these four models aredenoted as PGD-inf1 2 4 8 respectively and stepssize as 1255 1255 2255 4255
bull l2-norm bounded adversary For Caltech-256 ampTiny ImageNet the input size for our model is 224times224 we train three adversarially trained CNNs withε isin 4 8 12 and these four models are denoted asPGD-l2 4 8 12 respectively Step sizes for thesethree models are 2255 4255 6255 For CIFAR-10where images are of size 32times 32 the three adversari-ally trained CNNs have ε isin 410 810 1210 butthey are denoted in the same way and have the samestep size as that in Caltech-256 amp Tiny ImageNet
A22 TRAIN AGAINST A FGSM ADVERSARY
ε for these two adversarially trained CNNs are ε isin4 8 and they are denoted as FGSM 4 8 respectively
B Style-transferred test setFollowing (Geirhos et al 2019) we construct stylized testset for Caltech-256 and Tiny ImageNet by applying theAdaIn style transfer(Huang amp Belongie 2017) with a styl-ization coefficient of α = 10 to every test image withthe style of a randomly selected painting from 3KagglersquosPainter by numbers dataset we used source code providedby(Geirhos et al 2019)
C Experiments on Fourier-filtered datasets(Jo amp Bengio 2017) showed deep neural networks tendto learn surface statistical regularities as opposed to high-level abstractions Following them we test the performanceof different trained CNNs on the high-pass and low-passfiltered dataset to show their tendencies
C1 Fourier filtering setup
Following (Jo amp Bengio 2017) We construct three types ofFourier filtered version of test set
bull The low frequency filtered version We use a radialmask in the Fourier domain to set higher frequencymodes to zero(low-pass filtering)
bull The high frequency filtered version We use a radialmask in the Fourier domain to preserve only the higherfrequency modes(high-pass filtering)
bull The random filtered version We use a random maskin the Fourier domain to set each mode to 0 with prob-ability p uniformly The random mask is generated onthe fly during the test
C2 Results
We measure generalization performance (accuracy on cor-rectly classified images) of each model on these three fil-tered datasets from Caltech-256 results are listed in Ta-ble 3 AT-CNNs performs better on Low-pass filtered datasetand worse on High-pass filtered dataset Results indicatethat AT-CNNs make their predictions depend more on low-frequency information This finding is consistent with ourconclusions since local features such as textures are oftenconsidered as high-frequency information and shapes andcontours are more like low-frequency
D Detailed resultsWe the detailed results for our quantitative experimentshere Table 5 4 6 show the results of each models on
3httpswwwkagglecomcpainter-by-numbers
Interpreting Adversarially Trained Convolutional Neural Networks
Table 3 ldquoAccuracy on correctly classified imagesrdquo for different models on three Fourier-filtered Caltech-256 test setsDATA SET THE LOW FREQUENCY FILTERED VERSION THE HIGH FREQUENCY FILTERED VERSION THE RANDOM FILTERED VERSION
STANDARD 158 165 735UNDERFIT 145 176 622PGD-linfin 711 36 734
test set with different saturation levels Table 8 7 list allthe results of each models on test set after different path-shuffling operations
E Additional FiguresWe show additional sensitive maps in Figure 9 We alsocompare the sensitive maps using Grad and SmoothGradin Figure 10
ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt
M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018
Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018
Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018
Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015
Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016
Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019
Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009
Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input data
distributions In International Conference on LearningRepresentations 2019
Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017
Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009
Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015
Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019
Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014
Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014
Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007
He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a
He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b
Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017
Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017
Interpreting Adversarially Trained Convolutional Neural Networks
Table 4 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Caltech-256
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 2862 5745 8520 9013 6537 4237 2345 2003UNDERFIT 3184 6336 9096 8451 5751 3858 2600 2308PGD-linfin 8 3284 5347 8272 8645 7033 6109 5376 5191PGD-linfin 4 3199 5774 8518 8795 7033 5838 4816 4545PGD-linfin 2 3299 6075 8775 8935 6878 5199 4069 3783PGD-linfin 1 3267 6185 8936 9018 6907 5005 3798 3480PGD-l2 12 3138 5307 8210 8389 6706 5851 5245 5075PGD-l2 8 3282 5665 8501 8609 6890 5875 5159 4930PGD-l2 4 3282 5877 8630 8636 6794 5368 4443 4198FGSM 8 2953 5546 8510 8665 6901 5564 4592 4342FGSM 4 3268 5937 8722 8790 6671 5113 4166 3878
Table 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Tiny ImageNet test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Tiny ImageNet
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 724 2588 7252 7273 2538 824 262 193UNDERFIT 734 2544 6980 6067 1801 672 316 265PGD-linfin 8 1107 2908 6711 7453 498 4016 3544 3396PGD-linfin 4 1244 3353 7294 7575 4638 3212 2492 2265PGD-linfin 2 1209 3485 7577 7615 4135 2520 1693 1452PGD-linfin 1 1130 3503 7685 7863 4048 2137 1270 1081PGD-l2 12 1130 2948 6694 7522 5226 4211 3720 3585PGD-l2 8 1242 3278 7194 7515 4792 3566 2955 2790PGD-l2 4 1263 3410 7406 7732 4500 2873 2016 1804FGSM 8 1259 3266 7055 8153 4183 1752 729 582FGSM 4 1263 3410 7406 7505 4291 2909 2215 2014
Table 6 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated CIFAR-10 test set It is easily observed AT-CNNs aremuch more robust to increasing saturation levels on CIFAR-10
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 2736 5595 9103 9312 6998 4830 3439 3106UNDERFIT 2143 5028 8771 8989 6609 4335 2910 2613PGD-linfin 8 2605 4696 8097 8916 7546 6908 5898 6464PGD-linfin 4 2722 4981 8416 8979 7389 6535 5999 5847PGD-linfin 2 2832 5312 8693 9137 7402 6282 5525 5260PGD-linfin 1 2718 5359 8854 9177 7267 5839 4725 4175PGD-l2 12 2599 4692 8172 8844 7392 6603 6098 5941PGD-l2 8 2775 5029 8376 8092 7317 6483 5864 4694PGD-l2 4 2726 5117 8578 9008 7312 6150 5204 4879FGSM 8 2550 4611 8172 8767 7422 6712 6251 6132FGSM 4 2639 5893 8430 8902 7347 6443 5880 5682
Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012
Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016
Long J Shelhamer E and Darrell T Fully convolutional
networks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015
Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018
Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant to
Interpreting Adversarially Trained Convolutional Neural Networks
Table 7 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Caltech-256 test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Caltech-256
DATA SET 2times 2 4times 4 8times 8T
STANDARD 8476 5150 1084UNDERFIT 7559 3341 603PGD-linfin 8 5813 2014 770PGD-linfin 4 6854 2645 818PGD-linfin 2 7425 3077 900PGD-linfin 1 7811 3503 842PGD-l2 12 5825 2103 785PGD-l2 8 6336 2219 848PGD-l2 4 6965 2821 772FGSM 8 6448 2294 807FGSM 4 7050 2841 603
Table 8 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Tiny ImageNet test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Tiny ImageNet
DATA SET 2times 2 4times 4 8times 8T
STANDARD 6673 2487 448UNDERFIT 5922 2362 438PGD-linfin 8 4108 1605 683PGD-linfin 4 4954 1823 630PGD-linfin 2 5596 1995 561PGD-linfin 1 6019 2324 608PGD-l2 12 4223 1695 766PGD-l2 8 4767 1628 650PGD-l2 4 5194 1779 589FGSM 8 5742 2070 473FGSM 4 5068 1684 598
Figure 9 Visualization of Salience maps generated from SmoothGrad (Smilkov et al 2017) for all 11 models From left to rightStandard CNNs underfitting CNNs PGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4
Interpreting Adversarially Trained Convolutional Neural Networks
Figure 10 Visualization of Salience maps generated from Grad for all 11 models From left to right Standard CNNs underfitting CNNsPGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4 Itrsquos easily observed that sensitivity maps generated from Grad are more noisycompared with its smoothed variant SmoothGrad especially for Standard CNNs and underfitting CNNs
adversarial attacks In International Conference on Learn-ing Representations 2018
Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017
Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018
Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017
Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015
Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017
Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013
Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018
Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017
Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018
Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017
Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018
Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017
Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018
Interpreting Adversarially Trained Convolutional Neural Networks
Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018
Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018
Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014
Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a
Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b
Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016
Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017
Interpreting Adversarially Trained Convolutional Neural Networks
Table 3 ldquoAccuracy on correctly classified imagesrdquo for different models on three Fourier-filtered Caltech-256 test setsDATA SET THE LOW FREQUENCY FILTERED VERSION THE HIGH FREQUENCY FILTERED VERSION THE RANDOM FILTERED VERSION
STANDARD 158 165 735UNDERFIT 145 176 622PGD-linfin 711 36 734
test set with different saturation levels Table 8 7 list allthe results of each models on test set after different path-shuffling operations
E Additional FiguresWe show additional sensitive maps in Figure 9 We alsocompare the sensitive maps using Grad and SmoothGradin Figure 10
ReferencesAdebayo J Gilmer J Muelly M Goodfellow I Hardt
M and Kim B Sanity checks for saliency maps InAdvances in Neural Information Processing Systems pp9525ndash9536 2018
Ancona M Ceolini E Oztireli C and Gross M To-wards better understanding of gradient-based attributionmethods for deep neural networks In 6th InternationalConference on Learning Representations (ICLR 2018)2018
Athalye A Carlini N and Wagner D Obfuscatedgradients give a false sense of security Circumvent-ing defenses to adversarial examples arXiv preprintarXiv180200420 2018
Bach S Binder A Montavon G Klauschen F MullerK-R and Samek W On pixel-wise explanations fornon-linear classifier decisions by layer-wise relevancepropagation PloS one 10(7)e0130140 2015
Ballester P and de Araujo R M On the performance ofgooglenet and alexnet applied to sketches In AAAI pp1124ndash1128 2016
Brendel W and Bethge M Approximating cnns withbag-of-local-features models works surprisingly well onimagenet In International Conference on Learning Rep-resentations 2019
Deng J Dong W Socher R Li L-J Li K and Fei-FeiL Imagenet A large-scale hierarchical image databaseIn Computer Vision and Pattern Recognition 2009 CVPR2009 IEEE Conference on pp 248ndash255 Ieee 2009
Ding G W Lui K Y-C Jin X Wang L and Huang ROn the sensitivity of adversarial robustness to input data
distributions In International Conference on LearningRepresentations 2019
Dong Y Su H Zhu J and Bao F Towards interpretabledeep neural networks by leveraging adversarial examplesarXiv preprint arXiv170805493 2017
Erhan D Bengio Y Courville A and Vincent P Visual-izing higher-layer features of a deep network Universityof Montreal 1341(3)1 2009
Gatys L A Ecker A S and Bethge M A neural algo-rithm of artistic style arXiv preprint arXiv1508065762015
Geirhos R Rubisch P Michaelis C Bethge M Wich-mann F A and Brendel W Imagenet-trained cnns arebiased towards texture increasing shape bias improvesaccuracy and robustness In International Conference onLearning Representations 2019
Girshick R Donahue J Darrell T and Malik J Rich fea-ture hierarchies for accurate object detection and semanticsegmentation In Proceedings of the IEEE conference oncomputer vision and pattern recognition pp 580ndash5872014
Goodfellow I J Shlens J and Szegedy C Explain-ing and harnessing adversarial examples arXiv preprintarXiv14126572 2014
Griffin G Holub A and Perona P Caltech-256 objectcategory dataset 2007
He K Zhang X Ren S and Sun J Deep residual learn-ing for image recognition In Proceedings of the IEEEconference on computer vision and pattern recognitionpp 770ndash778 2016a
He K Zhang X Ren S and Sun J Identity mappingsin deep residual networks In European conference oncomputer vision pp 630ndash645 Springer 2016b
Huang X and Belongie S Arbitrary style transfer in real-time with adaptive instance normalization In 2017 IEEEInternational Conference on Computer Vision (ICCV) pp1510ndash1519 IEEE 2017
Jo J and Bengio Y Measuring the tendency of cnnsto learn surface statistical regularities arXiv preprintarXiv171111561 2017
Interpreting Adversarially Trained Convolutional Neural Networks
Table 4 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Caltech-256
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 2862 5745 8520 9013 6537 4237 2345 2003UNDERFIT 3184 6336 9096 8451 5751 3858 2600 2308PGD-linfin 8 3284 5347 8272 8645 7033 6109 5376 5191PGD-linfin 4 3199 5774 8518 8795 7033 5838 4816 4545PGD-linfin 2 3299 6075 8775 8935 6878 5199 4069 3783PGD-linfin 1 3267 6185 8936 9018 6907 5005 3798 3480PGD-l2 12 3138 5307 8210 8389 6706 5851 5245 5075PGD-l2 8 3282 5665 8501 8609 6890 5875 5159 4930PGD-l2 4 3282 5877 8630 8636 6794 5368 4443 4198FGSM 8 2953 5546 8510 8665 6901 5564 4592 4342FGSM 4 3268 5937 8722 8790 6671 5113 4166 3878
Table 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Tiny ImageNet test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Tiny ImageNet
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 724 2588 7252 7273 2538 824 262 193UNDERFIT 734 2544 6980 6067 1801 672 316 265PGD-linfin 8 1107 2908 6711 7453 498 4016 3544 3396PGD-linfin 4 1244 3353 7294 7575 4638 3212 2492 2265PGD-linfin 2 1209 3485 7577 7615 4135 2520 1693 1452PGD-linfin 1 1130 3503 7685 7863 4048 2137 1270 1081PGD-l2 12 1130 2948 6694 7522 5226 4211 3720 3585PGD-l2 8 1242 3278 7194 7515 4792 3566 2955 2790PGD-l2 4 1263 3410 7406 7732 4500 2873 2016 1804FGSM 8 1259 3266 7055 8153 4183 1752 729 582FGSM 4 1263 3410 7406 7505 4291 2909 2215 2014
Table 6 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated CIFAR-10 test set It is easily observed AT-CNNs aremuch more robust to increasing saturation levels on CIFAR-10
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 2736 5595 9103 9312 6998 4830 3439 3106UNDERFIT 2143 5028 8771 8989 6609 4335 2910 2613PGD-linfin 8 2605 4696 8097 8916 7546 6908 5898 6464PGD-linfin 4 2722 4981 8416 8979 7389 6535 5999 5847PGD-linfin 2 2832 5312 8693 9137 7402 6282 5525 5260PGD-linfin 1 2718 5359 8854 9177 7267 5839 4725 4175PGD-l2 12 2599 4692 8172 8844 7392 6603 6098 5941PGD-l2 8 2775 5029 8376 8092 7317 6483 5864 4694PGD-l2 4 2726 5117 8578 9008 7312 6150 5204 4879FGSM 8 2550 4611 8172 8767 7422 6712 6251 6132FGSM 4 2639 5893 8430 8902 7347 6443 5880 5682
Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012
Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016
Long J Shelhamer E and Darrell T Fully convolutional
networks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015
Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018
Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant to
Interpreting Adversarially Trained Convolutional Neural Networks
Table 7 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Caltech-256 test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Caltech-256
DATA SET 2times 2 4times 4 8times 8T
STANDARD 8476 5150 1084UNDERFIT 7559 3341 603PGD-linfin 8 5813 2014 770PGD-linfin 4 6854 2645 818PGD-linfin 2 7425 3077 900PGD-linfin 1 7811 3503 842PGD-l2 12 5825 2103 785PGD-l2 8 6336 2219 848PGD-l2 4 6965 2821 772FGSM 8 6448 2294 807FGSM 4 7050 2841 603
Table 8 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Tiny ImageNet test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Tiny ImageNet
DATA SET 2times 2 4times 4 8times 8T
STANDARD 6673 2487 448UNDERFIT 5922 2362 438PGD-linfin 8 4108 1605 683PGD-linfin 4 4954 1823 630PGD-linfin 2 5596 1995 561PGD-linfin 1 6019 2324 608PGD-l2 12 4223 1695 766PGD-l2 8 4767 1628 650PGD-l2 4 5194 1779 589FGSM 8 5742 2070 473FGSM 4 5068 1684 598
Figure 9 Visualization of Salience maps generated from SmoothGrad (Smilkov et al 2017) for all 11 models From left to rightStandard CNNs underfitting CNNs PGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4
Interpreting Adversarially Trained Convolutional Neural Networks
Figure 10 Visualization of Salience maps generated from Grad for all 11 models From left to right Standard CNNs underfitting CNNsPGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4 Itrsquos easily observed that sensitivity maps generated from Grad are more noisycompared with its smoothed variant SmoothGrad especially for Standard CNNs and underfitting CNNs
adversarial attacks In International Conference on Learn-ing Representations 2018
Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017
Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018
Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017
Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015
Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017
Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013
Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018
Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017
Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018
Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017
Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018
Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017
Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018
Interpreting Adversarially Trained Convolutional Neural Networks
Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018
Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018
Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014
Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a
Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b
Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016
Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017
Interpreting Adversarially Trained Convolutional Neural Networks
Table 4 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Caltech-256 test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Caltech-256
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 2862 5745 8520 9013 6537 4237 2345 2003UNDERFIT 3184 6336 9096 8451 5751 3858 2600 2308PGD-linfin 8 3284 5347 8272 8645 7033 6109 5376 5191PGD-linfin 4 3199 5774 8518 8795 7033 5838 4816 4545PGD-linfin 2 3299 6075 8775 8935 6878 5199 4069 3783PGD-linfin 1 3267 6185 8936 9018 6907 5005 3798 3480PGD-l2 12 3138 5307 8210 8389 6706 5851 5245 5075PGD-l2 8 3282 5665 8501 8609 6890 5875 5159 4930PGD-l2 4 3282 5877 8630 8636 6794 5368 4443 4198FGSM 8 2953 5546 8510 8665 6901 5564 4592 4342FGSM 4 3268 5937 8722 8790 6671 5113 4166 3878
Table 5 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated Tiny ImageNet test set It is easily observed AT-CNNsare much more robust to increasing saturation levels on Tiny ImageNet
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 724 2588 7252 7273 2538 824 262 193UNDERFIT 734 2544 6980 6067 1801 672 316 265PGD-linfin 8 1107 2908 6711 7453 498 4016 3544 3396PGD-linfin 4 1244 3353 7294 7575 4638 3212 2492 2265PGD-linfin 2 1209 3485 7577 7615 4135 2520 1693 1452PGD-linfin 1 1130 3503 7685 7863 4048 2137 1270 1081PGD-l2 12 1130 2948 6694 7522 5226 4211 3720 3585PGD-l2 8 1242 3278 7194 7515 4792 3566 2955 2790PGD-l2 4 1263 3410 7406 7732 4500 2873 2016 1804FGSM 8 1259 3266 7055 8153 4183 1752 729 582FGSM 4 1263 3410 7406 7505 4291 2909 2215 2014
Table 6 ldquoAccuracy on correctly classified imagesrdquo for different models on saturated CIFAR-10 test set It is easily observed AT-CNNs aremuch more robust to increasing saturation levels on CIFAR-10
SATURAION LEVEL 025 05 1 4 8 16 64 1024
STANDARD 2736 5595 9103 9312 6998 4830 3439 3106UNDERFIT 2143 5028 8771 8989 6609 4335 2910 2613PGD-linfin 8 2605 4696 8097 8916 7546 6908 5898 6464PGD-linfin 4 2722 4981 8416 8979 7389 6535 5999 5847PGD-linfin 2 2832 5312 8693 9137 7402 6282 5525 5260PGD-linfin 1 2718 5359 8854 9177 7267 5839 4725 4175PGD-l2 12 2599 4692 8172 8844 7392 6603 6098 5941PGD-l2 8 2775 5029 8376 8092 7317 6483 5864 4694PGD-l2 4 2726 5117 8578 9008 7312 6150 5204 4879FGSM 8 2550 4611 8172 8767 7422 6712 6251 6132FGSM 4 2639 5893 8430 8902 7347 6443 5880 5682
Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksIn Advances in neural information processing systemspp 1097ndash1105 2012
Kurakin A Goodfellow I and Bengio S Adversar-ial examples in the physical world arXiv preprintarXiv160702533 2016
Long J Shelhamer E and Darrell T Fully convolutional
networks for semantic segmentation In Proceedingsof the IEEE conference on computer vision and patternrecognition pp 3431ndash3440 2015
Luo T Cai T Zhang M Chen S and Wang L Randommask Towards robust convolutional neural networks2018
Madry A Makelov A Schmidt L Tsipras D andVladu A Towards deep learning models resistant to
Interpreting Adversarially Trained Convolutional Neural Networks
Table 7 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Caltech-256 test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Caltech-256
DATA SET 2times 2 4times 4 8times 8T
STANDARD 8476 5150 1084UNDERFIT 7559 3341 603PGD-linfin 8 5813 2014 770PGD-linfin 4 6854 2645 818PGD-linfin 2 7425 3077 900PGD-linfin 1 7811 3503 842PGD-l2 12 5825 2103 785PGD-l2 8 6336 2219 848PGD-l2 4 6965 2821 772FGSM 8 6448 2294 807FGSM 4 7050 2841 603
Table 8 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Tiny ImageNet test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Tiny ImageNet
DATA SET 2times 2 4times 4 8times 8T
STANDARD 6673 2487 448UNDERFIT 5922 2362 438PGD-linfin 8 4108 1605 683PGD-linfin 4 4954 1823 630PGD-linfin 2 5596 1995 561PGD-linfin 1 6019 2324 608PGD-l2 12 4223 1695 766PGD-l2 8 4767 1628 650PGD-l2 4 5194 1779 589FGSM 8 5742 2070 473FGSM 4 5068 1684 598
Figure 9 Visualization of Salience maps generated from SmoothGrad (Smilkov et al 2017) for all 11 models From left to rightStandard CNNs underfitting CNNs PGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4
Interpreting Adversarially Trained Convolutional Neural Networks
Figure 10 Visualization of Salience maps generated from Grad for all 11 models From left to right Standard CNNs underfitting CNNsPGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4 Itrsquos easily observed that sensitivity maps generated from Grad are more noisycompared with its smoothed variant SmoothGrad especially for Standard CNNs and underfitting CNNs
adversarial attacks In International Conference on Learn-ing Representations 2018
Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017
Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018
Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017
Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015
Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017
Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013
Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018
Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017
Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018
Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017
Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018
Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017
Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018
Interpreting Adversarially Trained Convolutional Neural Networks
Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018
Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018
Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014
Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a
Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b
Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016
Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017
Interpreting Adversarially Trained Convolutional Neural Networks
Table 7 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Caltech-256 test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Caltech-256
DATA SET 2times 2 4times 4 8times 8T
STANDARD 8476 5150 1084UNDERFIT 7559 3341 603PGD-linfin 8 5813 2014 770PGD-linfin 4 6854 2645 818PGD-linfin 2 7425 3077 900PGD-linfin 1 7811 3503 842PGD-l2 12 5825 2103 785PGD-l2 8 6336 2219 848PGD-l2 4 6965 2821 772FGSM 8 6448 2294 807FGSM 4 7050 2841 603
Table 8 ldquoAccuracy on correctly classified imagesrdquo for different models on Patch-shuffled Tiny ImageNet test set Results indicates thatAT-CNNs are more sensitive to Patch-shuffle operations on Tiny ImageNet
DATA SET 2times 2 4times 4 8times 8T
STANDARD 6673 2487 448UNDERFIT 5922 2362 438PGD-linfin 8 4108 1605 683PGD-linfin 4 4954 1823 630PGD-linfin 2 5596 1995 561PGD-linfin 1 6019 2324 608PGD-l2 12 4223 1695 766PGD-l2 8 4767 1628 650PGD-l2 4 5194 1779 589FGSM 8 5742 2070 473FGSM 4 5068 1684 598
Figure 9 Visualization of Salience maps generated from SmoothGrad (Smilkov et al 2017) for all 11 models From left to rightStandard CNNs underfitting CNNs PGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4
Interpreting Adversarially Trained Convolutional Neural Networks
Figure 10 Visualization of Salience maps generated from Grad for all 11 models From left to right Standard CNNs underfitting CNNsPGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4 Itrsquos easily observed that sensitivity maps generated from Grad are more noisycompared with its smoothed variant SmoothGrad especially for Standard CNNs and underfitting CNNs
adversarial attacks In International Conference on Learn-ing Representations 2018
Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017
Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018
Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017
Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015
Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017
Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013
Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018
Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017
Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018
Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017
Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018
Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017
Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018
Interpreting Adversarially Trained Convolutional Neural Networks
Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018
Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018
Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014
Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a
Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b
Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016
Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017
Interpreting Adversarially Trained Convolutional Neural Networks
Figure 10 Visualization of Salience maps generated from Grad for all 11 models From left to right Standard CNNs underfitting CNNsPGD-inf 8 4 2 1 PGD-L2 12 8 4 and FGSM 8 4 Itrsquos easily observed that sensitivity maps generated from Grad are more noisycompared with its smoothed variant SmoothGrad especially for Standard CNNs and underfitting CNNs
adversarial attacks In International Conference on Learn-ing Representations 2018
Paszke A Gross S Chintala S Chanan G Yang EDeVito Z Lin Z Desmaison A Antiga L and LererA Automatic differentiation in pytorch 2017
Schmidt L Santurkar S Tsipras D Talwar K andMadry A Adversarially robust generalization requiresmore data arXiv preprint arXiv180411285 2018
Selvaraju R R Cogswell M Das A Vedantam RParikh D and Batra D Grad-cam Visual explanationsfrom deep networks via gradient-based localization In2017 IEEE International Conference on Computer Vision(ICCV) pp 618ndash626 IEEE 2017
Shaham U Yamada Y and Negahban S Understandingadversarial training Increasing local stability of neu-ral nets through robust optimization arXiv preprintarXiv151105432 2015
Shrikumar A Greenside P and Kundaje A Learningimportant features through propagating activation differ-ences arXiv preprint arXiv170402685 2017
Simonyan K Vedaldi A and Zisserman A Deep in-side convolutional networks Visualising image clas-sification models and saliency maps arXiv preprintarXiv13126034 2013
Sinha A Namkoong H and Duchi J Certifiable distribu-tional robustness with principled adversarial training InInternational Conference on Learning Representations2018
Smilkov D Thorat N Kim B Viegas F and Watten-berg M Smoothgrad removing noise by adding noisearXiv preprint arXiv170603825 2017
Song Y Shu R Kushman N and Ermon S Constructingunrestricted adversarial examples with generative modelsIn Advances in Neural Information Processing Systemspp 8322ndash8333 2018
Sundararajan M Taly A and Yan Q Axiomatic attribu-tion for deep networks arXiv preprint arXiv1703013652017
Tsipras D Santurkar S Engstrom L Turner A andMadry A Robustness may be at odds with accuracy2018
Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser Ł and Polosukhin I Atten-tion is all you need In Advances in Neural InformationProcessing Systems pp 5998ndash6008 2017
Wang X Girshick R Gupta A and He K Non-localneural networks In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 7794ndash7803 2018
Interpreting Adversarially Trained Convolutional Neural Networks
Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018
Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018
Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014
Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a
Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b
Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016
Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017
Interpreting Adversarially Trained Convolutional Neural Networks
Xiao C Zhu J-Y Li B He W Liu M and Song DSpatially transformed adversarial examples In Interna-tional Conference on Learning Representations 2018
Xie C Wu Y van der Maaten L Yuille A and He KFeature denoising for improving adversarial robustnessarXiv preprint arXiv181203411 2018
Zeiler M D and Fergus R Visualizing and understand-ing convolutional networks In European conference oncomputer vision pp 818ndash833 Springer 2014
Zhang D Zhang T Lu Y Zhu Z and Dong B Youonly propagate once Painless adversarial training usingmaximal principle arXiv preprint arXiv1905008772019a
Zhang H Yu Y Jiao J Xing E P Ghaoui L E and Jor-dan M I Theoretically principled trade-off between ro-bustness and accuracy arXiv preprint arXiv1901085732019b
Zhou B Khosla A Lapedriza A Oliva A and TorralbaA Learning deep features for discriminative localizationIn Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 2921ndash2929 2016
Zintgraf L M Cohen T S Adel T and Welling MVisualizing deep neural network decisions Predictiondifference analysis arXiv preprint arXiv1702045952017