+ All Categories
Home > Documents > Ptolemy: Architecture Support for Robust Deep Learning

Ptolemy: Architecture Support for Robust Deep Learning

Date post: 01-Oct-2021
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
15
Ptolemy: Architecture Support for Robust Deep Learning Yiming Gan * University of Rochester [email protected] Yuxian Qiu * Shanghai Jiao Tong University [email protected] Jingwen Leng Shanghai Jiao Tong University [email protected] Minyi Guo Shanghai Jiao Tong University [email protected] Yuhao Zhu University of Rochester [email protected] Abstract—Deep learning is vulnerable to adversarial attacks, where carefully-crafted input perturbations could mislead a well-trained Deep Neural Network (DNN) to produce incorrect results. Adversarial attacks jeopardize the safety, security, and privacy of DNN-enabled systems. Today’s countermeasures to adversarial attacks either do not have the capability to detect adversarial samples at inference-time, or introduce prohibitively high overhead to be practical at inference-time. We propose Ptolemy, an algorithm-architecture co-designed system that detects adversarial attacks at inference time with low overhead and high accuracy. We exploit the synergies between DNN inference and imperative program execution: an input to a DNN uniquely activates a set of neurons that contribute significantly to the inference output, analogous to the sequence of basic blocks exercised by an input in a conventional program. Critically, we observe that adversarial samples tend to activate distinctive paths from those of benign inputs. Leveraging this in- sight, we propose an adversarial detection framework, which uses canary paths generated from oine profiling to detect adversarial samples at runtime. The Ptolemy compiler along with the co- designed hardware enable ecient execution by exploiting the unique algorithmic characteristics. Extensive evaluations show that Ptolemy achieves higher or similar adversarial detection accuracy than today’s mechanisms with much lower (as low as 2%) runtime overhead. Keywords—DNN; Robustness; Deep learning; Adversarial At- tack; Adversarial Samples; Defense; Artifacthttps://github.com/Ptolemy-DL/Ptolemy I. Introduction Deep Neural Networks (DNN) are not robust. Small per- turbations to inputs could easily “fool” DNNs to produce incorrect results. By manipulating the inputs, a range of so- called adversarial attacks have been demonstrated to mislead DNNs to mis-predict [50], [13], [36], [63], [48], [28], leading to potentially severe consequences. For instance, physically putting a sticker on a stop sign could lead a well-trained object recognition DNN to misclassify the stop sign as a yield sign [36]. Beyond mission-critical scenarios such as autonomous driving, the robustness issue also obstructs the deployment of DNN in privacy/security-sensitive domains such as biometric authentication [51], [60]. We take a first step toward architectural support for robust deep learning. For a robustness scheme to be eective in * Equal contribution practice, it not only has to accurately detect adversarial inputs, but must also do so eciently at inference time so that proper measure could be taken. This paper proposes Ptolemy, an algorithm-architecture co-design system that detects ad- versarial attacks at inference time with low overhead and high accuracy. This enables applications to reject incorrect results produced by adversarial attacks during inference. Fig. 1 provides an overview of the system. Existing countermeasures to adversarial attacks are unable to detect adversarial samples at inference time [12], [25]. Fundamentally, they treat DNN inferences as black boxes, ignoring their runtime behaviors. To enable ecient online adversarial detection, this paper takes a dierent approach. We exploit the fact that each input to a DNN uniquely exercises an activation path—a collection of neurons that contribute significantly to the inference output, analogous to the sequence of basic blocks exercised by an input in a conventional program. Analyzing “hot” activation paths in DNNs, our key observation is that inputs that lead to the same inference class tend to exercise a group of paths that are distinctive from other inference classes. We propose a general algorithmic framework that exploits the runtime path behaviors for ecient online adversarial detection. The detection framework constructs a canary class path oine for each inference class by profiling the training data. At runtime, it builds the activation path for an input, and detects the input as an adversary if the activation path is dierent from the canary path associated with the predicted class. The general algorithm framework exposes a myriad of design knobs aecting the critical trade-obetween detection accuracy and compute cost, such as how a path is formulated and when the path is constructed. To widen the applicability of our detection framework, Ptolemy provides a high-level programming interface, which allows programmers to calibrate the algorithmic knobs to explore the accuracy-cost trade-othat best suits an application’s needs. Ptolemy provides an ecient execution substrate. The key to the execution eciency is the Ptolemy compiler, which hides and reduces the detection overhead by exploiting the unique parallelisms and redundancies exposed by the detection algorithms. We show that with the aggressive compile-time optimizations and a well-defined ISA, detection algorithms can
Transcript
Page 1: Ptolemy: Architecture Support for Robust Deep Learning

Ptolemy: Architecture Support for Robust Deep Learning

Yiming Gan*

University of [email protected]

Yuxian Qiu*

Shanghai Jiao Tong [email protected]

Jingwen LengShanghai Jiao Tong University

[email protected]

Minyi GuoShanghai Jiao Tong University

[email protected]

Yuhao ZhuUniversity of Rochester

[email protected]

Abstract—Deep learning is vulnerable to adversarial attacks,where carefully-crafted input perturbations could mislead awell-trained Deep Neural Network (DNN) to produce incorrectresults. Adversarial attacks jeopardize the safety, security, andprivacy of DNN-enabled systems. Today’s countermeasures toadversarial attacks either do not have the capability to detectadversarial samples at inference-time, or introduce prohibitivelyhigh overhead to be practical at inference-time.

We propose Ptolemy, an algorithm-architecture co-designedsystem that detects adversarial attacks at inference time with lowoverhead and high accuracy. We exploit the synergies betweenDNN inference and imperative program execution: an inputto a DNN uniquely activates a set of neurons that contributesignificantly to the inference output, analogous to the sequenceof basic blocks exercised by an input in a conventional program.Critically, we observe that adversarial samples tend to activatedistinctive paths from those of benign inputs. Leveraging this in-sight, we propose an adversarial detection framework, which usescanary paths generated from offline profiling to detect adversarialsamples at runtime. The Ptolemy compiler along with the co-designed hardware enable efficient execution by exploiting theunique algorithmic characteristics. Extensive evaluations showthat Ptolemy achieves higher or similar adversarial detectionaccuracy than today’s mechanisms with much lower (as low as2%) runtime overhead.

Keywords—DNN; Robustness; Deep learning; Adversarial At-tack; Adversarial Samples; Defense;

Artifact—https://github.com/Ptolemy-DL/Ptolemy

I. IntroductionDeep Neural Networks (DNN) are not robust. Small per-

turbations to inputs could easily “fool” DNNs to produceincorrect results. By manipulating the inputs, a range of so-called adversarial attacks have been demonstrated to misleadDNNs to mis-predict [50], [13], [36], [63], [48], [28], leadingto potentially severe consequences. For instance, physicallyputting a sticker on a stop sign could lead a well-trainedobject recognition DNN to misclassify the stop sign as ayield sign [36]. Beyond mission-critical scenarios such asautonomous driving, the robustness issue also obstructs thedeployment of DNN in privacy/security-sensitive domainssuch as biometric authentication [51], [60].

We take a first step toward architectural support for robustdeep learning. For a robustness scheme to be effective in

*Equal contribution

practice, it not only has to accurately detect adversarial inputs,but must also do so efficiently at inference time so thatproper measure could be taken. This paper proposes Ptolemy,an algorithm-architecture co-design system that detects ad-versarial attacks at inference time with low overhead andhigh accuracy. This enables applications to reject incorrectresults produced by adversarial attacks during inference. Fig. 1provides an overview of the system.

Existing countermeasures to adversarial attacks are unableto detect adversarial samples at inference time [12], [25].Fundamentally, they treat DNN inferences as black boxes,ignoring their runtime behaviors. To enable efficient onlineadversarial detection, this paper takes a different approach. Weexploit the fact that each input to a DNN uniquely exercisesan activation path—a collection of neurons that contributesignificantly to the inference output, analogous to the sequenceof basic blocks exercised by an input in a conventionalprogram. Analyzing “hot” activation paths in DNNs, our keyobservation is that inputs that lead to the same inference classtend to exercise a group of paths that are distinctive from otherinference classes.

We propose a general algorithmic framework that exploitsthe runtime path behaviors for efficient online adversarialdetection. The detection framework constructs a canary classpath offline for each inference class by profiling the trainingdata. At runtime, it builds the activation path for an input,and detects the input as an adversary if the activation path isdifferent from the canary path associated with the predictedclass. The general algorithm framework exposes a myriad ofdesign knobs affecting the critical trade-off between detectionaccuracy and compute cost, such as how a path is formulatedand when the path is constructed. To widen the applicabilityof our detection framework, Ptolemy provides a high-levelprogramming interface, which allows programmers to calibratethe algorithmic knobs to explore the accuracy-cost trade-off

that best suits an application’s needs.Ptolemy provides an efficient execution substrate. The key

to the execution efficiency is the Ptolemy compiler, whichhides and reduces the detection overhead by exploiting theunique parallelisms and redundancies exposed by the detectionalgorithms. We show that with the aggressive compile-timeoptimizations and a well-defined ISA, detection algorithms can

Page 2: Ptolemy: Architecture Support for Robust Deep Learning

Algorithm Framework

CanaryClassPaths

Programming Interfaceoutput = Inference()foreach Layer ExtractImptNeurons() GenMask()return Classify()

Compiler Optimizations

✓ Layer-Level Pipelining✓ Neuron-Level Pipelining✓ Comp.-Mem. Trade-off

ISA

.set rfsize 0x200mov r3, rfsizefindrf r4, r1sort r1, r3, r6acum r6, r1, r5

Offline Profiling & ExtractionExtraction Inference

Reduce CostHide Cost

Algorithm Knobs

DNN Models and Legitimate Training

SamplesClassification

Hardware Architecture

Memory

Augmented DNN Accelerator

Programmable Path Extractor

SelectiveExtraction

ExtractionDirection

Thresholding Mechanism

Fig. 1: Ptolemy system overview.

be implemented on top of existing DNN accelerators with a setof basic, yet principled, hardware extensions, further wideningthe applicability of Ptolemy.

Ptolemy enables highly accurate adversarial detection withlow performance overhead. Compared to today’s defensemechanisms that introduce over 10 × performance overhead,we demonstrate a system that achieves higher accuracy withonly 2% performance overhead. Ptolemy defends not onlyexisting attacks, but also adaptive attacks that are specificallydesigned to defeat our defense [11]. We also demonstratethe Ptolemy framework’s flexibility by presenting a rangeof algorithm variants that offer different accuracy-efficiencytrade-offs. For instance, Ptolemy could trade 10% performanceoverhead for 0.03 higher detection accuracy.

The Ptolemy artifact, including the pre-trained models,offline-generated class paths, code to generate adaptive andnon-adaptive attacks, and the detection implementation isavailable at https://github.com/Ptolemy-DL/Ptolemy. In sum-mary, Ptolemy provides a generic framework for low-overhead, high-accuracy online defense against adversarialattacks with the following contributions:• We propose a novel static-dynamic collaborative approach

for adversarial detection by exploiting the unique programexecution characteristics of DNN inferences that arelargely ignored before.

• We present a general algorithmic framework, along witha high-level programming interface, that allows program-mers to explore key algorithm design knobs to navigatethe accuracy-efficiency trade-off space.

• We demonstrate that with a carefully-designed ISA,compiler optimizations could enable efficient detectionby exploiting the unique parallelisms and redundanciesexposed by our detection algorithm framework.

• We present a programmable hardware to achieve low-latency online adversarial defense with principled exten-sions to existing DNN accelerators.

II. BackgroundAdversarial Attacks DNNs are not robust to adversarial

attacks, where DNNs mis-predict under slightly perturbedinputs [13], [36], [50], [45]. Fig. 2 shows one such example,where the two slightly different images are both perceivedas stop signs to human eyes, but the second image is mis-predicted by a DNN model as a yield sign. The perturbations

could be the result of carefully engineered attacks, but couldalso be an artifact of normal data acquisition such as noisysensor capturing and image compression/resizing [64].

Legitimate Sample Adversarial Sample Perturbation

Fig. 2: Adversarial example using the FGSM [22] attack.

Formally, given a DNN C, an input x′ is defined as an adver-sarial sample if it is close to x yet makes C∗(x) =C(x) ,C(x′),where C∗(x) is the correct class of x. Different adversarialsamples differ in their measures of the distance between x andx′. The distance could be small, where the input perturbationsare imperceptible to humans (as in the example above), butcould also be large, where the perturbations are visible tohumans but still “fool” a DNN. For instance, physically puttinga sticker on a stop sign could mislead a DNN to misclassifythe stop sign as a yield sign [36], [38]. Ptolemy targetsthe general robustness issue that introduces mis-predictionsthrough input perturbations—small or large, inadvertent ormalicious. For simplicity, we refer to all of them as adversarialattacks throughout this paper.

An adversarial attack is a black-box attack if it does notassume knowledge of the attacked model; white-box attacksin contrast assume full knowledge of the model. Orthogonally,adaptive attacks have complete knowledge of the defense’sinner workings, i.e., are specifically designed to attempt todefeat a defense, while non-adaptive attacks do not [65], [11],[12]. We show that our detection scheme can defend againsta range of different attacks, including the strongest form ofattack: white-box adaptive attacks.

Countermeasures We aim to enable fast and accuratesystems that can detect adversarial examples at inference-timesuch that proper measures could be taken. Today’s defensemechanisms largely fall under two categories, neither of whichmeets this goal. The first class of defenses improves therobustness of DNN models at training time (e.g., adversarialretraining) [66], [72] by incorporating adversarial examplesinto the training data. However, re-training is not suitable

Page 3: Ptolemy: Architecture Support for Robust Deep Learning

0.3

0.4

0.2

1.0

0.1

0.2

0.2

0.3

0.3

0.2

0.4

0.4

0.1

0.2

-0.1

0.09

0.1

-1.0

2.1

0.5

0.06

0.44

=X

0.46 = 0.1 x 2.1 + 1.0 x 0.09 + 0.4 x 0.2 + 0.3 x 0.2 + 0.2 x 0.10.1 x 2.1 + 1.0 x 0.09 > 0.6 x 0.46, assuming θ = 0.6

0.46

Important Neurons identified in the current layer: 1.0, 0.1Important Neurons in the OFMap (identified before): 0.46

Input Feature Map Kernel Output Feature Map(OFMap)

2.63

1.1

1.2

0.9

0.2

1.2

1.9

1.0

1.0

1.1

⊛0.1

0.2

0.2

0.7

0.2 1.0

5.97

4.31

1.95

5.14

3.14

2.88

3.57

0.3 0.9

0.2

0.8

0.9=

Important Neurons identified in the current layer: 2.0, 1.4, 1.5Important Neurons in the OFMap (identified before): 5.47

5.47 = 2.0 x 0.7 + 1.4 x 0.9 + 1.5 x 0.8 + 1.0 x 0.9 + ……2.0 x 0.7 + 1.4 x 0.9 + 1.5 x 0.8 > 0.6 x 5.47, assuming θ = 0.6

2.0

1.4

5.471.5

Input Feature Map Kernel Output Feature Map(OFMap)

Layer 4(Output Layer)

Layer 1(Input Layer)

Layer 2

Layer 3

Important Neuron Extractionin Fully-connected Layer

Important Neuron Extractionin Convolution Layer

Constructing Activation Pathfrom Important Neurons

Fig. 3: Extracting important neurons from a fully connected layer (left) and a convolution layer (middle), and constructing theactivation path from important neurons across layers (right). Activation paths are input-specific. This figure illustrates backwardextraction using a cumulative thresholds. Forward extraction would start from the first layer rather than from the last year.Absolute thresholding would select important neurons based on absolute partial sums rather than cumulative partial sums.

at inference-time and requires accesses to the training data.Another class of defenses uses redundancies to defend againstadversarial attacks [64], [54], similar to the multi-moduleredundancy used in classic fault-tolerant systems [59]. Thisscheme, however, introduces high overhead, limiting its appli-cability at inference time.

III. Algorithmic FrameworkThis section introduces the Ptolemy algorithm framework,

which enables adversarial attack detection at inference-timewith high accuracy and low latency. Ptolemy provides a setof principled design knobs to allow programmers to customizethe accuracy vs. efficiency trade-off.

We first describe the intuition and key concepts behindour algorithm framework (Sec. III-A). We then introduce thealgorithm framework, and show that a basic algorithm underthe framework introduces excessive compute and memory cost(Sec. III-B). We further introduce key algorithmic knobs thatenable different algorithm variants to offer different accuracy-efficiency trade-offs (Sec. III-C). Finally, we introduce ahigh-level programming interface to flexibly express detectionalgorithms within our framework (Sec. III-D).

III-A. Intuition and Key ConceptsIntuition Each input to a DNN activates a sequence

of neurons. We find that inputs that are correctly predictedas the same class tend to activate a unique set of neuronsdistinctive from that of other inputs. This is a manifestationof recent work on class-level model sparsity [52], [69], whichshows that a small, but distinctive, portion of the networkcontributes to each predicted class. Taking this perspective,the way adversarial samples alter the inference result can bethought of as activating a sequence of neurons different fromthe canonical sequence associated with its predicted output.Analyzing dynamic paths in DNN inferences thus allows usto detect adversaries.

A sequence of activated neurons is analogous to a sequenceof basic blocks exercised by an input to a conventional

program. The frequently exercised basic block sequences, i.e.,“hot paths” [7], [20], [15], can be used to improve performancein classic profile-guided optimizations and dynamic compil-ers [57], [56], [19]. Our approach shares a similar idea, wherewe treat a DNN as an imperative program, and leverage itsruntime paths (sequence of neurons) to guide adversarial sam-ple detection. Conventional countermeasures largely ignore theprogram execution behaviors of DNN inferences.

Important Neurons The premise of our detection algo-rithm framework is the notion of important neurons, whichdenote a set of neurons that contribute significantly to theinference output. Important neurons are extracted in a back-ward fashion. The last layer Ln has only one important neuron,which is the neuron n that corresponds to the predicted class.At the second last layer Ln−1, the important neurons are theminimal set of neurons in the input feature map that contributeto at least θ (0 ≤ θ ≤ 1) of n. Here, θ controls the coverage ofimportant neurons. To extract the important neurons of layerLn−1, we simply rank the partial sums used to calculate n, andchoose the minimal number of neurons whose partial sumscollectively contribute to at least θ×n.

The left panel in Fig. 3 shows an example using a fully-connected layer. Assuming θ = 0.6 and the second neuron inthe output feature map (0.46) is the important neuron identifiedin the next layer. The fourth (1.0) and the fifth (0.1) neurons inthe input feature map are identified as the important neurons inthe current layer, because they contribute the two large partialsums and their cumulative partial sum (0.3) contribute to morethan 60% of the important neuron in the output feature map.The same process can be extended to convolution layers. Themiddle panel in Fig. 3 shows an example. For the importantneuron in the output feature map, we first find its receptivefield in the input feature map, and then identify the minimalset of neurons in the receptive field whose cumulative partialsums contribute to at least θ×n.

This process is repeated backwards from the last layer tothe first layer, as shown in the right panel in Fig. 3. The

Page 4: Ptolemy: Architecture Support for Robust Deep Learning

Path Extraction and Aggregation

Inference

Path ExtractionTraining

Data Class Paths Adversarial Classification Input

Adversary? (Y/N)+ Output

Offline Online

Extraction Direction (Forward vs. Backward)

Thresholding Mechanism (Cumulative vs. Absolute)

Selective Extraction (Start/Termination Layer)

PathExtractionKnobs

Fig. 4: Adversarial detection algorithm framework. It provides a range of knobs for path extraction, which dominates theruntime overhead. Note that the path extraction methods in both the offline and online phases must match.

0 1 2 3 4 5 6 7 8 9class

01

23

45

67

89

clas

s

1 0.37 0.34 0.36 0.35 0.37 0.27 0.29 0.36 0.34

0.37 1 0.35 0.37 0.33 0.38 0.28 0.28 0.35 0.35

0.34 0.35 1 0.33 0.32 0.34 0.27 0.29 0.3 0.31

0.36 0.37 0.33 1 0.32 0.37 0.27 0.27 0.34 0.33

0.35 0.33 0.32 0.32 1 0.32 0.29 0.29 0.36 0.33

0.37 0.38 0.34 0.37 0.32 1 0.26 0.29 0.35 0.35

0.27 0.28 0.27 0.27 0.29 0.26 1 0.29 0.27 0.29

0.29 0.28 0.29 0.27 0.29 0.29 0.29 1 0.28 0.29

0.36 0.35 0.3 0.34 0.36 0.35 0.27 0.28 1 0.36

0.34 0.35 0.31 0.33 0.33 0.35 0.29 0.29 0.36 1 0.32

0.40

0.48

0.56

0.64

(a) AlexNet @ ImageNet.

0 1 2 3 4 5 6 7 8 9class

01

23

45

67

89

clas

s

1 0.61 0.61 0.6 0.6 0.59 0.58 0.6 0.63 0.61

0.61 1 0.6 0.61 0.6 0.6 0.6 0.61 0.62 0.64

0.61 0.6 1 0.63 0.64 0.63 0.63 0.62 0.6 0.6

0.6 0.61 0.63 1 0.63 0.65 0.63 0.63 0.6 0.61

0.6 0.6 0.64 0.63 1 0.63 0.63 0.64 0.59 0.61

0.59 0.6 0.63 0.65 0.63 1 0.62 0.63 0.59 0.61

0.58 0.6 0.63 0.63 0.63 0.62 1 0.61 0.58 0.6

0.6 0.61 0.62 0.63 0.64 0.63 0.61 1 0.59 0.62

0.63 0.62 0.6 0.6 0.59 0.59 0.58 0.59 1 0.62

0.61 0.64 0.6 0.61 0.61 0.61 0.6 0.62 0.62 1

0.60

0.62

0.64

0.66

0.68

0.70

(b) ResNet18 @ CIFAR-10.

Fig. 5: Class path similarity (θ = 0.5).

important neurons identified at layer Li are used to determinethe important neurons at layer Li−1.

From Neurons to Paths The collection of importantneurons across all the layers under a given input constitutes anactivation path of that input, similar to how a sequence of basicblocks constitutes a path/trace in a program. We represent apath using a bitmask, where each bit mi, j indicates whetherthe neuron (input feature map element) at layer i position j isan important neuron.

From individual activation paths, we introduce the conceptof a class path for a class c, which aggregates (bitwise OR) theactivation paths of different inputs that are correctly predictedas class c. That is: Pc =

⋃x∈xc P(x), where P(x) denotes the

activation path of input x, xc denotes the set of all the correctlypredicted inputs of class c,

⋃denotes bitwise OR, and Pc

denotes the class path of class c. We observe that Pc starts tosaturate around 100 images and including more images fromthe training dataset does not result all bits being 1. We do notmanually stop filling the bits.

Critically, class paths are significantly different from eachother. Fig. 5a shows the path similarity in AlexNet [35] across10 randomly-sampled classes from ImageNet [17]. Fig. 5bshows the path similarity in ResNet18 [61] across the 10classes in CIFAR-10 [34]. All the results are obtained onthe training set. The average inter-class path similarity isonly 36.2% (max 38.2%, 90-percentile 36.6%) for AlexNeton ImageNet and 61.2% (max 65.1%, 90-percentile 63.4%)for ResNet18 on CIFAR-10, suggesting that class paths aredistinctive. In an attempt to normalize the dataset, we alsoperform the same experiment on ResNet50 on ImageNet. Theaverage inter-class path similarity is 37.6% (max 40.9%, 90-

percentile 39.1%), similar to those of AlexNet on ImageNet.The class path similarity is much higher in CIFAR-10 than

in ImageNet. This is because ImageNet has 1,000 classes thatcover a wide range of objects and CIFAR-10 has only 10classes, which are similar to each other (e.g., cat vs. dog).The randomly picked 10 classes in ImageNet are more likelyto be different from each other than the 10 classes in CIFAR-10. Across all the 1,000 classes in ImageNet, the maximuminter-class path similarity is still only 0.44, suggesting that ourrandom sampling of ImageNet is representative.

III-B. Detection Framework and Cost Analysis

We leverage the clear distinction across different class pathsto detect adversarial inputs. If an input x is predicted as classc while its activation path P(x) does not resemble the classpath Pc, we hypothesize that the input is an adversary.

Framework Fig. 4 shows an overview of the algorithmframework, which requires static-dynamic collaboration. Thestatic component profiles the training data to extract activationpaths P(x) for each correctly predicted sample x, and generatesthe class path Pc for each class c as described before. The classpaths are stored offline and reused over time. Critically, ourprofiling method can easily integrate new training samples,whose activation paths would simply be aggregated (OR-ed)with the existing class paths without having to re-generate theentire class paths from scratch.

At inference-time, the dynamic component extracts the pathfor a given input. Note that activation paths are extractedonly after the entire DNN inference finishes, because theidentification of important neurons starts from the predictedclass in the last layer and propagates backward. We will showother variants in Sec. III-C that relax this restriction.

Given the activation path P(x) of an input x and thecanary class path Pc, where c is the predicted class of x, aclassification module then decides whether x is an adversary ornot based on the similarity between P(x) and Pc. While a rangeof similarity metrics and algorithms could be used, we proposea lightweight algorithm that is extremely efficient to computewhile providing high accuracy. Specifically, we first estimatethe similarity S between P(x) and Pc: S = ‖P(x) & Pc‖1/‖P(x)‖1,where ‖P‖1 denotes the number of 1s in the vector P, and &denotes bitwise AND. S is fed into a learned classifier, for whichwe use the lightweight random forest method [39], for the

Page 5: Ptolemy: Architecture Support for Robust Deep Learning

final classification. The classification module is lightweight,contributing to less than 0.1% of the total detection cost.

Cost Analysis The algorithm described above is ableto achieve accuracy higher than state-of-the-art methods (seeSec. VII). However, runtime extraction of activation paths alsointroduces significant memory and compute costs.

The memory cost is significant because every single partialsum generated during inference must be stored in the mem-ory before the path extraction process begins. The detectionalgorithm introduces 9 × to 420 × memory overhead, whichis a lower bound of the actual memory traffic overhead in realsystems because the massive partial sums will not be bufferedcompletely on-chip. Storing partial sums will also stall thecomputing units and increase latency.

Path extraction also introduces compute overhead due tosorting and accumulating partial sums. Using AlexNet asan example, at θ = 0.9, the compute overhead could be ashigh as 30%. At first glance, it might be surprising that thecompute overhead is “only” 30%. Further investigations showthat percentage of important neurons in a network is generallybelow 5% even with θ = 0.9. Thus, the expensive sorting andaccumulation operations are applied to only a small portionof partial sums. Note that the compute cost shown here leadsto much higher latency overhead in reality because, whileinference is massively parallel, sorting and accumulating aremuch less so. A pure software implementation of the detectionalgorithm introduces 15.4× and 50.7× overhead over inferenceon AlexNet and ResNet50, respectively.

III-C. Algorithmic Knobs and Variants

To trade little accuracy loss for significant efficiency gains,we introduce three algorithmic knobs that control how ac-tivation paths are extracted, which dominates the runtimeperformance/energy overhead. The result is a set of algorithmvariants that follow the same algorithm framework describedin Fig. 4, but that differ in how the paths are extracted.

Hiding Detection Cost: Extraction DirectionThe cost introduced by the basic detection algorithm directly

increases the inference latency because path extraction andinference must be serialized. We identify a key algorithmicknob that provides the opportunity to hide the compute costof detection by overlapping detection with inference.

The key to the new algorithm is to extract important neuronsin a forward rather than a backward manner. Recall that inthe original backward extraction process, we use the importantneurons in layer Li’s output (which is equivalent to layer Li+1’sinput) to identify the important neurons in layer Li’s input.In our new forward extraction process, as soon as layer Lifinishes inference we first determine the important neurons inits output by simply ranking output neurons according to theirnumerical values and selecting the largest neurons, instead ofwaiting until after the extraction of layer Li+1. In this way, theextraction of important neurons at layer Li and the inferenceof layer Li+1 can be overlapped.

Reducing Detection Cost: Thresholding Mechanism

def AdversaryDetection(model, input, θ, φ): output = Inference(model, input) N = model.num_layers // Selective extraction only in the last three layers for L in range(N-3, N): if L != N-1: // Forward extraction using absolute thresholds ImptN[L] = ExtractImptNeurons(1, 1, φ, L) else: // Forward extraction using cumulative thresholds

ImptN[L] = ExtractImptNeurons(1, 0, θ, L) dynPath.concat(GenMask(ImptN[L])) classPath = LoadClassPath(argmax(output)) is_adversary = Classify(classPath, dynPath) return is_adversary

123456789

101112131415

Fig. 6: An adversarial detection algorithm expressed using theprogramming interface.

The forward extraction process hides the extraction behindinference, but does not reduce the detection cost, which couldsignificantly increase the energy overhead.

To reduce the detection cost, we propose to extract impor-tant neurons using absolute thresholds rather than cumulativethresholds. Whenever a partial sum is generated during infer-ence it is compared against an absolute threshold φ. A single-bit mask is stored to the memory based on the comparisonresult. Later during path extraction, the masks (as opposedto partial sums) are loaded to determine important neurons.Thresholding can be specified at each layer, and can be appliedto both extraction directions.

Using absolute thresholds significantly reduces both thecompute and memory costs (Sec. VII-C), because comparingpartial sums against a threshold is much cheaper than sortingand accumulating them, and writing single-bit masks ratherthan partial sums significantly reduces the memory accesses.

Reducing Detection Cost: Selective ExtractionAn orthogonal way to reduce the cost is to skip important

neurons from certain layers altogether. In many networks, laterlayers have a more significant impact on the inference outputthan earlier layers [53]. Thus, one could extract importantneurons from just the last a few layers to further reduce thecost (Sec. VII-F). When combined with forward extraction,this is equivalent to starting extraction later (“late-start”);when combined with backward extraction, this is equivalent toterminating extraction earlier (“early-termination”). This knobspecifies the start/termination layer.

Summary The Ptolemy framework provides three differ-ent knobs to explore the accuracy-efficiency trade-off. Whilethe extraction direction applies to the entire network and hidesthe detection cost behind the inference cost, the thresholdingmechanism and the extracted layer are specified at the layerlevel to reduce the detection cost.

III-D. Programming Interface

Ptolemy provides a (Python-based) programming interfacethat allows programmers to express a range of different al-gorithmic design knobs described above. Our programminginterface is designed with two principles in mind, which we

Page 6: Ptolemy: Architecture Support for Robust Deep Learning

TABLE I: Summary of Ptolemy instructions. Operands in the first three instruction classes are registers to simplify encoding.

Class Name 23-20 19-16 15-12 11-8 7-4 3-0

Inferenceinf 0000 Input addr. Weight addr. Output addr. Unusedinfsp 0001 Input addr. Weight addr. Output addr. First partial sum addr. Unusedcsps 0010 Output neuron ID Layer ID First partial sum addr. Unused

Path Construction

sort 0011 Unsorted seq. start addr. Seq. length Sorted seq. start addr. Unusedacum 0100 Input addr. Output addr. Cumulative threshold Unusedgenmasks 0101 Input addr. Output addr. Unusedfindneuron 0110 Layer ID Neuron position Target neuron addr. Unusedfindrf 0111 Neuron addr. Receptive field addr. Unused

Classification cls 1000 Class path addr. Activation path addr. Result Unused

Others Omitted for simplicity (mov, dec, jne, etc.)

will explain using an actual detection algorithm expressedusing the programming interface shown in Fig. 6.

Decoupled Inference/Detection The Ptolemy program-ming interface decouples inference with detection, whichallows programmers to focus on expressing the functionalitiesof the detection algorithm while leaving optimizations to thecompiler and runtime. For instance, while the inference code(Line 2) and the path extraction code (Line 3–15) are ex-pressed sequentially in the program, our compiler will under-stand that the program uses the forward extraction algorithm(Line 8 and 11), and thus will automatically pipeline inferencewith important neuron extraction across layers (see Sec. IV-B).

Per-Layer Extraction Granularity Our programming in-terface provides the flexibility to specify the important neuronextraction method for each layer to leverage the three knobsdescribed above to explore the efficiency-accuracy trade-off

space. We will demonstrate its effectiveness in Sec. VII-F.For instance in Fig. 6, the programmer selectively extracts

important neurons only for the last three layers (Line 5). Inaddition, only the last layer uses the cumulative threshold toextract important neurons (Line 11), which is more accuratebut requires more computations than using absolute thresholds,which is the method used by the other two layers (Line 8).Note that we do not allow backward extraction and forwardextraction to be combined in one network to avoid discrepan-cies in the layer where they join.

IV. ISA and Compiler OptimizationsThis section describes how Ptolemy efficiently maps de-

tection algorithms expressed in the high-level programminginterface to the hardware architecture. To that end, we firstintroduce the software-hardware interface, i.e., the InstructionSet Architecture (ISA) (Sec. IV-A), followed by the compileroptimizations (Sec. IV-B).

IV-A. Instruction Set ArchitecturePtolemy provides a custom CISC-like ISA to allow efficient

mapping from high-level detection algorithms to the hardwarearchitecture. The design principles of the ISA are two-fold.First, it abstracts away hardware implementation details; thesemantics are closer to high-level DNN programmers, and

the instructions will be decomposed by micro-instructionscontrolled by an FSM. Second, it exposes opportunities forcompiler and hardware to exploit parallelisms.

The Ptolemy ISA contains four types of instructions: In-ference, Path Construction, Classification, and Others. Theyare high-level instructions in the CISC style that performcomplex operations. We use a 24-bit fixed length encoding,and provide 16 general-purpose registers. Table I summarizesthe instructions. We highlight key design decisions.• Inference These instructions dictate the inference process.

In addition to support usual inferences (inf), Ptolemyalso provides an instruction that stores the partial sumsto memory (infsp) during inference for backward extrac-tion. Each inference instruction operates on one layer tomatch the per-layer extraction semantics in the high-levelprogramming interface. Finally, the ISA also provides aspecial instruction that calculates and stores all the partialsums given an output feature map element (csps), whichwill be used by the compiler for memory optimizations.

• Path Construction This class of instructions is used toconstruct activation path dynamically at runtime for anygiven input. To construct path, the ISA provides in-structions to identify important neurons (sorting sort,accumulate acum) and to generate the masks from theidentified important neurons to form an activation path(genmasks). There are also instructions to calculate neu-ron addresses, which are convenient in finding the startaddress of a receptive field for a given neuron (findrf)and finding a given neuron given its position in thenetwork (findneuron).

• Classification The classification instruction (cls) is usedto classify an input as either adversarial or benign.

• Others The ISA provides a set of control-flow instructions(e.g., and jne), arithmetic instructions (e.g., dec), andscalar data movement instructions (e.g., mov).

Example Lst. 1 shows a sample code that uses cumulativethresholds to extract important neurons. Through a loop, ititeratively finds a receptive field (findrf), sorts partial sumsin the receptive field (sort), and uses the sorted partial sumsto identify important neurons whose cumulative partial sumsexceed the threshold (acum).

Page 7: Ptolemy: Architecture Support for Robust Deep Learning

.set rfsize 0x200

.set thrd 0x08

mov r3, rfsizemov r5, thrd<start>

[update r7&r2 for next output neuron]

findneuron r2, r7, r4mul r5, (r4)findrf r4, r1sort r1, r3, r6acum r6, r1, r5dec r11jne <start>

Listing 1: Generating important neurons using a cumulativethreshold. .set is a directive setting compiler-calculated con-stants. [code] indicates code omitted for simplicity.

It highlights an important design decision of the PtolemyISA: all the detection related instructions use register operands.This design simplifies instruction encoding with little perfor-mance impact. For instance, the findrf instruction requiresthe receptive field size as an operand, which can be staticallycalculated by the compiler given the DNN model configura-tions. Since the receptive field size could be arbitrarily largeand thus does not always fit in a reasonable, fixed-lengthencoding, a mov instruction is used to move the staticallycalculated immediate value to a register (r3), which is laterused in the sort instruction. While a more complex instructionencoding that limits the range of immediate operands couldeliminate this mov instruction, the performance overhead in-troduced by this mov instruction is negligible compared to theheavy-duty sort and acum instructions.

IV-B. Code Generation and OptimizationThe compiler maximizes performance by exploiting unique

parallelisms and redundancies inherent to the detection al-gorithms. This is achieved through statically scheduling in-structions, which minimizes runtime overhead and hardwarecomplexity. Static scheduling is possible because the computeand memory access behaviors of both DNN inference anddetection are known at the compile time.

Layer-Level Pipelining A key characteristic of algorithmsthat use the forward extraction method is that inference and ex-traction of different layers can be overlapped. While the high-level programming interface decouples inference (Inference)and extraction (ExtractImptNeurons), and expresses themsequentially, our compiler will reorder instructions to enableautomatic pipelining at runtime, in a way similar to classicsoftware-pipelining technique [5].

Fig. 7a shows an example. We use pseudo-code to removeunnecessary details. <extraction for j> indicates the codeblock for extracting important neurons at layer j, and inf(j)indicates inference at layer j. By simply reordering instruc-tions, inference of layer j+1 and extraction of layer j, whichare independent, could be pipelined. At the hardware level,

for j = 1 to L { inf(j) <extraction for j>}

inf(1)for j = 1 to L-1 { inf(j+1) <extraction for j>}<extraction for L>

; inference & extraction of a model

(a) Overlapping inference with extrac-tion across layers in forward extrac-tion. L is the total number of DNNlayers.

for i = 1 to N { sort(i) acum(i)}

sort(1)for i = 1 to N-1 { sort(i+1) acum(i)}acum(N)

; extraction within a layer

(b) Neuron-level pipelining in im-portant neuron extraction. N de-notes the number of importantneurons in the current layer’s out-put.

Fig. 7: Pseudo-code of instruction scheduling examples. Thecode in (b) is the extraction block simplified in (a).

once inf(j) is issued to execute on the DNN accelerator,<extraction for j> could be issued and executed imme-diately on our hardware extension (Sec. V-B).

Note that our software pipelining technique does not fullyhide the instruction latency to guarantee that a new instructioncan be dispatched every cycle. Both inference and the extrac-tion code block take tens of millions of cycles. Fully hidinglatencies requires expensive optimizations in classic compilerliterature [67], [27]. We find that our simple static instructionreordering is able to largely overlap inference with extraction,leading to very low performance overhead. A side effect ofnot fully hiding the instruction latencies is that our hardwarewould still have the logic to check dependencies and stall thepipeline if necessary. But the hardware remains in-order with-out the expensive out-of-order instruction scheduling logic.

Neuron-Level Pipelining Similar to layer-level pipelining,our compiler will also automatically pipeline the extractionof different important neurons within a layer. Fig. 7b showsan example, where cumulative thresholds are used. The twosteps needed to extract important neurons, sorting all thepartial sums (sort) and accumulating the partial sums until thethreshold is reached (acum), have data dependencies. The com-piler overlaps the extraction across different important neurons(iterations), improving hardware utilization and performance.

Trading-off Compute for Memory Algorithms that usecumulative thresholds have high memory cost because all thepartial sums must be stored to memory (Fig. 5). However, if areceptive field does not correspond to an important neuron inthe output feature map, its partial sums will not be used later.We observe that fewer than 5% of the partial sums stored areused later to extract important neurons.

We propose to use redundant computation to reduce mem-ory overhead. Instead of storing all the partial sums duringinference, we re-compute the partial sums during the extrac-tion process only for the receptive fields that are known tocorrespond to important neurons in the output feature map.

Page 8: Ptolemy: Architecture Support for Robust Deep Learning

DNN Accelerator

SRAM (Weights, Feature Maps, Partial Sums, Masks)

Path Constructor

Sort & Merge Accumulate

Controller

SRAM (Code, Paths)

DRAM

Input/Output

Weights

Feature Maps

Partial Sums

Masks

Mask Gen.

SRAM (Partial sums, Partial masks, Masks)Paths

Fig. 8: Ptolemy architecture overview.

The compiler implements this by generating csps instructionsto re-compute partial sums.

V. Architecture SupportThis section introduces the Ptolemy hardware architecture.

Following an overview (Sec. V-A), we describe the designs ofmajor hardware components (Sec. V-B – Sec. V-D).

V-A. OverviewOur architecture builds on top of a conventional DNN accel-

erator. Fig. 8 provides an overview of the Ptolemy architecture,which consists of an augmented DNN accelerator, a PathConstructor that builds the activation path for an input, and aController that dispatches instructions, runs state machines thatcontrol the hardware blocks, and executes the final classifier.An off-chip memory stores all the data structures that areneeded for inference and detection. Both the DNN acceleratorand the Path Constructor use double-buffered on-chip SRAMsto capture data reuse and to overlap DMA transfer with com-putation. The controller’s SRAM stores the compiled detectionprogram and activation/class paths for classification.

V-B. Enhanced DNN AcceleratorPtolemy can be integrated into general DNN accelerator

designs. Without losing generality we assume a TPU-likesystolic array design [31]. Each PE consists of two 16-bitinput registers, a 16-bit fixed-point MAC unit with a 32-bitaccumulator register, and simple trivial control logic.

Ptolemy minimally extends each MAC unit. Fig. 9a showsthe simple MAC unit augmentations (shaded). Specifically,algorithms that use absolute thresholds compare each partialsum with the threshold and store the single-bit mask to theSRAM; algorithms that use cumulative thresholds require eachpartial sum to be stored to the SRAM. Note that with there-computation optimization, partial sums are recomputed atextraction time only for important neurons instead of beingstored during inference.

To avoid the SRAM becoming a scalability bottleneck, thepartial sums and the masks are double-buffered in the SRAMand doubled-buffered to the DRAM through a DMA. Later,the partial sums and/or masks are double-buffered back to theSRAM, similar to how feature maps and kernels are accessed.The extra DRAM space required to store partial sums is small

i w

x

+ psum

>?

thd

MUX

0/1

mode

to SRAM

to SRAM

(a) Enhanced MAC unit.

SRAM

Merge Unit

Sort Unit

Sort Unit

(b) Sorting logic.

Fig. 9: Microarchitecture details. MAC and sorting constitutes99.9% of the operations in our detection algorithm.

as we will show in Sec. VII-A. The additional DRAM trafficincurred by storing and reading partial sums is negligible(<0.1%) compared to the original DRAM traffic since eachpartial sum is read and stored only once.

The PE array is used both for the usual inference andfor re-computing partial sums as instructed by the clpsinstruction (Sec. IV-B). During re-computation, only the firstrow in the PE array is active because only a selected fewelements in the output feature maps are to be re-computed.

V-C. Path ConstructorThe goal of the path constructor is to extract important

neurons and to construct activation paths. Algorithms that usecumulative thresholds requires sorting partial sums in receptivefields. Since receptive fields in modern DNNs are usually large(tens of thousands of elements), sorting all the elements on onepiece of hardware could become a latency bottleneck as thesequence length increases. Our design splits a long sequenceinto multiple subsequences, which are sorted in parallel andmerged together. Fig. 9b shows the sort unit organization. Thesort unit uses the classic sorting network [32], and the mergeunit uses a standard merge tree, both have efficient hardwareimplementations [46], [16], [33].

The path constructor uses lightweight mask generationhardware, which generates the important neuron masks foreach layer, from which the entire activation path (a bit vector)is constructed. The path constructor also integrates hardwarethat calculates similarities between an activation path and acanary class path, which is a highly bit-parallel operation. TheSRAM in the path constructor is separate from the SRAM usedby the DNN accelerator to avoid resource contention, and isalso doubled-buffered.

V-D. ControllerWe assume a micro-controller unit (MCU) in the baseline

hardware, as is common in today’s DNN-based Systems-on-a-chip (SoCs) [2]. We piggyback two key tasks on the MCU:dispatching instructions and executing the final classifier todetect adversaries. Both are lightweight tasks that can beexecuted efficiently on an MCU without extra hardware.

Dispatching Instructions Thanks to the simple ISAencoding (Table I), the compiled programs can be interpretedon the MCU (i.e., software decoding) efficiently while avoid-ing extra hardware cost. The overhead of interpreting the

Page 9: Ptolemy: Architecture Support for Robust Deep Learning

code is negligible compared to the total execution time. Theprograms are very small in size. The largest one, which usescumulative thresholds and backward extraction, is about 30static instructions (below 100 bytes).

Classification The similarity between an activation pathand the canary class path calculated from the path constructoris fed into a random forest (RF) for the final classification(Sec. III-B). Our particular RF implementation uses 100decision trees, each of which has an average depth of 12. Intotal, RF consumes about 2,000 operations on AlexNet (fiveorders of magnitude lower than inference), and could executeon an MCU in microseconds.

VI. Evaluation MethodologyThis section explains the basic hardware and software setup

(Sec. VI-A) and the evaluation plan (Sec. VI-B).

VI-A. Experimental SetupHardware Implementation We develop RTL implementa-

tion using Synposys synthesis and Cadence layout tools withSilvaco’s Open-Cell 15nm technology [1]. The on-chip SRAMis generated using an ARM memory compiler and the off-chip DRAM is modeled after four Micron 16 Gb LPDDR3-1600 channels. We assume an ARM Cortex M4-like micro-controller (MCU) as the controller in the hardware (Sec. V-D).The synthesis and memory estimation results are used to drivea cycle-level simulator for performance and energy analyses.

Networks and Datasets We evaluate Ptolemy using twonetworks: 1) ResNet18 [61] on the CIFAR-100 dataset [34]with 100 different classes and 50,000 training images, and 2)AlexNet [35] on the ImageNet dataset [17] with 1000 differentclasses and 1 million training images. The networks anddatasets we evaluate are at the high end of the benchmark scaleevaluated by today’s countermeasure mechanisms [12], [25],[43], which mostly use much smaller datasets and networks(e.g., MNIST, CIFAR-10) [37], [47] that are less effectivein exercising the capability of our system. The test sets areevenly split between adversarial and benign inputs, followingthe common setup of adversarial attack research.

The clean AlexNet without attacks has an accuracy of55.13% on ImageNet; ResNet18 has an accuracy of 94.49%and 75.87% on CIFAR-10 and CIFAR-100, respectively.

Attacks We evaluate Ptolemy against a wide range ofattacks. We first evaluate using five common non-adaptiveattacks: BIM [36], CWL2 [14], DeepFool [45], FGSM [49],and JSMA [49], which comprehensively cover all three typesof input perturbation measures (L0, L2, and L∞) [4].

We also specifically construct attacks that attempt to defeatour detection mechanism (a.k.a., adaptive attacks [12]). In par-ticular, we assume an adversary that has a complete knowledgeof Ptolemy’s detection algorithms and the attacked model, andthereby generates adversarial samples by incorporating pathsimilarities into the loss function.

Metrics We use the standard “area under curve” (AUC)accuracy metric (between 0 and 1) for adversarial detec-tion [29], which captures the interaction between true positive

rate and false positive rate. Unless otherwise noted, we reportthe average accuracy across all attacks. We confirm that theaccuracy trend is similar across attacks.

VI-B. Evaluation PlanOur evaluation is designed to demonstrate that 1) Ptolemy

achieves similar or higher accuracy than today’s detectionmechanisms with a much lower performance penalty, and 2)the general framework allows for a large accuracy-efficiencytrade-off. To that end, we develop and evaluate four algorithmvariants using our programming model. All the compileroptimizations (Sec. IV-B) are enabled when applicable.• BwCu: Backward extraction with cumulative thresholds.• BwAb: Backward extraction with absolute thresholds.• FwAb: Forward extraction with absolute thresholds.• Hybrid: Hybrid algorithm where BwAb is used on the

first half of a network and BwCu is used on the rest.Baselines We compare against three state-of-the-art ad-

versarial detection mechanisms: EP [52], CDRP [69], Deep-Fense [54]. Both EP and CDRP leverage class-level sparsity.CDRP requires retraining and thus is not able to detectadversaries at inference-time. Note that we evaluate Ptolemyusing the exact same attacks used in the above papers.

DeepFense represents a class of detection mechanisms thatuse modular redundancy. DeepFense employs multiple latentmodels as redundancies. We directly use the accuracy resultsreported in their papers. Note that DeepFense is evaluated us-ing ResNet18 on CIFAR-10, on which we perform additionalexperiments for a fair comparison.

VII. EvaluationWe first show the area and DRAM space overhead intro-

duced by Ptolemy’s hardware extensions (Sec. VII-A) aresmall. We show that Ptolemy provides more accurate detection(Sec. VII-B) with lower latency and energy overhead than priorwork (Sec. VII-C – Sec. VII-D). We show that Ptolemy isrobust against adaptive attacks that are specifically designedto defeat it (Sec. VII-E). Ptolemy provides a large accuracy-efficiency trade-off space (Sec. VII-F). We further study thesensitivity and scalability of Ptolemy (Sec. VII-G). Finally, wereport additional results on several other models (Sec. VII-H).

VII-A. Overhead AnalysisArea Overhead The baseline DNN accelerator incor-

porates a 20×20 MAC array operating at 250MHz. Theaccelerator has an SRAM size of 1.5 MB, which is banked ata 64 KB granularity. Ptolemy augments the baseline hardwarewith a 32 KB SRAM banked at 2KB granularity for storingpartial sums/masks, and a 64 KB SRAM used by the pathconstructor, which includes two 16-element sort units, one 16-way merge tree, and an accumulation unit. This accelerator isused in evaluating both Ptolemy and all our baselines.

On top of the baseline DNN accelerator, Ptolemy introducesa total area overhead of 5.2% (0.08 mm2), of which 3.9%is contributed by the additional SRAM. The rest of the area

Page 10: Ptolemy: Architecture Support for Robust Deep Learning

0.95

0.90

0.85

0.80

Accuracy

BwCuBwAbFw

AbHyb

rid EPCDRP

(a) AlexNet on ImageNet.

1.000.950.900.850.800.75

Accuracy

BwCuBwAbFw

AbHyb

rid EPCDRP

(b) ResNet18 on CIFAR-100.

Fig. 10: Accuracy comparisons with EP and CDRP. Error barsindicate the max and min accuracies of all the attacks.

overhead is attributed to the MAC unit augmentation (0.4%)and other logic (0.9%).

DRAM Space Under BwAb and FwAb, AlexNet andResNet18 require 1.6 MB and 2.2 MB extra DRAM space.To show scalability, we also evaluated VGG19, which is 13×larger than ResNet18 and requires only 18.5 MB extra DRAMspace. With the recompute optimization, AlexNet, ResNet18,and VGG19 require only an extra 12.8 MB, 17.6 MB, and148.0 MB in DRAM, respectively under BwCu. The additionalDRAM traffic is less than 0.1% (Sec. V-B).

VII-B. Accuracy

Ptolemy’s accuracy varies with the choice of θ and φ, whichcontrol the coverage of important neurons. Using BwCu asan example, Table II shows how its accuracy changes as θvaries from 0.1 to 0.9. As θ initially increases from 0.1 to0.5 the accuracy also increases, because a higher θ capturesmore important neurons. However, as θ increases to 0.9, theaccuracy slightly drops. This is because a high θ value causesdifferent class paths to overlap and become less distinguish-able. Meanwhile, the latency and energy consumption increasealmost proportionally as θ increases. We thus use θ = 0.5 forthe rest of our evaluation. The trend with respect to φ is similar,but is omitted due to limited space.

TABLE II: Sensitivity ofaccuracy, latency, and en-ergy of BwCu as θ varies.Latency and Energy arenormalized to inference.

θ Accuracy Latency Energy

0.1 0.86 4.7× 2.9×0.5 0.94 12.3× 7.7×0.9 0.91 25.7× 15.6×

Ptolemy variants achieve similar or better accuracy thanexisting defense mechanisms. Fig. 10 shows the accuracycomparison. On AlexNet across all attacks (Fig. 10a), thethree backward extraction-based variants (BwCu, BwAb, andHybrid) outperform EP and CDRP by up to 0.02 and 0.1,respectively. FwAb uses forward extraction and has 0.03lower accuracy than EP (0.06 higher than CDRP), indicatingthe accuracy benefits of backward extraction. On ResNet18(Fig. 10b), Ptolemy consistently achieves higher (0.14 – 0.16)accuracy than CDRP, and has similar or higher accuracy thanEP (at most 0.01 accuracy loss).

Note that adversarial attacks generated by CWL2 have lowconfidence of the rank1 class, and the confidence of rank1class is similar to that of the rank2 class. Thus, evaluating

1

2

4

8

16

Lat

ency

Ove

rhea

d

BwCuBwAb

FwAbHybrid EP

1

2

4

8 Energy O

verhead

(a) AlexNet on ImageNet.

1 4

16

64

256

Late

ncy

Ove

rhea

d

1 4 16 64 256 E

nergy Overhead

BwCuBwAb

FwAbHybrid EP

(b) ResNet18 on CIFAR-100.

Fig. 11: Latency and energy comparisons with EP.

CWL2 let us understand Ptolemy’s robustness against ad-versarial attacks launched by “low-confidence” images. OnImagenet against CWL2, Ptolemy’s accuracy is 0.95, whilethe baselines are 0.94 (EP) and 0.85 (CDRP); on CIFAR10,Ptolemy’s accuracy is 0.96 while DeepFense is 0.93.

VII-C. Latency and Energy

Ptolemy could achieve low performance and energy over-head over usual DNN inference. Fig. 11a and Fig. 11b showthe latency and energy consumption of the four Ptolemyvariants normalized to DNN inference, respectively. For com-parison purposes, we also show the latency and energy of EP.We do not show the results of CDRP because CDRP requiresretraining and is not suitable for online detection.

Although having the highest accuracy, BwCu also has thehighest latency and energy overhead due to the expensivepartial sum sorting and accumulation operations during extrac-tion, which is serialized with inference. On AlexNet, BwCuintroduces 12.3× latency overhead and increases the energy by7.7×. The corresponding results on ResNet18 are 195.4× and105.9×, respectively. The overhead on ResNet18 (18 layers)is higher than on AlexNet (8 layers), because as the networkbecomes deeper the amount of important neurons increases,which in turn increases the extraction time.

The overhead of BwCu is similar to EP, while BwAB,FwAB and Hybrid all achieve much lower latency and energyoverhead. BwAb uses absolute thresholds to avoid sorting andstoring partials sums. BwAb reduces the latency and energyoverhead on AlexNet to only 1.2× and 1.1×, respectively, and3.2× and 2.0× on ResNet18, respectively.

FwAb further reduces the latency overhead to only 2.1%and 2.1× on the two networks, respectively, by using forwardextraction to overlap extraction with inference. The latencyoverhead on ResNet18 is higher because ResNet18 is deeperwith a higher important neuron density (explained above),leading to longer extraction latency that is harder to hidebehind the inference latency. FwAb does not reduce energyoverhead significantly comparing to BwAb, because it hides,rather than reducing, the amount of compute.

Finally, Hybrid provides a design point that balances effi-ciency with accuracy by combining cumulative thresholds andabsolute thresholds. It leads to 1.7× latency overhead and 1.4×energy overhead on AlexNet, and the overheads are 47.3× and36.1× on ResNet18, respectively.

Page 11: Ptolemy: Architecture Support for Robust Deep Learning

1.00

0.95

0.90

0.85

0.80

Accuracy

BwCuBwAbFw

AbHyb

ridDFLDFMDF

H

PTOLEMY DeepFense

(a) Accuracy.

1

4

16

Late

ncy

Ove

rhea

d

1

4

16Energy O

verhead

BwCuBwAb

FwAbHybridDFL

DFMDFH

PTOLEMY DeepFense

(b) Latency and energy.

Fig. 12: DeepFense comparison.

VII-D. DeepFense Comparison

We compare against the three default DeepFense variants,which differ in the number of redundant networks: 1 in DFL, 8in DFM, and 16 in DFH. DeepFense is originally implementedon FPGA/GPUs; we perform a best-effort reimplementation onour hardware substrate for a fair comparison.

Fig. 12a shows the accuracy comparison between Ptolemyand DeepFense using ResNet18 on CIFAR-10. All Ptolemyvariants achieve significantly higher detection accuracy overDeepFense. Specifically, FwAb, which has the lowest accuracyamong all Ptolemy variants, outperforms DFH, which is themost accurate setup of DeepFense, by 0.11 on average.

Fig. 12b shows the latency and energy of Ptolemy andDeepFense variants normalized to usual inference. With higheraccuracy, BwAb and FwAb are also faster and consume lessenergy compared to all three DeepFense variants. For in-stance, FwAb reduces latency and energy overhead by 89.0%and 59.0%, respectively, compared with DFL, the most lightversion of DeepFense. The better efficiency of Ptolemy overDeepFense indicates the effectiveness of exploiting the runtimebehaviors of DNN inferences.

VII-E. Defending Against Adaptive Attacks

Adaptive attacks refer to attacks that have complete knowl-edge of how a defense mechanism works and attempt todefeat that specific defense [11], [65]. We perform a best-effortconstruction of adaptive attacks against Ptolemy, and showthat Ptolemy can effectively defend against adaptive attacks.

Constructing the Attacks To attempt to defeat Ptolemy,we force an adversarial sample to have the same activationpath as a benign input. However, since our path constructionrequires ranking/thresholding, which are non-differentiable, weopt for a differentiable approximation–a common practice inadversarial ML [6], [65]. We experiment with several heuris-tics, and find that the most effective one is to force all theactivations of an adversary to be the same as a benign input,i.e., a sufficient but not necessary condition.

Specifically, to generate an adversarial sample from an inputx that has a true class c, we first randomly choose a benigninput xt of target class t from the training dataset, wherec , t. We then add noise δx to x to generate xa such thatxa’s activations are as close to that of xt as possible. This isachieved by minimizing the L2 loss

∑i ‖zi(x +δx)− zi(xt)‖22 as

the objective function, where zi(?) denotes the activations of ?at layer i. To strengthen the attack, we choose five different xtof different classes to generate five different xa, and select the

1.00.80.60.40.20.0

Acc

urac

y

BwCu FwAb

AT1 BIM AT2 CWL2 AT3 DeepFool AT8 FGSM

JSMA

Fig. 13: Detection accuracy of Ptolemy on various adaptiveattacks (AT) compared to the five existing attacks.

xa with the smallest loss. We use projected gradient descent(PGD) [42] as the optimization method.

Results Ptolemy detects these adaptive adversarial sam-ples, even though they are generated specifically to “fool”Ptolemy by having activation paths that are similar to theirbenign counterparts. Using AlexNet on ImageNet as an exam-ple, Fig. 13 shows the detection accuracy of BwCU and FwAbon the adaptive attacks (AT ). ATn denotes that activations ofthe last n layers are considered in the loss function whengenerating adversarial samples. Since AlexNet has 8 layers,AT8 is the strongest adaptive attack. The detection accuracieson existing attacks are shown as for comparison.

Overall, the detection accuracy decreases as more layersare considered in generating the adaptive attacks, i.e., attacksbecome more effective. When only the first three layers areconsidered by the adaptive attack, the adversaries are moreeasily detected by Ptolemy than existing attacks. The detectionaccuracies on adaptive attacks are lower than those on non-adaptive attacks, confirming that adaptive attacks are moreeffective, matching the intuition [11].

Validating and Analyzing the Attacks Our adaptiveattack does not bound perturbation, i.e., is an unboundedattack. Following the guideline in Carlini et al. [11] that“The correct metric for evaluating unbounded attacks is thedistortion required to generate an adversarial example, not thesuccess rate (which should always be 100%)”, we verify thevalidity of our adaptive attack in two ways. First, we verifythat the constructed attacks do reach 100% success rate; theaverage distortion, measured in Mean Square Error (MSE), is0.007, and the maximum MSE 0.035.

Second, we show how the detection accuracy of Ptolemyis impacted by the distortion rate introduced in the adaptiveadversarial examples. The data is shown in Fig. 14, whereevery < x, y> point denotes the average detection accuracy (y)for all the adaptive attacks whose distortions (MSE) is lowerthan or equal to a certain value (x). We find that overall thedetection accuracy drops slightly as the distortion increases—an expected trend—although the trend is not strong, which islikely because the absolute distortion is too low (a desirableproperty) to demonstrate strong correlation with accuracy.We do verify that when the distortion is large enough tocompletely transform an image from one class to another,the detection accuracy would drop to 0, but at that point theinput could not be considered an adversarial attack since thetransformed image does not look like the original image.

We also investigate how the detection accuracy is impacted

Page 12: Ptolemy: Architecture Support for Robust Deep Learning

0.90

0.85

0.80

0.75

0.70

Acc

urac

y

35302520151050Distortion/Perturbation (x10-3 MSE)

Fig. 14: Detection accuracyof adaptive adversarial inputsunder different distortions.

1.00

0.95

0.90

0.85

0.80

Acc

urac

y

0.300.200.100.00Path Similarity

Fig. 15: Detection accuracyof adaptive attacks under dif-ferent path similarities.

by the path similarities between the original class and thetarget class. We show the results in Fig. 15, where every < x,y > point denotes the average detection accuracy (y) for allthe adaptive adversarial inputs whose path similarity betweenthe original class and the target class is lower than or equalto a certain value (x). While the path similarity between theoriginal class and the target class has a wide range (0.0 –0.34), the detection accuracy does not correlate strongly withthe path similarity. This is a desirable property, as it suggeststhat Ptolemy is not more vulnerable when the attacker simplytargets a similar class when generating the attacks.

Discussion The way we construct the adaptive attackis by approximating the hard path objective (i.e., forcingan adversarial sample to have the same activation path as abenign input) using a differentiable objective that constrainsthe individual activations. This relaxation let us formulateadversarial attack generation as an optimization problem thatcould be solved using effective optimization methods (e.g.,PGD). If one were to force a hard constraint on the activationpath, the objective function would not be differentiable.

In that case, a naive approach to generate adaptive attackswould be to exhaustively search all the possible perturbations.But without guidance such search would be prohibitivelyexpensive (e.g., (256340,000 for an 8-bit color depth, 200×200resolution RGB image). We did try the exhaustive searchmethod in a limited form, which generated results that addso much perturbation so that the resulted images do not looklike the original images at all.

An interesting direction would be to investigate intelligentsearch heuristics (e.g., simulated annealing) to find pertur-bations that meets the hard path constraint while foolingPtolemy. We leave this to future work.

VII-F. Early-Termination and Late-StartThe Ptolemy framework allows programmers to flexi-

bly select which layers to extract important neurons from(Sec. III-C). To trade accuracy for performance, programmerscould start extracting important neurons later in forwardextraction algorithms (as illustrated in Fig. 6), or terminateextraction earlier in backward extraction algorithms.

Early-Termination We use BwCu to showcase the trade-off that early-termination in backward extraction offers. Forsimplicity, we show only the results on AlexNet; ResNet18 hassimilar trends. Fig. 16a shows how accuracy (y-axis) varies asthe termination layer (x-axis) varies from 8 (the last layer) to 1(the first layer). As AlexNet has 8 layers in total, terminating

0.95

0.89

0.83

0.77

Acc

urac

y

8 7 6 5 4 3 2 1Termination Layer

(a) Accuracy.

1

4

16

Nor

m. L

aten

cy

8 7 6 5 4 3 2 1Termination Layer

1

2

4

8

Norm

. Energy

(b) Latency and energy.

Fig. 16: Accuracy, latency, and energy consumption underdifferent termination layer in BwCu.

0.95

0.89

0.83

0.77

Acc

urac

y

8 7 6 5 4 3 2 1Start Layer

(a) Accuracy.

1.04

1.03

1.02

1.01

1.00 Nor

m. L

aten

cy 1.20

1.15

1.10

1.05

1.00

Norm

. Energy8 7 6 5 4 3 2 1

Start Layer

(b) Latency and energy.

Fig. 17: Accuracy, latency, and energy consumption underdifferent start layer in FwAb.

at layer 8 means extracting important neurons from only onelayer. As extraction terminates later (further to the right onx-axis), more important neurons are captured and thus theaccuracy increases. The accuracy increase eventually plateausbeyond layer 6, indicating marginal return of investment toextract more layers.

Fig. 16b shows how the latency and energy consumptionvaries with the termination layer. With virtually the sameaccuracy, extracting all the layers (i.e., terminating at layer 1)leads to 11.2× higher latency and 6.6× more energy comparedto extracting only 3 layers (i.e., terminating after layer 6),which introduces only 1.1× and 1.1× latency and energyoverhead over normal inference, respectively.

Late-Start We use FwAb as an example to demonstrate thetrade-off that late-start provides to forward extraction-basedmethods. Fig. 17a and Fig. 17b show how the accuracy andlatency/energy vary with the start layer, respectively.

Similar to early-termination, the accuracy increases as morelayers are extracted, i.e., start earlier (further to the right).Interestingly, starting later does not help reduce the latency.This is because extraction latency is largely hidden behindthe inference latency. However, starting later does reduce theenergy consumption by 8.4% because less work is done.

VII-G. Sensitivity and Scalability Studies

We show how Ptolemy’s performance varies with differenthardware resource provisions in the path constructor. Wereport only the results of BwCu on AlexNet due to limitedspace. Fig. 18a shows how the latency and energy consumption(normalized to DNN inference) vary with the number of mergetree length (the number of partially sorted sequences that aremerged simultaneously). As the merge tree length increases,the latency reduces (from 31.0× to 12.3×), but the powerconsumption stays virtually the same. This is because a 16-length merge tree contributes to only 2% of the total power.

Page 13: Ptolemy: Architecture Support for Robust Deep Learning

1 4

16

64 N

orm

. Lat

ency

1

2

4 Norm

. Pow

er4 8 16 32Merge Tree Length

Design Choice

(a) Merge tree length.

161412108N

orm

. Lat

ency 8

6420

Norm

. Pow

er2 4 8 16Number of Sort Units

Design Choice

(b) Number of sort unit.

Fig. 18: Performance vary with hardware resource.

Fig. 18b shows how the latency and power consumptionvary with the number of sort units. We find out latency de-creases only marginally with more sort units, because sortingis memory-bound and thus increasing computing units has amarginal impact. The power consumption, however, increasessignificantly, because the sort unit contributes significantly(33.4%) to the overall power in our design.

While our original DNN accelerator uses 16-bit precision,we also evaluate our system under a 8-bit design. The areaoverhead increases from 5.2% to 5.5%. For AlexNet, the 8-bitdesign has 2.1% latency overhead and 33.0% energy overheadusing FwAb, comparable with 2.1% and 16.0% overhead of theoriginal design. We also increase the MAC array size from20×20 to 32×32. The area overhead increases from 5.2% to6.4%. AlexNet has 4.4% latency overhead and 16.4% energyoverhead using FwAb, comparable with the 2.1% and 16.0%in the original design.

VII-H. Large Model EvaluationOn VGG16 [55] and Inception-V4 [62], the average inter-

class path similarity on ImageNet is only 41.5% and 28.8%,respectively, indicating that important neurons exist and classpaths are unique in these models.

We also applied our detection scheme to DenseNet [30],and achieved 100% detection accuracy with 0% false positiverate (FPR), higher than the previously best accuracy at 96%with 3.8% FPR [41]. We use the detection accuracy and falsepositive rate instead of AUC in order to directly comparewith the referenced method. We also evaluated ResNet50 onImageNet using BwCu. The accuracy is 0.900, which is moreaccurate than EP [52] (0.898).

VIII. Related Work and DiscussionDifferent mechanisms to counter adversarial attacks have

been explored. One major class is to boost the DNN robustnessat the training time through adversarial retraining [9], [22],[44], [23], which incorporates adversarial samples into thetraining data. However, adversarial retraining does not have thedetection capability at inference time. It also requires accessesto the retraining data, which Ptolemy does not. Ptolemy canalso be integrated with adversarial retraining.

Detection mechanisms have also been extensively explored,ranging from using modular redundancies (e.g., input trans-formation [10], [24], [64], multiple models [54], and weightsrandomization [18], [70]), to cascading a dedicated DNN todetect adversaries [41], [40], [21], [43]. Wang et al. [68]proposes to spatially share the DNN accelerator resources

between the original network and the detection network.Ptolemy differs from them in two ways. First, we show thatusing path as an explicit representation of the input, Ptolemycan use a simple random forest classifier to detect adversarialinputs rather than complicated DNNs. Coupled with otherperformance optimizations, Ptolemy provides very low (2%)overhead to enable detection at inference-time while othersintroduce several folds higher overhead. Second, Ptolemy pro-vides an algorithm design framework that allows programmersto make trade-offs between detection efficiency and accuracy.

Carlini et al. [11] provides a checklist of best practices inevaluating defense mechanisms of adversarial attacks. Thispaper exercises the following red teaming:• Stated the threat model: attackers know everything

(model, inputs, defense).• Performed adaptive attacks (Sec. VII-E).• Reported clean model accuracy (Sec. VI-A).• Performed basic sanity checks (iterative attacks perform

better than single-step attacks; increasing the perturbationbudget strictly increases attack success rate; with “high”distortion, model accuracy reaches random guessing.).

• Analyzed success vs. distortion (perturbation) for ouradaptive attack (Sec. VII-E).

• Showed that adaptive attacks are better (harder to bedetected) than non-adaptive ones (Fig. 13).

• Showed attack hyper-parameters with the released code.• Applied both non-adaptive attacks (covering all three

types of input perturbation measures (L0, L2, and L∞))and adaptive attacks (Sec. VI-A).

• For non-differentiable components (in adaptive attacks),applied differentiable techniques (Sec. VII-E).

• Verified that the attacks have converged under the selectedhyper-parameters.

IX. ConclusionDeep-learning driven applications are cultivating Software

2.0, an exciting software paradigm that is not robust to inputperturbations. The robustness issue is further exacerbatedby the lack of explainability in deep learning. Adversarialattacks exploit the robustness vulnerability, and represents oneimportant instance of AI safety as AI techniques penetrate intomission-critical systems [71], [73].

Ptolemy enables efficient and accurate adversarial detectionat inference-time. The key is to exploit the program executionbehaviors of DNN inference that are largely ignored before.We demonstrate a careful co-design of algorithmic framework,compiler optimizations, and hardware architecture. The con-cepts of important neuron and activation path complementexisting explainable ML efforts [3], [26], [8], [8], [58], andcould shed new light on interpreting DNNs.

X. AcknowledgementWe thank the anonymous reviewers from ISCA 2020 and

MICRO 2020 and the shepherd from MICRO for their valuablefeedback and/or guidance. Jingwen Leng and Minyi Guo arethe corresponding authors of the paper.

Page 14: Ptolemy: Architecture Support for Robust Deep Learning

References

[1] “15NM OPEN-CELL LIBRARY,” http://www.si2.org/open-cell-library/.[Online]. Available: http://www.si2.org/open-cell-library/

[2] “NVIDIA Reveals Xavier SOC Details,” https://bit.ly/2qq0TWp.[Online]. Available: https://www.forbes.com/sites/moorinsights/2018/08/24/nvidia-reveals-xavier-soc-details/amp/

[3] “2016–2019 Progress Report: Advancing Artificial IntelligenceR&D,” https://www.whitehouse.gov/wp-content/uploads/2019/11/AI-Research-and-Development-Progress-Report-2016-2019.pdf, 2019.

[4] N. Akhtar and A. Mian, “Threat of adversarial attacks on deep learningin computer vision: A survey,” IEEE Access, vol. 6, pp. 14 410–14 430,2018.

[5] V. H. Allan, R. B. Jones, R. M. Lee, and S. J. Allan, “Softwarepipelining,” ACM Computing Surveys (CSUR), vol. 27, no. 3, pp. 367–432, 1995.

[6] A. Athalye and N. Carlini, “On the robustness of the cvpr 2018 white-box adversarial example defenses,” arXiv preprint arXiv:1804.03286,2018.

[7] T. Ball and J. R. Larus, “Efficient path profiling,” in Proceedings of the29th annual ACM/IEEE international symposium on Microarchitecture.IEEE Computer Society, 1996, pp. 46–57.

[8] P. Biecek, “Dalex: explainers for complex predictive models in r,” TheJournal of Machine Learning Research, vol. 19, no. 1, pp. 3245–3249,2018.

[9] J. Bradshaw, A. G. d. G. Matthews, and Z. Ghahramani, “Adversarialexamples, uncertainty, and transfer testing robustness in gaussian processhybrid deep networks,” arXiv preprint arXiv:1707.02476, 2017.

[10] J. Buckman, A. Roy, C. Raffel, and I. Goodfellow, “Thermometerencoding: One hot way to resist adversarial examples,” 2018.

[11] N. Carlini, A. Athalye, N. Papernot, W. Brendel, J. Rauber, D. Tsipras,I. Goodfellow, A. Madry, and A. Kurakin, “On evaluating adversarialrobustness,” arXiv preprint arXiv:1902.06705, 2019.

[12] N. Carlini and D. Wagner, “Adversarial examples are not easily detected:Bypassing ten detection methods,” in Proceedings of the 10th ACMWorkshop on Artificial Intelligence and Security. ACM, 2017, pp.3–14.

[13] N. Carlini and D. Wagner, “Towards evaluating the robustness of neuralnetworks,” in 2017 IEEE Symposium on Security and Privacy (SP).IEEE, 2017, pp. 39–57.

[14] N. Carlini and D. Wagner, “Towards evaluating the robustness of neuralnetworks,” in 2017 IEEE Symposium on Security and Privacy (SP).IEEE, 2017, pp. 39–57.

[15] P. P. Chang and W. Hwu, “Trace selection for compiling large capplication programs to microcode,” in Proceedings of the 21st annualworkshop on Microprogramming and microarchitecture. IEEE Com-puter Society Press, 1988, pp. 21–29.

[16] R. Chen, S. Siriyal, and V. Prasanna, “Energy and memory efficientmapping of bitonic sorting on fpga,” in Proceedings of the 2015ACM/SIGDA International Symposium on Field-Programmable GateArrays. ACM, 2015, pp. 240–249.

[17] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:A large-scale hierarchical image database,” in 2009 IEEE conference oncomputer vision and pattern recognition. Ieee, 2009, pp. 248–255.

[18] G. S. Dhillon, K. Azizzadenesheli, Z. C. Lipton, J. Bernstein, J. Kossaifi,A. Khanna, and A. Anandkumar, “Stochastic activation pruning forrobust adversarial defense,” arXiv preprint arXiv:1803.01442, 2018.

[19] R. J. Donovan, R. R. Roediger, and W. J. Schmidt, “Profile drivenoptimization of frequently executed paths with inlining of code frag-ment (one or more lines of code from a child procedure to a parentprocedure),” Jun. 6 2000, uS Patent 6,072,951.

[20] J. A. Fisher, “Trace scheduling: A technique for global microcodecompaction,” IEEE transactions on computers, no. 7, pp. 478–490, 1981.

[21] Z. Gong, W. Wang, and W.-S. Ku, “Adversarial and clean data are nottwins,” arXiv preprint arXiv:1704.04960, 2017.

[22] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessingadversarial examples,” arXiv preprint arXiv:1412.6572, 2014.

[23] S. Gu and L. Rigazio, “Towards deep neural network architectures robustto adversarial examples,” arXiv preprint arXiv:1412.5068, 2014.

[24] C. Guo, M. Rana, M. Cisse, and L. Van Der Maaten, “Counter-ing adversarial images using input transformations,” arXiv preprintarXiv:1711.00117, 2017.

[25] W. He, J. Wei, X. Chen, N. Carlini, and D. Song, “Adversarial exampledefense: Ensembles of weak defenses are not strong,” in 11th {USENIX}Workshop on Offensive Technologies ({WOOT} 17), 2017.

[26] A. Holzinger, M. Plass, K. Holzinger, G. C. Crisan, C.-M. Pintea,and V. Palade, “A glass-box interactive machine learning approach forsolving np-hard problems with the human-in-the-loop,” arXiv preprintarXiv:1708.01104, 2017.

[27] K. Hoste and L. Eeckhout, “Cole: compiler optimization level ex-ploration,” in Proceedings of the 6th annual IEEE/ACM internationalsymposium on Code generation and optimization. ACM, 2008, pp.165–174.

[28] X. Hu, L. Liang, S. Li, L. Deng, P. Zuo, Y. Ji, X. Xie, Y. Ding, C. Liu,T. Sherwood et al., “Deepsniffer: A dnn model extraction frameworkbased on learning architectural hints,” in Proceedings of the Twenty-FifthInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems, 2020, pp. 385–399.

[29] J. Huang and C. X. Ling, “Using auc and accuracy in evaluating learningalgorithms,” IEEE Transactions on knowledge and Data Engineering,vol. 17, no. 3, pp. 299–310, 2005.

[30] F. Iandola, M. Moskewicz, S. Karayev, R. Girshick, T. Darrell, andK. Keutzer, “Densenet: Implementing efficient convnet descriptor pyra-mids,” arXiv preprint arXiv:1404.1869, 2014.

[31] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,S. Bates, S. Bhatia, N. Boden, A. Borchers et al., “In-datacenterperformance analysis of a tensor processing unit,” in 2017 ACM/IEEE44th Annual International Symposium on Computer Architecture (ISCA).IEEE, 2017, pp. 1–12.

[32] D. E. Knuth, Art of computer programming, volume 3: Sorting andSearching. Addison-Wesley Professional, 2014.

[33] D. Koch and J. Torresen, “Fpgasort: A high performance sorting archi-tecture exploiting run-time reconfiguration on fpgas for large problemsorting,” in Proceedings of the 19th ACM/SIGDA international sympo-sium on Field programmable gate arrays. ACM, 2011, pp. 45–54.

[34] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of featuresfrom tiny images,” Citeseer, Tech. Rep., 2009.

[35] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in Advances in neural infor-mation processing systems, 2012, pp. 1097–1105.

[36] A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial examples in thephysical world,” arXiv preprint arXiv:1607.02533, 2016.

[37] Y. LeCun, “The mnist database of handwritten digits,” http://yann. lecun.com/exdb/mnist/, 1998.

[38] J. Li, F. Schmidt, and Z. Kolter, “Adversarial camera stickers: Aphysical camera-based attack on deep learning systems,” in InternationalConference on Machine Learning, 2019, pp. 3896–3904.

[39] A. Liaw, M. Wiener et al., “Classification and regression by randomfor-est,” R news, vol. 2, no. 3, pp. 18–22, 2002.

[40] J. Lu, T. Issaranon, and D. Forsyth, “Safetynet: Detecting and rejectingadversarial examples robustly,” in Proceedings of the IEEE InternationalConference on Computer Vision, 2017, pp. 446–454.

[41] S. Ma and Y. Liu, “Nic: Detecting adversarial samples with neuralnetwork invariant checking,” in Proceedings of the 26th Network andDistributed System Security Symposium (NDSS 2019), 2019.

[42] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towardsdeep learning models resistant to adversarial attacks,” arXiv preprintarXiv:1706.06083, 2017.

[43] J. H. Metzen, T. Genewein, V. Fischer, and B. Bischoff, “On detectingadversarial perturbations,” arXiv preprint arXiv:1702.04267, 2017.

[44] T. Miyato, A. M. Dai, and I. Goodfellow, “Adversarial train-ing methods for semi-supervised text classification,” arXiv preprintarXiv:1605.07725, 2016.

[45] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “Deepfool: a simpleand accurate method to fool deep neural networks,” in Proceedings ofthe IEEE conference on computer vision and pattern recognition, 2016,pp. 2574–2582.

[46] R. Mueller, J. Teubner, and G. Alonso, “Sorting networks on fpgas,” TheVLDB Journal—The International Journal on Very Large Data Bases,vol. 21, no. 1, pp. 1–23, 2012.

[47] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng,“Reading digits in natural images with unsupervised feature learning,”2011.

[48] A. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are easilyfooled: High confidence predictions for unrecognizable images,” in

Page 15: Ptolemy: Architecture Support for Robust Deep Learning

Proceedings of the IEEE conference on computer vision and patternrecognition, 2015, pp. 427–436.

[49] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, andA. Swami, “The limitations of deep learning in adversarial settings,” in2016 IEEE European Symposium on Security and Privacy (EuroS&P).IEEE, 2016, pp. 372–387.

[50] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami, “Distillationas a defense to adversarial perturbations against deep neural networks,”in 2016 IEEE Symposium on Security and Privacy (SP). IEEE, 2016,pp. 582–597.

[51] O. M. Parkhi, A. Vedaldi, A. Zisserman et al., “Deep face recognition.”in bmvc, vol. 1, no. 3, 2015, p. 6.

[52] Y. Qiu, J. Leng, C. Guo, Q. Chen, C. Li, M. Guo, and Y. Zhu,“Adversarial defense through network profiling based path extraction,”in Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, 2019, pp. 4777–4786.

[53] M. Raghu, J. Gilmer, J. Yosinski, and J. Sohl-Dickstein, “Svcca: Singularvector canonical correlation analysis for deep learning dynamics andinterpretability,” in Advances in Neural Information Processing Systems,2017, pp. 6076–6085.

[54] B. D. Rouhani, M. Samragh, M. Javaheripi, T. Javidi, and F. Koushanfar,“Deepfense: Online accelerated defense against adversarial deep learn-ing,” in 2018 IEEE/ACM International Conference on Computer-AidedDesign (ICCAD). IEEE, 2018, pp. 1–8.

[55] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.

[56] J. Smith and R. Nair, Virtual machines: versatile platforms for systemsand processes. Elsevier, 2005.

[57] M. D. Smith, “Overcoming the challenges to feedback-directed opti-mization (keynote talk),” in ACM SIGPLAN Notices, vol. 35, no. 7.ACM, 2000, pp. 1–11.

[58] K. Sokol and P. A. Flach, “Glass-box: Explaining ai decisions withcounterfactual statements through conversation with a voice-enabledvirtual assistant.” in IJCAI, 2018, pp. 5868–5870.

[59] D. J. Sorin, “Fault tolerant computer architecture,” Synthesis Lectureson Computer Architecture, vol. 4, no. 1, pp. 1–104, 2009.

[60] Y. Sun, D. Liang, X. Wang, and X. Tang, “Deepid3: Face recognitionwith very deep neural networks,” arXiv preprint arXiv:1502.00873,2015.

[71] H. Zhao, Y. Zhang, P. Meng, H. Shi, L. E. Li, T. Lou, and J. Zhao, “To-wards safety-aware computing system design in autonomous vehicles,”arXiv preprint arXiv:1905.08453, 2019.

[61] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4,inception-resnet and the impact of residual connections on learning,” inThirty-First AAAI Conference on Artificial Intelligence, 2017.

[62] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4,inception-resnet and the impact of residual connections on learning,” inThirty-first AAAI conference on artificial intelligence, 2017.

[63] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow,and R. Fergus, “Intriguing properties of neural networks,” arXiv preprintarXiv:1312.6199, 2013.

[64] D. D. Thang and T. Matsui, “Image transformation can make neuralnetworks more robust against adversarial examples,” arXiv preprintarXiv:1901.03037, 2019.

[65] F. Tramer, N. Carlini, W. Brendel, and A. Madry, “On adaptive attacks toadversarial example defenses,” arXiv preprint arXiv:2002.08347, 2020.

[66] F. Tramer, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, andP. McDaniel, “Ensemble adversarial training: Attacks and defenses,”arXiv preprint arXiv:1705.07204, 2017.

[67] S. Triantafyllis, M. Vachharajani, N. Vachharajani, and D. I. August,“Compiler optimization-space exploration,” in Proceedings of the inter-national symposium on Code generation and optimization: feedback-directed and runtime optimization. IEEE Computer Society, 2003, pp.204–215.

[68] X. Wang, R. Hou, B. Zhao, F. Yuan, J. Zhang, D. Meng, andX. Qian, “Dnnguard: An elastic heterogeneous dnn accelerator archi-tecture against adversarial attacks,” in Proceedings of the Twenty-FifthInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems, 2020, pp. 19–34.

[69] Y. Wang, H. Su, B. Zhang, and X. Hu, “Interpret neural networks byidentifying critical data routing paths,” in The IEEE Conference onComputer Vision and Pattern Recognition (CVPR), June 2018.

[70] C. Xie, J. Wang, Z. Zhang, Z. Ren, and A. Yuille, “Mitigating adversarialeffects through randomization,” arXiv preprint arXiv:1711.01991, 2017.

[72] S. Zheng, Y. Song, T. Leung, and I. Goodfellow, “Improving therobustness of deep neural networks via stability training,” in Proceedingsof the ieee conference on computer vision and pattern recognition, 2016,pp. 4480–4488.

[73] Y. Zhu, V. J. Reddi, R. Adolf, S. Rama, B. Reagen, G.-Y. Wei,and D. Brooks, “Cognitive computing safety: The new horizon forreliability/the design and evolution of deep learning workloads,” IEEE

Micro, vol. 37, no. 1, pp. 15–21, 2017.


Recommended