Share this document with a friend

Embed Size (px)

of 18
/18

Transcript

1

Universal Adversarial Audio PerturbationsSajjad Abdoli, Luiz G. Hafemann, Jerome Rony, Ismail Ben Ayed, Patrick Cardinal and Alessandro L.

Koerich

Abstract—We demonstrate the existence of universal adversarial perturbations, which can fool a family of audio classificationarchitectures, for both targeted and untargeted attack scenarios. We propose two methods for finding such perturbations. The firstmethod is based on an iterative, greedy approach that is well-known in computer vision: it aggregates small perturbations to the inputso as to push it to the decision boundary. The second method, which is the main contribution of this work, is a novel penaltyformulation, which finds targeted and untargeted universal adversarial perturbations. Differently from the greedy approach, the penaltymethod minimizes an appropriate objective function on a batch of samples. Therefore, it produces more successful attacks when thenumber of training samples is limited. Moreover, we provide a proof that the proposed penalty method theoretically converges to asolution that corresponds to universal adversarial perturbations. We also demonstrate that it is possible to provide successful attacksusing the penalty method when only one sample from the target dataset is available for the attacker. Experimental results on attackingvarious 1D convolutional neural network architectures have shown attack success rates higher than 85.0% and 83.1% for targeted anduntargeted attacks, respectively using the proposed penalty method.

Index Terms—adversarial perturbation, deep learning, audio processing, audio classification.

F

1 INTRODUCTION

D EEP learning models have been achieving state-of-the-art performance in various problems, notably in

image recognition [1], natural language processing [2], [3]and speech processing [4], [5]. However, recent studieshave demonstrated that such deep models are vulnerableto adversarial attacks [6], [7], [8], [9], [10], [11]. Adversarialexamples are carefully perturbed input examples that canfool a machine learning model at test time [6], [12], posingsecurity and reliability concerns for such models. The threatof such attacks has been mainly addressed for computervision tasks [9], [13]. For instance, Moosavi-Dezfooli etal. [14] have shown the existence of universal adversarialperturbation, which, when added to an input image, causesthe input to be misclassified with high probability. For theseuniversal attacks, the generated vector is independent fromthe input examples.

End-to-end audio classification systems have been gain-ing more attention recently [15], [16], [17], [18]. In suchsystems, the input to the classifier is the audio wave-form. Moreover, there have been some studies that embedtraditional signal processing techniques into the layers ofconvolutional neural networks (CNNs) [19], [20], [21]. Forsuch audio classification systems, the effect of adversarialattacks is not widely addressed [22], [23]. Creating attacksto threaten audio classification systems is challenging, duemainly to the signal variability in the time domain [22].

In this paper, we demonstrate the existence of universaladversarial perturbations, which can fool a family of audioclassification architectures, for both targeted and untargetedattack scenarios. We propose two methods for finding such

• Authors are with Ecole de Technologie Superieure (ETS), Universite duQuebec, Montreal, QC, CanadaE-mail: [email protected], [email protected], [email protected], {ismail.benayed, patrick.cardinal, alessandro.koerich}@etsmtl.ca

perturbations. The first method is based on the greedy-approach principle proposed by Moosavi-Dezfooli et al. [14],which finds the minimum perturbation that sends examplesto the decision boundary. The second method, which isthe main contribution of this work, is a novel penaltyformulation, which finds targeted and untargeted univer-sal adversarial perturbations. Differently from the greedyapproach, the penalty method minimizes an appropriate ob-jective function on a batch of samples. Therefore, it producesmore successful attacks than the previous method when thenumber of training samples is limited. We also show thatusing this method, it is possible to provide successful attackswhen only one sample from the target dataset is available tothe attacker. Moreover, we provide a proof that the proposedpenalty method theoretically converges to a solution thatcorresponds to universal adversarial perturbations. Bothmethods are evaluated on a family of audio classifiers basedon deep models for environmental sound classification andspeech recognition. The experimental results have shownthat both proposed methods can attack deep models, whichare used as target models, with a high success rate.

This paper is organized as follows: Section 2 presents anoverview of adversarial attacks on deep learning models.Section 3 presents the proposed methods to craft univer-sal audio adversarial perturbations. Section 4 presents thedataset, the target models used to evaluate the proposedmethods as well as the experimental results on two bench-marking datasets. are presented in Section 4.1. The conclu-sion and perspective of future work is presented in the lastsection.

2 ADVERSARIAL MACHINE LEARNING

Research on adversarial attacks has attracted considerableattention recently, due to its impact on the reliability ofshallow and deep learning models for computer vision tasks[8]. For a given example x, an attack can find a small

arX

iv:1

908.

0317

3v5

[cs

.LG

] 1

7 N

ov 2

020

2

perturbation δ, often imperceptible to a human observer, sothat an example x = x + δ is misclassified by a machinelearning model [6], [24], [25]. The attacker’s goal may covera wide range of threats like privacy violation, availability vi-olation and integrity violation [12]. Moreover, the attacker’sgoal may be specific (targeted), with inputs misclassified asa specific class or generic (untargeted), where the attackersimply wants to have an example classified differently fromits true class [12]. Attacks generated by targeting a specificclassifier are often transferable to other classifiers (i.e. alsoinduce misclassification on them) [26], even if they havedifferent architectures and do not use the same trainingset [6]. It has been also shown that such attacks can beapplied in the physical world too. For instance, Kurakin etal. [27] showed that printed adversarial examples were stillmisclassified after being captured by a camera; Athalye etal. [28] presented a method to generate adversarial examplesthat are robust under translation and scale transformations,showing that such attacks induce misclassification evenwhen the examples correspond to different viewpoints.

Research on adversarial attacks on audio classificationsystems is quite recent. Some studies focused mainly onproviding inaudible and hidden targeted attacks on thesystems. In such attacks, new audio examples are synthe-sized, instead of adding perturbations to actual inputs [29],[30]. Other works focus on untargeted attacks on speechand music classification systems [31], [32]. Du et al. [33]proposed a method based on particle swarm optimizationfor targeted and untargeted attacks. They evaluated theirattacks on a range of applications such as speech commandrecognition, speaker recognition, sound event detection andmusic genre classification. Alzantot et al. [34] proposed asimilar approach that uses a genetic algorithm for craftingtargeted black-box attacks on a speech command classifica-tion system [35], which achieved 87% of success. Carlini etal. [22] proposed a targeted attack on the speech processingsystem DeepSpeech [15], which is a state-of-the-art speech-to-text transcription neural network. A penalty method isused in such attacks, which achieved 100% of success.

Most studies consider that the attacker can directly ma-nipulate the input of the classifiers. Crafting audio attacksthat can work on the physical world (i.e. played over-the-air) presents challenges such as being robust to backgroundnoise and reverberations. Yakura et al. [36] showed thatover-the-air attacks are possible, at least for a dataset ofshort phrases. Qin et al. [37] also reported successful ad-versarial attacks on automatic speech recognition systems,which remain effective after applying realistic simulatedenvironmental distortions.

Universal adversarial perturbations are effective typeof attacks [14] because the additive noise is independentof the input examples, but once added to any example,it may fool a deep model to misclassify such an exam-ple. Moosavi-Dezfooli et al. [14] proposed a greedy algo-rithm to provide such perturbations for untargeted attacksto images. The perturbation is generated by aggregatingatomic perturbation vectors, which send the data points tothe decision boundary of the classifier. Recently, Moosavi-Dezfooli et al. [38] provided a formal relationship betweenthe geometry of the decision boundary and robustness touniversal perturbations. They have also shown the strong

vulnerability of state-of-the-art deep models to universalperturbations. Metzen et al. [39] generalized this idea to pro-vide attacks against semantic image segmentation models.Recently, Behjati et al. [40] generalized this idea to attack atext classifier in both targeted and non-targeted scenarios. Ina different approach, Hayes et al. [41] proposed a generativemodel for providing such universal perturbations. Recently,Neekhara et al. [42] were also inspired by this idea togenerate universal adversarial perturbations which causemis-transcription of audio signals of automatic transcriptionsystems. They reported success rates up to 88.24% on thehold-out set of the pre-trained Mozilla DeepSpeech model[15]. Nonetheless, their method is designed only for untar-geted attacks. For audio classification systems, the impact ofsuch perturbations can be very strong if these perturbationscan be played over the air even without knowing what thetest examples would look like.

In this work we show the existence of UAPs for attackingseveral audio processing models in both targeted and alsountargeted scenarios. The previous studies on UAPs forattacking audio processing systems, however, have focusedonly on untargeted attacking scenarios. We show that theiterative method proposed by Moosavi-Dezfooli [14] canbe generalized for attacking audio models. As a part ofthis algorithm, we show that the decoupled direction andnorm (DDN) attack [43], which was originally proposedfor targeting image processing models, can also be usedin an iterative method for targeting audio models. Besides,we also propose a novel penalty optimization formulationfor generating UAPs and we address new challenges suchas generating perturbations when a single audio exampleis available to the attacker, as well as the transferabilityof UAPs in both targeted and untargeted scenarios. It isshown that the proposed methods generate UAPs, whichgeneralize well across data points and they are to someextent model-agnostic (doubly universal) where UAPs aregeneralizable across different models specially for the un-targeted attacking scenario.

3 UNIVERSAL ADVERSARIAL AUDIO PERTURBA-TIONS

In this section, we formalize the problem of crafting uni-versal audio adversarial perturbations and propose twomethods for finding such perturbations. The first methodis based on the greedy-approach principle proposed byMoosavi-Dezfooli et al. [14] which finds the minimum per-turbation that sends examples to the decision boundary ofthe classifier or inside the boundary of the target class foruntargeted and targeted perturbations, respectively. The sec-ond method, which is the main contribution of this work, isa penalty formulation, which finds a universal perturbationvector that minimizes an objective function.

Let µ be the distribution of audio samples in Rd andk(x) = arg maxy P (y|x, θ) be a classifier that predicts theclass of the audio sample x, where y is the predicted labelof x and θ denotes the parameters of the classifier. Ourgoal is to find a vector v that, once added to the audiosamples can fool the classifier for most of the samples. Thisvector is called universal as it is a fixed perturbation thatis independent of the audio samples and, therefore, it can

3

be added to any sample in order to fool a classifier. Theproblem can be defined such that k(x + v) 6= k(x) for auntargeted attack and, for a targeted attack, k(x + v) = yt,where yt denotes the target class. In this context, the uni-versal perturbation is a vector with a sufficiently small `pnorm, where p ∈ [1,∞), which satisfies two constraints [14]:‖v‖p ≤ ξ and Px∼µ(k(x + v) 6= k(x)) ≥ 1 − δ, where ξcontrols the magnitude of the perturbation and δ controlsthe desired fooling rate. For a targeted attack, the secondconstraint is defined as Px∼µ(k(x + v) = yt) ≥ 1− δ.

3.1 Iterative Greedy AlgorithmLet X = {x1, . . . ,xm} be a set of m audio files sampledfrom the distribution µ. The greedy algorithm proposed byMoosavi-Dezfooli et al. [14] gradually crafts adversarial per-turbations in an iterative manner. For untargeted attacks, ateach iteration, the algorithm finds the minimal perturbation∆vi that pushes an example xi to the decision boundary,and adds the current perturbation to the universal pertur-bation. In this study, a targeted version of the algorithm isalso proposed such that the universal perturbation addedto the example must push it toward the decision boundaryof the target class. In more details, at each iteration of thealgorithm, if the universal perturbation makes the modelmisclassify the example, the algorithm ignores it, otherwise,an extra ∆vi is found and aggregated to the universalperturbation by solving the minimization problem with thefollowing constraints for untargeted and targeted attacksrespectively:

∆vi ← arg minr‖r‖2

s.t. k (xi + v + r) 6= k (xi) ,

or k (xi + v + r) = yt.

(1)

In order to find ∆vi for each sample of the dataset, anyattack that provides perturbation that misclassifies the sam-ple, such as Carlini and Wanger `2 attack [8] or DDN attack[43], can be used. Moosavi-Dezfooli et al. [44] used Deepfoolto find such a vector.

In order to satisfy the first constraint (||v‖p ≤ ξ), theuniversal perturbation is projected on the `p ball of radius ξand centered at 0. The projection function Pp,ξ is formulatedas:

Pp,ξ(v) = arg minv′

‖v − v′‖2 s.t. ‖v′‖p ≤ ξ. (2)

The termination criteria for the algorithm is defined suchthat the Attack Success Rate (ASR) on the perturbed trainingset exceeds a threshold 1− δ. In this protocol, the algorithmstops for untargeted perturbations when:

ASR (X,v) :=1

m

m∑i=1

1{k (xi + v) 6= k (xi)} ≥ 1− δ, (3)

where 1{·} is the true-or-false indicator function. For atargeted attack, we replace inequality k (xi + v) 6= k(xi)by k (xi + v) = yt in Eq. (3). The problem with iterativeGreedy formulation is that the constraint is only definedon universal perturbation (‖v‖p ≤ ξ). Therefore, the sum-mation of the universal perturbation with the data pointsresults in an audio signal that is out of a specific range such

as [0, 1], for almost all audio samples, even by selecting asmall value for ξ. In order to solve this problem, we mayclip the value of each resulting data point to a valid range.

3.2 Penalty Method

The proposed penalty method minimizes an appropriateobjective function on a batch of samples from a datasetfor finding universal adversarial perturbations. In the caseof noise perception in audio systems, the level of noiseperception can be measured using a realistic metric suchas the sound pressure level (SPL). Therefore, the SPL is usedinstead of the `p norm. In this paper, such a measure isused in one of the objective functions of the optimizationproblem, where one of the goals is to minimize the SPL ofthe perturbation, which is measured in decibel (dB) [22]. Theproblem of crafting a perturbation in a targeted attack canbe reformulated as the following constrained optimizationproblem:

minimize SPL(v)

s.t. yt = arg maxy

P (y|xi + v, θ)

and 0 ≤ xi + v ≤ 1 ∀i

(4)

For untargeted attacks we use yl 6= arg maxy P (y|xi + v, θ),where yl is the legitimate class.

Different from the iterative Greedy formulation, in theproposed penalty method the second constraint is definedon the summation of the data points and universal pertur-bation (xi + v) to keep the perturbed example in a validrange. As the constraint of the DDN attack, which is usedin the iterative greedy algorithm, is that the data must bein range [0, 1], for a fair comparison between methods, weimpose the same constraint on the penalty method. This boxconstraint should be valid for all audio samples.

In Eq. (4), the pressure level of an audio waveform(noise) can be computed as:

SPL(v) = 20 log10 P (v), (5)

where P (v) is the root mean square (RMS) of the perturba-tion signal v of length N , which is given by:

P (v) =

√√√√ 1

N

N∑n=1

v2n, (6)

where vn denotes the n-th component of the array v.The optimization problem introduced in Eq. (4) can

be solved by a gradient-based algorithm, which howeverdoes not enforce the box constraint. Therefore, we need tointroduce a new parameter w, which is defined in Eq. (7) toensure that the box constraint is satisfied.

This variable change is inspired by the `2 attack ofCarlini and Wagner [8].

wi =1

2(tanh(x′i + v′) + 1), (7)

where x′i = arctanh((2xi − 1) ∗ (1 − ε)) and v′ = arctanh((2v− 1) ∗ (1− ε)) are the audio example xi and the pertur-bation vector v represented in the tanh space, respectively,and ε is a small constant that depends on the extreme valuesof the transformed signal that ensures that x′i and v′ does

4

not assume infinity values. For instance, ε=1e−7 is a suitablevalue for the datasets used in Section 4. The audio examplexi must be transformed to tanh space and then Eq. (7) canbe used to transform the perturbed data to the valid range of[0, 1]. Since−1 ≤ tanh(x′i+v′) ≤ 1 then 0 ≤ wi ≤ 1 and thesolution will be valid according to the box constraint. Referto Appendix B for details. As a result of this transformation,the produced perturbation vector is also in tanh space, andv can be written as:

v =tanh(v′) + 1− ε

2− 2ε. (8)

In order to solve the optimization problem defined inEq. (4), we rewrite variable v′ as follows:

v′ =1

2ln

(wi

1−wi

)− x′i. (9)

The details of expressing v′ as a function of wi are shown inAppendix A. Therefore, we propose a penalty method thatoptimizes the following objective function:

minwi

L(wi, t) = SPL

(12 ln

(wi

1−wi

)− x′i

)+ c.G(wi, t),

G(wi, t) = max{maxj 6=t{f(wi)j} − f(wi)t,−κ}

(10)

where t is the target class, f(wi)j is the output of the pre-softmax layer (logit) of a neural network for class j, c is apositive constant known as ”penalty coefficient” and κ con-trols the confidence level of sample misclassification. Thisformulation enables the attacker to control the confidencelevel of the attack. For untargeted attacks, we modify theobjective function of Eq. (10) as:

minwi

L(wi, yl) = SPL

(12 ln

(wi

1−wi

)− x′i

)+ c.G(wi, yl),

G(wi, yl) = max{f(wi)yl −maxj 6=yl{f(wi)j} ,−κ}

(11)

where yl is the legitimate label for the i-th sample of thebatch. G(wi, t) is the hinge loss penalty function, which fora targeted attack and κ = 0, must satisfy:

G(wi, t) = 0 if yt = arg maxy

P(y|wi, θ),

G(wi, t) > 0 if yt 6= arg maxy

P(y|wi, θ),(12)

The same properties of the penalty function are also validfor untargeted perturbations. This penalty function is con-vex and has subgradients therefore, a gradient-based opti-mization algorithm, such as the Adam algorithm [45] canbe used to minimize the finite-sum loss defined in Eqs. (10)and (11). Several other optimization algorithms like Ada-Grad [46], standard gradient descent, gradient descent withNesterov momentum [47] and RMSProp [48] have also beenevaluated but Adam converges in fewer iterations and itproduces relatively similar solutions. Algorithm 1 presentsthe pseudo-code of the proposed penalty method.

Theorem 1: Let{vk}

, k = 1, ...,∞ be the sequencegenerated by the proposed penalty method in Algorithm1 for k iterations. Let v be the limit point of

{vk}

. Then

any limit point of the sequence is a solution to the originaloptimization problem defined in Eq. (4)1.

Proof: According to Eqs. (8) and (9), vk can be definedas:

v′k

=1

2ln

(wki

1−wki

)− x′i,

vk =tanh(v′

k

) + 1− ε2− 2ε

(13)

Before proving Theorem 1, a useful Lemma is also presentedand proved.

Lemma 1: Let v∗ be the optimal value of the originalconstrained problem defined in Eq. (4). Then SPL (v∗) ≥L(wki , t)≥ SPL

(vk)∀k.

Proof of Lemma 1:SPL(v∗) = SPL(v∗) + c.G(w∗i , t) (∵ G(w∗i , t) = 0)

≥ SPL(vk) + c.G(wki , t) (∵ c > 0, G(wk

i , t) ≥ 0,

wki minimizes L(wk

i , t))

≥ SPL(vk)

∴ SPL(v∗)≥L(wki , t) ≥ SPL(vk)∀k.

Proof of Theorem 1. SPL is a monotonically increasingfunction and continuous. Also, G is a hinge function, whichis continuous. L is the summation of two continuous func-tions. Therefore, it is also a continuous function. The limitpoint of

{vk}

is defined as: v = limk→∞ vk and since SPL isa continuous function, SPL(v) = limk→∞ SPL(vk). We canconclude that:

L∗ = limk→∞ L(wki , t) ≤ SPL(v∗) (∵ Lemma 1)

L∗ = limk→∞ SPL(vk) + limk→∞ c.G(wki , t) ≤ SPL(v∗)

L∗ = SPL(v) + limk→∞ c.G(wki , t) ≤ SPL(v∗).

If vk is a feasible point for the constrained optimizationproblem defined in Eq. (4), then, from the definition of func-tion G(.), one can conclude that limk→∞ c.G(wk

i , t) = 0.Then:

L∗ = SPL(v) ≤ SPL(v∗)

∴ v is a solution of the problem defined in Eq. (4)

4 EXPERIMENTAL PROTOCOL AND RESULTS

The proposed UAP is evaluated on two audio tasks: envi-ronmental sound classification and speech command recog-nition. For environmental sound classification, we haveused the full audio recordings of UrbanSound8k dataset [49]downsampled to 16 kHz for training and evaluating themodels as well as for generating adversarial perturbations.This dataset consists of 7.3 hours of audio recordings splitinto 8,732 audio clips of up to slightly more than threeseconds. Therefore, each audio sample is represented by a50,999-dimensional array. The audio clips were categorizedinto 10 classes. The dataset was split into training (80%),validation (10%) and test (10%) set. For generating pertur-bations, 1,000 samples of the training set were randomly

1. Theorem 1 applies to the context of convex optimization. Theneural network is defined as a functional constraint on the optimizationproblem defined in Eq. (4). Since neural networks are not convex, afeasible solution to the optimization problem may not be unique and itis not guaranteed to be a global optimum. Moreover, the Theorem 1 isproved based on the assumption defined in Eq. (12) i.e. κ=0

5

Algorithm 1: Penalty method for universal adver-sarial audio perturbations.

Input: Data points X = {x1, . . . ,xm} withcorresponding legitimate labels Y , desiredfooling rate on perturbed samples δ, andtarget class t (for targeted attacks)

Output: Universal perturbation signal v′

1 initialize v′ ← 0,2 while ASR (X,v′) ≤ 1− δ do3 Sample a mini-batch of size S from (X,Y )4 g← 05 for i← 1 to S do6 Transform the audio signal to tanh space:

x′i = arctanh ((2xi − 1) ∗ (1− ε)),7 Compute the transformation of the perturbed

signal for each sample i from mini-batch:wi = 1

2 (tanh(x′i + v′) + 1),Compute the gradient of the objectivefunction, i.e., Eq. (10) or Eq. (11), w.r.t. wi:

8 if targeted attack: then9 g← g + ∂L(wi,t)

∂wi

else10 g← g + ∂L(wi,yi)

∂wi

11 Compute update ∆v′ using g according toAdam update rule [45]

12 apply update: v′ ← v′ + ∆v′

return v

selected, and for penalty-based method a mini-batch size of100 samples is used. The perturbations were evaluated onthe whole test set (874 samples).

For speech command recognition, we have used theSpeech Command dataset, which consists of 61.83 hoursof audio sampled at 16 kHz [50], and categorized into 32classes. The training set consists of 17.8 hours of speechcommand recordings split into 64,271 audio clips of onesecond. The test set consists of 158,537 audio samples corre-sponding to 44.03 hours of speech commands. This datasetwas used as the benchmark dataset for TensorFlow SpeechRecognition (TFSR) challenge in 20172. This challenge wasbased on the principles of open science so, data, source code,and evaluation methods of over 1,300 participant teamsare publicly available for commercial and non-commercialusage. So, as it is shown in this study, it is quite straightfor-ward for the adversary to attack the model.

Signal-to-noise Ratio (SNR) is used as a metric to mea-sure the level of noise with respect to the original signal.This metric, which is also measured in dB, is used for mea-suring the level of the perturbation of the signal after addingthe universal perturbation. This measure is also used inprevious works for evaluating the quality of the generatedadversarial audio attacks [32], [33], and it is defined as:

SNR(x,v) = 20 log10

P (x)

P (v), (14)

2. https://www.kaggle.com/c/tensorflow-speech-recognition-challenge

where P (.) is the power of the signal defined in Eq. (6).A high SNR indicates that a low level of noise is added tothe audio sample by the universal adversarial perturbation.Additionally, the relative loudness of perturbation withrespect to audio sample, measured in dB, is also applied:

ldBx(v) = ldB(v)− ldB(x) (15)

where ldB(x) = maxn(20 log10(xn)) and xn denotes then-th component of the array x. This measure is similar to`∞ norm in image domain. Smaller values indicate quieterdistortions. This measure is also applied in recent studiesfor quality assessment of adversarial attacks against audioprocessing models [22], [42], [53].

We have chosen a family of diverse end-to-end archi-tectures as our target models. This selection is based onchoosing architectures which might learn representationsdirectly from the audio signal. We briefly describe thearchitecture of each model as follows. 1D CNN Rand, 1DCNN Gamma, ENVnet-V2, SincNet and SincNet+VGG19are just used for environmental sound classification whileSpchCMD model is used for speech command recognition.A detailed description of the architectures can be found insupplementary material.

1D CNN Rand [52]: This model consists of five one-dimensional convolutional layers (CLs). The output of CLsis used as input to two fully connected (FC) layers followedby an output layer with softmax activation function. Theweights of all of the layers are initialized randomly. Thismodel was proposed by Abdoli et al. [52] for environmentalsound classification.

1D CNN Gamma [52]: This model is similar to 1D CNNRand except that it employs a Gammatone filter-bank in itsfirst layer. Furthermore, this layer is kept frozen during thetraining process. Gammatone filters are used to decomposethe input signal to appropriate frequency bands.

ENVnet-V2 [51]: The architecture for sound recognitionwas slightly modified to make it compatible with the inputsize of the downsampled audio samples of UrbanSound8kdataset. This architecture uses the raw audio signal as input,and it extracts short-time frequency features by using twoone-dimensional CLs followed by a pooling layer (PL). Itthen swaps axes and convolves in time and in frequencydomain the features using five two-dimensional CLs. TwoFC layers and an output layer with softmax activationfunction complete the network.

SincNet [19]: The end-to-end architecture for sound pro-cessing extracts meaningful features from the audio signal atits first layer. In this model, several sinc functions are used asband-pass filters and only low and high cutoff frequenciesare learned from audio. After that, two one-dimensional CLsare applied. Two FC layers followed by an output layer withsoftmax activation are used for classification.

SincNet+VGG19 [19]: This model uses sinc filters toextract features from the raw audio signal as in SincNet [19].After an one-dimensional maxpooling layer, the output isstacked along time axis to form a 2D representation. Thistime-frequency representation is used as the input to aVGG19 network [54] followed by a FC layer and an outputlayer with softmax activation for classification. This time-frequency representation resembles a spectrogram represen-tation of the audio signal.

6

0 20 40 60 80Confidence value

50.0

60.0

70.0

80.0

Mea

n AS

R (%

)

UntargetedTargeted

(a)

0 20 40 60 80Confidence value

18.018.519.019.520.020.521.0

Mea

n SN

R (d

B)

UntargetedTargeted

(b)

0 20 40 60 80Confidence value

14

13

12

11

10

Mea

n l d

B X(V

)

UntargetedTargeted

(c)

0 20 40 60 80Confidence value

75.0

80.0

85.0

90.0M

ean

ASR

(%)

UntargetedTargeted

(d)

0 20 40 60 80Confidence value

12

14

16

18

20

22

24

Mea

n SN

R (d

B)

UntargetedTargeted

(e)

0 20 40 60 80Confidence value

16

14

12

10

8

6

Mea

n l d

B X(V

)

UntargetedTargeted

(f)

Fig. 1. Effect of different confidence values on the mean values of ASR, SNR and ldBx (v) for targeted and untargeted attacks. Top row (a, b, c):ENVnet-V2 [51] used as target model. Bottom row (d, e, f): 1D CNN Gamma [52] used as target model.

SpchCMD: This architecture was proposed by the win-ner of the TFSR challenge. The model is based on a CNN,which uses one-dimensional CLs and several depth-wiseCLs for extracting useful information from several chunksof raw audio. The model also uses an attention layer anda global average PL. According to the challenge rules,the model must handle silence signal and also unknownsamples from other proposed classes in Speech Commanddataset [50]. Therefore, the softmax layer has 32 outputs.

For the iterative method, several parameters must bechosen. In order to find the minimal perturbation ∆vi,we used the DDN `2 attack [43]. This attack is designedto efficiently find small perturbations that fool the model.The difference with DeepFool [44] is that it can be used forboth untargeted and targeted attacks, extending the iterativemethod to the targeted scenario. DDN was used with abudget of 50 steps and an initial norm of 0.2. Results arereported for p=∞ and we set ξ to 0.2 and 0.12 for untargetedand targeted attack scenarios, respectively. These valueswere chosen to craft perturbations in which the norm ismuch lower than the norm of the audio samples in thedataset.

For evaluating the penalty method we set the penaltycoefficient c to 0.2 and 0.15 for untargeted and targetedattack scenarios, respectively. The confidence value κ is setto 40 and 10 for crafting untargeted and targeted perturba-tions, respectively. For both methods, based on our initialexperiments, we found that these values are appropriate toproduce fine quality perturbed samples within a reasonablenumber of iterations. Fig. 1 shows the effect of differentconfidence values on mean ASR, mean SNR and meanldBx(v) for targeted and untargeted attacks on the ENVnet-V2 [51] and 1D CNN Gamma [52] models for the test set.For this experiment, 1,000 and 500 training samples for un-targeted and targeted attack scenarios are used, respectively.For targeted attacks, the target class is ”Gun shot”. Fig. 1shows that the ASR increases as confidence value increases.However, the SNR also decreases in the same way. Foruntargeted attacks, it has a more detrimental effect on themean ldBx(v) than targeted scenario. For both iterative and

penalty methods, we set the desired fooling rate on per-turbed training samples to δ=0.1. Both algorithms terminateexecution whether they achieve the desired fooling ratio, orthey reach 100 iterations. Both algorithms have been trainedand tested using a TITAN Xp GPU.

4.1 ResultsTable 1 shows the results of the iterative and penaltymethods against the five environmental sound classifica-tion models considered in this study. We evaluate bothtargeted and untargeted attacks in terms of mean ASR onthe training and test sets, as well as mean SNR and meanldBx(v) of the perturbed samples of the test set. For craftingthe perturbation vector 1,000 randomly chosen samples ofthe training set is used. For both untargeted and targetedattacks, the penalty method produces the highest ASR forall target models. Both methods produce relatively similarmean SNR on the test set. For targeted scenario, the penaltymethod produces better results in terms of mean ldBx(v) forattacking SincNet and SincNet+VGG19 models. The itera-tive method, however, works slightly better for attackingENVnet-V2 model. For the rest of the models, the differenceis negligible. The projection operation on `∞ ball used inthe iterative method is like clipping the perturbation vector.Since the maximum amplitude of the perturbation vectorsurpasses the limit (ξ= 0.12) while producing the perturba-tion, the projection operation causes the method to achievethe same mean ldBx(v) for all methods in this attacking sce-nario. For the untargeted scenario, the penalty method alsoproduces quieter perturbations in terms of mean ldBx(v) forattacking 1D CNN Gamma, SincNet and SincNet+VGG19models. The crafted universal perturbations produced fromthe training set by the penalty method generalize better thanthose produced by the iterative method for both attackingscenarios. We also observe a relatively low difference inASR between the training and test sets. The detailed resultsof the targeted attack scenario for each model is reported

7

TABLE 1Mean values of ASR, SNR and ldBx (v) on training and test sets for untargeted and targeted perturbations. Higher ASRs are in boldface.

Targeted Attack Untargeted Attack

Training Set Test Set Training Set Test Set

Method Model ASR ASR SNR (dB) ldBx (v) ASR ASR SNR (dB) ldBx (v)

Iterative

1D CNN Rand 0.926 0.672 25.321 -18.416 0.911 0.412 25.244 -14.2581D CNN Gamma 0.945 0.795 23.587 -18.416 0.904 0.737 22.922 -13.979ENVnet-V2 0.916 0.767 22.734 -18.416 0.910 0.669 24.960 -13.979SincNet 1.000 0.899 28.668 -21.044 0.915 0.886 24.025 -18.959SincNet+VGG19 0.985 0.872 26.203 -18.416 0.924 0.838 26.362 -14.007

Penalty

1D CNN Rand 0.917 0.854 23.468 -16.762 0.900 0.876 20.350 -13.3711D CNN Gamma 0.913 0.888 22.835 -17.375 0.901 0.858 20.551 -18.133ENVnet-V2 0.922 0.877 21.832 -14.198 0.900 0.831 18.727 -10.490SincNet 0.962 0.971 30.411 -31.916 0.900 0.919 29.972 -27.494SincNet+VGG19 0.916 0.898 26.736 -21.059 0.902 0.865 23.555 -17.759

TABLE 2Mean values of ASR, SNR and ldBx (v) on training and test sets for untargeted and targeted perturbations for targeting speech recognition model

(SpchCMD). Higher ASRs are in boldface.

Targeted Attack Untargeted Attack

Training Set Test Set Training Set Test Set

Method ASR ASR SNR (dB) ldBx (v) ASR ASR SNR (dB) ldBx (v) Model Acc.

Iterative 0.902 0.855 27.437 -18.924 0.901 0.834 28.524 -18.418 0.192

Penalty 0.903 0.850 26.728 -24.575 0.910 0.875 26.716 -21.462 0.191

in supplementary material. For a better assessment of theperturbations produced by both proposed methods, severalrandomly chosen examples of perturbed audio samples arepresented in supplementary material.

Table 2 shows the results achieved by iterative andpenalty methods for targeting the SpchCMD model. In orderto find the universal perturbation vector, 3,000 samples fromthe training set were randomly selected. All samples ofthe test set were used to evaluate the effectiveness of theperturbation vector. For targeting this model, the pertur-bation generated by the penalty method is also projectedaround the `2 ball according to Eq. (2) with radius ξ=6.Based on the preliminary experiments for this task, thisprojection produces slightly better results in terms of meanSNR and mean ldBx(v). For the targeted attack scenario,both methods produce similar results in terms of ASR andSNR however, penalty method generates higher-quality per-turbation vectors in terms of mean ldBx(v). For untargetedattacks, the penalty method produces better results in termsof mean ASR and mean ldBx(v) and both methods producesimilar perturbations in terms of SNR. Roughly speaking,SNRs higher than 20 are considered as acceptable ones. Asmentioned by Yang et al. [53], perturbed audio sampleswhich have ldBx(v) ranging from -15 dB to -45 dB aretolerable to human ears. From Tables 1 and 2, the meanSNR and ldBx(v) of the perturbed samples are mostly in thisrange. Neekhara et al. [42] also reported relatively similarresults (between -29.82 dB and -41.86 dB) for attacking aspeech-to-text model in an untargeted scenario. The modelpredictions after adding the perturbations were submittedto the evaluation system of the TFSR challenge to evaluatethe impact of the perturbations on the model accuracy.Table 2 shows that the accuracy of the winner speech recog-

nition model (Model Acc.) has dropped to near of randomguessing for the perturbations produced by both proposedalgorithms. The success rates reported in Tables 1 and 2 arealso statistically significant. Refer to supplementary materialfor details.

We now consider the influence of the number of trainingdata points on the quality of the universal perturbations.Fig. 2 shows the ASR and mean SNR achieved on thetest set with different number of data points, consideringtwo target models. The untargeted attack is evaluated onSincNet and the targeted attack is evaluated on 1D CNNGamma model. For targeted attacks, the target class is ”Gunshot”. For both targeted and untargeted scenarios, penaltymethod produces better ASR when the perturbations arecrafted with a lower number of data points. For the un-targeted scenario, iterative method produces perturbationswith a slightly better mean SNR than those produced bythe penalty method. However, this difference is perceptuallynegligible. When the number of data points are limited,the penalty method also produces better results in termsof mean ldBx(v). However, the iterative method producesslightly better results in terms of mean ldBx(v) when moredata points are available (e.g. more than 50 samples). For thetargeted attack scenario, when the number of data pointsare limited (e.g. lower than 100), the iterative method alsoproduces perturbations with a slightly better mean SNR.However, when the number of data points increases, thepenalty method produces better perturbations in terms ofSNR than those produced by the iterative method. For sucha scenario, the penalty method also produces quieter per-turbations in terms of mean ldBx(v). The Greedy algorithm

8

101 102 103

# data points

20.0

40.0

60.0

80.0

Mea

n AS

R (%

)

PenaltyIterative

(a)

101 102 103

# data points

25.0

27.5

30.0

32.5

35.0

37.5

Mea

n SN

R (d

B)

PenaltyIterative

(b)

101 102 103

# data points

24

22

20

18

16

Mea

n l d

B X(V

)

PenaltyIterative

(c)

101 102 103

# data points

60.0

70.0

80.0

90.0M

ean

ASR

(%)

PenaltyIterative

(d)

101 102 103

# data points

25

30

35

40

Mea

n SN

R (d

B)

PenaltyIterative

(e)

101 102 103

# data points

35

30

25

20

Mea

n l d

B X(V

)

PenaltyIterative

(f)

Fig. 2. Effect of the number of data points on the mean values of ASR, SNR and ldBx (v) for the test set. Top row (a, b, c): Targeted attack on the1D CNN Gamma model [52]. Bottom row (d, e, f): Untargeted attack on the SincNet model [19].

TABLE 3Results of the untargeted attack in terms of mean values of ASR, SNR and ldBx (v) on the test set. The perturbation is generated by a singlerandomly selected sample of each class. The proposed penalty method is used to generate attacks against the 1D CNN Gamma model [52].

Available Class

AI CA CH DO DR EN GU JA SI ST Mean

ASR 0.641 0.722 0.730 0.732 0.720 0.561 0.688 0.796 0.677 0.713 0.698SNR 17.718 18.317 19.597 18.813 16.699 19.039 17.908 20.046 18.014 18.743 18.489ldBx (v) -15.130 -16.574 -16.322 -16.052 -15.048 -16.963 -15.672 -15.510 -15.088 -16.086 -15.844

used in the iterative method is designed to generate pertur-bations with the least possible power level. At each iteration,if the perturbation vector misclassifies the example, it will beignored for the next iterations. Therefore, it is impossible forthe attacker to obtain higher ASRs at the expense of havinga universal perturbation with slightly higher SPL, especiallywhen the number of samples is limited. However, in thepenalty method, the algorithm is able to generate moresuccessful universal perturbations at the expense of havinga universal perturbation with a negligible higher powerlevel as long as the gradient-based algorithm can minimizethe objective functions defined in Eqs. (10) and (11). More-over, for all iterations of the penalty method, the algorithmexploits all available data to minimize the objective functionfor generating the universal perturbation.

Another advantage of the proposed penalty method isthat it is also able to generate perturbations when a singleaudio example is available to the attacker. For such an aim,a single audio sample of each class is randomly selectedfrom the training set, and the objective functions definedin Eqs. (10) and (11) are minimized iteratively using Al-gorithm 1 for targeted and untargeted attacks, respectively.For this experiment, we set the penalty coefficient c=0.2 andthe confidence value κ=90 for both untargeted and targetedattack scenarios. The algorithm is executed for 19 iterationsand the perturbation produced is used to perturb all audiosamples of the test set, which are then used to fool the 1DCNN Gamma model.

Table 3 shows the results of the untargeted attack interms of ASR and SNR on the perturbed samples of the test

set. Fig. 3 shows the results of the targeted attack in termsof SNR, ASR and mean ldBx(v), considering all classes. Inthis case, the perturbation is also generated from a singlerandomly selected audio sample of each class in order toattack the model for a specific target class. Mean ASR of0.698 and 0.602 are achieved for untargeted and targetedattack scenarios, respectively. Moreover, mean SNR of 18.489dB and 19.690 dB are also achieved for untargeted andtargeted attack scenarios, respectively. Mean ldBx(v) of -15.844 dB and 15.571 dB are also achieved for untargetedand targeted attack scenarios, respectively. Similar resultswere also obtained for all other target models.

Finally, Table 4 shows the transferability of perturbed audiosamples of the test set of UrbanSound8k dataset generatedby one model to other models. Generating transferableadversarial perturbations among the proposed models is achallenging task [23], [55]. Abdullah et al. [23] have shownthat transferability is challenging for most of the currentaudio attacks. This problem is still more challenging togradient-based optimization attacks. Liu et al. [56] intro-duced an ensemble-based approach that seems to be apromising research direction to produce more transferableaudio examples.

5 CONCLUSION

In this study, we proposed an iterative method and a penaltymethod for generating targeted and untargeted universaladversarial audio perturbations. Both methods were used

9

AI CA CH DO DR EN GU JA SI ST AVGTarget Class

AI

CA

CH

DO

DR

EN

GU

JA

SI

ST

AVG

Avai

labl

e Cl

ass

0.563 0.473 0.554 0.673 0.799 0.722 0.514 0.888 0.637 0.546 0.637

0.629 0.582 0.53 0.643 0.833 0.746 0.497 0.86 0.578 0.608 0.651

0.646 0.549 0.479 0.559 0.836 0.763 0.207 0.836 0.523 0.589 0.599

0.64 0.519 0.564 0.698 0.847 0.751 0.465 0.894 0.606 0.643 0.663

0.645 0.28 0.506 0.667 0.789 0.745 0.509 0.867 0.606 0.617 0.623

0.608 0.557 0.497 0.683 0.83 0.715 0.57 0.894 0.522 0.638 0.651

0.612 0.338 0.529 0.588 0.827 0.722 0.457 0.824 0.486 0.562 0.594

0.58 0.649 0.347 0.55 0.848 0.751 0.67 0.834 0.592 0.571 0.639

0.554 0.389 0.439 0.553 0.809 0.73 0.392 0.839 0.406 0.514 0.562

0.669 0.339 0.529 0.696 0.84 0.81 0.39 0.902 0.656 0.611 0.644

0.615 0.468 0.497 0.631 0.826 0.745 0.467 0.864 0.561 0.59 0.626

ASR

(a)

AI CA CH DO DR EN GU JA SI ST AVGTarget Class

AI

CA

CH

DO

DR

EN

GU

JA

SI

ST

AVG

Avai

labl

e Cl

ass

20.43 20.48 20.33 20.00 19.64 19.99 20.52 19.62 19.93 20.24 20.12

19.89 19.11 20.11 19.49 19.37 20.07 20.15 19.97 20.19 20.10 19.84

19.51 19.56 20.09 19.48 19.32 19.46 20.39 19.39 19.59 19.38 19.62

20.11 20.38 20.71 20.34 19.87 20.39 20.64 20.00 20.54 20.30 20.33

19.82 20.16 20.05 19.36 19.43 19.66 20.31 19.11 19.67 19.68 19.72

19.66 19.97 20.03 19.58 19.51 20.15 20.31 19.13 20.07 19.67 19.81

19.71 19.68 19.97 19.23 18.96 19.06 19.60 19.39 19.99 19.28 19.49

19.42 19.31 19.41 19.15 19.12 19.05 19.82 19.14 19.11 19.18 19.27

18.98 19.71 19.20 18.77 18.70 18.68 20.47 18.66 19.04 19.07 19.13

19.49 20.33 19.69 19.33 19.18 19.16 20.43 18.91 19.43 19.71 19.57

19.70 19.87 19.96 19.47 19.31 19.57 20.26 19.33 19.76 19.66 19.69

SNR

(b)

AI CA CH DO DR EN GU JA SI ST AVGTarget Class

AI

CA

CH

DO

DR

EN

GU

JA

SI

ST

AVG

Avai

labl

e Cl

ass

-16.822-15.002-17.225-15.621 -15.5 -15.632-14.959-15.423-15.235-16.388-15.781

-15.11 -15.268-15.893-15.331-15.044-15.013-14.859-15.065-15.253-15.228-15.206

-14.827-14.785-17.521-15.755-15.506-15.184-15.492-14.933-14.941-15.786-15.737

-15.738-15.243-16.971-16.745-15.641-15.445 -14.96 -15.169-15.328-15.083-15.632

-14.955-15.608-16.384-15.531-16.644 -15.28 -14.955-15.903-14.672 -15.18 -15.511

-15.676-14.938-17.259-15.275-15.528-16.418-14.738-15.362-15.288-15.524-15.601

-14.977-14.912-16.437-15.935 -15.32 -15.104-14.982-15.475 -14.85 -15.479-15.347

-14.953-14.615-17.138-16.013-15.258-16.011-14.824-17.579-14.978-15.999-15.737

-15.235-15.073 -17.09 -15.982 -15.45 -14.696-14.918-14.782-16.328 -15.41 -15.496

-15.51 -15.706-16.745-15.357 -15.76 -15.257 -15.0 -15.074-15.028 -17.19 -15.663

-15.38 -15.115-16.866-15.755-15.565-15.404-14.969-15.476 -15.19 -15.727-15.571

Mean dB

(c)

Fig. 3. Targeted mean values of ASR (a), SNR (b) and ldBx (v) on the test set. Proposed penalty method is used for crafting the perturbation. Theperturbation is generated by a single randomly selected sample of each class in order to attack the 1D CNN Gamma model [52] for a specific targetclass.

TABLE 4Transferability of adversarial samples generated by both iterative and penalty methods between pairs of the models for untargeted and targeted

attacking scenarios. The cell (i, j) indicates the ASR of the adversarial sample generated for model i (row) evaluated over model j (column)

.Panel (A): Untargeted attack

Method Model 1D CNN Rand 1D CNN Gamma ENVnet-V2 SincNet SincNet+VGG19

Iterative

1D CNN Rand N/A 0.371 0.175 0.300 0.2001D CNN Gamma 0.340 N/A 0.245 0.627 0.530ENVnet-V2 0.269 0.339 N/A 0.304 0.310SincNet 0.164 0.207 0.152 N/A 0.237SincNet+VGG19 0.176 0.191 0.134 0.228 N/A

Penalty

1D CNN Rand N/A 0.558 0.342 0.675 0.3881D CNN Gamma 0.423 N/A 0.295 0.698 0.672ENVnet-V2 0.390 0.503 N/A 0.490 0.479SincNet 0.142 0.170 0.103 N/A 0.185SincNet+VGG19 0.253 0.192 0.194 0.340 N/A

Panel (B): Targeted attack

Method Model 1D CNN Rand 1D CNN Gamma ENVnet-V2 SincNet SincNet+VGG19

Iterative

1D CNN Rand N/A 0.352 0.186 0.534 0.2321D CNN Gamma 0.285 N/A 0.210 0.476 0.423ENVnet-V2 0.292 0.340 N/A 0.327 0.356SincNet 0.126 0.131 0.105 N/A 0.215SincNet+VGG19 0.185 0.149 0.172 0.237 N/A

Penalty

1D CNN Rand N/A 0.119 0.192 0.130 0.1091D CNN Gamma 0.130 N/A 0.117 0.251 0.134ENVnet-V2 0.127 0.153 N/A 0.146 0.159SincNet 0.103 0.106 0.103 N/A 0.129SincNet+VGG19 0.102 0.113 0.099 0.142 N/A

for attacking a diverse family of end-to-end audio classifiersbased on deep models and the experimental results haveshown that both methods can easily fool all models with ahigh success rate. It is also proved that the proposed penaltymethod converges to an optimal solution.

Although the perturbations produced by both methodscan target the models with high success rate, in listeningtests humans can detect the additive noise but the ad-versarial audios can be recognized as their original labels.Developing effective universal adversarial audio examplesusing the principle of psychoacoustics and auditory mask-

ing [23], [37] to reduce the level of noise introduced by theadversarial perturbation is our current goal. Moreover, itis also an interesting direction for designing appropriateaudio intelligibility metrics to asses the quality of perturbedsamples.

Different from image-based universal adversarial pertur-bations [14], the universal audio perturbations crafted in thisstudy do not perform well in the physical world (playedover-the-air). Combining the methods proposed in this pa-per with recent studies on generating robust adversarialattacks on several transformations on the audio signal [36],

10

[37] may be a promising way to craft robust physical attacksagainst audio systems. Moreover, proposing an ensemble-based method for crafting more transferable attacks is alsoa promising direction for future work [56].

Finally, proposing a defensive mechanism [53], [57], [58],[59] against such universal perturbations is also an impor-tant issue. Using methods like adversarial training [60], [61]against such attacks is also considered as one of our futureresearch directions. Moreover, using the proposed methodsfor targeting sequence-to-sequence models (e.g. speech-to-text) is another interesting research direction.

APPENDIX ASOLVING wi FOR v′

wi =1

2(tanh(x′i + v′) + 1)

Substituting x′i + v′ by bi:

wi =1

2(tanh(bi) + 1)

By using the definition of tanh function:

tanh(bi) =−e−bi + ebi

e−bi + ebi,

we have:

2wi − 1 =−e−bi + ebi

e−bi + ebi.

Substituting ebi by ui:

2wi − 1 =−u−1i + ui

u−1i + ui

⇒ 2wi − 1 =−1 + u2

i

1 + u2i

⇒ 2wi

(1 + u2

i

)−(1 + u2

i

)= −1 + u2

i

⇒ 2wi + 2wiu2i − u2

i = u2i

⇒ 2wi = 2u2i − 2wiu

2i

⇒ wi = u2i (1−wi)

⇒ wi

1−wi= u2

i

ui has two solutions:

⇒ ui = ±√

wi

1−wi

Considering ebi = ui, for the first solution we have:

ebi =

√wi

1−wi

⇒ ebi =

(wi

1−wi

) 12

⇒ ln(ebi

)= ln

(wi

1−wi

) 12

⇒ bi =1

2ln

(wi

1−wi

).

By substituting back bi by x′i + v′, we have:

⇒ x′i + v′ =1

2ln

(wi

1−wi

)⇒ v′ =

1

2ln

(wi

1−wi

)− x′i.

For the second solution we have:

bi =1

2ln

(− wi

1−wi

)However, since 0 ≤ wi ≤ 1 and the natural logarithm of anegative number is undefined, the second solution for ui isinvalid.

APPENDIX BSOLVING v′ FOR v

v′ = arctanh ((2v − 1) ∗ (1− ε)) .

Substituting (2v − 1) ∗ (1− ε) by s:

v′ = arctanh (s)

⇒ tanh(v′) = tanh(arctanh(s))

⇒ tanh(v′) = s.

By substituting back s by (2v − 1) ∗ (1− ε) we have:

tanh(v′) = (2v − 1) ∗ (1− ε)⇒ tanh(v′) = 2v − 2vε− 1 + ε

⇒ tanh(v′) = v(2− 2ε)− 1 + ε

tanh(v′) + 1− ε2− 2ε

= v

APPENDIX CSTATISTICAL TEST

Since the ASRs reported in this paper are the proportionof the successfully attacked samples by the two methods(iterative and penalty), the test of hypothesis concerning twoproportions can be applied here [62]. Compare the ASRsobtained by two proposed methods (iterative and penalty)presented in Table 1 and Table 2 and let pl and ph be theASR of the method with lower ASR and the ASR of themethod with higher ASR, respectively. The statistical test[62] is given by :

Z =pl − ph√

2p(1− p)/m, p =

(Xl + Xh)

2m.

Where Xl and Xh are the number of successfully attackedsamples by the method with lower ASR and by the methodwith higher ASR, respectively. Parameter m is also thenumber of samples in the test set. The intention is to provethe ASR of the two methods i.e., we want to establish that:pl < ph so,

• H0 : pl = ph (Null hypothesis)• Ha : pl < ph (Alternative hypothesis).

The null hypothesis is rejected if Z < −zα, where zα isobtained from a standard normal distribution that is relatedto level of significance, α. If the condition is true we canreject H0 and accept Ha. Table 5 shows pl and ph for all

11

of the target models for each attacking scenario as well asthe corresponding Z values. The ASR of the method whichproduces higher ASR is considered as Xh and vice versa.From the standard normal distribution function [62], weknow that z0.057 = 1.58. Since all of the Z values in Table 5are below -1.58 (Z < −z0.057), we can reject the null hypoth-esis at least with 94.3% significance level (1-α) for all of thetarget models. In other words, the method corresponding toASR of ph is more effective with 94.3% significance level.For most of the models, the attacking method with higherASR, is more effective with much higher significance level.We can conclude that results are statistically significant.

TABLE 5Statistical test of the two attacking methods. pl and ph for all of the

target models for each attacking scenario as well as the correspondingZ values. pl and ph are derived from ASRs reported in Table 1 and

Table 2

Attacking sc. Model pl ph Z

Targeted

1D CNN Rand 0.672 0.854 -8.9461D CNN Gamma 0.795 0.888 -5.323ENVnet-V2 0.767 0.877 -6.011SincNet 0.899 0.971 -6.105SincNet+VGG19 0.872 0.898 -1.703SpchCMD 0.850 0.855 -3.969

Untargeted

1D CNN Rand 0.412 0.876 -20.2571D CNN Gamma 0.737 0.858 -6.294ENVnet-V2 0.669 0.831 -7.820SincNet 0.886 0.919 -2.325SincNet+VGG19 0.838 0.865 -1.587SpchCMD 0.834 0.875 -32.737

APPENDIX DDETAILED TARGETED ATTACK RESULTS

Tables 6 to 11 show the detailed mean vlaues of ASR, SNRand ldBx(v) on the target models in the targeted attackscenario for training and test sets. For Tables 6 to 10 themeasures are reported for each specific target class of Urban-Sound8k [49] and for the Table 11, the measures are reportedfor each specific target class of speech commands dataset[50]. The target classes of Urbansound8K dataset [49] are:Air conditioner (AI), Car horn (CA), Children playing (CH),Dog bark (DO), Drilling (DR), Engine (EN) idling, Gun shot(GU), Jackhammer (JA), Siren (SI), Street music (ST).

APPENDIX ETARGET MODELS

In this study six types of models are targeted. For trainingall models, categorical crossentropy is used as loss function.Adadelta [63] is used for optimizing the parameters of

the models proposed for environmental sound classifica-tion. For training the speech command classification modelRMSprop is also used as the optimization method. In thissection, we present the complete description of the models.

E.1 1D CNN RandTable 12 shows the configuration of 1D CNN Rand [52]. Thismodel consists of five one-dimensional CLs. The numberof kernels of each CL is 16, 32, 64, 128 and 256. The sizeof the feature maps of each CL is 64, 32, 16, 8 and 4.The first, second and fifth CLs are followed by a one-dimensional max-pooling layer of size of eight, eight andfour, respectively. The output of the second pooling layer isused as input to two FC layers on which a drop-out withprobability of 0.5 is applied for both layers [64]. Relu is usedas the activation function for all of the layers. The numberof neurons of the FC layers are 128 and 64. In order toreduce the over-fitting, batch normalization is applied afterthe activation function of each convolution layer [65]. Theoutput of the last FC layer is used as the input to a softmaxlayer with ten neurons for classification.

E.2 1D CNN GammaThis model is similar to 1D CNN Rand except that agammatone filter-bank is used for initialization of the filtersof the first layer of this model [52]. Table 13 shows theconfiguration of this model. The gammatone filters are keptfrozen and they are not trained during the backpropagationprocess. Sixty-four filters are used to decompose the inputsignal into appropriate frequency bands. This filter-bankcovers the frequency range between 100 Hz to 8 kHz. Afterthis layer, batch normalization is also applied [65].

E.3 ENVnet-V2Table 14 shows the architecture of ENVnet-V2 [51]. Thismodel extracts short-time frequency features from audiowaveforms by using two one-dimensional CLs with 32 and64 filters, respectively followed by a one-dimensional max-pooling layer. The model then swaps axes and convolvesin time and frequency domain features using two two-dimensional CLs each with 32 filters. After the CLs, atwo-dimensional max-pooling layer is used. After that, twoother two-dimensional CLs followed by a max-pooling layerare used and finally another two-dimensional CL with 128filters is used. After using two FC layers with 4,096 neurons,a softmax layer is applied for classification. Drop-out withprobability of 0.5 is applied to FC layers [64]. Relu is usedas the activation function for all of the layers.

E.4 SincNetTable 15 shows the architecture of SincNet [19]. In thismodel, 80 sinc functions are used as band-pass filters fordecomposing the audio signal into appropriate frequencybands. After that, two one-dimensional CLs with 80 and60 filters are applied. Layer normalization [66] is used aftereach CL. After each CL, max-pooling is used. Two FC layersfollowed by a softmax layer is used for classification. Drop-out with probability of 0.5 is applied to FC layers [64]. Batchnormalization [65] is used after FC layers. In this model, allhidden layers use leaky-ReLU [67] non-linearity.

12

TABLE 6Mean values of ASR, SNR and ldBx (v) for targeting each label of UrbanSound8k [49] dataset. The target model is 1D CNN Rand. Higher ASRs

are in boldface.

Target ClassesMethod AI CA CH DO DR EN GU JA SI ST

Iterative

ASR training set 0.962 0.900 0.924 0.920 0.941 0.937 0.919 0.907 0.936 0.909ASR test set 0.636 0.741 0.602 0.666 0.716 0.656 0.808 0.662 0.640 0.593SNR (dB) test set 25.396 24.234 25.751 25.256 26.494 25.715 24.689 25.399 25.623 24.651ldBx (v) -18.416 -18.416 -18.416 -18.416 -18.416 -18.416 -18.416 -18.416 -18.416 -18.416

Penalty

ASR training set 0.907 0.917 0.916 0.919 0.906 0.932 0.929 0.921 0.906 0.917ASR test set 0.846 0.872 0.807 0.832 0.860 0.890 0.905 0.871 0.822 0.834SNR (dB) test set 24.170 22.492 23.239 22.532 24.371 23.647 24.688 23.445 23.172 22.920ldBx (v) -18.848 -15.553 -15.421 -14.980 -19.406 -18.046 -16.627 -16.752 -16.394 -15.590

TABLE 7Mean values of ASR, SNR and ldBx (v) for targeting each label of UrbanSound8k [49] dataset. The target model is 1D CNN Gamma. Higher ASRs

are in boldface.

Target ClassesMethod AI CA CH DO DR EN GU JA SI ST

Iterative

ASR training set 0.905 0.938 0.952 0.941 0.968 0.936 0.959 0.974 0.941 0.934ASR test set 0.745 0.874 0.744 0.810 0.815 0.746 0.887 0.795 0.772 0.761SNR (dB) test set 21.766 23.091 23.384 24.829 25.339 22.620 24.734 24.530 22.519 23.058ldBx (v) -18.416 -18.416 -18.416 -18.416 -18.416 -18.416 -18.416 -18.416 -18.416 -18.416

Penalty

ASR training set 0.916 0.928 0.923 0.912 0.909 0.902 0.926 0.910 0.903 0.900ASR test set 0.871 0.920 0.895 0.891 0.879 0.886 0.900 0.899 0.883 0.855SNR (dB) test set 21.351 21.730 22.392 22.648 24.257 22.141 23.380 24.663 21.401 22.902ldBx (v) -15.560 -15.113 -16.543 -16.950 -21.505 -17.312 -15.828 -21.380 -16.329 -17.229

TABLE 8Mean values of ASR, SNR and ldBx (v) for targeting each label of UrbanSound8k [49] dataset. The target model is ENVnet-V2. Higher ASRs are in

boldface.

Target ClassesMethod AI CA CH DO DR EN GU JA SI ST

Iterative

ASR training set 0.939 0.928 0.921 0.925 0.882 0.923 0.902 0.929 0.901 0.906ASR test set 0.797 0.835 0.697 0.767 0.764 0.763 0.804 0.744 0.747 0.754SNR (dB) test set 22.852 22.932 22.371 23.183 22.702 23.387 20.868 23.355 23.049 22.637ldBx (v) -18.416 -18.416 -18.416 -18.416 -18.416 -18.416 -18.416 -18.416 -18.416 -18.416

Penalty

ASR training set 0.926 0.935 0.916 0.928 0.923 0.904 0.929 0.904 0.936 0.919ASR test set 0.873 0.902 0.860 0.873 0.867 0.888 0.895 0.879 0.866 0.863SNR (dB) test set 22.143 21.208 22.241 21.601 22.251 21.917 20.798 22.367 21.971 21.818ldBx (v) -13.571 -14.281 -13.874 -12.977 -15.444 -14.541 -12.176 -15.025 -15.190 -14.896

E.5 SincNet+VGG19Table 16 shows the specification of this architecture. Thismodel uses 227 Sinc filters to extract features from theraw audio signal as it is introduced in SincNet [19]. Afterapplying one-dimensional max-pooling layer of size 218with stride of one, and layer normalization [66], the outputis stacked along time axis to form a 2D representation. Thistime-frequency representation is used as input to a VGG19[54] network followed by a FC layer and softmax layer forclassification. The parameters of the VGG19 are the sameas described in [54] and they are not changed in this study.The output of the VGG19 is used as input of a softmax layerwith ten neurons for classification.

E.6 SpchCMDTable 17 shows the architecture of SpchCMD model. Themodel receives the raw audio input of size of 16,000. Themodel generates 40 patches of overlapped audio chunks of

size of 800. After that, a one-dimensional convolution layerwith 64 filters is used. Eleven depth-wise one-dimensionalconvolution layers with appropriate number of featuremaps are then used to extract suitable information fromthe representations from the last layer. Relu is used as theactivation function for all convolution layers. Bach normal-ization is also used after each convolution layers. After anattention layer and also a global average pooling layer aFC and softmax layer is used to classify the input samplesinto appropriate classes. This model is trained based on theavailable recipe from the Github page of the winner of thechallenge 3. A batch consists of 384 sample from the datasetis used where up to 40% of the batch are samples from thetest set which are annotated based on pseudo labeling. Anensemble of the three best performed models during theinitial experiments on the test is used for the labeling where,the three models predict the same label for the sample. Up

3. https://github.com/see--/speech recognition

13

TABLE 9Mean values of ASR, SNR and ldBx (v) for targeting each label of UrbanSound8k [49] dataset. The target model is SincNet. Higher ASRs are in

boldface.

Target ClassesMethod AI CA CH DO DR EN GU JA SI ST

Iterative

ASR training set 1.000 0.998 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000ASR test set 0.754 0.955 0.931 0.854 0.916 0.919 0.959 0.878 0.905 0.920SNR (dB) test set 28.852 25.245 29.940 30.968 29.004 28.910 26.639 26.966 30.356 29.800Mean ldBx (v) -20.1006 -18.416 -23.405 -24.369 -19.411 -20.230 -18.416 -18.416 -24.250 -23.427

Penalty

ASR training set 0.935 0.974 0.964 0.948 0.985 0.994 0.924 0.970 0.966 0.961ASR test set 0.941 0.990 0.974 0.957 0.987 0.994 0.943 0.975 0.978 0.975SNR (dB) test set 33.329 25.554 32.919 32.652 28.420 28.579 28.437 28.946 32.663 32.616Mean ldBx (v) -35.161 -25.740 -35.174 -35.117 -29.397 -29.583 -29.214 -29.408 -35.149 -35.219

TABLE 10Mean values of ASR, SNR and ldBx (v) for targeting each label of UrbanSound8k [49] dataset. The target model is SincNet+VGG. Higher ASRs

are in boldface.

Target ClassesMethod AI CA CH DO DR EN GU JA SI ST

IterativeASR training set 0.990 0.992 0.993 0.994 0.995 0.972 0.925 0.996 0.999 0.994ASR test set 0.876 0.918 0.855 0.887 0.863 0.831 0.899 0.852 0.864 0.879SNR (dB) test set 25.571 26.434 27.734 25.674 28.890 24.276 22.930 26.621 27.292 26.607ldBx (v) -18.416 -18.416 -18.416 -18.416 -18.416 -18.416 -18.416 -18.416 -18.416 -18.416

PenaltyASR training set 0.902 0.908 0.903 0.922 0.921 0.903 0.920 0.922 0.924 0.935ASR test set 0.899 0.889 0.898 0.897 0.897 0.881 0.882 0.907 0.919 0.914SNR (dB) test set 26.417 27.191 28.784 27.085 28.448 24.626 22.411 27.330 27.276 27.787ldBx (v) -20.024 -21.601 -23.594 -21.293 -23.944 -18.213 -14.776 -22.494 -22.045 -22.607

to 50% percent of the samples are also augmented by the useof time shifting with the minimum and maximum range of(-2000, 0) moreover, up to 15% of the samples are also silentsignals.

APPENDIX FAUDIO EXAMPLES

Several randomly chosen examples of perturbed audio sam-ples of Urbansound8k dataset [49] and Speech commands[50] datasets are also presented. The audio samples areperturbed based on two presented methods in this study.Targeted and untargeted perturbations are considered. Ta-ble 18 shows a list of the samples of audio samples ofUrbansound8k dataset [49] and Table 19 also shows thelist of audio samples of Speech Commands [50] dataset .Methodology of crafting the samples, target models, SNRof the perturbed samples, ldBx(v) of the perturbed samples,detected class of the sample by each model as well as thetrue class of the samples are also presented.

ACKNOWLEDGMENTS

We thank Dr. Rachel Bouserhal for her insightful feedback.This work was funded by the Natural Sciences and Engi-neering Research Council of Canada (NSERC). This workwas also supported by the NVIDIA GPU Grant Program.

REFERENCES

[1] Z. Zhao, P. Zheng, S. Xu, and X. Wu, “Object detection with deeplearning: A review,” IEEE Trans Neural Netw and Learn Syst, vol. 30,no. 11, pp. 3212–3232, 2019.

[2] M. Yuan, B. Van Durme, and J. L. Ying, “Multilingual anchoring:Interactive topic modeling and alignment across languages,” inAdv in Neural Inf Proc Syst, 2018, pp. 8653–8663.

[3] Z. Yang, Z. Hu, C. Dyer, E. P. Xing, and T. Berg-Kirkpatrick,“Unsupervised text style transfer using language models as dis-criminators,” in Adv in Neural Inf Proc Syst, 2018, pp. 7287–7298.

[4] Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, P. Nguyen,R. Pang, I. L. Moreno, Y. Wu et al., “Transfer learning from speakerverification to multispeaker text-to-speech synthesis,” in Adv inNeural Inf Proc Syst, 2018, pp. 4480–4490.

[5] Y.-A. Chung, W.-H. Weng, S. Tong, and J. Glass, “Unsupervisedcross-modal alignment of speech and text embedding spaces,” inAdv in Neural Inf Proc Syst, 2018, pp. 7354–7364.

[6] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Good-fellow, and R. Fergus, “Intriguing properties of neural networks,”in 2nd Intl Conf Learn Repres, 2014.

[7] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and Har-nessing Adversarial Examples,” in Intl Conf Learn Repres, 2015.

[8] N. Carlini and D. Wagner, “Towards evaluating the robustness ofneural networks,” in IEEE Symp Secur Privacy, 2017, pp. 39–57.

[9] N. Akhtar and A. Mian, “Threat of adversarial attacks on deeplearning in computer vision: A survey,” IEEE Access, vol. 6, pp.14 410–14 430, 2018.

[10] K. Grosse, T. A. Trost, M. Mosbach, M. Backes, and D. Klakow,“Adversarial initialization – when your network performs the wayi want,” arXiv preprint 1902.03020, 2019.

[11] K. M. Koerich, M. Esmaeilpour, S. Abdoli, A. S. Britto Jr., andA. L. Koerich, “Cross-representation transferability of adversarialattacks: From spectrograms to audio waveforms,” in Intl J Conf onNeural Netw, 2020, pp. 1–7.

[12] B. Biggio and F. Roli, “Wild patterns: Ten years after the rise ofadversarial machine learning,” Patt Recog, vol. 84, pp. 317–331,2018.

[13] T. Orekondy, B. Schiele, and M. Fritz, “Knockoff nets: Stealingfunctionality of black-box models,” in IEEE Conf Comp Vis PattRecog, 2019, pp. 4954–4963.

14

TAB

LE11

Mea

nva

lues

ofA

SR

,SN

Ran

dl d

Bx(v

)fo

rtar

getin

gea

chla

belo

fspe

ech

com

man

dsda

tase

t[50

]dat

aset

.The

targ

etm

odel

isS

pchC

MD

.Hig

herA

SR

sar

ein

bold

face

.

Targ

etC

lass

esM

etho

dsi

lenc

eun

know

nsh

eila

nine

stop

bed

four

six

dow

nbi

rdm

arvi

nca

tof

fri

ght

seve

nei

ght

upth

ree

happ

ygo

zero

onw

owdo

gye

sfiv

eon

etr

eeho

use

two

left

no

Iter

ativ

e

ASR

trai

ning

set

0.90

30.

902

0.90

30.

901

0.90

80.

902

0.90

00.

902

0.90

40.

906

0.90

10.

902

0.90

10.

902

0.90

30.

902

0.90

20.

900

0.90

00.

901

0.90

00.

903

0.90

00.

903

0.90

50.

904

0.90

40.

901

0.90

50.

904

0.90

00.

900

ASR

test

set

0.87

50.

893

0.85

00.

868

0.85

90.

857

0.87

10.

829

0.84

50.

861

0.84

10.

840

0.84

30.

863

0.83

80.

872

0.84

90.

871

0.83

50.

857

0.86

40.

863

0.85

50.

857

0.85

40.

840

0.87

00.

838

0.86

00.

840

0.85

60.

835

SNR

(dB)

test

set

21.5

5927

.747

31.2

4026

.656

32.1

1625

.345

26.7

2629

.381

30.7

9525

.127

26.9

8428

.068

25.6

4027

.637

29.0

7325

.217

25.7

7527

.901

27.4

0428

.173

29.6

3027

.074

26.0

4326

.052

28.2

6726

.428

27.7

8928

.529

26.8

0028

.596

26.7

8226

.203

l dB x

(v)

-18.

416

-18.

416

-20.

546

-18.

416

-22.

206

-18.

416

-18.

416

-19.

350

-22.

887

-18.

416

-18.

416

-18.

529

-18.

416

-18.

418

-20.

292

-18.

416

-18.

416

-19.

338

-18.

416

-18.

416

-19.

459

-18.

416

-18.

416

-18.

416

-18.

421

-18.

416

-18.

417

-18.

450

-18.

416

-18.

840

-18.

416

-18.

416

Pena

lty

ASR

trai

ning

set

0.90

00.

901

0.90

20.

905

0.90

10.

906

0.90

30.

901

0.90

40.

900

0.90

10.

901

0.90

10.

909

0.90

10.

900

0.90

10.

913

0.90

00.

907

0.91

40.

901

0.90

30.

902

0.90

70.

901

0.90

30.

903

0.90

20.

901

0.90

30.

902

ASR

test

set

0.85

90.

910

0.84

90.

867

0.85

80.

835

0.83

80.

856

0.86

40.

821

0.86

10.

830

0.83

20.

837

0.85

40.

883

0.82

90.

867

0.85

50.

848

0.83

70.

857

0.82

30.

825

0.84

30.

850

0.85

70.

849

0.86

00.

849

0.85

10.

842

SNR

(dB)

test

set

26.7

0726

.693

26.7

1626

.763

26.7

1426

.706

26.7

2226

.728

26.7

4126

.744

26.7

2926

.716

26.7

6526

.720

26.7

1326

.758

26.7

2926

.750

26.7

2626

.738

26.7

2126

.761

26.7

1226

.727

26.7

2726

.735

26.7

1626

.722

26.7

2826

.722

26.7

3026

.769

l dB x

(v)

-19.

767

-19.

913

-24.

018

-20.

482

-22.

100

-20.

514

-22.

881

-21.

749

-21.

484

-20.

034

-20.

996

-23.

751

-20.

333

-23.

642

-20.

663

-20.

971

-17.

679

-22.

893

-19.

322

-22.

805

-23.

252

-19.

994

-22.

219

-21.

555

-22.

725

-20.

931

-23.

301

-23.

257

-20.

861

-20.

510

-20.

723

-22.

056

15

TABLE 121D CNN Rand architecture.

Layer Ksize Stride # of filters Data shape

InputLayer - - - (50,999, 1)Conv1D 64 2 16 (25,468, 16)MaxPooling1D 8 8 16 (3,183, 16)Conv1D 32 2 32 (1,576, 32)MaxPooling1D 8 8 32 (197, 32)Conv1D 16 2 64 (91, 64)Conv1D 8 2 128 (42, 128)Conv1D 4 2 256 (20, 256)MaxPooling1D 4 4 128 (5, 256)FC - - 128 (128)FC - - 64 (64)FC - - 10 (10)

TABLE 131D CNN Gamma architecture

Layer Ksize Stride # of filters Data shape

InputLayer - - - (50,999, 1)Conv1D 512 1 64 (50,488, 64)MaxPooling1D 8 8 64 (6,311, 64)Conv1D 32 2 32 (3,140, 32)MaxPooling1D 8 8 32 (392, 32)Conv1D 16 2 64 (189, 64)Conv1D 8 2 128 (91, 128)Conv1D 4 2 256 (44, 256)MaxPooling1D 4 4 128 (11, 256)FC - - 128 (128)FC - - 64 (64)FC - - 10 (10)

TABLE 14ENVnet-V2 architecture

Layer Ksize Stride # of filters Data shape

InputLayer - - - (50,999, 1)Conv1D 64 2 32 (25,468, 32)Conv1D 16 2 64 (12,727, 64)MaxPooling1D 64 64 64 (198, 64)swapaxes - - - (64, 198, 1)Conv2D (8,8) (1,1) 32 (57, 191, 32)Conv2D (8,8) (1,1) 32 (50, 184, 32)MaxPooling2D (5,3) (5,3) 32 (10, 61, 32)Conv2D (1,4) (1,1) 64 (10, 58, 64)Conv2D (1,4) (1,1) 64 (10, 55, 64)MaxPooling2D (1,2) (1,2) 64 (10, 27, 64)Conv2D (1,2) (1,1) 128 (10, 26, 128)FC - - 4,096 (4,096)FC - - 4,096 (4,096)FC - - 10 (10)

TABLE 15SincNet architecture

Layer Ksize Stride # of filters Data shape

InputLayer - - - (50,999, 1)SincConv1D 251 1 80 (50,749, 80)MaxPooling1D 3 1 80 (16,916, 80)Conv1D 5 1 60 (16,912, 60)MaxPooling1D 3 1 60 (5,637, 60)Conv1D 5 1 60 (5,633, 60)FC - - 128 (128)FC - - 64 (64)FC - - 10 (10)

TABLE 16SincNet+VGG19 architecture

Layer Ksize Stride # of filters Data shape

InputLayer - - - (50,999, 1)SincConv1D 251 1 227 (50,749, 1)MaxPooling1D 218 1 227 (232, 1)Reshape - - - (232, 227, 1)VGG19 [54] - - - (4096)FC - - 10 (10)

TABLE 17SpchCMD architecture

Layer Ksize Stride # of filters Data shape

InputLayer - - - (16,000)Time slice stack 40 20 - (800, 40)Conv1D 3 2 64 (399, 64)DeptwiseConv1D 3 1 128 (397, 128)DeptwiseConv1D 3 1 192 (199, 192)DeptwiseConv1D 3 1 192 (197, 192)DeptwiseConv1D 3 1 256 (99, 256)DeptwiseConv1D 3 1 256 (97, 256)DeptwiseConv1D 3 1 320 (49, 320)DeptwiseConv1D 3 1 320 (47, 320)DeptwiseConv1D 3 1 384 (24, 384)DeptwiseConv1D 3 1 384 (22, 384)DeptwiseConv1D 3 1 448 (11, 448)DeptwiseConv1D 3 1 448 (9, 448)AttentionLayer - - 448 (9, 448)GolobalAvgPooling - - 448 (448)FC - - 32 (32)

[14] S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard,“Universal adversarial perturbations,” in IEEE Conf Comp Vis PattRecog, 2017, pp. 1765–1773.

[15] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen,R. Prenger, S. Satheesh, S. Sengupta, A. Coates et al., “Deepspeech: Scaling up end-to-end speech recognition,” arXiv preprint1412.5567, 2014.

[16] Y. Hoshen, R. J. Weiss, and K. W. Wilson, “Speech acoustic mod-eling from raw multichannel waveforms,” in IEEE Intl Conf onAcoust Speech Signal Process, 2015, pp. 4624–4628.

[17] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu,“Wavenet: A generative model for raw audio,” in 9th ISCA SpeechSynth Workshop, 2016, p. 125.

[18] T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, and O. Vinyals,“Learning the speech front-end with raw waveform CLDNNs,” in16th Annual Conf Intl Speech Comm Assoc, 2015, pp. 1–5.

[19] M. Ravanelli and Y. Bengio, “Speaker recognition from raw wave-form with SincNet,” in IEEE Spoken Lang Techn Works, 2018, pp.1021–1028.

[20] N. Zeghidour, N. Usunier, I. Kokkinos, T. Schaiz, G. Synnaeve,and E. Dupoux, “Learning filterbanks from raw speech for phonerecognition,” in IEEE Intl Conf on Acoust Speech Signal Process, 2018,pp. 5509–5513.

[21] N. Zeghidour, N. Usunier, G. Synnaeve, R. Collobert, andE. Dupoux, “End-to-end speech recognition from the raw wave-form,” in 19th Annual Conf Intl Speech Comm Assoc, 2018, pp. 781–785.

[22] N. Carlini and D. Wagner, “Audio adversarial examples: Targetedattacks on speech-to-text,” in IEEE Secur Privacy Worksh, 2018, pp.1–7.

[23] H. Abdullah, K. Warren, V. Bindschaedler, N. Papernot, andP. Traynor, “The faults in our asrs: An overview of attacks againstautomatic speech recognition and speaker identification systems,”arXiv preprint 2007.06622, 2020.

[24] J. Peck, J. Roels, B. Goossens, and Y. Saeys, “Lower bounds on therobustness to adversarial perturbations,” in Adv in Neural Inf ProcSyst, 2017, pp. 804–813.

16

Sample Detected Class True Class Target Model Method Targeted/Untargeted SNR ldBx (v)

DR 0 org.wav Drilling Drilling SINCNet N/A N/A N/A N/ADR 0 pert itr.wav Gun shot Drilling SINCNet Iterative Targeted 27.025 -18.416

DR 0 pert pen.wav Gun shot Drilling SINCNet penalty Targeted 28.040 -29.264SI 0 org.wav Siren Siren SINCNet N/A N/A N/A N/A

SI 0 pert itr.wav Airconditioner Siren SINCNet Iterative Targeted 29.244 -20.101SI 0 pert pen.wav Airconditioner Siren SINCNet penalty Targeted 33.766 -35.153

CH 0 org.wav Children playing Children playing SINCNet N/A N/A N/A N/ACH 0 pert itr.wav Dog bark Children playing SINCNet Iterative Targeted 31.784 -24.369

CH 0 pert pen.wav Dog bark Children playing SINCNet penalty Targeted 33.207 -35.154ST 0 org.wav Street music Street music SINCNet N/A N/A N/A N/A

ST 0 pert itr.wav Airconditioner Street music SINCNet Iterative Targeted 28.994 -20.101ST 0 pert pen.wav Airconditioner Street music SINCNet penalty Targeted 33.539 -35.126

DO 0 org.wav Dog bark Dog bark SINCNet N/A N/A N/A N/ADO 0 pert itr.wav Drilling Dog bark SINCNet Iterative Targeted 28.861 -19.410

DO 0 pert pen.wav Drilling Dog bark SINCNet penalty Targeted 28.040 -29.306EN 0 org.wav Engine idling Engine idling SINCNet+VGG19 N/A N/A N/A N/A

EN 0 pert itr.wav Drilling Engine idling SINCNet+VGG19 Iterative untargeted 26.378 -14.005EN 0 pert pen.wav Children playing Engine idling SINCNet+VGG19 penalty untargeted 23.444 -17.651

CA 0 org.wav Car horn Car horn SINCNet+VGG19 N/A N/A N/A N/ACA 0 pert itr.wav Drilling Car horn SINCNet+VGG19 Iterative untargeted 26.175 -14.005

CA 0 pert pen Street music Car horn SINCNet+VGG19 penalty untargeted 23.341 -17.588AI 0 org.wav Airconditioner Airconditioner SINCNet+VGG19 N/A N/A N/A N/A

AI 0 pert itr.wav Jackhammer Airconditioner SINCNet+VGG19 Iterative untargeted 25.759 -14.005AI 0 pert pen.wav Children playing Airconditioner SINCNet+VGG19 penalty untargeted 23.073 -17.535

SI 1 org.wav Siren Siren SINCNet+VGG19 N/A N/A N/A N/ASI 1 pert itr.wav Jackhammer Siren SINCNet+VGG19 Iterative untargeted 26.334 -14.005

SI 1 pert pen.wav Children playing Siren SINCNet+VGG19 penalty untargeted 23.423 -17.907DR 1 org.wav Drilling Drilling SINCNet+VGG19 N/A N/A N/A N/A

DR 1 pert itr.wav Jackhammer Drilling SINCNet+VGG19 Iterative untargeted 27.040 -14.005DR 1 pert pen.wav Children playing Drilling SINCNet+VGG19 penalty untargeted 24.555 -17.688

TABLE 18List of examples of perturbed audio samples, Methodology of crafting the samples, SNR of the perturbed samples, dBx(v) of the perturbedsamples, target models, and also detected class of the sample by each model and the true class of the samples. The audio files belong to

UrbanSound8k dateset [49]. N/A: Not Applicable

Sample Detected Class True Class Method Targeted/Untargeted SNR ldBx (v)

No 0 org.wav no no N/A N/A N/A N/ANo 0 pert pen.wav bed no penalty Targeted 29.190 -20.958No 0 pert itr.wav bed no iterative Targeted 27.323 -18.416

Up 0 org.wav up up N/A N/A N/A N/AUp 0 pert pen.wav unknown up Penalty Targeted 24.773 -20.319Up 0 pert itr.wav unknown up iterative Targeted 25.385 -18.416

Five 0 org.wav five five N/A N/A N/A N/AFive 0 pert pen.wav silence five penalty Targeted 25.475 -20.218Five 0 pert itr.wav silence five iterative Targeted 20.293 -18.416Down 0 org.wav down down N/A N/A N/A N/A

Down 0 pert pen.wav dog down penalty targeted 31.342 -23.030Down 0 pert itr.wav dog down iterative targeted 29.130 -18.416

One 0 org.wav one one N/A N/A N/A N/AOne 0 pert pen.wav seven one penalty targeted 27.276 -21.299One 0 pert itr.wav seven one iterative targeted 29.351 -20.292

Right 0 org.wav right right N/A N/A N/A N/ARight 0 pert pen.wav five right Penalty Untargeted 25.784 -24.836Right 0 pert itr.wav six right iterative Untargeted 27.580 -18.416

On 0 org.wav on on N/A N/A N/A N/AOn 0 pert pen.wav sheila on Penalty Untargeted 25.126 -25.037On 0 pert itr.wav stop on iterative Untargeted 26.609 -18.416Eight 0 org.wav eight eight N/A N/A N/A N/A

Eight 0 pert pen.wav sheila eight penalty Untargeted 25.921 -24.822Eight 0 pert itr.wav six eight iterative Untargeted 27.764 -18.416

Two 0 org.wav two two N/A N/A N/A N/ATwo 0 pert pen.wav sheila two penalty targeted 26.627 -24.813Two 0 pert itr.wav sheila two iterative Untargeted 28.200 -18.416

No 1 org.wav no no N/A N/A N/A N/ANo 1 pert pen.wav sheila no penalty Untargeted 27.316 -24.840No 1 pert itr.wav sheila no iterative Untargeted 28.845 -18.416

TABLE 19List of examples of perturbed audio samples, Methodology of crafting the samples, SNR of the perturbed samples, dBx(v) of the perturbed

samples and also detected class of the sample by SpchCMD model and the true class of the samples. The audio files belong to SpeechCommands dataset [50]. N/A: Not Applicable

17

[25] A. Shafahi, W. R. Huang, C. Studer, S. Feizi, and T. Goldstein, “Areadversarial examples inevitable?” in Intl Conf Learn Repres, 2019.

[26] A. Fawzi, H. Fawzi, and O. Fawzi, “Adversarial vulnerability forany classifier,” in Adv in Neural Inf Proc Syst, 2018, pp. 1178–1187.

[27] A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial examplesin the physical world,” in Intl Conf on Learn Repres, 2017.

[28] A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok, “SynthesizingRobust Adversarial Examples,” in Intl Conf Mach Learn, 2018, pp.284–293.

[29] N. Carlini, P. Mishra, T. Vaidya, Y. Zhang, M. Sherr, C. Shields,D. Wagner, and W. Zhou, “Hidden voice commands,” in 25th SecurSympos, 2016, pp. 513–530.

[30] G. Zhang, C. Yan, X. Ji, T. Zhang, T. Zhang, and W. Xu, “Dolphi-nattack: Inaudible voice commands,” in ACM SIGSAC Conf CompComm Secur, 2017, pp. 103–117.

[31] Y. Gong and C. Poellabauer, “Crafting adversarial examples forspeech paralinguistics applications,” arXiv preprint 1711.03280,2017.

[32] C. Kereliuk, B. L. Sturm, and J. Larsen, “Deep learning and musicadversaries,” IEEE Trans Multim, vol. 17, no. 11, pp. 2059–2071,2015.

[33] T. Du, S. Ji, J. Li, Q. Gu, T. Wang, and R. Beyah, “Sirenattack:Generating adversarial audio for end-to-end acoustic systems,” in15th ACM Asia Conf Comp Comm Secur, 2020, pp. 357–369.

[34] M. Alzantot, B. Balaji, and M. Srivastava, “Did you hear that?adversarial examples against automatic speech recognition,” arXivpreprint 1801.00554, 2018.

[35] T. N. Sainath and C. Parada, “Convolutional neural networks forsmall-footprint keyword spotting,” in 16th Annual Conf Intl SpeechComm Assoc, 2015, pp. 1478–1482.

[36] H. Yakura and J. Sakuma, “Robust audio adversarial example fora physical attack,” in 28th Intl J Conf Artif Intell, S. Kraus, Ed., 2019,pp. 5334–5341.

[37] Y. Qin, N. Carlini, G. W. Cottrell, I. J. Goodfellow, and C. Raffel,“Imperceptible, robust, and targeted adversarial examples forautomatic speech recognition,” in 36th Intl Conf Mach Learn, 2019,pp. 5231–5240.

[38] S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, P. Frossard, andS. Soatto, “Robustness of classifiers to universal perturbations: Ageometric perspective,” in Intl Conf on Learn Repres, 2018.

[39] J. H. Metzen, M. C. Kumar, T. Brox, and V. Fischer, “Universaladversarial perturbations against semantic image segmentation,”in IEEE Intl Conf on Comp Vis, 2017, pp. 2774–2783.

[40] M. Behjati, S.-M. Moosavi-Dezfooli, M. S. Baghshah, andP. Frossard, “Universal adversarial attacks on text classifiers,” inIEEE Intl Conf on Acoust Speech Signal Process, 2019, pp. 7345–7349.

[41] J. Hayes and G. Danezis, “Learning universal adversarial pertur-bations with generative models,” in IEEE Secur Priv Worksh, 2018,pp. 43–49.

[42] P. Neekhara, S. Hussain, P. Pandey, S. Dubnov, J. J. McAuley, andF. Koushanfar, “Universal adversarial perturbations for speechrecognition systems,” in 20th Annual Conf Intl Speech Comm Assoc,2019, pp. 481–485.

[43] J. Rony, L. G. Hafemann, L. S. Oliveira, I. B. Ayed, R. Sabourin, andE. Granger, “Decoupling direction and norm for efficient gradient-based L2 adversarial attacks and defenses,” in IEEE Conf Comp VisPatt Recogn, 2019, pp. 4322–4330.

[44] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “Deepfool: asimple and accurate method to fool deep neural networks,” inIEEE Conf Comp Vis Patt Recog, 2016, pp. 2574–2582.

[45] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” arXiv preprint 1412.6980, 2014.

[46] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methodsfor online learning and stochastic optimization,” J Mach LearningResearch, vol. 12, no. Jul, pp. 2121–2159, 2011.

[47] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the impor-tance of initialization and momentum in deep learning,” in IntlConf Mach Learn, 2013, pp. 1139–1147.

[48] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MITPress, 2016.

[49] J. Salamon, C. Jacoby, and J. Bello, “A dataset and taxonomy forurban sound research,” in 22nd ACM Intl Conf Multim, New York,NY, USA, 2014, pp. 1041–1044.

[50] P. Warden, “Speech commands: A dataset for limited-vocabularyspeech recognition,” arXiv preprint 1804.03209, 2018.

[51] Y. Tokozume, Y. Ushiku, and T. Harada, “Learning from between-class examples for deep sound recognition,” in 6th Intl Conf LearnRepres, 2018.

[52] S. Abdoli, P. Cardinal, and A. L. Koerich, “End-to-end envi-ronmental sound classification using a 1D convolutional neuralnetwork,” Expert Systems with Applic, vol. 136, pp. 252–263, 2019.

[53] Z. Yang, B. Li, P.-Y. Chen, and D. Song, “Characterizing audioadversarial examples using temporal dependency,” in 7th Intl ConfLearn Repres, 2019.

[54] K. Simonyan and A. Zisserman, “Very deep convolutional net-works for large-scale image recognition,” in 3rd Intl Conf LearnRepres, 2015.

[55] V. Subramanian, E. Benetos, N. Xu, S. McDonald, and M. Sandler,“Adversarial attacks in sound event classification,” arXiv preprint1907.02477, 2019.

[56] Y. Liu, X. Chen, C. Liu, and D. Song, “Delving into transferableadversarial examples and black-box attacks,” in Intl Conf on LearnRepres, 2017.

[57] M. Esmaeilpour, P. Cardinal, and A. L. Koerich, “Class-conditionaldefense GAN against end-to-end speech attacks,” in arXiv preprint2010.11352, 2020, pp. 1–5.

[58] Q. Zeng, J. Su, C. Fu, G. Kayas, L. Luo, X. Du, C. C. Tan, and J. Wu,“A multiversion programming inspired approach to detectingaudio adversarial examples,” in 49th Annual IEEE/IFIP Intl ConfDepend Syst and Netw, 2019, pp. 39–51.

[59] M. Esmaeilpour, P. Cardinal, and A. L. Koerich, “A robust ap-proach for securing audio classification against adversarial at-tacks,” IEEE Trans on Inf Forensics and Security, vol. 15, pp. 2147–2159, 2019.

[60] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu,“Towards deep learning models resistant to adversarial attacks,”in Intl Conf on Learn Repres, 2018.

[61] S. Sun, C.-F. Yeh, M. Ostendorf, M.-Y. Hwang, and L. Xie, “Train-ing augmentation with adversarial examples for robust speechrecognition,” in 19th Annual Conf Intl Speech Comm Assoc, 2018,pp. 2404–2408.

[62] R. Johnson, I. Miller, and J. Freund, Miller and Freund’sProbability and Statistics for Engineers, Global Edition, ser. GlobalEdition. Pearson Education Limited, 2017. [Online]. Available:https://books.google.ca/books?id=GoiPAQAACAAJ

[63] M. D. Zeiler, “Adadelta: An adaptive learning rate method,” 2012.[64] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and

R. Salakhutdinov, “Dropout: a simple way to prevent neuralnetworks from overfitting.” J Mach Learning Research, vol. 15, no. 1,pp. 1929–1958, 2014.

[65] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” in 32nd IntlConf Mach Learn, vol. 37, 2015, pp. 448–456.

[66] J. Lei Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,”arXiv preprint 1607.06450, 2016.

[67] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearitiesimprove neural network acoustic models,” in Proc ICML, vol. 30,no. 1, 2013, p. 3.

Sajjad Abdoli Received his Master’s degree incomputer engineering from QIAU, Qazvin, Iran in2017. He is currently a Ph.D candidate at Ecolede Technologie Superieure (ETS), Universite duQuebec, Montreal, QC, Canada. His researchinterests include audio and speech processing,music information retrieval and developing ad-versarial attacks on machine learning systems.

18

Luiz Gustavo Hafemann received his B.S. andM.Sc. degrees in Computer Science from theFederal University of Parana, Brazil, in the yearsof 2008 and 2014, respectively. He receivedhis Ph.D. degree in Systems Engineering in2019 from the Ecole de Technologie Superieure,Canada. He is currently a researcher at Sport-logiq, applying computer vision models for sportsanalytics. His interests include meta-learning,adversarial machine learning and group activityrecognition.

Jerome Rony received his M.A.Sc. degree inSystems Engineering in 2019 from the Ecolede Technologie Superieure (ETS), Universite duQuebec, Montreal, QC, Canada. He is currentlya Ph.D student at ETS whose research interestsinclude computer vision, adversarial examplesand robust machine learning.

Ismail Ben Ayed received the PhD degree (withthe highest honor) in computer vision from theInstitut National de la Recherche Scientifique(INRS-EMT), Montreal, QC, in 2007. He is cur-rently Associate Professor at the Ecole de Tech-nologie Superieure (ETS), University of Quebec,where he holds a research chair on ArtificialIntelligence in Medical Imaging. Before joiningthe ETS, he worked for 8 years as a researchscientist at GE Healthcare, London, ON, con-ducting research in medical image analysis. His

research interests are in computer vision, optimization, machine learn-ing and their potential applications in medical image analysis.

Patrick Cardinal received the B. Eng. degreein electrical engineering in 2000 from Ecolede Technologie Superieure (ETS), M.Sc. fromMcGill University in 2003 and PhD from ETSin 2013. From 2000 to 2013, he has beeninvolved in several projects related to speechprocessing, especially in the development ofa closed-captioning system for live televisionshows based on automatic speech recognition.After his postdoc at MIT, he joined ETS as aprofessor. His research interests cover several

aspects of speech processing for real life and medical applications.

Alessandro Lameiras Koerich is an AssociateProfessor in the Dept. of Software and IT Engi-neering of the Ecole de Technologie Superieure(ETS). He received the B.Eng. degree in elec-trical engineering from the Federal Universityof Santa Catarina, Brazil, in 1995, the M.Sc.in electrical engineering from the University ofCampinas, Brazil, in 1997, and the Ph.D. in en-gineering from the ETS, in 2002. His currentresearch interests include computer vision, ma-chine learning and music information retrieval.

Recommended