1/29
Motivation Method Experiment
Black-box Generation of Adversarial TextSequences to Evade Deep Learning Classifiers
Ji Gao1, Jack Lanchantin1, Mary Lou Soffa1, Yanjun Qi1
1University of Virginiahttp://trustworthymachinelearning.org/
@ 1st Deep Learning and Security Workshop ; 2018
2/29
Motivation Method Experiment
Outline
1 Motivation
White box vs. black box
2 Method
Word scorerWord transformer
3 Experiment
4 Conclusions
3/29
Motivation Method Experiment
Example of black-box classification systems
Google Perspective API
3/29
Motivation Method Experiment
Example of black-box classification systems
Google Perspective API
4/29
Motivation Method Experiment
Target scenario
5/29
Motivation Method Experiment
An example of DeepWordBug
Goal: Flip the prediction of a sentiment analyzer
5/29
Motivation Method Experiment
An example of DeepWordBug
Goal: Flip the prediction of a sentiment analyzer
5/29
Motivation Method Experiment
An example of DeepWordBug
Goal: Flip the prediction of a sentiment analyzer
6/29
Motivation Method Experiment
AlgorithmOur Methods
7/29
Motivation Method Experiment
Challenges of language tasksOur Method
Adversarial examples
Suppose a deep learning classifier F (·) : X→ Y original sample isx , an adversarial example x ′ in Untargeted attack follows:
x′ = x + ∆x, ||∆x||p < ε, x′ ∈ XF (x) 6= F (x′)
When X is symbolic:
How to perturb x?
No metric for measuring ∆x
8/29
Motivation Method Experiment
Our settingOur Method
∆x = Edit distance(x, x′)
9/29
Motivation Method Experiment
DeepWordBugOur Methods
1. Scoring - Find important words to change
2. Transformation - Generate some modification on words oftop importance.
∆x = Edit distance(x, x′)
=∑
i∈Selected words
Edit distance(xi , x′i )
10/29
Motivation Method Experiment
Step 1: Scoring functionOur Methods
Goal: Select important words
The proposed scoring functions have the following properties:1 Correctly reflect the importance of words2 Black-box3 Efficient to calculate.
11/29
Motivation Method Experiment
Temporal Head Score
11/29
Motivation Method Experiment
Temporal Head Score
11/29
Motivation Method Experiment
Temporal Head Score
12/29
Motivation Method Experiment
Temporal Tail score
13/29
Motivation Method Experiment
Combined score
14/29
Motivation Method Experiment
Step 2: Ranking and transformation
Calculate the scoring function for all words in the input once.
Rank all the words according to the scores.
15/29
Motivation Method Experiment
Step 3: Word TransformerOur Methods
Original Substitution Swapping Deletion Insertion
Team → Texm Taem Tem Tezam
Artist → Arxist Artsit Artst Articst
Computer → Computnr Comptuer Compter Comnputer
Aim I: Machine-learning based classifier views generated wordsas “unknown”.
Aim II: Control the edit distance of the modification
16/29
Motivation Method Experiment
SummaryOur Methods
17/29
Motivation Method Experiment
Dataset
#Training #Testing #Classes Task
AG’s News 120,000 7,600 4NewsCategorization
Amazon ReviewFull
3,000,000 650,000 5SentimentAnalysis
Amazon ReviewPolarity
3,600,000 400,000 2SentimentAnalysis
DBPedia 560,000 70,000 14OntologyClassification
Yahoo! Answers 1,400,000 60,000 10TopicClassification
Yelp Review Full 650,000 50,000 5SentimentAnalysis
Yelp Review Polarity 560,000 38,000 2SentimentAnalysis
Enron Spam Email 26,972 6,744 2Spam E-mailDetection
18/29
Motivation Method Experiment
Methods in comparison
Random(Baseline): Random selection of words. Similar to(Papernot et al. 2013)
Gradient(Baseline): White-box method. Judging theimportance of the word using the magnitude of the gradient(Samanta, S., & Mehta, S. (2017).).
DeepWordBug(Our method): Use 3 Different scoringfunctions: Temporal Head, Temporal Tail and Combined.
19/29
Motivation Method Experiment
Main result: Effectiveness of adversarial samples (average)
6.82%
16.36%
63.02%
44.40%
68.05%64.38%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
Random Gradient Replace-1 TemporalHead
TemporalTail
Combined
Re
lati
ve
Pe
rfo
rma
nce
De
cre
ase
(%)
DeepWordBug
20/29
Motivation Method Experiment
Question: Are the generated adversarial samplestransferable to other models?
Adversarial samples generated on one model can besuccessfully transferred between models, reducing the modelaccuracy from around 90% to 20-50%
21/29
Motivation Method Experiment
Question: How does different transformer functions work?
Varying transformation function have small effect on theattack performance.
22/29
Motivation Method Experiment
Question: How strong are the adversarial samplesgenerated?
The adversarial samples generated successfully make themachine learning model to believe a wrong answer with 0.9probability
23/29
Motivation Method Experiment
Defense: by Adversarial training
88.5%85.9%87.3%87.6%87.5%87.4%87.4%87.6%87.5%86.8%87.0%
11.9%
30.2%
45.0%52.4%
57.1%58.8%58.8%59.9%60.5%61.6%62.7%
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
70.0%
80.0%
90.0%
100.0%
0 1 2 3 4 5 6 7 8 9 10
Accuracy Adversarial accuracy
ReTrain the model with adversarial samples.
Accuracy on raw inputs slightly decreases;
Accuracy on the adversarial samples rapidly increases fromaround 12% (before the training) to 62% (after the training)
24/29
Motivation Method Experiment
Defense: by an autocorrector?
Original Attack Defend with Autocorrector
Swap 88.45% 14.77% 77.34%
Substitute 88.45% 12.28% 74.85%
Remove 88.45% 14.06% 62.43%
Insert 88.45% 12.28% 82.07%
Substitute-2 88.45% 11.90% 54.54%
Remove-2 88.45% 14.25% 33.67%
While spellchecker reduces the effectiveness of the adversarialsamples, stronger attacks such as removing 2 characters inevery selected word still can successfully reduce the modelaccuracy to 34%
25/29
Motivation Method Experiment
Related Works
Related works:
Papernot et. al 2016Iteratively:
Pick words randomlyApply gradient based algorithm directly on the word embeddingProject to the nearest word
Samanta & Sameep 2017Iteratively:
Pick important words using gradientGenerate linguistic based modification on the words
Summary: White-box and costly
26/29
Motivation Method Experiment
Conclusion
Black-box: DeepWordBug generates adversarial samples in apure black-box manner.
Performance: Reduce the performance of state-of-the-art deeplearning models by up to 80%
Transferability: The adversarial samples generated on onemodel can be successfully transferred to other models,reducing the target model accuracy from around 90% to20-50%.
27/29
Motivation Method Experiment
Reference
Goodfellow, Ian, J., Jonathon Shlens, and Christian Szegedy. ”Explaining andharnessing adversarial examples.” arXiv preprint arXiv:1412.6572 (2014).
Papernot, Nicolas, et al. ”Crafting adversarial input sequences for recurrentneural networks.” Military Communications Conference, MILCOM 2016-2016IEEE. IEEE, 2016.
Samanta, Suranjana, and Sameep Mehta. ”Towards Crafting Text AdversarialSamples.” arXiv preprint arXiv:1707.02812 (2017).
Zhang, Xiang, Junbo Zhao, and Yann LeCun. ”Character-level convolutionalnetworks for text classification.” Advances in neural information processingsystems. 2015.
Rayner, Keith, Sarah J. White, and S. P. Liversedge. ”Raeding wrods withjubmled lettres: There is a cost.” (2006).
28/29
Motivation Method Experiment
Why Word Transformer is Effective?
Do not guarantee the original word will be changed to“unknown”, but failure chance is very slight
Suppose the longest word in the dictionary is length l , thereare 27l possible letter sequences ≤ l
Let l = 8, and |D| = 20000. The chance that changed word is
not “unknown” is roughly 278
20000 ≈ 0.00000007
29/29
Motivation Method Experiment
Why current scoring functions?
For a single step, Replace-1 score gives the bestapproximation.
However, globally it’s not optimal.
Example:
Here, Temporal tail gives better result than Replace-1.