CS573 Data Privacy and Security
Adversarial Machine Learning
Li Xiong
Machine Learning Under Adversarial Settings
• Data privacy/confidentiality attacks
• membership attacks, model inversion attacks
• Model integrity attacks
• Training time: data poisoning attacks
• Inference time: evasion attacks and adversarial examples
Adversarial Machine Learning
• Inference time: evasion attacks and adversarial example
• Background
• Attacks
• Defenses
• Training time: data poisoning attacks
• Attacks
• Defenses
• Crowdsourcing applications
Adversarial Attacks and Defense
Competition
NIPS 2017
Fereshteh Razmi
Spring 2018
Y E V G E N I Y ( E U G E N E ) V O R O B E Y C H I K 1
B O L I 2
ADVERSARIAL MACHINE LEARNING (TUTORIAL)
1Assistant Professor, Computer Science & Biomedical Informatics
Director, Computational Economics Research Laboratory
Vanderbilt University
2 Post Doctoral Research Associate, UC Berkeley
Adversarial ML applications
● Machine learning for adversarial applications
○ Fraud detection
○ Malware detection
○ Intrusion detection
○ Spam detection
● What do all of these have in common?
Adversarial ML applications
● Machine learning for adversarial applications
○ Fraud detection
○ Malware detection
○ Intrusion detection
○ Spam detection
● What do all of these have in common?
○ Detect bad “things” (actors, actions, objects)
Bad actors
● Bad actors (who do bad things) have objectives
○ the main one is not getting detected
○ they can change their behavior to avoid detection
● This gives rise to evasion attacks
○ Attacks on ML, where malicious objects are deliberately transformed to evade detection
(prediction by ML that these are malicious)
EVASION ATTACKS
• Adversary who previously chose instance x (which would be classified as malicious) now chooses another instance x’ which is classified as benign
Benign
Malicious
EXAMPLE OF EVASION
cheap = 1.0
mortgage = 1.5
Total score = 2.5
From: [email protected] mortgage now!!!
Feature Weights
> 1.0 (threshold)
1.
2.
3.
Spam
10
EXAMPLE OF EVASION
cheap = 1.0
mortgage = 1.5
Total score = 0.5
From: [email protected] mortgage now!!!Joy Oregon
< 1.0 (threshold)
1.
2.
3.
OK11
Feature Weights
Joy= -1.0
Oregon = -1.0
ADVERSARIAL EXAMPLES
Classified as panda Small adversarial noise Classified as gibbon
ADVERSARIAL EXAMPLES
Small adversarial noise
Adversarial Examples
Clean Example:
● Naturally occuring example (like in ImageNet
dataset)
Adversarial Example:
● Modified example
● Fool classifier to misclassify it (can be targeted
or untargeted)
● Unnoticable to human
Introduction Attack Methods Defenses NIPS Top Scores
Adv Examples
Evasion Attacks
● Malicious input
● Fool (binary) classifier to misclassy it as benign
(evade detection)
ADVERSARIAL EXAMPLES
Figure by Qiuchen Zhang
Common Attack Scenarios
Type of outcome:
1. Non-targeted: predict ANY incorrect label
2. Targeted: Change prediction to some
SPECIFIC TARGET class
Adversary knowledge on the model:
1. White box
2. Black box with probing
3. Black box without probling
Feed data:
1. Digital attack:
a. Direct access to the digital representation
b. Precise control
2. Physical attack:
a. Physical world
b. Change camera angle, ambient light
c. Input obtained by a sensor (e.g. Camera or
microphone)
Introduction Attack Methods Defenses NIPS Top Scores
Attack Scenarios
Adversarial Machine Learning
• Inference time: evasion attacks and adversarial example
• Background
• Attacks
• White box attacks
• Optimization based methods: L-BFGS, C&W
• Fast/approximate methods: FGSM, I-FGSM
• Black/gray box attacks
• Defenses
• Competition methods
• Training time: data poisoning attacks
L-BFGS
Szegedy et al. (2014b)
● First methods to find adv examples for NN
● xadv closest image to x, classified as y’ by f
● Find δ with box-constrained L-BFGS
● Smallest possible attack perturbation
● Drawback:
a. can be defeated merely by degrading the
image quality (e.g. rounding to 8-bit
representation)
b. quite slow
Introduction Attack Methods Defenses NIPS Top Scores
White-box L-BFGS
Carlini and Wagner (C&W) (2017)
● Followed L-BFGS work
● Dealt with box constraints by change of variables: Xadv = 0.5(tanh(w) + 1)
● K: determine confidence level
● Used Adam optimizer
Introduction Attack Methods Defenses NIPS Top Scores
White-box Carlini
0.5(tanh(w) + 1)confience
Loss function
Adversarial Machine Learning
• Inference time: evasion attacks and adversarial example
• Background
• Attacks
• White box attacks
• Optimization based methods: L-BFGS, C&W
• Fast/approximate methods: FGSM, I-FGSM
• Black/gray box attacks
• Defenses
• Competition methods
• Training time: data poisoning attacks
Explaining and Harnessing Adversarial Examples
Ian J. Goodfellow, Jonathon Shlens and Christian Szegedy
Google Inc., Mountain View, CA
Linear explanation of adversarial examples
Introduction Attack Methods Defenses NIPS Top Scores
White-box FGSM
Fast Gradient Sign Method (FGSM)
● Linear perturbation of non-linear methods
● Fast (one step) but not too precise
● Using infinity norm
Linear perturbation of non-linear models
Image from reference paper
Introduction Attack Methods Defenses NIPS Top Scores
White-box I-FGSM
Iterative Attacks (I-FGSM)
● L-BFGS: high success, high computational cost
● FGSM: low success, low computational cost
Rapid progress by one step
● Solution: Iterative method (small no. of iterations)
● Targeted attacks
Adversarial Machine Learning
• Inference time: evasion attacks and adversarial example
• Background
• Attacks
• White box attacks
• Optimization based methods: L-BFGS, C&W
• Fast/approximate methods: FGSM, I-FGSM
• Black/gray box attacks
• Defenses
• Competition methods
• Training time: data poisoning attacks
Other White-box Attacks
● Madry et. al’s Attack
○ Start I-BFGS from a random point inside
ε-ball
● Adversarial Transformation Networks
● Non differentiable systems
○ cannot calculate gradient
○ use transferability for training (black
box)
Black-box Attacks
● Transferability: Xadv can fool one model, is able
to fool other models.
● 0% < fraction of transferable Xadv < 100l%
○ Source model
○ Target model
○ Dataset
● Luck or high transfer rate …?
● Probes: copy of the model (substitute)
● Fully Black box
○ Ensemble: if Xadv fool every model, it’s
more likely it can be generalized
Introduction Attack Methods Defenses NIPS Top Scores
Black-box
Adversarial Machine Learning
• Inference time: evasion attacks and adversarial example
• Background
• Attacks
• White box attacks
• Optimization based methods: L-BFGS, C&W
• Fast/approximate methods: FGSM, I-FGSM
• Black/gray box attacks
• Defenses
• Adversarial training
• Detector/reformer based
• Competition methods
• Training time: data poisoning attacks
Defenses
● Image preprocessing/denoising
a. Compression
b. Median filter (reduce precision)
● Fail in white box attacks
● Gradient Masking
● Most white box attacks use gradients of the
model
● Defender: makes gradients useless
a. Non-differentiable
b. Zero gradients in most places
● Vulnerable to black box attacks: similar decision
boundaries
● Detection-based
● Refuse to classify adversarial examples
● May decrease Acc on clean data (Shallow RBF)
● Automated reforming/denoising
● Adversarial Training
● Train on both clean and adversarial examples
● Drawback:
○ Tend to overfit to specific attack (add noise)
○ If uses X trained by some max-norm constraint,
cannot resist on high perturbations
Introduction Attack Methods Defenses NIPS Top Scores
Adversarial Training
Classifier
logits1Clean Image
Regular Training
True label
Cross entropy loss
Classifier
logits2
logits1
loss
Adversarial Image
Clean Image
Adversarial Training
True label Cross
entropy
Cross entropy
Classifier
logits2
logits1
loss
resnet18+fgsm
Vgg19+fgsm
Ensemble adversarial
trainingRandom choice
Clean Image
Ensemble Adversarial Training
True label Cross
entropy
Cross entropy
Vgg16+fgsm
Classifier
logits2
logits1 loss
Random choice
Ensemble Adversarial Training + Adversarial Logits Pairing
True label
Cross entropy
ALP loss
(MSE)
Logits pairing
Vgg16+fgsm
resnet18+fgsm
Vgg19+fgsm
Ensemble adversarial
training
Clean Image
Adversarial Machine Learning
• Inference time: evasion attacks and adversarial example
• Background
• Attacks
• White box attacks
• Optimization based methods: L-BFGS, C&W
• Fast/approximate methods: FGSM, I-FGSM
• Black/gray box attacks
• Defenses
• Adversarial training
• Detector/reformer based
• Competition methods
• Training time: data poisoning attacks
MagNet:
a Two-Pronged Defense against
Adversarial Examples*
*Dongyu Meng (Shanghai Tech University), Hao Chen (University of California Davis)
ACM CCS 2017
Manifold of normal examples
Misclassification on Adversarial:
1. Far from the boundary
No option to reject
1. Close to the boundary
Poorly generalized classifier
Adversarial vs. Normal Examples Introduction
Defense Evaluation
Introduction
Adversarial vs Normal
Attacks on Images
Existing Defense
MagNet Design
Detector
Reformer
Threat Models
Implementation
Conclusion
1. Adversarial Training
Build a robust classifier
Train on both Adv. and Norm.
What attack to train on
2. Detecting Adv. Examples
Separate classification network (detector)
Train on both Adv. and Norm.
What attack to train on
3. Defensive Distillation
Train classifier in a specific way
Hard for attacks
Complex to retrain, Not protected against Carlini attack
Existing Defense Methods Introduction
Defense Evaluation
Adversarial vs Normal
Attacks on Images
Introduction
MagNet Design
Detector
Reformer
Threat Models
Implementation
Conclusion
Existing Defense
1. Adversarial Training
Build a robust classifier
Train on both Adv. and Norm.
What attack to train on
2. Detecting Adv. Examples
Separate classification network (detector)
Train on both Adv. and Norm.
What attack to train on
3. Defensive Distillation
Train classifier in a specific way
Hard for attacks
Complex to retrain, Not protected against Carlini attack
Existing Defense Methods
MagNet
● Doesn’t retrain classifier
● Uses only Normal examples
(can be generalized on attacks)
Introduction
Defense Evaluation
Adversarial vs Normal
Attacks on Images
Introduction
MagNet Design
Detector
Reformer
Threat Models
Implementation
Conclusion
Existing Defense
MagNet Design Introduction
Defense Evaluation
Adversarial vs Normal
Attacks on Images
Existing Defense
Introduction
Detector
Reformer
Threat Models
Implementation
Conclusion
MagNet Design
Detector
Whether the input is adversarial
Detector based on reconstruction error
Trains an autoencoder by Normal examples
Reconstruction error:
High E(x) on Adversarial examples
Defines a threshold on E
Not effective on small E
Detector based on probability divergence
Uses AE and classifier softmax layer
f(x) = f(ae(x)), but f(x’) != f(ae(x’))
Softmax may saturate → add Temperature
Introduction
Defense Evaluation
Adversarial vs Normal
Attacks on Images
Existing Defense
MagNet Design
introduction
Reformer
Threat Models
Implementation
Conclusion
Detector
Reformer
Autoencoder-based
Train:
▪ Use Normal examples
▪ Minimize Reconstruction Error
Test:
▪ Normal: same data generation process as Training set
▪ Adversarial: AE approximates it and makes it closer to Normal manifold
Introduction
Defense Evaluation
Adversarial vs Normal
Attacks on Images
Existing Defense
MagNet Design
Detector
Introduction
Threat Models
Implementation
Conclusion
Reformer
Blackbox Attack
MNIST CIFER-10
Reformers 1 1
Detectors 2
Reconstruction error-
based
1 : Error-based
2 : Probability
divergence-based
(T=10 , 40)
Acc
Target
classifier
99.4% 90.6%
MagNet on
Normal
99.1% 86.8%
MagNet on
Adversarial
> 99%
(except Carlini L0)
> 75%
(> 99% half of attacks)
Introduction
Defense Evaluation
Adversarial vs Normal
Attacks on Images
Existing Defense
MagNet Design
Detector
Reformer
Threat Models
Introcution
Conclusion
Implementation
Blackbox Attacks Introduction
Defense Evaluation
Adversarial vs Normal
Attacks on Images
Existing Defense
MagNet Design
Detector
Reformer
Threat Models
Introcution
Conclusion
Implementation
Blackbox Attack
* MNIST , CIFAR-10
* Impact of reformer and detector ?
* Blackbox attack
* Carlini’s L2 attack with different Confidence level
* The higher confidence, the harder attack
Introduction
Defense Evaluation
Adversarial vs Normal
Attacks on Images
Existing Defense
MagNet Design
Detector
Reformer
Threat Models
Introcution
Conclusion
Implementation
Adversarial Machine Learning
• Inference time: evasion attacks and adversarial example
• Background
• Attacks
• White box attacks
• Optimization based methods: L-BFGS, C&W
• Fast/approximate methods: FGSM, I-FGSM
• Black/gray box attacks
• Defenses
• Competition
• Training time: data poisoning attacks
NIPS competition
Evaluation
● ImageNet
● Time limit: batch of 100 images less than 500s
● Development dataset vs. Secret final dataset
● 4 rounds (3 optional, just for test purposes)
● Test all Defense methods against all attacks
● Attack score: number of times they fool defense
methods
● Defense score: number of correctly classified
examples
Tasks
1. Non-targeted Adversarial Attack (blackbox)
2. Targeted Adversarial Attack (blackbox)
3. Defense against Adversarial Attacks: Robust
classifier
Introduction Attack Methods Defenses NIPS Top Scores
NIPS
Results
91 non-targeted, 65 targeted, 107 defense
Introduction Attack Methods Defenses NIPS Top Scores
Baseline Top scoring Submission (TSAIL)
Score Score Wort Score
Defense 0.772 0.9531 0.1184
Non-targeted 0.346 0.781 0.136
Targeted 0.199 0.402 0
1st place in Both attack tracks (TsAIL)
● Momentum iterative attack
● Attack strength vs. transferability (FGSM vs. I-FGSM)
● Use momentum to stablize update directions and escape from poor local optima (avoid overfitting)
● Use ensemble (fuse logits) to be more transferable (for black box attacks)
● For targeted attacks
○ Standard deviation instead of L1 norm
○ Set {-2,-1,0,1,2}, larger search space
Introduction Attack Methods Defenses NIPS Top Scores
Introduction Attack Methods Defenses NIPS Top Scores
2nd place in Both attack tracks (Sangxia)
● I-FGSM based
● Ensemble models for transferabillity: fi
● Image augmentation (rotate, scale, … by
parameter θ) for transferability: Tθ
● Only include adversarially trained models in later
iterations
● Targeted attacks:
○ Minimize loss wrt to ytarget
○ No augmentation
Introduction Attack Methods Defenses NIPS Top Scores
3rd place in targeted attack (FatFingers)
● Dynamic iterative ensemble attack
● Use many NN in ensemble (hit rate is zero)
● Dynamically exclude some NNs in later itr
○ Preassign no. of iterations
○ Remove if loss function is small enough
1st place in Defense (TsAIL)
● Denoiser based
● Input: adversarial example, Output: adversarial
noise
● Loss function between the representation at l-th
layer in the target model for original and adv image
● Topmost convolutional layer
● Logits layer
● Submitted: DUNET (denoising U-net) +CNN
● Trained on FGSM and I-FGSM
Introduction Attack Methods Defenses NIPS Top Scores
Introduction Attack Methods Defenses NIPS Top Scores
2nd place in Defense (iyswim)
● Randomization
1. Random resizing layer
2. Random padding layer
3. Adversarially trained model
● Low level transformations may destroy
perturbation
● Advantages:
○ No additional training
○ Few additional computations
○ Compatible with other defense methods
● Randomization: it also can be applied to whitebox
attacks
Adversarial Machine Learning
• Inference time: evasion attacks and adversarial example
• Background
• Attacks
• Defenses
• Competition
• Training time: data poisoning attacks