CS573 Data Privacy and Security - cs.emory.edulxiong/cs573/share/slides/09_AML_AdvExample.pdf ·...

Post on 04-Sep-2019

2 views 0 download

transcript

CS573 Data Privacy and Security

Adversarial Machine Learning

Li Xiong

Machine Learning Under Adversarial Settings

• Data privacy/confidentiality attacks

• membership attacks, model inversion attacks

• Model integrity attacks

• Training time: data poisoning attacks

• Inference time: evasion attacks and adversarial examples

Adversarial Machine Learning

• Inference time: evasion attacks and adversarial example

• Background

• Attacks

• Defenses

• Training time: data poisoning attacks

• Attacks

• Defenses

• Crowdsourcing applications

Adversarial Attacks and Defense

Competition

NIPS 2017

Fereshteh Razmi

Spring 2018

Y E V G E N I Y ( E U G E N E ) V O R O B E Y C H I K 1

B O L I 2

ADVERSARIAL MACHINE LEARNING (TUTORIAL)

1Assistant Professor, Computer Science & Biomedical Informatics

Director, Computational Economics Research Laboratory

Vanderbilt University

2 Post Doctoral Research Associate, UC Berkeley

Adversarial ML applications

● Machine learning for adversarial applications

○ Fraud detection

○ Malware detection

○ Intrusion detection

○ Spam detection

● What do all of these have in common?

Adversarial ML applications

● Machine learning for adversarial applications

○ Fraud detection

○ Malware detection

○ Intrusion detection

○ Spam detection

● What do all of these have in common?

○ Detect bad “things” (actors, actions, objects)

Bad actors

● Bad actors (who do bad things) have objectives

○ the main one is not getting detected

○ they can change their behavior to avoid detection

● This gives rise to evasion attacks

○ Attacks on ML, where malicious objects are deliberately transformed to evade detection

(prediction by ML that these are malicious)

EVASION ATTACKS

• Adversary who previously chose instance x (which would be classified as malicious) now chooses another instance x’ which is classified as benign

Benign

Malicious

EXAMPLE OF EVASION

cheap = 1.0

mortgage = 1.5

Total score = 2.5

From: spammer@example.comCheap mortgage now!!!

Feature Weights

> 1.0 (threshold)

1.

2.

3.

Spam

10

EXAMPLE OF EVASION

cheap = 1.0

mortgage = 1.5

Total score = 0.5

From: spammer@example.comCheap mortgage now!!!Joy Oregon

< 1.0 (threshold)

1.

2.

3.

OK11

Feature Weights

Joy= -1.0

Oregon = -1.0

ADVERSARIAL EXAMPLES

Classified as panda Small adversarial noise Classified as gibbon

ADVERSARIAL EXAMPLES

Small adversarial noise

Adversarial Examples

Clean Example:

● Naturally occuring example (like in ImageNet

dataset)

Adversarial Example:

● Modified example

● Fool classifier to misclassify it (can be targeted

or untargeted)

● Unnoticable to human

Introduction Attack Methods Defenses NIPS Top Scores

Adv Examples

Evasion Attacks

● Malicious input

● Fool (binary) classifier to misclassy it as benign

(evade detection)

ADVERSARIAL EXAMPLES

Figure by Qiuchen Zhang

Common Attack Scenarios

Type of outcome:

1. Non-targeted: predict ANY incorrect label

2. Targeted: Change prediction to some

SPECIFIC TARGET class

Adversary knowledge on the model:

1. White box

2. Black box with probing

3. Black box without probling

Feed data:

1. Digital attack:

a. Direct access to the digital representation

b. Precise control

2. Physical attack:

a. Physical world

b. Change camera angle, ambient light

c. Input obtained by a sensor (e.g. Camera or

microphone)

Introduction Attack Methods Defenses NIPS Top Scores

Attack Scenarios

Adversarial Machine Learning

• Inference time: evasion attacks and adversarial example

• Background

• Attacks

• White box attacks

• Optimization based methods: L-BFGS, C&W

• Fast/approximate methods: FGSM, I-FGSM

• Black/gray box attacks

• Defenses

• Competition methods

• Training time: data poisoning attacks

L-BFGS

Szegedy et al. (2014b)

● First methods to find adv examples for NN

● xadv closest image to x, classified as y’ by f

● Find δ with box-constrained L-BFGS

● Smallest possible attack perturbation

● Drawback:

a. can be defeated merely by degrading the

image quality (e.g. rounding to 8-bit

representation)

b. quite slow

Introduction Attack Methods Defenses NIPS Top Scores

White-box L-BFGS

Carlini and Wagner (C&W) (2017)

● Followed L-BFGS work

● Dealt with box constraints by change of variables: Xadv = 0.5(tanh(w) + 1)

● K: determine confidence level

● Used Adam optimizer

Introduction Attack Methods Defenses NIPS Top Scores

White-box Carlini

0.5(tanh(w) + 1)confience

Loss function

Adversarial Machine Learning

• Inference time: evasion attacks and adversarial example

• Background

• Attacks

• White box attacks

• Optimization based methods: L-BFGS, C&W

• Fast/approximate methods: FGSM, I-FGSM

• Black/gray box attacks

• Defenses

• Competition methods

• Training time: data poisoning attacks

Explaining and Harnessing Adversarial Examples

Ian J. Goodfellow, Jonathon Shlens and Christian Szegedy

Google Inc., Mountain View, CA

Linear explanation of adversarial examples

Introduction Attack Methods Defenses NIPS Top Scores

White-box FGSM

Fast Gradient Sign Method (FGSM)

● Linear perturbation of non-linear methods

● Fast (one step) but not too precise

● Using infinity norm

Linear perturbation of non-linear models

Image from reference paper

Introduction Attack Methods Defenses NIPS Top Scores

White-box I-FGSM

Iterative Attacks (I-FGSM)

● L-BFGS: high success, high computational cost

● FGSM: low success, low computational cost

Rapid progress by one step

● Solution: Iterative method (small no. of iterations)

● Targeted attacks

Adversarial Machine Learning

• Inference time: evasion attacks and adversarial example

• Background

• Attacks

• White box attacks

• Optimization based methods: L-BFGS, C&W

• Fast/approximate methods: FGSM, I-FGSM

• Black/gray box attacks

• Defenses

• Competition methods

• Training time: data poisoning attacks

Other White-box Attacks

● Madry et. al’s Attack

○ Start I-BFGS from a random point inside

ε-ball

● Adversarial Transformation Networks

● Non differentiable systems

○ cannot calculate gradient

○ use transferability for training (black

box)

Black-box Attacks

● Transferability: Xadv can fool one model, is able

to fool other models.

● 0% < fraction of transferable Xadv < 100l%

○ Source model

○ Target model

○ Dataset

● Luck or high transfer rate …?

● Probes: copy of the model (substitute)

● Fully Black box

○ Ensemble: if Xadv fool every model, it’s

more likely it can be generalized

Introduction Attack Methods Defenses NIPS Top Scores

Black-box

Adversarial Machine Learning

• Inference time: evasion attacks and adversarial example

• Background

• Attacks

• White box attacks

• Optimization based methods: L-BFGS, C&W

• Fast/approximate methods: FGSM, I-FGSM

• Black/gray box attacks

• Defenses

• Adversarial training

• Detector/reformer based

• Competition methods

• Training time: data poisoning attacks

Defenses

● Image preprocessing/denoising

a. Compression

b. Median filter (reduce precision)

● Fail in white box attacks

● Gradient Masking

● Most white box attacks use gradients of the

model

● Defender: makes gradients useless

a. Non-differentiable

b. Zero gradients in most places

● Vulnerable to black box attacks: similar decision

boundaries

● Detection-based

● Refuse to classify adversarial examples

● May decrease Acc on clean data (Shallow RBF)

● Automated reforming/denoising

● Adversarial Training

● Train on both clean and adversarial examples

● Drawback:

○ Tend to overfit to specific attack (add noise)

○ If uses X trained by some max-norm constraint,

cannot resist on high perturbations

Introduction Attack Methods Defenses NIPS Top Scores

Adversarial Training

Classifier

logits1Clean Image

Regular Training

True label

Cross entropy loss

Classifier

logits2

logits1

loss

Adversarial Image

Clean Image

Adversarial Training

True label Cross

entropy

Cross entropy

Classifier

logits2

logits1

loss

resnet18+fgsm

Vgg19+fgsm

Ensemble adversarial

trainingRandom choice

Clean Image

Ensemble Adversarial Training

True label Cross

entropy

Cross entropy

Vgg16+fgsm

Classifier

logits2

logits1 loss

Random choice

Ensemble Adversarial Training + Adversarial Logits Pairing

True label

Cross entropy

ALP loss

(MSE)

Logits pairing

Vgg16+fgsm

resnet18+fgsm

Vgg19+fgsm

Ensemble adversarial

training

Clean Image

Adversarial Machine Learning

• Inference time: evasion attacks and adversarial example

• Background

• Attacks

• White box attacks

• Optimization based methods: L-BFGS, C&W

• Fast/approximate methods: FGSM, I-FGSM

• Black/gray box attacks

• Defenses

• Adversarial training

• Detector/reformer based

• Competition methods

• Training time: data poisoning attacks

MagNet:

a Two-Pronged Defense against

Adversarial Examples*

*Dongyu Meng (Shanghai Tech University), Hao Chen (University of California Davis)

ACM CCS 2017

Manifold of normal examples

Misclassification on Adversarial:

1. Far from the boundary

No option to reject

1. Close to the boundary

Poorly generalized classifier

Adversarial vs. Normal Examples Introduction

Defense Evaluation

Introduction

Adversarial vs Normal

Attacks on Images

Existing Defense

MagNet Design

Detector

Reformer

Threat Models

Implementation

Conclusion

1. Adversarial Training

Build a robust classifier

Train on both Adv. and Norm.

What attack to train on

2. Detecting Adv. Examples

Separate classification network (detector)

Train on both Adv. and Norm.

What attack to train on

3. Defensive Distillation

Train classifier in a specific way

Hard for attacks

Complex to retrain, Not protected against Carlini attack

Existing Defense Methods Introduction

Defense Evaluation

Adversarial vs Normal

Attacks on Images

Introduction

MagNet Design

Detector

Reformer

Threat Models

Implementation

Conclusion

Existing Defense

1. Adversarial Training

Build a robust classifier

Train on both Adv. and Norm.

What attack to train on

2. Detecting Adv. Examples

Separate classification network (detector)

Train on both Adv. and Norm.

What attack to train on

3. Defensive Distillation

Train classifier in a specific way

Hard for attacks

Complex to retrain, Not protected against Carlini attack

Existing Defense Methods

MagNet

● Doesn’t retrain classifier

● Uses only Normal examples

(can be generalized on attacks)

Introduction

Defense Evaluation

Adversarial vs Normal

Attacks on Images

Introduction

MagNet Design

Detector

Reformer

Threat Models

Implementation

Conclusion

Existing Defense

MagNet Design Introduction

Defense Evaluation

Adversarial vs Normal

Attacks on Images

Existing Defense

Introduction

Detector

Reformer

Threat Models

Implementation

Conclusion

MagNet Design

Detector

Whether the input is adversarial

Detector based on reconstruction error

Trains an autoencoder by Normal examples

Reconstruction error:

High E(x) on Adversarial examples

Defines a threshold on E

Not effective on small E

Detector based on probability divergence

Uses AE and classifier softmax layer

f(x) = f(ae(x)), but f(x’) != f(ae(x’))

Softmax may saturate → add Temperature

Introduction

Defense Evaluation

Adversarial vs Normal

Attacks on Images

Existing Defense

MagNet Design

introduction

Reformer

Threat Models

Implementation

Conclusion

Detector

Reformer

Autoencoder-based

Train:

▪ Use Normal examples

▪ Minimize Reconstruction Error

Test:

▪ Normal: same data generation process as Training set

▪ Adversarial: AE approximates it and makes it closer to Normal manifold

Introduction

Defense Evaluation

Adversarial vs Normal

Attacks on Images

Existing Defense

MagNet Design

Detector

Introduction

Threat Models

Implementation

Conclusion

Reformer

Blackbox Attack

MNIST CIFER-10

Reformers 1 1

Detectors 2

Reconstruction error-

based

1 : Error-based

2 : Probability

divergence-based

(T=10 , 40)

Acc

Target

classifier

99.4% 90.6%

MagNet on

Normal

99.1% 86.8%

MagNet on

Adversarial

> 99%

(except Carlini L0)

> 75%

(> 99% half of attacks)

Introduction

Defense Evaluation

Adversarial vs Normal

Attacks on Images

Existing Defense

MagNet Design

Detector

Reformer

Threat Models

Introcution

Conclusion

Implementation

Blackbox Attacks Introduction

Defense Evaluation

Adversarial vs Normal

Attacks on Images

Existing Defense

MagNet Design

Detector

Reformer

Threat Models

Introcution

Conclusion

Implementation

Blackbox Attack

* MNIST , CIFAR-10

* Impact of reformer and detector ?

* Blackbox attack

* Carlini’s L2 attack with different Confidence level

* The higher confidence, the harder attack

Introduction

Defense Evaluation

Adversarial vs Normal

Attacks on Images

Existing Defense

MagNet Design

Detector

Reformer

Threat Models

Introcution

Conclusion

Implementation

Adversarial Machine Learning

• Inference time: evasion attacks and adversarial example

• Background

• Attacks

• White box attacks

• Optimization based methods: L-BFGS, C&W

• Fast/approximate methods: FGSM, I-FGSM

• Black/gray box attacks

• Defenses

• Competition

• Training time: data poisoning attacks

NIPS competition

Evaluation

● ImageNet

● Time limit: batch of 100 images less than 500s

● Development dataset vs. Secret final dataset

● 4 rounds (3 optional, just for test purposes)

● Test all Defense methods against all attacks

● Attack score: number of times they fool defense

methods

● Defense score: number of correctly classified

examples

Tasks

1. Non-targeted Adversarial Attack (blackbox)

2. Targeted Adversarial Attack (blackbox)

3. Defense against Adversarial Attacks: Robust

classifier

Introduction Attack Methods Defenses NIPS Top Scores

NIPS

Results

91 non-targeted, 65 targeted, 107 defense

Introduction Attack Methods Defenses NIPS Top Scores

Baseline Top scoring Submission (TSAIL)

Score Score Wort Score

Defense 0.772 0.9531 0.1184

Non-targeted 0.346 0.781 0.136

Targeted 0.199 0.402 0

1st place in Both attack tracks (TsAIL)

● Momentum iterative attack

● Attack strength vs. transferability (FGSM vs. I-FGSM)

● Use momentum to stablize update directions and escape from poor local optima (avoid overfitting)

● Use ensemble (fuse logits) to be more transferable (for black box attacks)

● For targeted attacks

○ Standard deviation instead of L1 norm

○ Set {-2,-1,0,1,2}, larger search space

Introduction Attack Methods Defenses NIPS Top Scores

Introduction Attack Methods Defenses NIPS Top Scores

2nd place in Both attack tracks (Sangxia)

● I-FGSM based

● Ensemble models for transferabillity: fi

● Image augmentation (rotate, scale, … by

parameter θ) for transferability: Tθ

● Only include adversarially trained models in later

iterations

● Targeted attacks:

○ Minimize loss wrt to ytarget

○ No augmentation

Introduction Attack Methods Defenses NIPS Top Scores

3rd place in targeted attack (FatFingers)

● Dynamic iterative ensemble attack

● Use many NN in ensemble (hit rate is zero)

● Dynamically exclude some NNs in later itr

○ Preassign no. of iterations

○ Remove if loss function is small enough

1st place in Defense (TsAIL)

● Denoiser based

● Input: adversarial example, Output: adversarial

noise

● Loss function between the representation at l-th

layer in the target model for original and adv image

● Topmost convolutional layer

● Logits layer

● Submitted: DUNET (denoising U-net) +CNN

● Trained on FGSM and I-FGSM

Introduction Attack Methods Defenses NIPS Top Scores

Introduction Attack Methods Defenses NIPS Top Scores

2nd place in Defense (iyswim)

● Randomization

1. Random resizing layer

2. Random padding layer

3. Adversarially trained model

● Low level transformations may destroy

perturbation

● Advantages:

○ No additional training

○ Few additional computations

○ Compatible with other defense methods

● Randomization: it also can be applied to whitebox

attacks

Adversarial Machine Learning

• Inference time: evasion attacks and adversarial example

• Background

• Attacks

• Defenses

• Competition

• Training time: data poisoning attacks