Adversarial Attack

An adversarial attack refers to a technique used to exploit vulnerabilities in machine learning models, particularly deep neural networks, by introducing carefully crafted input data called adversarial examples. These examples are designed to deceive the model and cause it to make incorrect predictions or classifications. The goal is to manipulate the model’s behavior in such a way that it produces a desired outcome for the attacker, while appearing indistinguishable to a human observer.

Adversarial attacks typically involve designing attack methods specific to the target model. There is no general method that universally works for all models due to the diversity of model architectures, training processes, and defense mechanisms employed. However, there are some attack methods that can be broadly applicable across models, especially for models with similar architectures or characteristics. For example, gradient-based attacks such as the Fast Gradient Sign Method (FGSM) or the Projected Gradient Descent (PGD)
Attackers often need to have some knowledge about the target model, such as its architecture, parameters, or access to its gradients, to craft effective adversarial examples.

Attack can also happen at training stage, e.g. data poisoning

Taxonomy

Taxonomy of Adversarial Attack Threat Model
Taxonomy of Adversarial Attack Threat Model

By Adversarial Falsification:

False positive attacks generate a negative sample which is misclassified as a positive one (Type I Error). In an image classification task, a false positive can be an adversarial image unrecognizable to human, while DNNs predict it to a class with a high confidence score.

False negative attacks generate a positive sample which is misclassified as a negative one (Type II Error). Also called ML evasion. This error is shown in most adversarial images, where a human can recognize the image, but the neural networks cannot identify it.

By Knowledge:

White-box

Black-box

decision-based adversarial attack

Most adversarial example attacks are white-box attacks. However, they can be transferred to attack black-box services due to the transferability of adversarial examples proposed by Papernot et al.

However, it was found adversarial examples are transferrable (cf. Transferability of Adversarial Examples), hence given the same training data as the original network, an attacker can train their own mirror network of the black box original network and then attack the mirror network with white-box techniques. If attack on mirror network succeeds, it will likely succeed on the original

By Specificity: (For binary classification, targeted attacks are equivalent to nontargeted attacks.)

Targeted attack: misguide DNNs to a specific class.

Non-targeted attack: do not assign a specific class to the neural network output. The adversarial class of output can be arbitrary except the original one.

By Frequency:

One-time attacks take only one time to optimize the adversarial examples.

Iterative attacks take multiple times to update the adversarial examples.

指向原始笔记的链接

Taxonomy of Perturbation

Scope:

Individual attacks generate different perturbations for each clean input.
Universal attacks only create a universal perturbation for the whole data set. This perturbation can be applied to all clean input data.

Limitation

optimization vs constraint

measurement

p-norm distance: $∣∣ x ∣ ∣_{p} = (Σ_{i = 1}^{N} ∣∣ x_{i} ∣ ∣^{p})^{\frac{1}{p}}$
Psychometric perceptual adversarial similarity score (PASS)

White-Box Methods of Adversarial Attack

Fast Gradient Sign Method (FGSM)

Check Fast Gradient Sign Method (FGSM)

PGD Attack

Check Projected Gradient Descent (PGD)

Optimization-based Methods

find $η$ such that $∣∣ η ∣∣$ is minimized, and $f (x + η) = t$ and $x + η \in [0, 1]^{n}$ .

L-BFGS

This problem can be very difficult to solve, however, so Szegedy et al. instead solve the following problem:

minimize c \cdot ∥ x - x^{'} ∥_{2}^{2} + loss_{F, l} (x^{'}) such that x^{'} \in [0, 1]^{n}

The original attack algorithm uses L-BFGS to solve the optimization problem.

C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, et al., “Intriguing properties of neural networks”, ICLR, 2013.

Line search is performed to find the constant c>0 that yields an adversarial example of minimum distance: in other words, we repeatedly solve this optimization problem for multiple values of c, adaptively updating c using bisection search or any other method for one-dimensional optimization.

Carlini-Wagner Attack (CW)

Nicholas Carlini, David Wagner. Towards Evaluating the Robustness of Neural Networks. In: Proceedings of the IEEE Symposium on Security and Privacy (S&P 2017), San Jose, CA, USA, May 22-26, 2017: 39-57

Here $f (x + η) = t$ is a hard constraint, and can be relaxed by selecting an objective function $o bj$ such that if $o bj (x + η) \leq 0$ then $f (x + η) = t$ .

Examples of objective function:

$o bj (x^{'}) = l os s_{t} (x^{'}) - 1 = - l o g_{2} (p (t)) - 1$
$o bj (x^{'}) = ma x (0, 0.5 - p (x^{'})_{t}$ where $p (x^{'})_{t}$ is the probability of class t on input $x^{'}$

The problem can now be formulated as:

find $η$ such that $∣∣ η ∣ ∣_{\infty} + c \times o bj (x + η)$ is minimized, and $x + η \in [0, 1]^{n}$ .

Note that the norm $∣∣ η ∣ ∣_{\infty}$ is problematic for optimization. We can replace it with a proxy function like $Σ_{i} ma x (0, (∣ η ∣ - τ))$ , where $τ$ is a bound that is decreased at every iteration.

The box constraints $x + η \in [0, 1]^{n}$ can be dealt with using Projected Gradient Descent (PGD).

Diffing Networks

The goal is to find a differencing input give two neural networks trained to learn the same function.

Simply: while $f_{1} (x) \neq = f_{2} (x)$ : $x = x + ϵ \times \frac{\partial ( f _{1} ( x ) - f _{2} ( x ))}{\partial x}$

Black-Box Methods of Adversarial Attack

decision-based adversarial attack
score-based adversarial attack

蔓生庭院

目录

Adversarial Attack

Adversarial Attack

Taxonomy

Taxonomy of Adversarial Attack Threat Model

Taxonomy of Adversarial Attack Threat Model

Taxonomy of Perturbation

White-Box Methods of Adversarial Attack

Fast Gradient Sign Method (FGSM)

PGD Attack

Optimization-based Methods

L-BFGS

Carlini-Wagner Attack (CW)

Diffing Networks

Black-Box Methods of Adversarial Attack

关系图谱

反向链接