Adversarial examples have attracted significant attention in machine learning, but the reasons for their existence and pervasiveness remain unclear. We demonstrate that adversarial examples can be directly attributed to the presence of non-robust features: features (derived from patterns in the data distribution) that are highly predictive, yet brittle and (thus) incomprehensible to humans. After capturing these features within a theoretical framework, we establish their widespread ex- istence in standard datasets. Finally, we present a simple setting where we can rigorously tie the phenomena we observe in practice to a misalignment between the (human-specified) notion of robustness and the inherent geometry of the data.

1 Introduction

The pervasive brittleness of deep neural networks [Sze+14; Eng+19b; HD19; Ath+18] has attracted significant attention in recent years. Particularly worrisome is the phenomenon of adversarial ex- amples [Big+13; Sze+14], imperceptibly perturbed natural inputs that induce erroneous predictions in state-of-the-art classifiers. Previous work has proposed a variety of explanations for this phe- nomenon, ranging from theoretical models [Sch+18; BPR18] to arguments based on concentration of measure in high-dimensions [Gil+18; MDM18; Sha+19a]. These theories, however, are often unable to fully capture behaviors we observe in practice (we discuss this further in Section 5).

More broadly, previous work in the field tends to view adversarial examples as aberrations arising either from the high dimensional nature of the input space or statistical fluctuations in the training data [GSS15; Gil+18]. From this point of view, it is natural to treat adversarial robustness as a goal that can be disentangled and pursued independently from maximizing accuracy [Mad+18; SHS19; Sug+19], either through improved standard regularization methods [TG16] or pre/post-processing of network inputs/outputs [Ues+18; CW17a; He+17].

In this work, we propose a new perspective on the phenomenon of adversarial examples. In con- trast to the previous models, we cast adversarial vulnerability as a fundamental consequence of the dominant supervised learning paradigm. Specifically, we claim that:

Adversarial vulnerability is a direct result of sensitivity to well-generalizing features in the data.

Recall that we usually train classifiers to solely maximize (distributional) accuracy. Consequently, classifiers tend to use any available signal to do so, even those that look incomprehensible to hu- mans. After all, the presence of “a tail” or “ears” is no more natural to a classifier than any other equally predictive feature. In fact, we find that standard ML datasets do admit highly predictive

33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.

yet imperceptible features. We posit that our models learn to rely on these “non-robust” features, leading to adversarial perturbations that exploit this dependence.2

Our hypothesis also suggests an explanation for adversarial transferability: the phenomenon that perturbations computed for one model often transfer to other, independently trained models. Since any two models are likely to learn similar non-robust features, perturbations that manipulate such features will apply to both. Finally, this perspective establishes adversarial vulnerability as a human- centric phenomenon, since, from the standard supervised learning point of view, non-robust features can be as important as robust ones. It also suggests that approaches aiming to enhance the inter- pretability of a given model by enforcing “priors” for its explanation [MV15; OMS17; Smi+17] actually hide features that are “meaningful” and predictive to standard models. As such, produc- ing human-meaningful explanations that remain faithful to underlying models cannot be pursued independently from the training of the models themselves.

To corroborate our theory, we show that it is possible to disentangle robust from non-robust features in standard image classification datasets. Specifically, given a training dataset, we construct:

1. A “robustified” version for robust classification (Figure 1a)3 . We are able to effectively remove non-robust features from a dataset. Concretely, we create a training set (semanti- cally similar to the original) on which standard training yields good robust accuracy on the original, unmodified test set. This finding establishes that adversarial vulnerability is not necessarily tied to the standard training framework, but is also a property of the dataset.

2. A “non-robust” version for standard classification (Figure 1b)2. We are also able to construct a training dataset for which the inputs are nearly identical to the originals, but all appear incorrectly labeled. In fact, the inputs in the new training set are associated to their labels only through small adversarial perturbations (and hence utilize only non- robust features). Despite the lack of any predictive human-visible information, training on this dataset yields good accuracy on the original, unmodified test set. This demonstrates that adversarial perturbations can arise from flipping features in the data that are useful for classification of correct inputs (hence not being purely aberrations).

Finally, we present a concrete classification task where the connection between adversarial examples and non-robust features can be studied rigorously. This task consists of separating Gaussian distri- butions, and is loosely based on the model presented in Tsipras et al. [Tsi+19], while expanding upon it in a few ways. First, adversarial vulnerability in our setting can be precisely quantified as a difference between the intrinsic data geometry and that of the adversary’s perturbation set. Second, robust training yields a classifier which utilizes a geometry corresponding to a combination of these two. Lastly, the gradients of standard models can be significantly misaligned with the inter-class direction, capturing a phenomenon that has been observed in practice [Tsi+19].

2 The Robust Features Model

We begin by developing a framework, loosely based on the setting of Tsipras et al. [Tsi+19], that enables us to rigorously refer to “robust” and “non-robust” features. In particular, we present a set of definitions which allow us to formally describe our setup, theoretical results, and empirical evidence.

Setup. We study binary classification, where input-label pairs (x,y) ∈ X ×{±1} are sampled from a distribution D; the goal is to learn a classifier C : X →{±1} predicting y given x. We define a feature to be a function mapping from the input space X to real numbers, with the set of all features thus being F = {f : X → R}. For convenience, we assume that the features in F are shifted/scaled to be mean-zero and unit-variance (i.e., so that E(x,y)∼D[f(x)] = 0 and E(x,y)∼D[f(x)2] = 1), making the following definitions scale-invariant. Note that this definition captures what we abstractly think of as features (e.g., a function capturing how “furry” an image is).

2It is worth emphasizing that while our findings demonstrate that adversarial vulnerability does arise from non-robust features, they do not preclude the possibility of adversarial vulnerability also arising from other phenomena [Nak19a]. Nevertheless, the mere existence of useful non-robust features suffices to establish that without explicitly preventing models from utilizing these features, adversarial vulnerability will persist.

3The corresponding datasets for CIFAR-10 are publicly available at http://git.io/adv-datasets.

http://git.io/adv-datasets

Robust dataset

Train

good standard accuracy good robust accuracy

good standard accuracy bad robust accuracy

Unmodified test set

Training image

frog

Non-robust dataset

Train

(a)

Evaluate on original test set

Training image

Robust Features: dog Non-Robust Features: dog

dog

Relabel as cat

Robust Features: dog Non-Robust Features: cat

cat

max P(cat)

Adversarial example towards “cat”

Traingood accuracy

(b)

Figure 1: A conceptual diagram of the experiments of Section 3: (a) we disentangle features into robust and non-robust (Section 3.1), (b) we construct a dataset which appears mislabeled to humans (via adversarial examples) but results in good accuracy on the original test set (Section 3.2).

Useful, robust, and non-robust features. We now define the key concepts required for formulat- ing our framework. To this end, we categorize features in the following manner:

• ρ-useful features: For a given distribution D, we call a feature f ρ-useful (ρ > 0) if it is correlated with the true label in expectation, that is if

E(x,y)∼D[y ·f(x)] ≥ ρ. (1) We then define ρD(f) as the largest ρ for which feature f is ρ-useful under distribution D. (Note that if a feature f is negatively correlated with the label, then −f is useful instead.) Crucially, a linear classifier trained on ρ-useful features can attain non-trivial performance.

• γ-robustly useful features: Suppose we have a ρ-useful feature f (ρD(f) > 0). We refer to f as a robust feature (formally a γ-robustly useful feature for γ > 0) if, under adversarial perturbation (for some specified set of valid perturbations ∆), f remains γ- useful. Formally, if we have that

E(x,y)∼D [

inf δ∈∆(x)

y ·f(x + δ) ] ≥ γ. (2)

• Useful, non-robust features: A useful, non-robust feature is a feature which is ρ-useful for some ρ bounded away from zero, but is not a γ-robust feature for any γ ≥ 0. These features help with classification in the standard setting, but may hinder accuracy in the adversarial setting, as the correlation with the label can be flipped.

Classification. In our framework, a classifier C = (F,w,b) is comprised of a set of features F ⊆F, a weight vector w, and a scalar bias b. For an input x, the classifier predicts the label y as

C(x) = sgn

 b +

∑

f∈F

wf ·f(x)

  .

For convenience, we denote the set of features learned by a classifier C as FC.

Standard Training. Training a classifier is performed by minimizing a loss function Lθ(x,y) over input-label pairs (x,y) from the training set (via empirical risk minimization (ERM)) that de- creases with the correlation between the weighted combination of the features and the label. When minimizing classification loss, no distinction exists between robust and non-robust features: the only distinguishing factor of a feature is its ρ-usefulness. Furthermore, the classifier will utilize any ρ-useful feature in F to decrease the loss of the classifier.

Robust training. In the presence of an adversary, any useful but non-robust features can be made anti-correlated with the true label, leading to adversarial vulnerability. Therefore, ERM is no longer sufficient to train classifiers that are robust, and we need to explicitly account for the effect of the

adversary on the classifier. To do so, we use an adversarial loss function that can discern between robust and non-robust features [Mad+18]:

E(x,y)∼D [

max δ∈∆(x)

Lθ(x + δ,y) ] , (3)

for an appropriately defined set of perturbations ∆. Since the adversary can exploit non-robust features to degrade classification accuracy, minimizing this adversarial loss [GSS15; Mad+18] can be viewed as explicitly preventing the classifier from relying on non-robust features.

Remark. We want to note that even though this framework enables us to describe and predict the outcome of our experiments, it does not capture the notion of non-robust features exactly as we intuitively might think of them. For instance, in principle, our theoretical framework would allow for useful non-robust features to arise as combinations of useful robust features and useless non-robust features [Goh19b]. These types of constructions, however, are actually precluded by our experimental results (for instance, the classifiers trained in Section 3 would not generalize). This shows that our experimental findings capture a stronger, more fine-grained statement than our formal definitions are able to express. We view bridging this gap as an interesting direction for future work.

3 Finding Robust (and Non-Robust) Features

The central premise of our proposed framework is that there exist both robust and non-robust features that constitute useful signals for standard classification. We now provide evidence in support of this hypothesis by disentangling these two sets of features (see conceptual description in Figure 1).

On one hand, we will construct a “robustified” dataset, consisting of samples that primarily contain robust features. Using such a dataset, we are able to train robust classifiers (with respect to the standard test set) using standard (i.e., non-robust) training. This demonstrates that robustness can arise by removing certain features from the dataset (as, overall, the new dataset contains less infor- mation about the original training set). Moreover, it provides evidence that adversarial vulnerability is caused by non-robust features and is not inherently tied to the standard training framework.

On the other hand, we will construct datasets where the input-label association is based purely on non-robust features (and thus the resulting dataset appears completely mislabeled to humans). We show that this dataset suffices to train a classifier with good performance on the standard test set. This indicates that natural models use non-robust features to make predictions, even in the presence of robust features. These features alone are sufficient for non-trivial generalization to natural images, indicating that they are indeed predictive, rather than artifacts of finite-sample overfitting.

3.1 Disentangling robust and non-robust features

Recall that the features a classifier learns to rely on are based purely on how useful these features are for (standard) generalization. Thus, under our conceptual framework, if we can ensure that only robust features are useful, standard training should result in a robust classifier. Unfortunately, we cannot directly manipulate the features of very complex, high-dimensional datasets. Instead, we will leverage a robust model and modify our dataset to contain only the features relevant to that model.

Conceptually, given a robust (i.e., adversarially trained [Mad+18]) model C, we aim to construct a distribution D̂R for which features used by C are as useful as they were on the original distribution D while ensuring that the rest of the features are not useful. In terms of our formal framework:

E (x,y)∼D̂R

[f(x) ·y] = { E(x,y)∼D [f(x) ·y] if f ∈ FC 0 otherwise,

(4)

where FC again represents the set of features utilized by C.

We will construct a training set for D̂R via a one-to-one mapping x 7→ xr from the original training set for D. In the case of a deep neural network, FC corresponds to exactly the set of activations in the penultimate layer (since these correspond to inputs to a linear classifier). To ensure that features used by the model are equally useful under both training sets, we (approximately) enforce all features in FC to have similar values for both x and xr through the following optimization:

min xr ‖g(xr) −g(x)‖2, (5)

“airplane’’ “ship’’ “dog’’ “frog’’“truck’’

D ! D

N R

! D R

(a)

Std Training using

Adv Training using

Std Training using R

Std Training using NR

100

Te st

A cc

ur ac

y on

)

Std accuracy Adv accuracy ( = 0.25)

(b)

Figure 2: (a): Random samples from our variants of the CIFAR-10 [Kri09] training set: the original training set; the robust training set D̂R, restricted to features used by a robust model; and the non- robust training set D̂NR, restricted to features relevant to a standard model (labels appear incorrect to humans). (b): Standard and robust accuracy on the CIFAR-10 test set (D) for models trained with: (i) standard training (on D) ; (ii) standard training on D̂NR; (iii) adversarial training (on D); and (iv) standard training on D̂R. Models trained on D̂R and D̂NR reflect the original models used to create them: notably, standard training on D̂R yields nontrivial robust accuracy. Results for Restricted-ImageNet [Tsi+19] are in D.8 Figure 12.

where x is the original input and g is the mapping from x to the representation layer. We optimize this objective using (normalized) gradient descent (see details in Appendix C).

Since we don’t have access to features outside FC, there is no way to ensure that the expectation in (4) is zero for all f 6∈ FC. To approximate this condition, we choose the starting point of gradient descent for the optimization in (5) to be an input x0 which is drawn from D independently of the label of x (we also explore sampling x0 from noise in Appendix D.1). This choice ensures that any feature present in that input will not be useful since they are not correlated with the label in expectation over x0. The underlying assumption here is that, when performing the optimization in (5), features that are not being directly optimized (i.e., features outside FC) are not affected. We provide pseudocode for the construction in Figure 5 (Appendix C).

Given the new training set for D̂R (a few random samples are visualized in Figure 2a), we train a classifier using standard (non-robust) training. We then test this classifier on the original test set (i.e. D). The results (Figure 2b) indicate that the classifier learned using the new dataset attains good accuracy in both standard and adversarial settings (see additional evaluation in Appendix D.2.) 4.

As a control, we repeat this methodology using a standard (non-robust) model for C in our con- struction of the dataset. Sample images from the resulting “non-robust dataset” D̂NR are shown in Figure 2a—they tend to resemble more the source image of the optimization x0 than the target image x. We find that training on this dataset leads to good standard accuracy, yet yields almost no robustness (Figure 2b). We also verify that this procedure is not simply a matter of encoding the weights of the original model—we get the same results for both D̂R and D̂NR if we train with different architectures than that of the original models.

Overall, our findings corroborate the hypothesis that adversarial examples can arise from (non- robust) features of the data itself. By filtering out non-robust features from the dataset (e.g. by restricting the set of available features to those used by a robust model), one can train a significantly more robust model using standard training.

3.2 Non-robust features suffice for standard classification

The results of the previous section show that by restricting the dataset to only contain features that are used by a robust model, standard training results in classifiers that are significantly more robust.

4In an attempt to explain the gap in accuracy between the model trained on D̂R and the original robust classifier C, we test distributional shift, by reporting results on the “robustified” test set in Appendix D.3.

This suggests that when training on the standard dataset, non-robust features take on a large role in the resulting learned classifier. Here we will show that this is not merely incidental. In particular, we demonstrate that non-robust features alone suffice for standard generalization— i.e., a model trained solely on non-robust features can generalize to the standard test set.

To show this, we construct a dataset where the only features that are useful for classification are non-robust features (or in terms of our formal model from Section 2, all features f that are ρ-useful are non-robust). To accomplish this, we modify each input-label pair (x,y) as follows. We select a target class t either (a) uniformly at random (hence features become uncorrelated with the labels) or (b) deterministically according to the source class (e.g. permuting the labels). Then, we add a small adversarial perturbation to x to cause it to be classified as t by a standard model:

xadv = arg min ‖x′−x‖≤ε

LC(x ′, t), (6)

where LC is the loss under a standard (non-robust) classifier C and ε is a small constant. The result- ing inputs are indistinguishable from the originals (Appendix D Figure 9)—to a human observer, it thus appears that the label t assigned to the modified input is simply incorrect. The resulting input-label pairs (xadv, t) make up the new training set (pseudocode in Appendix C Figure 6).

Now, since ‖xadv − x‖ is small, by definition the robust features of xadv are still correlated with class y (and not t) in expectation over the dataset. After all, humans still recognize the original class. On the other hand, since every xadv is strongly classified as t by a standard classifier, it must be that some of the non-robust features are now strongly correlated with t (in expectation).

In the case where t is chosen at random, the robust features are originally uncorrelated with the label t (in expectation), and after the perturbation can be only slightly correlated (hence being significantly less useful for classification than before) 5. Formally, we aim to construct a dataset D̂rand where

E (x,y)∼D̂rand

[y ·f(x)] { > 0 if f non-robustly useful under D, ' 0 otherwise. (7)

In contrast, when t is chosen deterministically based on y, the robust features actually point away from the assigned label t. In particular, all of the inputs labeled with class t exhibit non-robust features correlated with t, but robust features correlated with the original class y. Thus, robust features on the original training set provide significant predictive power on the training set, but will actually hurt generalization on the standard test set. Formally, our goal is to construct D̂det such that

E (x,y)∼D̂det

[y ·f(x)]

  

> 0 if f non-robustly useful under D, < 0 if f robustly useful under D ∈ R otherwise (f not useful under D).

(8)

We find that standard training on these datasets actually generalizes to the original test set, as shown in Table 1). This indicates that non-robust features are indeed useful for classification in the standard setting. Remarkably, even training on D̂det (where all the robust features are correlated with the wrong class), results in a well-generalizing classifier. This indicates that non-robust features can be picked up by models during standard training, even in the presence of predictive robust features 6

3.3 Transferability can arise from non-robust features

One of the most intriguing properties of adversarial examples is that they transfer across models with different architectures and independently sampled training sets [Sze+14; PMG16; CRP19]. Here, we show that this phenomenon can in fact be viewed as a natural consequence of the existence of non- robust features. Recall that, according to our main thesis, adversarial examples can arise as a result of perturbing well-generalizing, yet brittle features. Given that such features are inherent to the data distribution, different classifiers trained on independent samples from that distribution are likely to utilize similar non-robust features. Consequently, perturbations constructed by exploiting non- robust features learned by one classifier will transfer to other classifiers utilizing similar features.

5Goh [Goh19a] provides an approach to quantifying this “robust feature leakage” and finds that one can obtain a (small) amount of test accuracy by leveraging robust feature leakage on D̂rand.

6Additional results and analysis are in App. D.5, D.6, and D.7.

In order to illustrate and corroborate this hypothesis, we train five different architectures on the dataset generated in Section 3.2 (adversarial examples with deterministic labels) for a standard ResNet-50 [He+16]. Our hypothesis would suggest that architectures which learn better from this training set (in terms of performance on the standard test set) are more likely to learn similar non- robust features to the original classifier. Indeed, we find that the test accuracy of each architecture is predictive of how often adversarial examples transfer from the original model to standard classifiers with that architecture (Figure 3). In a similar vein, Nakkiran [Nak19a] constructs a set of adversarial perturbations that is explicitly non-transferable and finds that these perturbations cannot be used to learn a good classifier. These findings thus corroborate our hypothesis that adversarial transferability arises when models learn similar brittle features of the underlying dataset.

25 30 35 40 45 50 Test accuracy (%; trained on Dy + 1)

100

Tr an

sf er

s uc

ce ss

ra te

(% )

VGG-16

Inception-v3

ResNet-18 DenseNet

ResNet-50

Figure 3: Transfer rate of adversarial exam- ples from a ResNet-50 to different architec- tures alongside test set performance of these architecture when trained on the dataset gen- erated in Section 3.2. Architectures more susceptible to transfer attacks also performed better on the standard test set supporting our hypothesis that adversarial transferability arises from using similar non-robust features.

Source Dataset Dataset

CIFAR-10 ImageNetR

D 95.3% 96.6% D̂rand 63.3% 87.9% D̂det 43.7% 64.4%

Table 1: Test accuracy (on D) of classifiers trained on the D, D̂rand, and D̂det train- ing sets created using a standard (non-robust) model. For both D̂rand and D̂det, only non- robust features correspond to useful features on both the train set and D. These datasets are constructed using adversarial perturba- tions of x towards a class t (random for D̂rand and deterministic for D̂det); the result- ing images are relabeled as t.

4 A Theoretical Framework for Studying (Non)-Robust Features

The experiments from the previous section demonstrate that the conceptual framework of robust and non-robust features is strongly predictive of the empirical behavior of state-of-the-art models on real- world datasets. In order to further strengthen our understanding of the phenomenon, we instantiate the framework in a concrete setting that allows us to theoretically study various properties of the corresponding model. Our model is similar to that of Tsipras et al. [Tsi+19] in the sense that it contains a dichotomy between robust and non-robust features, extending upon it in a few ways: a) the adversarial vulnerability can be explicitly expressed as a difference between the inherent data metric and the `2 metric, b) robust learning corresponds exactly to learning a combination of these two metrics, c) the gradients of robust models align better with the adversary’s metric.

Setup. We study a simple problem of maximum likelihood classification between two Gaussian distributions. In particular, given samples (x,y) sampled from D according to

y u.a.r.∼ {−1, +1}, x ∼N(y ·µ∗, Σ∗), (9)

our goal is to learn parameters Θ = (µ, Σ) such that

Θ = arg min µ,Σ

E(x,y)∼D [`(x; y ·µ, Σ)] , (10)

where `(x; µ, Σ) represents the Gaussian negative log-likelihood (NLL) function. Intuitively, we find the parameters µ, Σ which maximize the likelihood of the sampled data under the given model. Classification can be accomplished via likelihood test: given an unlabeled sample x, we predict y as

y = arg max y

`(x; y ·µ, Σ) = sign ( x>Σ−1µ

) .

In turn, the robust analogue of this problem arises from replacing `(x; y ·µ, Σ) with the NLL under adversarial perturbation. The resulting robust parameters Θr can be written as

Θr = arg min µ,Σ

E(x,y)∼D [

max ‖δ‖2≤ε

`(x + δ; y ·µ, Σ) ] , (11)

A detailed analysis appears in Appendix E—here we present a high-level overview of the results.

(1) Vulnerability from metric misalignment (non-robust features). Note that in this model, one can rigorously refer to an inner product (and thus a metric) induced by the features. In particular, one can view the learned parameters of a Gaussian Θ = (µ, Σ) as defining an inner product over the input space given by 〈x,y〉Θ = (x−µ)>Σ−1(y−µ). This in turn induces the Mahalanobis distance, which represents how a change in the input affects the features of the classifier. This metric is not necessarily aligned with the metric in which the adversary is constrained, the `2-norm. Actually, we show that adversarial vulnerability arises exactly as a misalignment of these two metrics. Theorem 1 (Adversarial vulnerability from misalignment). Consider an adversary whose pertur- bation is determined by the “Lagrangian penalty” form of (11), i.e.

max δ `(x + δ; y ·µ, Σ) −C · ‖δ‖2,

where C ≥ 1 σmin(Σ∗)

is a constant trading off NLL minimization and the adversarial constraint (the bound on C ensures the problem is concave). Then, the adversarial loss Ladv incurred by (µ, Σ) is

Ladv(Θ) −L(Θ) = tr [( I + (C · Σ∗ − I)−1

)2] −d,

and, for a fixed tr(Σ∗) = k the above is minimized by Σ∗ = k d I.

In fact, note that such a misalignment corresponds precisely to the existence of non-robust features— “small” changes in the adversary’s metric along certain directions can cause large changes under the notion of distance established by the parameters (illustrated in Figure 4).

(2) Robust Learning. The (non-robust) maximum likelihood estimate is Θ = Θ∗, and thus the vulnerability for the standard MLE depends entirely on the data distribution. The following theorem characterizes the behaviour of the learned parameters in the robust problem (we study a slight relax- ation of (11) that becomes exact exponentially fast as d →∞, see Appendix E.3.3). In fact, we can prove (Section E.3.4) that performing (sub)gradient descent on the inner maximization (known as adversarial training [GSS15; Mad+18]) yields exactly Θr. We find that as the perturbation budget ε increases, the metric induced by the classifier mixes `2 and the metric induced by the data features. Theorem 2 (Robustly Learned Parameters). Just as in the non-robust case, µr = µ∗, i.e. the true mean is learned. For the robust covariance Σr, there exists an ε0 > 0, such that for any ε ∈ [0,ε0),

Σr = 1

2 Σ∗+

λ ·I+

√ 1

λ · Σ∗ +

4 Σ2∗, where Ω

( 1 + ε1/2

ε1/2 + ε3/2

) ≤ λ ≤ O

( 1 + ε1/2

ε1/2

) .

The effect of robust optimization under an `2-constrained adversary is visualized in Figure 4. As � grows, the learned covariance becomes more aligned with identity. For instance, we can see that the classifier learns to be less sensitive in certain directions, despite their usefulness for classification.

(3) Gradient Interpretability. Tsipras et al. [Tsi+19] observe that gradients of robust models tend to look more semantically meaningful. It turns out that under our model, this behaviour arises as a natural consequence of Theorem 2. In particular, we show that the resulting robustly learned parameters cause the gradient of the linear classifier and the vector connecting the means of the two distributions to better align (in a worst-case sense) under the `2 inner product. Theorem 3 (Gradient alignment). Let f(x) and fr(x) be monotonic classifiers based on the linear separator induced by standard and `2-robust maximum likelihood classification, respectively. The maximum angle formed between the gradient of the classifier (wrt input) and the vector connecting the classes can be smaller for the robust model:

min µ

〈µ,∇xfr(x)〉 ‖µ‖ ·‖∇xfr(x)‖

> min µ

〈µ,∇xf(x)〉 ‖µ‖ ·‖∇xf(x)‖

Figure 4 illustrates this phenomenon in the two-dimensional case, where `2-robustness causes the gradient direction to become increasingly aligned with the vector between the means (µ).

20 15 10 5 0 5 10 15 20 Feature x1

10.0

7.5

5.0

2.5

0.0

2.5

5.0

7.5

10.0

Fe at

ur e

x 2

Maximum likelihood estimate 2 unit ball

1-induced metric unit ball Samples from (0, )

20 15 10 5 0 5 10 15 20 Feature x1

10.0

7.5

5.0

2.5

0.0

2.5

5.0

7.5

10.0

Fe at

ur e

x 2

True Parameters ( = 0) Samples from ( , ) Samples from ( , )

20 15 10 5 0 5 10 15 20 Feature x1

10.0

7.5

5.0

2.5

0.0

2.5

5.0

7.5

10.0

Fe at

ur e

x 2

Robust parameters, = 1.0

20 15 10 5 0 5 10 15 20 Feature x1

10.0

7.5

5.0

2.5

0.0

2.5

5.0

7.5

10.0

Fe at

ur e

x 2

Robust parameters, = 10.0

Figure 4: An empirical demonstration of the effect illustrated by Theorem 2—as the adversarial perturbation budget ε is increased, the learned mean µ remains constant, but the learned covariance “blends” with the identity matrix, effectively adding uncertainty onto the non-robust feature.

Discussion. Our analysis suggests that rather than offering quantitative classification benefits, a natural way to view the role of robust optimization is as enforcing a prior over the features learned by the classifier. In particular, training with an `2-bounded adversary prevents the classifier from relying heavily on features which induce a metric dissimilar to the `2 metric. The strength of the adversary then allows for a trade-off between the enforced prior, and the data-dependent features.

Robustness and accuracy. Note that in the setting described so far, robustness can be at odds with accuracy since robust training prevents us from learning the most accurate classifier (a similar conclusion is drawn in [Tsi+19]). However, we note that there are very similar settings where non- robust features manifest themselves in the same way, yet a classifier with perfect robustness and accuracy is still attainable. Concretely, consider the distributions pictured in Figure 14 in Appendix D.10. It is straightforward to show that while there are many perfectly accurate classifiers, any standard loss function will learn an accurate yet non-robust classifier. Only when robust training is employed does the classifier learn a perfectly accurate and perfectly robust decision boundary.

5 Related Work

Several models for explaining adversarial examples have been proposed in prior work, utilizing ideas ranging from finite-sample overfitting to high-dimensional statistical phenomena [Gil+18; FFF18; For+19; TG16; Sha+19a; MDM18; Sha+19b; GSS15; BPR18]. The key differentiating aspect of our model is that adversarial perturbations arise as well-generalizing, yet brittle, features, rather than statistical anomalies. In particular, adversarial vulnerability does not stem from using a specific model class or a specific training method, since standard training on the “robustified” data distribution of Section 3.1 leads to robust models. At the same time, as shown in Section 3.2, these non-robust features are sufficient to learn a good standard classifier. We discuss the connection between our model and others in detail in Appendix A and additional related work in Appendix B.

6 Conclusion

In this work, we cast the phenomenon of adversarial examples as a natural consequence of the presence of highly predictive but non-robust features in standard ML datasets. We provide support for this hypothesis by explicitly disentangling robust and non-robust features in standard datasets, as well as showing that non-robust features alone are sufficient for good generalization. Finally, we study these phenomena in more detail in a theoretical setting where we can rigorously study adversarial vulnerability, robust training, and gradient alignment.

Our findings prompt us to view adversarial examples as a fundamentally human phenomenon. In particular, we should not be surprised that classifiers exploit highly predictive features that happen to be non-robust under a human-selected notion of similarity, given such features exist in real-world datasets. In the same manner, from the perspective of interpretability, as long as models rely on these non-robust features, we cannot expect to have model explanations that are both human-meaningful and faithful to the models themselves. Overall, attaining models that are robust and interpretable will require explicitly encoding human priors into the training process.

Acknowledgements

We thank Preetum Nakkiran for suggesting the experiment of Appendix D.9 (i.e. replicating Figure 3 with targeted attacks). We also are grateful to the authors of Engstrom et al. [Eng+19a] (Chris Olah, Dan Hendrycks, Justin Gilmer, Reiichiro Nakano, Preetum Nakkiran, Gabriel Goh, Eric Wallace)— for their insights and efforts replicating, extending, and discussing our experimental results.

Work supported in part by the NSF grants CCF-1553428, CCF-1563880, CNS-1413920, CNS- 1815221, IIS-1447786, IIS-1607189, the Microsoft Corporation, the Intel Corporation, the MIT- IBM Watson AI Lab research grant, and an Analog Devices Fellowship.

References

[ACW18] Anish Athalye, Nicholas Carlini, and David A. Wagner. “Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples”. In: Inter- national Conference on Machine Learning (ICML). 2018.

[Ath+18] Anish Athalye et al. “Synthesizing Robust Adversarial Examples”. In: International Conference on Machine Learning (ICML). 2018.

[BCN06] Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. “Model compres- sion”. In: International Conference on Knowledge Discovery and Data Mining (KDD). 2006.

[Big+13] Battista Biggio et al. “Evasion attacks against machine learning at test time”. In: Joint European conference on machine learning and knowledge discovery in databases (ECML-KDD). 2013.

[BPR18] Sébastien Bubeck, Eric Price, and Ilya Razenshteyn. “Adversarial examples from com- putational constraints”. In: arXiv preprint arXiv:1805.10204. 2018.

[Car+19] Nicholas Carlini et al. “On Evaluating Adversarial Robustness”. In: ArXiv preprint arXiv:1902.06705. 2019.

[CRK19] Jeremy M Cohen, Elan Rosenfeld, and J Zico Kolter. “Certified adversarial robustness via randomized smoothing”. In: arXiv preprint arXiv:1902.02918. 2019.

[CRP19] Zachary Charles, Harrison Rosenberg, and Dimitris Papailiopoulos. “A Geometric Per- spective on the Transferability of Adversarial Directions”. In: International Conference on Artificial Intelligence and Statistics (AISTATS). 2019.

[CW17a] Nicholas Carlini and David Wagner. “Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods”. In: Workshop on Artificial Intelligence and Secu- rity (AISec). 2017.

[CW17b] Nicholas Carlini and David Wagner. “Towards evaluating the robustness of neural net- works”. In: Symposium on Security and Privacy (SP). 2017.

[Dan67] John M. Danskin. The Theory of Max-Min and its Application to Weapons Allocation Problems. 1967.

[Das+19] Constantinos Daskalakis et al. “Efficient Statistics, in High Dimensions, from Trun- cated Samples”. In: Foundations of Computer Science (FOCS). 2019.

[Din+19] Gavin Weiguang Ding et al. “On the Sensitivity of Adversarial Robustness to Input Data Distributions”. In: International Conference on Learning Representations. 2019.

[Eng+19a] Logan Engstrom et al. “A Discussion of ’Adversarial Examples Are Not Bugs, They Are Features’”. In: Distill (2019). https://distill.pub/2019/advex-bugs-discussion. DOI: 10.23915/distill.00019.

[Eng+19b] Logan Engstrom et al. “A Rotation and a Translation Suffice: Fooling CNNs with Simple Transformations”. In: International Conference on Machine Learning (ICML). 2019.

[FFF18] Alhussein Fawzi, Hamza Fawzi, and Omar Fawzi. “Adversarial vulnerability for any classifier”. In: Advances in Neural Information Processing Systems (NeuRIPS). 2018.

[FMF16] Alhussein Fawzi, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. “Robustness of classifiers: from adversarial to random noise”. In: Advances in Neural Information Processing Systems. 2016.

https://doi.org/10.23915/distill.00019

[For+19] Nic Ford et al. “Adversarial Examples Are a Natural Consequence of Test Error in Noise”. In: arXiv preprint arXiv:1901.10513. 2019.

[Fur+18] Tommaso Furlanello et al. “Born-Again Neural Networks”. In: International Confer- ence on Machine Learning (ICML). 2018.

[Gei+19] Robert Geirhos et al. “ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness.” In: International Conference on Learn- ing Representations. 2019.

[Gil+18] Justin Gilmer et al. “Adversarial spheres”. In: Workshop of International Conference on Learning Representations (ICLR). 2018.

[Goh19a] Gabriel Goh. “A Discussion of ’Adversarial Examples Are Not Bugs, They Are Features’: Robust Feature Leakage”. In: Distill (2019). https://distill.pub/2019/advex- bugs-discussion/response-2. DOI: 10.23915/distill.00019.2.

[Goh19b] Gabriel Goh. “A Discussion of ’Adversarial Examples Are Not Bugs, They Are Features’: Two Examples of Useful, Non-Robust Features”. In: Distill (2019). https://distill.pub/2019/advex-bugs-discussion/response-3. DOI: 10.23915/ distill.00019.3.

[GSS15] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. “Explaining and Harness- ing Adversarial Examples”. In: International Conference on Learning Representations (ICLR). 2015.

[HD19] Dan Hendrycks and Thomas G. Dietterich. “Benchmarking Neural Network Robust- ness to Common Corruptions and Surface Variations”. In: International Conference on Learning Representations (ICLR). 2019.

[He+16] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In: Conference on Computer Vision and Pattern Recognition (CVPR). 2016.

[He+17] Warren He et al. “Adversarial example defense: Ensembles of weak defenses are not strong”. In: USENIX Workshop on Offensive Technologies (WOOT). 2017.

[HVD14] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. “Distilling the Knowledge in a Neu- ral Network”. In: Neural Information Processing Systems (NeurIPS) Deep Learning Workshop. 2014.

[JLT18] Saumya Jetley, Nicholas Lord, and Philip Torr. “With friends like these, who needs ad- versaries?” In: Advances in Neural Information Processing Systems (NeurIPS). 2018.

[Kri09] Alex Krizhevsky. “Learning Multiple Layers of Features from Tiny Images”. In: Tech- nical report. 2009.

[KSJ19] Beomsu Kim, Junghoon Seo, and Taegyun Jeon. “Bridging Adversarial Robustness and Gradient Interpretability”. In: International Conference on Learning Representations Workshop on Safe Machine Learning (ICLR SafeML). 2019.

[Lec+19] Mathias Lecuyer et al. “Certified robustness to adversarial examples with differential privacy”. In: Symposium on Security and Privacy (SP). 2019.

[Liu+17] Yanpei Liu et al. “Delving into Transferable Adversarial Examples and Black-box At- tacks”. In: International Conference on Learning Representations (ICLR). 2017.

[LM00] Beatrice Laurent and Pascal Massart. “Adaptive estimation of a quadratic functional by model selection”. In: Annals of Statistics. 2000.

[Mad+18] Aleksander Madry et al. “Towards deep learning models resistant to adversarial at- tacks”. In: International Conference on Learning Representations (ICLR). 2018.

[MDM18] Saeed Mahloujifar, Dimitrios I Diochnos, and Mohammad Mahmoody. “The curse of concentration in robust learning: Evasion and poisoning attacks from concentration of measure”. In: AAAI Conference on Artificial Intelligence (AAAI). 2018.

[Moo+17] Seyed-Mohsen Moosavi-Dezfooli et al. “Universal adversarial perturbations”. In: con- ference on computer vision and pattern recognition (CVPR). 2017.

[MV15] Aravindh Mahendran and Andrea Vedaldi. “Understanding deep image representations by inverting them”. In: computer vision and pattern recognition (CVPR). 2015.

[Nak19a] Preetum Nakkiran. “A Discussion of ’Adversarial Examples Are Not Bugs, They Are Features’: Adversarial Examples are Just Bugs, Too”. In: Distill (2019). https://distill.pub/2019/advex-bugs-discussion/response-5. DOI: 10.23915/ distill.00019.5.

https://doi.org/10.23915/distill.00019.2

https://doi.org/10.23915/distill.00019.3

https://doi.org/10.23915/distill.00019.5

[Nak19b] Preetum Nakkiran. “Adversarial robustness may be at odds with simplicity”. In: arXiv preprint arXiv:1901.00532. 2019.

[OMS17] Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. “Feature Visualization”. In: Distill. 2017.

[Pap+17] Nicolas Papernot et al. “Practical black-box attacks against machine learning”. In: Asia Conference on Computer and Communications Security. 2017.

[PMG16] Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. “Transferability in Machine Learning: from Phenomena to Black-box Attacks using Adversarial Samples”. In: ArXiv preprint arXiv:1605.07277. 2016.

[Rec+19] Benjamin Recht et al. “Do CIFAR-10 Classifiers Generalize to CIFAR-10?” In: Inter- national Conference on Machine Learning (ICML). 2019.

[RSL18] Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. “Certified defenses against adversarial examples”. In: International Conference on Learning Representations (ICLR). 2018.

[Rus+15] Olga Russakovsky et al. “ImageNet Large Scale Visual Recognition Challenge”. In: International Journal of Computer Vision (IJCV). 2015.

[Sch+18] Ludwig Schmidt et al. “Adversarially Robust Generalization Requires More Data”. In: Advances in Neural Information Processing Systems (NeurIPS). 2018.

[Sha+19a] Ali Shafahi et al. “Are adversarial examples inevitable?” In: International Conference on Learning Representations (ICLR). 2019.

[Sha+19b] Adi Shamir et al. “A Simple Explanation for the Existence of Adversarial Examples with Small Hamming Distance”. In: arXiv preprint arXiv:1901.10861. 2019.

[SHS19] David Stutz, Matthias Hein, and Bernt Schiele. “Disentangling Adversarial Robustness and Generalization”. In: Computer Vision and Pattern Recognition (CVPR). 2019.

[Smi+17] D. Smilkov et al. “SmoothGrad: removing noise by adding noise”. In: ICML workshop on visualization for deep learning. 2017.

[Sug+19] Arun Sai Suggala et al. “Revisiting Adversarial Risk”. In: Conference on Artificial Intelligence and Statistics (AISTATS). 2019.

[Sze+14] Christian Szegedy et al. “Intriguing properties of neural networks”. In: International Conference on Learning Representations (ICLR). 2014.

[TG16] Thomas Tanay and Lewis Griffin. “A Boundary Tilting Perspective on the Phenomenon of Adversarial Examples”. In: ArXiv preprint arXiv:1608.07690. 2016.

[Tra+17] Florian Tramer et al. “The Space of Transferable Adversarial Examples”. In: ArXiv preprint arXiv:1704.03453. 2017.

[Tsi+19] Dimitris Tsipras et al. “Robustness May Be at Odds with Accuracy”. In: International Conference on Learning Representations (ICLR). 2019.

[Ues+18] Jonathan Uesato et al. “Adversarial Risk and the Dangers of Evaluating Against Weak Attacks”. In: International Conference on Machine Learning (ICML). 2018.

[Wan+18] Tongzhou Wang et al. “Dataset Distillation”. In: ArXiv preprint arXiv:1811.10959. 2018.

[WK18] Eric Wong and J Zico Kolter. “Provable defenses against adversarial examples via the convex outer adversarial polytope”. In: International Conference on Machine Learning (ICML). 2018.

[Xia+19] Kai Y. Xiao et al. “Training for Faster Adversarial Robustness Verification via Inducing ReLU Stability”. In: International Conference on Learning Representations (ICLR). 2019.

[Zou+18] Haosheng Zou et al. “Geometric Universality of Adversarial Examples in Deep Learn- ing”. In: Geometry in Machine Learning ICML Workshop (GIML). 2018.