thomas-tanay / post--l2-regularization Goto Github PK

View Code? Open in Web Editor NEW

33.0 33.0 1.0 38.29 MB

Distill submission

JavaScript 92.19% HTML 7.81%

post--l2-regularization's People

Contributors

Stargazers

Watchers

Forkers

shancarter

post--l2-regularization's Issues

Toy Problem

Thanks for the quick and already detailed comments, this is great!

I open here a new issue commenting issues #2 and #5. If I'm correct, both issues raise similar points: the introductory toy problem is useful but sometimes hard to follow. This might discourage some readers (especially being the first section of the article) and should be simplified and/or made more intuitive. I agree with these concerns. And since the classes of images considered are arbitrary, lots of alternatives are possible.

Before I discuss this further, I should clarify the different constraints that led me to this particular choice of classes:

There should be a clear feature (linearly separable) to distinguish between the two classes.
→ white/black images.
There should be some intra-class variability (to make the problem more realistic).
→ random values in the intervals [-1, -0.1] and [0.1, 1] (I didn't want the two classes to intersect in 0).
There should be at least one (but natural data tends to have more than one) flat direction of variation along which the classification boundary can tilt.
→ null half image.
The class definitions should be valid whether the dimensionality of the problem is 2 or 200 (to show that dimensionality does not influence the phenomenon).
→ when the number of dimensions is 2, there is one random pixel and one null pixel. More dimensions lead to a random half image and a null half image.

Among the modifications suggested in issues #2 and #5, some are at least partially in conflict with the constraints just described:

For instance I like the idea (in issue #5) of introducing a hierarchy in the features and making them compose into a single object visually (center square/outer background), but classes defined in this way aren't compatible with 2-pixel images (what I like about the 2-pixel case is that the entire image space collapses into the deviation plane, which helps connect the two concepts).
On the contrary, the suggestion (in issue #2) to only consider the 2-dimensional case might be too restrictive in my view: a lot of people might find it difficult to regard 2-dimensional vectors as images (and this might make the toy problem even less convincing for them). The advantage with the current class definitions is that the case with h = 1 (which is simpler), is identical to the case with h = 50 (which feels more like real images) and helps connect the two.

Concerning the image representation used, I think it is important to show the actual images (so that readers can see the adversarial examples) and their projections in the deviation plane (so that readers can understand why the phenomenon occurs). I like the idea of using arrows to visualize the vectors (suggested in issue #5), but my concern is that introducing a third representation for the images might be slightly redundant, and confusing to some readers.

On the other hand, a number of the modifications suggested in issues #2, #3 and #5 are compatible with the current class definitions and can help make the toy problem more accessible:

Rotating the images 90° to make the comparison between original and mirror images easier.
Using more diagrams to help the reader pause.
Making the title of the section more explicit (“a toy problem”) and acknowledging concerns with the toy problem earlier.

I'll keep this issue in mind, and try to think of other ways to simplify the setup of the toy problem.

Abstract

General Comments

At the moment, your article opens with a short section giving a flavor of the content of the article:

I think this is a really good direction, but is trying to be too many things at once right now. In particular, it feels like it's playing a balancing act between being a visual abstract (which I think would try to be more focused and brief) and an introduction (which would provide more context).

In doing so, it falls into a bit of an uncanny valley: It doesn't quite have enough context to really get what's going on if you come in knowing about adversarial examples but not the ideas in this article. Readers will suspend the expectation of really knowing what's going on for a cool "teaser" at the top of an article, but once it starts to feel like a full section they get confused.

My inclination is to tighten this into a visual abstract / teaser / hero, and use the background section as an introduction (which it already is and does quite well).

Specific advice:

1. Figure out the angle / core point

There's a lot of interesting things about your article:

A geometric way of thinking about adversarial examples
Adversarial examples can occur in two dimensions.
Conflict between minimizing training error and reducing adversarial distance.
L2 regularization can help control this balance.
Finding this helps for both linear models and simple neural networks.
...

I think you want to pick out a couple that are interesting, hook the reader, and give the flavor of your article.

2. Focus

Then you want to boil it down into a few sentences and an interactive. At the very dense end, you could do something like this:

But I'd avoid trying to squeeze in that much text.

3. Include a bit more context if possible.

This is in tension with other things I've said, but I think you could give a bit more context. That said, I think you'll also make it a lot less important by making it shorter and queuing the reader that they don't have to understand everything that's going on here.

4. Make it interactive, not a video.

The interactive at the top is kind of a video right now. You hit play and it goes. If possible, it would be better for the reader to be able to fiddle directly with the regularization parameter.

Examples

Here's one example of what a tight visual abstract might look like. It cuts a lot of nice stuff, but you have lots of space to talk about the content you didn't mention here later in the article. The goal at this point is to interest the reader.

Or with a different focus:

Review Report 2 - Anonymous Reviewer B

The following peer review was solicited as part of the Distill review process.

The reviewer chose to keep keep anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.

Distill is grateful to the reviewer for taking the time to give such a thorough review of this article. Thoughtful and invested reviewers are essential to the success of the Distill project.

Conflicts of Interest: Reviewer disclosed no conflicts of interest.

The present submission proposes an explanation for adversarial examples based on insufficient regularization of the model. It is argued that a lack of regularization leads to a decision boundary which is skewed and hence vulnerable to adversarial examples.

In my opinion, this article suffers from a few issues. The three main issues are:

It makes claims (at least suggestively) about adversarial examples that I think are wrong.
Its coverage of the related work is not very good.
It uses language (such as "A new angle" and "We challenge this intuition") that seems to claim substantially more novelty than there actually is. I don't mind an expository paper that is not novel, but I do mind over-claiming.

In addition, I felt the writing could be improved, but I assume that other reviewers can comment on that in more depth.

Perhaps the most serious issue is issue 1. At a high level, I am not convinced that L2 regularization / a tilted decision boundary are the primary reason for the existence of adversarial examples. While this is a natural explanation to consider, it does not match my own empirical experience; furthermore, my sense is that this explanation has occurred to others in the field as well but has been de-emphasized due to not accounting for all the facts on the ground. (I wish that I had good citations covering this, but I do not know if/where this has been discussed in detail---however, there are various empirical papers showing that L2 regularization/weight decay does not work very well in comparison to adversarial training and other techniques).

More concretely, three claims that I think are wrong are the following (a brief explanation of why I think it is wrong is given after each point):

that adversarial examples are primarily due to tilting of the decision boundary --- in high dimensions, every decision boundary (tilted or not) might have trouble avoiding adversarial examples
that weight decay for deep networks confers robustness to adversarial examples --- weight decay seems to be too crude an instrument, and conveys only limited robustness
that "the fact that [neural networks] are often vulnerable to linear attacks of small magnitude suggests that they are strongly under-regularized" --- many authors have found that neural networks actually underfit the data during adversarial training and need additional capacity to successfully fit adversarial perturbations

Small Diagrams + Text

One nice trick that might be helpful in this article -- especially in the "Preamble" section, due to its two dimensional nature -- is combining some of the text with small diagrams.

Below, I've included a few examples I threw together really quickly. Please don't take them too seriously. They're just there to stimulate discussion. You could do the illustrations in lots of different ways! And you should probably sharpen the text a bit if you want to do this -- this kind of illustration works best if you can use sharp concise prose to accompany the diagrams.

Example: Effect of angle on adversarial examples

The effect of angle on adversarial examples (in a few different styles):

Example: Construction of adversarial examples

Not sure that this is really necessary as a standalone diagram but it gives an idea:

Benefits

The integrated diagrams can make a point easier to understand.
It can make the reader more comfortable pausing to think. Mathematical texts often require you to pause and think about an equation. Many readers -- especially those who haven't studied advanced math/CS/etc -- aren't comfortable with this and will just keep going. A diagram gives them "permission" to think.
It can emphasize points, making them stick in peoples heads.

Review Report 1 -- Reviewer A

The following peer review was solicited as part of the Distill review process.

Distill is grateful to the reviewer for taking the time to give such a thorough review of this article. Thoughtful and invested reviewers are essential to the success of the Distill project.

Conflicts of Interest: Reviewer disclosed no conflicts of interest.

Overall Review:
I thought the plots and intuition given were nice, below I have some comments and clarifying questions. I’d be surprised if this is the first work to give a relationship between weight decay and distance from the data to the margin, but I’m not personally aware of prior work that does this. Have the authors done a literature review to see if this relationship exists in prior work? I believe this is worth publishing on distill but am not confident in this assessment.

Detailed Comments:

In the first plot (and rest of paper) is adversarial distance d_adv the average distance between the training data and the boundary? Or just a single data point?

“First, it challenges conventional wisdom on generalization in machine learning.”

Why do adversarial examples challenge the fundamentals of generalization? Generalization deals with probability or the average behavior of a model under the data distribution. Adversarial examples deal with the worst case behavior of a model.
I would cite the recent Madry et. al. paper on adversarial training https://arxiv.org/abs/1706.06083

“Here, we challenge this intuition and argue instead that adversarial examples exist when the classification boundary lies close the data manifold—independently of the image space dimension.”

I don’t see why this challenges the earlier intuition, the two perspectives are complementary.

“According to the new perspective, adversarial examples exist when the classification boundary approaches the data manifold in image space.”

One can make the same statement in the higher dimensional data space. (I like the idea of a toy image space, it just seems to me like the classification approaches the data manifold in data space if and only if it approaches in image space.)
How are the adversarial examples generated for the LeNet trained with weight decay (Projected gradient descent? Single step fast gradient sign method?)
I believe similar results for MNIST have been demonstrated with models trained with an RBF kernel (I think Ian Goodfellow has done work on this).

Differentiate two dimensional image vectors

When the two grey colors are very close, it's hard to see the two values. A small suggestion:

Toy Problem Section

I spent some time redesigning the figures for the toy problem section. I didn't get through all of them, but addressed the first few. Thought I'd post to get your thoughts before I continue. There are quite a few changes. Happy to discuss or describe in more detail if needed.

Jitters while sliding

Some of your text output elements should have a fixed width so that when you slide through different length values the layout of the elements doesn't jitter around.

Initial thoughts on "Preamble" Section

I wanted to leave some initial thoughts on the preamble section. I think it's really interesting, but that you could make it much sharper.

Toy Problem Setup

I wonder if you could make your toy problem a bit simpler? While it's fairly straightforward, there is a bit of effort in setting up the idea of these images where half are random noise and half constant and how to manipulate them... (For example, one way you could simplify it would be to just do it all with two-dimensional vectors -- so could I and J would be two intervals on the x-axis of a plane, e_theta just (cos(T), sin(T)), and so on. That's just one possibility to throw something out there!)

Of course, the slight complexity of the present example is fine if it is serving a good cause. But I'd encourage you to ask: is it paying rent? That is, is there an upside that makes the increased complexity worth it? For example, you may want to use your present version because you want something that can be seen as both high-dimensional and low-dimensional. Or because you want to be closer to your later examples. Or something else entirely.

Section Title

"Preamble" as a title doesn't communicate much of what the section is about. I'd consider something a bit more descriptive like "A Toy Example." This helps set the readers expectations and helps a reader quickly skimming over the article understand the narrative arc.

Acknowledge Example Concerns

Once concern I had while reading through this section is that the example seems unrealistic. Why would the angle be so unnatural? You address this later, but a sentence acknowledging these concerns in this section might help. "The decision boundary being tilted like this may seem unrealistic, but we'll see in the next section that it's actually a very common phenomenon."

Different ways of visualizing a two-dimensional vector

Just some brainstorming about different ways to visualize a two-dimensional vector in the first section.

One thing that tripped me up a little bit was when you had two image swatches side by side, but were then trying to compare them with another horizontally:

It made it a little difficult to cross compare the different dimensions. For instance, if you want to compare the second swatch across all examples you visually have to skip over the left swatch. If you stack them vertically, it makes that comparison a lot easier:

Another idea would be to use arrows to visualize the vectors, which frees you up to use color to visualize the categorization. Here blue and pink show how the vector was categorized:

Another thing that tripped me up a little was the fact that there was no visual hierarchy to the two dimensions. Because one of the dimensions was the one you "care" about, that one could be a bigger to communicate that. Also, when both dimensions were equal sized swatches, they didn't seem to compose as much into a single object visually. Here's another rough idea. X is how light or dark the outer background color is, Y is how pink or blue the center square is. If a little bit of blue is added to the same background color, it can be misclassified.

Response to Reviewers B and C

Thanks to Reviewer B and C for their comments and time reviewing our submission.

Our response below is organized in two parts. We start by clarifying our goal and the way we approached it, before addressing the specific points raised by each reviewer.

Approach

Problem

The adversarial example phenomenon has attracted considerable attention and many elaborate attempts have been made at solving it – most of them leading to disappointing results. We believe that it is useful in this context to step back, focus on a simpler problem, and then progressively build up from there. Linear classification in particular appears as a sensible first step.

The existence of adversarial examples in linear classification has been known for several years, and the current dominant explanation is that they are a property of the dot product in high dimension: “adversarial examples can be explained as a property of high-dimensional dot products” [1]. This explanation has had a significant influence on the field and is still often mentioned when introducing the phenomenon (e.g. [2,3,4,5]). Yet we believe that it presents a number of limitations.

First, the formal argument is not entirely convincing: small perturbations do not provoke changes in activation that grow linearly with the dimensionality of the problem when they are considered relatively to the activations themselves. Second, a number of results are not predicted by the linear explanation.

High-dimensionality is not necessary for the phenomenon to occur

A 2-dimensional problem can suffer from adversarial examples, as shown by our toy problem:

High-dimensionality is not sufficient for the phenomenon to occur

Some high-dimensional problems do not suffer from adversarial examples. Again, our toy problem can illustrate this, if we consider that the images are 100 pixels wide and 100 pixels high (for instance) instead of being 2-dimensional:

Varying the dimensionality does not influence the phenomenon

More generally, varying the dimensionality of the problem does not actually influence the phenomenon.

Consider for instance the classification of 3 vs 7 MNIST digits with a linear SVM (from our arxiv paper). We do this on the standard version of the dataset and on a version where each image has been linearly interpolated to a size of 200*200 pixels (for the two datasets, we also perturbed each image with some noise to add some variability)

Increasing the image resolution has no influence on the perceptual magnitude of the adversarial
perturbations, even if the dimensionality of the problem has been multiplied by more than 50.

Varying the regularization level does influence the phenomenon

However, varying the level of regularization does influence the phenomenon. This observation was for instance made by Andrej Karpathy in this blog post:

a “linear classifier with lower regularization (which leads to more noisy class weights) is easier to fool [left]. Higher regularization produces more diffuse filters and is harder to fool [right]”

This result is not readily explicable by the linear explanation of [1].

Results

To resolve the previous misconceptions and explain the phenomenon of adversarial examples in linear classification, we introduce a number of ideas – some of which we thought were novel and worth sharing.

For instance, we show that:

The level of regularization used controls the scaling of the loss function.
Varying the scaling of the loss function balances two objectives: the minimization of the error distance and the maximization of the adversarial distance.
The adversarial distance has a simple geometric interpretation: d_adv = ½ ||j-i|| cos(theta)

Or, in short:
L2 regularization controls the angle between the learned classifier and the nearest centroid classifier, resulting in a simple picture of the phenomenon of adversarial examples in linear classification.

Limits

In the second part of the article, we apply our new insights to non-linear classification. We observe that weight decay still acts on the scaling of the loss function and can therefore be interpreted as a form of adversarial training. We test this hypothesis on a very simple problem (LeNet on MNIST) and show that weight decay has indeed a significant influence on the robustness of our model.

Admittedly, the phenomenon is likely to be more complicated with deeper networks on more sophisticated datasets (more non-linearities and other forms of regularization at play). But we still think that our discussion of LeNet on MNIST constitutes a small step towards a better understanding of the phenomenon: it shows that our analysis does not completely break down as soon as we introduce some non-linearities, and it shows that L2 weight decay plays a more significant role than previously suspected (at least in this simple setup). Our hope is that this result will encourage further investigations of the relation between regularization and adversarial examples in deep networks.

Specific criticisms

Reviewer B

I. It makes claims (at least suggestively) about adversarial examples that I think are wrong.

that adversarial examples are primarily due to tilting of the decision boundary --- in high dimensions, every decision boundary (tilted or not) might have trouble avoiding adversarial examples

Maybe at a certain point, the problem really becomes a semantic one: what do we choose to call an adversarial example? In their seminal paper, Szegedy et al. [6] defined adversarial examples as the result of applying “an imperceptible non-random perturbation to a test image”. Adversarial perturbations are also typically difficult to interpret (as mentioned briefly in Goodfellow et al. [1]: “this perturbation is not readily recognizable to a human observer as having anything to do with the relationship between 3s and 7s.”).

These two conditions are met in linear classification when the boundary is strongly tilted:

But not when the boundary is not tilted (i.e. for the nearest centroid classifier). In that case, the perturbations become highly visible, and easy to interpret (as a difference of centroids):

The first case is counter-intuitive and necessitates an explanation. The second case is hardly surprising. In my opinion, the images in the second case should not be called “adversarial examples” but should instead be considered as “fooling images”: non-digit images which are recognized as digits with high confidence (a phenomenon more akin to the one discussed by Nguyen et al. [7]). If we make this distinction, then we can reasonably claim that in linear classification, “adversarial examples are primarily due to the tilting of the decision boundary”.

that weight decay for deep networks confers robustness to adversarial examples – weight decay seems to be too crude an instrument, and conveys only limited robustness

We agree that weight decay is a relatively crude instrument, and we tried to be transparent about the fact that, although we do believe that weight decay constitutes an effective regularizer against adversarial examples for LeNet on MNIST, this result is unlikely to generalize completely to state-of-the-art networks on more sophisticated datasets.

The text may still give the impression that we make unreasonable claims and we will try to improve this aspect further in our revisions.

that "the fact that [neural networks] are often vulnerable to linear attacks of small magnitude suggests that they are strongly under-regularized" --- many authors have found that neural networks actually underfit the data during adversarial training and need additional capacity to successfully fit adversarial perturbations.

This is an interesting remark. This observation does indeed suggest that neural networks present some symptoms of underfitting. Yet, they also clearly show some symptoms of overfitting, as emphasized for instance by the result of Zhang et al. [8]: neural networks often converge to zero training error, even on a random labelling of the data. Perhaps these two views are compatible: neural networks may need additional capacity to successfully fit adversarial perturbations, but they may also need additional regularization to help use the additional capacity in a meaningful way.

II. Its coverage of the related work is not very good.

Our limited coverage of related work was mainly due to space considerations but I would be happy to expand further. I spent the month of November writing a literature review for my MPhil to PhD transfer report, and I've tried to keep the same writing style as for the Distill post. Some parts of it could potentially be polished and turned into a section or added as an appendix.

III. It uses language (such as "A new angle" and "We challenge this intuition") that seems to claim substantially more novelty than there actually is. I don't mind an expository paper that is not novel, but I do mind over-claiming.

I understand your concern and I do agree that over-claiming is generally harmful and should be avoided.
However, I thought that some of our ideas where indeed novel. For instance, I don't think it has been observed before that in linear classification, L2 regularization controls the angle between the learned classifier and the nearest centroid classifier (hence the phrase: “a new angle”).

Reviewer C

It's hard to tell what the overall goal of this piece is: a pedagogical explanation of the topic, or a new results paper arguing that people were wrong to reject weight decay as a defense against adversarial examples in the past?

The overall goal of the piece is to provide an explanation of the adversarial example phenomenon in linear classification (summarized in conclusion: “our main goal here was to provide a clear and intuitive picture of the phenomenon in the linear case, hopefully constituting a solid base from which to move forward.”)

As emphasized before, we do not consider this piece to be purely pedagogical: clarity is important to us, but we also introduce a number of new ideas. In particular, we show that in linear classification, L2 regularization controls the angle between the learned classifier and the nearest centroid classifier.

Whether it's meant to be pedagogical or advocacy, any discussion of weight decay should probably be in the context of label smoothing (advocated for adversarial examples by David Warde-Farley) and entropy regularization (advocated for adversarial examples by Takeru Miyato and Ekin Cubuk in separate papers).

Thank you for the references. I will try to add a comparison between these works and ours.

Just wrong: "In practice, using an appropriate level of regularization helps avoid overfitting and constitutes a simple form of adversarial training."
→ weight decay isn't the same as adversarial training. The phrase "adversarial training" originates in "Explaining and Harnessing Adversarial Examples" by Goodfellow et al. Section 5 of that paper compares adversarial training to weight decay and shows how they are different things.

It is true that weight decay and adversarial training are not the same thing, but they share some similarities. In particular, both of them can be seen as a way of attributing penalties to correctly classified images during training (by moving them across the boundary with adversarial training, and by rescaling the loss function with weight decay). This is why we call weight decay “a form of adversarial training” or that we use phrases such as “the type of first-order adversarial training that L2 regularization implements”.

The first plot has no axis labels and I just don't get what's going on. Why do all the data points move around when I change the regularization? I would expect the decision boundary to move, not the data points.

Thank you for your question. This first plot is very important in my view and I realize now that I may have failed to explain it clearly. I am planning to do a number of modifications to improve this.

Let me try to explain it again here.

Consider the problem of classifying 2s versus 3s MNIST digits.

z is the weight vector of the nearest centroid classifier.
(w, b) is an SVM model for a given level of L2 regularization.

There exists a plane containing z and w: we call it the tilting plane of w.
We can find a vector n such that (z,n) is an orthonormal basis of the tilting plane of w by using the Gram-Schmidt process: n = normalize(w – (w.z) z).

We can then project the training data in (z,n) and we obtain something that looks like this:

The horizontal direction passes through the two centroids and the vertical direction is chosen such that w belongs to the plane (the hyperplane boundary simply appears as a line). Remark also that since (z,n) is an orthonormal basis, the distances in this plane are actual pixel distances.

Now, we obtain the first animation (and the two related ones from the section “Example: SVM on MNIST”) by repeating this process 81 times with the regularization parameter lambda varying between 10^-1 and 10^7 (the exponent increasing by steps of 0.1). Remarkably, the tilting angle between z and w varies monotonically with lambda.

To understand why the data points appear to be moving around when lambda varies, one needs to imagine the tilting plane rotating around z in the n-dimensional input space (thus showing a different section of the n-dimensional training data for each value of lambda).

This idea can be illustrated with the following simplified scenario:
z is the weight vector of the nearest centroid classifier.
w1 is the weight vector of an SVM model trained with high regularization (lambda = 10^5).
w2 is the weight vector of an SVM model trained with low regularization (lambda = 10^-1).
w_theta rotates from w1 to w2.

Using the Gram-Schmidt process again, we find the vectors e1 and e2 such that (z,e1,e2) forms an orthonormal basis of the 3D subspace containing z, w1 and w2 (and by definition, w_theta):
e1 = normalize(w1 – (w1.z) z)
e2 = normalize(w2 – (w2.z) z – (w2.e1) e1)

We then project the training data in (z,e1,e2) and consider the boundaries defined by w1 and w2 (in light grey) and the boundary defined by w_theta (in orange). Below, we observe the space from a viewpoint that is orthogonal to z and w_theta for fives different values of theta:

Although the 3D data is static, the points appear to be moving around because the tilting plane and the viewpoint are rotating around z (we see how the adversarial distance decreases as w_theta tilts from w1 to w2).

In the first animation, the situation is more complex because the 81 defined weight vectors span a subspace that is more than 3-dimensional. This subspace can no longer be visualized, but the projections of the training data into the tilting plane still can.

Szegedy et al 2013 also experimented with using weight decay to resist adversarial examples and found that it didn't work. That isn't discussed in this article.

I am not sure to what experiment you are referring specifically.

For a linear classifier, Szegedy et al actually observed a direct relation between the value of the regularization parameter lambda and the average minimum distortion:
FC(10^-4) → 0.062
FC(10^-2) → 0.1
FC(1) → 0.14
which seems to be consistent with our results. We expect lower regularization levels to lead to even smaller average minimum distortions (the values of lambda reported here are not directly comparable to ours).

The experiments advocating for weight decay don't really report clear metrics compared to other methods in the literature (I don't see anything like error rate for a fixed epsilon).

There are two conceivable ways of evaluating the robustness of a model to adversarial perturbations. As suggested above, most authors fix the size of the perturbation (epsilon) and report an error rate. Here we choose to fix the confidence level (median value of 0.95) and report the size of the perturbation instead (we find it more adapted to the visual evaluation task that we focus on). Arguably, both approaches have advantages and disadvantages.

One more thought: they really should use SVHN instead of MNIST. MNIST is basically solved for norm-constrained adversarial examples now. There's a trivial solution where you just threshold each pixel at
0.5 into a binary value, and the latest training algorithms are able to discover this solution. SVHN is not a lot more computationally demanding than MNIST but doesn't have this trivial solution.

We do have some results with a Network in Network architecture trained on SVHN. Overall, they suggest that weight decay does play a role and the minimum distortion tends to be higher and more meaningful for the network trained with higher weight decay.

weight decay = 0, test error = 8.1%

weight decay = 0.005, test error = 7.1%

However, it is difficult to know exactly what is going on there:

NiNs are typically trained with additional explicit regularizers (such as Dropout or Batch norm) which may interfere with our study of weight decay.
Even without other explicit regularizers, stochastic gradient descent may act as an implicit regularizer [8]. Its dynamics are complex and can be influenced by several parameters (Learning rate? Momentum? Batch size? Weight initializations? Early stopping? Etc)

For these reasons, LeNet on MNIST appeared as a simpler model to study as a first step.

In fact, what puzzles me most about the results with the NiN on SVHN is that even without weight decay, the adversarial perturbations tend to be much larger than those affecting models trained on ImageNet. In future work, I am planning to study in more details under what conditions neural networks become more vulnerable to adversarial perturbations.

[1] Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014).
[2] Kereliuk, Corey, Bob L. Sturm, and Jan Larsen. "Deep learning and music adversaries." IEEE Transactions on Multimedia 17.11 (2015): 2059-2071.
[3] Warde-Farley, David, and Ian Goodfellow. "11 Adversarial Perturbations of Deep Neural Networks." Perturbations, Optimization, and Statistics (2016): 311.
[4] Nayebi, Aran, and Surya Ganguli. "Biologically inspired protection of deep networks from adversarial attacks." arXiv preprint arXiv:1703.09202 (2017).
[5] Anonymous. "Thermometer Encoding: One Hot Way To Resist Adversarial Examples." International Conference on Learning Representations (2018). Under review.
[6] Szegedy, Christian, et al. "Intriguing properties of neural networks." arXiv preprint arXiv:1312.6199 (2013).
[7] Nguyen, Anh, Jason Yosinski, and Jeff Clune. "Deep neural networks are easily fooled: High confidence predictions for unrecognizable images." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.
[8] Zhang, Chiyuan, et al. "Understanding deep learning requires rethinking generalization." arXiv preprint arXiv:1611.03530 (2016).

Review Report 3 - Anonymous Reviewer C

The following peer review was solicited as part of the Distill review process.

Distill is grateful to the reviewer for taking the time to give such a thorough review of this article. Thoughtful and invested reviewers are essential to the success of the Distill project.

Conflicts of Interest: Reviewer disclosed no conflicts of interest.

High level:

I'm skeptical whether this work is interesting enough for Distill.
It is based on work that has been available on the web for over a year
and has attracted little interest. If this was a conference reviewing
system I think this paper would be rejected for low interest / low
novelty at the least.
It's hard to tell what the overall goal of this piece is: a
pedagogical explanation of the topic, or a new results paper arguing
that people were wrong to reject weight decay as a defense against
adversarial examples in the past?
If this is meant to be a pedagogy piece, it probably shouldn't argue
so hard for weight decay in particular.
If this is meant to be a pedagogy piece, I actually find the plots
pretty confusing, even though I'm already familiar with the topic. I
can reverse engineer them, but it takes effort.
If this is meant to be a new results piece advocating weight decay
as a new method, the evaluation should be more rigorous (the article
itself calls the results on that front "inconclusive")
Whether it's meant to be pedagogical or advocacy, any discussion of
weight decay should probably be in the context of label smoothing
(advocated for adversarial examples by David Warde-Farley) and entropy
regularization (advocated for adversarial examples by Takeru Miyato
and Ekin Cubuk in separate papers). These are other simple
regularization methods that many people agree are helpful for
adversarial examples in practice, they have simple interpretations for
linear models, and those interpretations are more meaningful for deep
nets (because they come closer to regularizing in function space
rather than parameter space; parameter space weight decay of a deep
net is quite different from parameter space weight decay of a linear
model, but function space regularization of both model classes is
similar).

I guess if this were a regular journal I would probably recommend
"reject" or "resubmit with major revision" after deciding whether it's
a pedagogy piece or a weight decay advocacy paper.

General suggestions:

Maybe give the figures names, so people can refer to them specifically?

Just wrong:
"In practice, using an appropriate level of regularization helps avoid
overfitting and constitutes a simple form of adversarial training."
-> weight decay isn't the same as adversarial training. The phrase
"adversarial training" originates in "Explaining and Harnessing
Adversarial Examples" by Goodfellow et al. Section 5 of that paper
compares adversarial training to weight decay and shows how they are
different things.

Medium size problem:

The first plot has no axis labels and I just don't get what's going
on. Why do all the data points move around when I change the
regularization? I would expect the decision boundary to move, not the
data points.
The first plot doesn't tell me the amount of weight decay I'm
getting as I move the slider.
It seems unfair to quote 'Goodfellow et al. [2] have observed that
adversarial training is “somewhat similar to L1 regularization” in the
linear case' without also discussing the differences they report.
Szegedy et al 2013 also experimented with using weight decay to
resist adversarial examples and found that it didn't work. That isn't
discussed in this article.
The experiments advocating for weight decay don't really report
clear metrics compared to other methods in the literature (I don't see
anything like error rate for a fixed epsilon.

Nits:
"In linear models and small neural nets, L2 regularization can be
understood as a balancing mechanism between two objectives: minimizing
the training error
errtrainerrtrain and maximizing the average distance between the data
and the boundary dadvdadv." -> maybe unpack this into a few more
sentences. When I first read it, I thought the boundary was called
d_adv. It took me a few reads to get that the d_adv was the distance.

Typos:
" strongly titled." -> "strongly tilted"

One more thought: they really should use SVHN instead of MNIST.
MNIST is basically solved for norm-constrained adversarial examples
now. There's a trivial solution where you just threshold each pixel at
0.5 into a binary value, and the latest training algorithms are able
to discover this solution. SVHN is not a lot more computationally
demanding than MNIST but doesn't have this trivial solution.

thomas-tanay / post--l2-regularization Goto Github PK

post--l2-regularization's People

Contributors

Stargazers

Watchers

Forkers

post--l2-regularization's Issues

General Comments

Specific advice:

Examples

Example: Effect of angle on adversarial examples

Example: Construction of adversarial examples

Benefits

Toy Problem Setup

Section Title

Acknowledge Example Concerns

Approach

Problem

Results

Limits

Specific criticisms

Reviewer B

Reviewer C

Recommend Projects

Recommend Topics

Recommend Org