Giter Club home page Giter Club logo

Comments (9)

fchollet avatar fchollet commented on May 5, 2024 10

Benchmarking on MNIST gives me better results with glorot_uniform compared to glorot_normal. glorot_uniform also appears to perform about as well as lecun_uniform.

glorot_uniform:

Train on 37800 samples, validate on 4200 samples
Epoch 0
loss: 0.0257 - acc.: 0.7500 - val. loss: 0.0123 - val. acc.: 0.9348
Epoch 1
loss: 0.0092 - acc.: 1.0000 - val. loss: 0.0081 - val. acc.: 0.9512
Epoch 2
loss: 0.0112 - acc.: 0.8750 - val. loss: 0.0070 - val. acc.: 0.9590
Epoch 3
loss: 0.0031 - acc.: 1.0000 - val. loss: 0.0061 - val. acc.: 0.9631
Epoch 4
loss: 0.0029 - acc.: 1.0000 - val. loss: 0.0054 - val. acc.: 0.9664
Epoch 5
loss: 0.0027 - acc.: 1.0000 - val. loss: 0.0051 - val. acc.: 0.9674
Epoch 6
loss: 0.0047 - acc.: 1.0000 - val. loss: 0.0050 - val. acc.: 0.9657
Epoch 7
loss: 0.0012 - acc.: 1.0000 - val. loss: 0.0050 - val. acc.: 0.9679
Epoch 8
loss: 0.0119 - acc.: 0.8750 - val. loss: 0.0048 - val. acc.: 0.9700
Epoch 9
loss: 0.0011 - acc.: 1.0000 - val. loss: 0.0045 - val. acc.: 0.9712

glorot_normal:

Train on 37800 samples, validate on 4200 samples
Epoch 0
loss: 0.0208 - acc.: 0.8750 - val. loss: 0.0127 - val. acc.: 0.9367
Epoch 1
loss: 0.0113 - acc.: 1.0000 - val. loss: 0.0088 - val. acc.: 0.9490
Epoch 2
loss: 0.0045 - acc.: 1.0000 - val. loss: 0.0076 - val. acc.: 0.9548
Epoch 3
loss: 0.0245 - acc.: 0.7500 - val. loss: 0.0070 - val. acc.: 0.9598
Epoch 4
loss: 0.0090 - acc.: 0.8750 - val. loss: 0.0062 - val. acc.: 0.9643
Epoch 5
loss: 0.0032 - acc.: 1.0000 - val. loss: 0.0057 - val. acc.: 0.9660
Epoch 6
loss: 0.0009 - acc.: 1.0000 - val. loss: 0.0058 - val. acc.: 0.9650
Epoch 7
loss: 0.0032 - acc.: 1.0000 - val. loss: 0.0057 - val. acc.: 0.9643
Epoch 8
loss: 0.0155 - acc.: 0.8750 - val. loss: 0.0053 - val. acc.: 0.9679
Epoch 9
loss: 0.0053 - acc.: 1.0000 - val. loss: 0.0052 - val. acc.: 0.9679

Code is at https://www.kaggle.com/users/123235/fchollet/digit-recognizer/simple-deep-mlp-with-keras

from keras.

untom avatar untom commented on May 5, 2024 7

The main assumption behind the Glorot initialization is that the variance of the gradients should be the same in each layer. In eq 12 of the paper you can see that to achieve this, the variance of the layer-sizes should be 2 / (fan-in + fan-out) . To achieve this, you could initialize your weights either directly from a normal with sigma^2 =2 / (fan-in + fan-out) or by the uniform distribution given in in eq 16 of the paper (i.e., the one with sqrt(6) in the numerator). If you calculate the variance of the latter, you will see that it is again 2/(fi+fo).

So in terms of the Glorot paper, Normal(0, 2/(fi+fo)) achieves the same thing as Uniform(-sqrt(6)/sqrt(fi+fo), sqrt(6)/sqrt(fi+fo)), namely that the variance of the gradients is initially approximately the same in each layer.

So the only question remaining is whether one should use a normal or a uniform distribution. This is a personal choice. People around Bengio's group have always preferred initializing from a uniform distribution (hence they used that in the paper), while e.g. Hinton always advocates using a normal distribution. I personally think that using a normal is the more natural thing, since at the end of training, the weight distribution always looks approximately Gaussian anyway (unless you use e.g. L1 regularization), no matter what you started with. So my reasoning is that with a normal, you at least have the correct prior. Which is why my patch to keras used the normal, as well.

To be honest, I do not think it makes much of a difference. It certainly does not make a difference in terms of how Glorot derived his initialization.

from keras.

fchollet avatar fchollet commented on May 5, 2024 7

After some research, uniform seems to be more common across other libraries.

I don't really buy the gaussian argument. A random normal array is less likely to be a good approximation of another random normal array than a constant or random uniform (small scale) array. Yes, the distribution of values will be the same in aggregate, but the mean absolute error per weight will be larger.

The point of a good initialization is one where your weights 1) make learning possible (avoid pathological cases with no gradient or exploding gradients), and 2) are the closest possible to the final learned weights. Normal distributions seem less likely to fit 2) compared to uniform distributions.

I will add glorot_uniform and he_uniform, to match what other DL libraries are doing. I think we should also make glorot_uniform the default initialization for the layers in which uniform is currently used as default.

from keras.

fchollet avatar fchollet commented on May 5, 2024

Maybe we need both. I've seen this scheme around in various forms (sometimes normal, sometimes uniform) and various names (Caffe seems to call it xavier?). And it's been around far prior Glorot 2010 tbh.

@untom thoughts on this?

from keras.

capybaralet avatar capybaralet commented on May 5, 2024

where was it used before that paper?

On Thu, Apr 16, 2015 at 7:10 PM, fchollet [email protected] wrote:

Maybe we need both. I've seen this scheme around in various forms
(sometimes normal, sometimes uniform) and various names (Caffe seems to
call it xavier?). And it's been around far prior Glorot 2010 tbh.

@untom https://github.com/untom thoughts on this?

β€”
Reply to this email directly or view it on GitHub
#52 (comment).

from keras.

fchollet avatar fchollet commented on May 5, 2024

scale / sqrt(nb_units) has been around since the late 80s, with scale values generally between 1 and 2. Sometimes you'd use nb_units = fan in, sometimes nb_units = fan_out, sometimes the sum of the two. For the latter you'd use high scale values, for balance. It's not rocket science.

It hasn't necessarily been written down very often (earliest might have been LeCun98), but it has been part of the old toolbox of NN "tricks" for a long time.

from keras.

capybaralet avatar capybaralet commented on May 5, 2024

So the reasoning I have heard for initializing from uniform instead of
normal is that you don't have outliers. Of course, normal dist shouldn't
have extreme outliers, but it still has outliers. I'm not sure why that
would matter, but it seems somewhat intuitive to me.

The argument that it ends up being Gaussian doesn't seem strong to me...
you don't which weights should be large before training, and if it ends up
Gaussian anyways, then why does it matter?

Can you give a reference for Hinton advocating normals?

wrt the code, I just think it would be good to actually use the
initialization from the paper you reference, so people aren't surprised.

On Fri, Apr 17, 2015 at 2:12 AM, untom [email protected] wrote:

The main assumption behind the Glorot initialization is that the variance
of the gradients should be the same in each layer. In eq 12 of the paper
you can see that to achieve this, the variance of the layer-sizes should be
2 / (fan-in + fan-out) . To achieve this, you could initialize your weights
either directly from a normal that has this distribution (since the
variance is a parameter of the normal) or by the uniform distribution given
in in eq 16 of the paper (i.e., the one with sqrt(6) in the numerator). If
you calculate the variance of the later, you will see that it is again
2/(fi+fo).

So in terms of the Glorot paper, Normal(0, 2/(fi+fo)) achieves the same
thing as Uniform(-sqrt(6)/sqrt(fi+fo), sqrt(6)/sqrt(fi+fo)), namely that
the variance of the gradients is initially approximately the same in each
layer.

So the only question remaining is whether one should use a normal or a
uniform distribution. This is a personal choice. People around Bengio's
group have always preferred initializing from a uniform distribution (hence
they used that in the paper), while e.g. Hinton always advocates using a
normal distribution. I personally think that using a normal is the more
natural thing, since at the end of training, the weight distribution always
looks approximately Gaussian anyway (unless you use e.g. L1
regularization), no matter what you started with. So my reasoning is that
with a normal, you at least have the correct prior. Which is why my patch
to keras used the normal, as well.

To be honest, I do not think it makes much of a difference. It certainly
does not make a difference in terms of how Glorot derived his
initialization.

β€”
Reply to this email directly or view it on GitHub
#52 (comment).

from keras.

untom avatar untom commented on May 5, 2024

That's an interesting reason. But thinking about it, are outliers in the weight-sizes are necessarily a bad thing? As long as overall the gradients don't explode/collapse (which is what the variance-derivation of Glorot is about) you're still okay, aren't you? And as long as learning can proceed, your weights will be adjusted.

The argument that it ends up being Gaussian doesn't seem strong to me... you don't which weights should be large before training, and if it ends up Gaussian anyways, then why does it matter?

The whole point of a "good initialization" is having one that somehow aids learning. You could always argue "if you and up with the same solution in the end, what does it matter how I initialize?". Like I said, picking a prior that matches the posterior just makes sense to me, but I'll admit the argument isn't strong. However, having each unit initialized by a Gaussian also allows each unit to focus more strongly on a given combination of inputs (since fewer of the weights will be large, and likely the combination of large weights will be different for each unit). Of course you'll end up with units that have nonsensical combinations, and those will take longer to recover than if they'd have been initialized uniformly.

Every paper from Hinton's group that mentions how they initialized weights uses Gaussians. Off the top of my head , a specific example from Hinton himself comes from his RBM-Training guide (https://www.cs.toronto.edu/~hinton/absps/guideTR.pdf , section 8). But if you go through papers from his group, several mention Gaussian initialization, none mention Uniform distributions.

In any case, like I said I don't think it will make much difference in the end. So if you think uniform is more userfriendly, then maybe that's the best thing to do. (Personally, I found the normal easier to understand because it doesn't hide the 2/(fi+fo), which was Glorot's main result). But yeah, if people expect a uniform, then of course it's confusing.

from keras.

untom avatar untom commented on May 5, 2024

Interesting, thanks for the tests! :)

from keras.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.