Comments (9)
Benchmarking on MNIST gives me better results with glorot_uniform compared to glorot_normal. glorot_uniform also appears to perform about as well as lecun_uniform.
glorot_uniform:
Train on 37800 samples, validate on 4200 samples
Epoch 0
loss: 0.0257 - acc.: 0.7500 - val. loss: 0.0123 - val. acc.: 0.9348
Epoch 1
loss: 0.0092 - acc.: 1.0000 - val. loss: 0.0081 - val. acc.: 0.9512
Epoch 2
loss: 0.0112 - acc.: 0.8750 - val. loss: 0.0070 - val. acc.: 0.9590
Epoch 3
loss: 0.0031 - acc.: 1.0000 - val. loss: 0.0061 - val. acc.: 0.9631
Epoch 4
loss: 0.0029 - acc.: 1.0000 - val. loss: 0.0054 - val. acc.: 0.9664
Epoch 5
loss: 0.0027 - acc.: 1.0000 - val. loss: 0.0051 - val. acc.: 0.9674
Epoch 6
loss: 0.0047 - acc.: 1.0000 - val. loss: 0.0050 - val. acc.: 0.9657
Epoch 7
loss: 0.0012 - acc.: 1.0000 - val. loss: 0.0050 - val. acc.: 0.9679
Epoch 8
loss: 0.0119 - acc.: 0.8750 - val. loss: 0.0048 - val. acc.: 0.9700
Epoch 9
loss: 0.0011 - acc.: 1.0000 - val. loss: 0.0045 - val. acc.: 0.9712
glorot_normal:
Train on 37800 samples, validate on 4200 samples
Epoch 0
loss: 0.0208 - acc.: 0.8750 - val. loss: 0.0127 - val. acc.: 0.9367
Epoch 1
loss: 0.0113 - acc.: 1.0000 - val. loss: 0.0088 - val. acc.: 0.9490
Epoch 2
loss: 0.0045 - acc.: 1.0000 - val. loss: 0.0076 - val. acc.: 0.9548
Epoch 3
loss: 0.0245 - acc.: 0.7500 - val. loss: 0.0070 - val. acc.: 0.9598
Epoch 4
loss: 0.0090 - acc.: 0.8750 - val. loss: 0.0062 - val. acc.: 0.9643
Epoch 5
loss: 0.0032 - acc.: 1.0000 - val. loss: 0.0057 - val. acc.: 0.9660
Epoch 6
loss: 0.0009 - acc.: 1.0000 - val. loss: 0.0058 - val. acc.: 0.9650
Epoch 7
loss: 0.0032 - acc.: 1.0000 - val. loss: 0.0057 - val. acc.: 0.9643
Epoch 8
loss: 0.0155 - acc.: 0.8750 - val. loss: 0.0053 - val. acc.: 0.9679
Epoch 9
loss: 0.0053 - acc.: 1.0000 - val. loss: 0.0052 - val. acc.: 0.9679
Code is at https://www.kaggle.com/users/123235/fchollet/digit-recognizer/simple-deep-mlp-with-keras
from keras.
The main assumption behind the Glorot initialization is that the variance of the gradients should be the same in each layer. In eq 12 of the paper you can see that to achieve this, the variance of the layer-sizes should be 2 / (fan-in + fan-out) . To achieve this, you could initialize your weights either directly from a normal with sigma^2 =2 / (fan-in + fan-out)
or by the uniform distribution given in in eq 16 of the paper (i.e., the one with sqrt(6) in the numerator). If you calculate the variance of the latter, you will see that it is again 2/(fi+fo).
So in terms of the Glorot paper, Normal(0, 2/(fi+fo))
achieves the same thing as Uniform(-sqrt(6)/sqrt(fi+fo), sqrt(6)/sqrt(fi+fo))
, namely that the variance of the gradients is initially approximately the same in each layer.
So the only question remaining is whether one should use a normal or a uniform distribution. This is a personal choice. People around Bengio's group have always preferred initializing from a uniform distribution (hence they used that in the paper), while e.g. Hinton always advocates using a normal distribution. I personally think that using a normal is the more natural thing, since at the end of training, the weight distribution always looks approximately Gaussian anyway (unless you use e.g. L1 regularization), no matter what you started with. So my reasoning is that with a normal, you at least have the correct prior. Which is why my patch to keras used the normal, as well.
To be honest, I do not think it makes much of a difference. It certainly does not make a difference in terms of how Glorot derived his initialization.
from keras.
After some research, uniform seems to be more common across other libraries.
I don't really buy the gaussian argument. A random normal array is less likely to be a good approximation of another random normal array than a constant or random uniform (small scale) array. Yes, the distribution of values will be the same in aggregate, but the mean absolute error per weight will be larger.
The point of a good initialization is one where your weights 1) make learning possible (avoid pathological cases with no gradient or exploding gradients), and 2) are the closest possible to the final learned weights. Normal distributions seem less likely to fit 2) compared to uniform distributions.
I will add glorot_uniform and he_uniform, to match what other DL libraries are doing. I think we should also make glorot_uniform the default initialization for the layers in which uniform is currently used as default.
from keras.
Maybe we need both. I've seen this scheme around in various forms (sometimes normal, sometimes uniform) and various names (Caffe seems to call it xavier?). And it's been around far prior Glorot 2010 tbh.
@untom thoughts on this?
from keras.
where was it used before that paper?
On Thu, Apr 16, 2015 at 7:10 PM, fchollet [email protected] wrote:
Maybe we need both. I've seen this scheme around in various forms
(sometimes normal, sometimes uniform) and various names (Caffe seems to
call it xavier?). And it's been around far prior Glorot 2010 tbh.@untom https://github.com/untom thoughts on this?
β
Reply to this email directly or view it on GitHub
#52 (comment).
from keras.
scale / sqrt(nb_units)
has been around since the late 80s, with scale values generally between 1 and 2. Sometimes you'd use nb_units = fan in, sometimes nb_units = fan_out, sometimes the sum of the two. For the latter you'd use high scale values, for balance. It's not rocket science.
It hasn't necessarily been written down very often (earliest might have been LeCun98), but it has been part of the old toolbox of NN "tricks" for a long time.
from keras.
So the reasoning I have heard for initializing from uniform instead of
normal is that you don't have outliers. Of course, normal dist shouldn't
have extreme outliers, but it still has outliers. I'm not sure why that
would matter, but it seems somewhat intuitive to me.
The argument that it ends up being Gaussian doesn't seem strong to me...
you don't which weights should be large before training, and if it ends up
Gaussian anyways, then why does it matter?
Can you give a reference for Hinton advocating normals?
wrt the code, I just think it would be good to actually use the
initialization from the paper you reference, so people aren't surprised.
On Fri, Apr 17, 2015 at 2:12 AM, untom [email protected] wrote:
The main assumption behind the Glorot initialization is that the variance
of the gradients should be the same in each layer. In eq 12 of the paper
you can see that to achieve this, the variance of the layer-sizes should be
2 / (fan-in + fan-out) . To achieve this, you could initialize your weights
either directly from a normal that has this distribution (since the
variance is a parameter of the normal) or by the uniform distribution given
in in eq 16 of the paper (i.e., the one with sqrt(6) in the numerator). If
you calculate the variance of the later, you will see that it is again
2/(fi+fo).So in terms of the Glorot paper, Normal(0, 2/(fi+fo)) achieves the same
thing as Uniform(-sqrt(6)/sqrt(fi+fo), sqrt(6)/sqrt(fi+fo)), namely that
the variance of the gradients is initially approximately the same in each
layer.So the only question remaining is whether one should use a normal or a
uniform distribution. This is a personal choice. People around Bengio's
group have always preferred initializing from a uniform distribution (hence
they used that in the paper), while e.g. Hinton always advocates using a
normal distribution. I personally think that using a normal is the more
natural thing, since at the end of training, the weight distribution always
looks approximately Gaussian anyway (unless you use e.g. L1
regularization), no matter what you started with. So my reasoning is that
with a normal, you at least have the correct prior. Which is why my patch
to keras used the normal, as well.To be honest, I do not think it makes much of a difference. It certainly
does not make a difference in terms of how Glorot derived his
initialization.β
Reply to this email directly or view it on GitHub
#52 (comment).
from keras.
That's an interesting reason. But thinking about it, are outliers in the weight-sizes are necessarily a bad thing? As long as overall the gradients don't explode/collapse (which is what the variance-derivation of Glorot is about) you're still okay, aren't you? And as long as learning can proceed, your weights will be adjusted.
The argument that it ends up being Gaussian doesn't seem strong to me... you don't which weights should be large before training, and if it ends up Gaussian anyways, then why does it matter?
The whole point of a "good initialization" is having one that somehow aids learning. You could always argue "if you and up with the same solution in the end, what does it matter how I initialize?". Like I said, picking a prior that matches the posterior just makes sense to me, but I'll admit the argument isn't strong. However, having each unit initialized by a Gaussian also allows each unit to focus more strongly on a given combination of inputs (since fewer of the weights will be large, and likely the combination of large weights will be different for each unit). Of course you'll end up with units that have nonsensical combinations, and those will take longer to recover than if they'd have been initialized uniformly.
Every paper from Hinton's group that mentions how they initialized weights uses Gaussians. Off the top of my head , a specific example from Hinton himself comes from his RBM-Training guide (https://www.cs.toronto.edu/~hinton/absps/guideTR.pdf , section 8). But if you go through papers from his group, several mention Gaussian initialization, none mention Uniform distributions.
In any case, like I said I don't think it will make much difference in the end. So if you think uniform is more userfriendly, then maybe that's the best thing to do. (Personally, I found the normal easier to understand because it doesn't hide the 2/(fi+fo)
, which was Glorot's main result). But yeah, if people expect a uniform, then of course it's confusing.
from keras.
Interesting, thanks for the tests! :)
from keras.
Related Issues (20)
- Loading saved frozen model fails HOT 1
- Saving an `Optimizer` initially instantiated with a `float` learning_rate and later replaced with a `LearningRateSchedule` includes an extra Variable. HOT 1
- Enhance Error Messages for Tuple Input Validations for `IRFFT`, `FFT`, `FFT2`, `RFFT`, `ISTFT` in `keras/ops/math.py`
- Keras 3.2.1 breaks LoRA with ModelParallel HOT 6
- Enormous memory usage after batched forward passes with TensorFlow 2.16.1 (CPU) HOT 7
- model.keras deserialization outputs wrong results HOT 1
- image_dataset_from_directory: Shuffle buffer size arbitrarily set to 1024 when batch_size is None HOT 1
- TrackedList type error for appending NamedTuple HOT 4
- Error when using LearningRateSchedule in distributed context HOT 2
- Newly introduced `weights` in `Embedding` layer breaks the downstream layers such as `ReversibleEmbedding` in KerasNLP HOT 1
- `torch.scatter` equivalent op in Keras HOT 2
- Input is Empty HOT 2
- [Torch backend] GPU utilization and slow running issue when using MeanIoU or its child classes
- 'str' object has no attribute 'shape' HOT 1
- keras/preprocessing/text module is missing in Keras 3 HOT 1
- cannot load fine-tuned model HOT 4
- πΊοΈ Keras Development Roadmap
- kl divergence outputs confusing result when the label contains negative HOT 3
- IntegerLookup with XLA Compilation Fails to Enable JIT in TensorFlow 2.16.1 and Keras 3.2.1 HOT 2
- MeanIoU and its child classes throwing error when used with `ignore_class` HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from keras.