Comments (6)
Gradients can go above 65504 quite easily when predicting large images. I already switched to my own implementation of loss scaling that allows values below 1 and that works great. I just posted the issue here in case the Keras team wanted to fix it for others.
from tf-keras.
@danijar Can you please share a simple standalone code to reproduce the issue. Thanks!
from tf-keras.
I don't have a simple reproducer, sorry. But it should be easy to see that 1 is an arbitrary lower bound and breaks down for models with large gradients (e.g. many skip connections and image reconstruction loss that is a sum over thousands of pixels).
from tf-keras.
@reedwm could you take a look here? Thanks!
from tf-keras.
This is similar to tensorflow/tensorflow#38357, except that issue involved an older version of the API. The same issue still remains though: the loss scale cannot go below 1.
As mentioned in the other issue, if gradients are still nonfinite when the loss scale reaches 1, every step will be skipped and training will not progress. This behavior is not great, but I cannot think of a good solution here. Ideally, we would raise an error, but that could cause a performance penalty by transferring the loss scale to the CPU, and there isn't a clear place of where in Keras we would check if the loss scale was 1.
We could also allow the loss scale to be below 1. However, this would cause LossScaleOptimizer to no longer fulfills its sole purpose of preventing underflow, as having a loss scale below 1 increases the chance of underflow. Allowing this would still be worth it if it did help mixed precision training. However, I suspect it would not help. For gradients to still overflow with a loss scale of 1, a gradient value would have to be larger than 65504, which is a very large value. I don't know of any models with gradients this large, so I suspect if the loss scale goes to 1, instead NaNs or Infs are being generated in a way other than gradient overflow.
@danijar, there is a workaround in tensorflow/tensorflow#38357 (comment). I suspect allowing the loss scale to go below 1 will not fix the model however. You can alternative try setting some layers to float32 by passing dtype="float32"
and seeing if that helps. You can try setting all or most layers to float32, then start switching layers to mixed precision and checking if the loss scale is reaching 1 or not.
/CC @nluehr
from tf-keras.
@reedwm and others, it seems like this issue has not been resolved yet. Could we add a boolean parameter prevent_overflow
to LossScaleOptimizer
which disables the lower bound if it is set to True
?
from tf-keras.
Related Issues (20)
- Calling a model and using model.predict yields different results for the same input HOT 5
- StringLookup layer never releases memory on load HOT 3
- Add a dedicated file for Keras' security policy HOT 1
- Accurarcy() does not work, but 'accuracy' does HOT 8
- L1 penalty set small weights to 0 HOT 9
- `Cannot iterate over a scalar tensor.` with split_dataset HOT 4
- Add a dedicated file for Keras' security policy
- Feature: Flops calculation
- NotFoundError could not find registered transfer manager for platform Host -- check target linkage [Op:__inference__jit_compiled_convolution_op_26169] HOT 3
- Error importing model with both load_model (.keras) and load_weights (.h5) HOT 10
- Using Lambda layers to take different slices of a prevous layer's output causes earlier Lambda layers to be overwritten HOT 5
- Error while importing tf_keras HOT 11
- shape issue for y_pred for a custom made loss function HOT 3
- Add a target_width parameter to keras.utils.timeseries_dataset_from_array HOT 4
- UNIMPLEMENTED: Cast string to float is not supported; CANCELLED: Function was cancelled before it was started HOT 3
- TextVectorization: output_mode={multi_hot, count} promise int arrays but output floats
- Cloning a TextVectorization Layer with Split Function Doesn't Work HOT 6
- Mirrored strategy model.load_weights() failure HOT 2
- sparse_categorical_crossentropy with ignore_class=-1 makes loss to `nan` HOT 3
- Importing `tf_keras` to use Keras 2 in TensorFlow 2.16 fails HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tf-keras.