Comments (13)
While this is an interesting application of device placement for larger models, the cost is in training time.
Your moving average weights are on the cpu, whereas the gradients of every parameter are on the gpu. With your device blocks, you are effectively shuttling gpu gradients on the cpu, performing the op and then shuttling it back onto the gpu.
This has several issues :
- Shuttling of gradients from GPU <-> CPU for large models will millions of parameters is done per batch. This costs too much time.
- The CPU must perform multiple tasks, multiprocess data loading (if the images are from ImageNet or external source in general), batching, shuffling, and finally now it must also incur the cost of synchronizing gradients and performing CPU ops on large matrices. This will cause a bottleneck on the IO pipeline. As IO is generally the major bottleneck anyway, this is highly inefficient.
This is fine, when one is willing to pay the price on compute time wrt larger models, but it is not feasible in general case.
EDIT:
Why not then just force the entire optimizer to be on the CPU device, if you incur the cost of device shuttling anyway. That way, at least the CPU ops can be streamlined.
from keras-adabound.
In addition, Gradient Checkpointing likewise already does this in similar line of thought, recomputing gradients on requirement rather than preserve them on GPU RAM. You could look into that if memory is the bottleneck and time is not a consideration.
from keras-adabound.
I already tested it and applied in my DeepFaceLab project ( deepfakes ).
8bs - 128x128 maximum face model (~500MB model files) for my 6 GB card with tf_cpu_mode=0
4bs - 256x256 (~1000MB model files) with tf_cpu_mode=1 , -10% slower. But it is due to model bigger.
8bs - 256x256 (~1000MB model files) with tf_cpu_mode=2 , -30% slower.
So this approach brings deepfakes to the new era.
from keras-adabound.
if you dont like it, just close it :)
I just wanted to share the find.
from keras-adabound.
trying AdaBound right after Adam
same lr, but final_lr = lr * 100
history of last 5k iters
NSFW pic
interesting :)
from keras-adabound.
I believe you will find similar results with simply marking the entire optimizer to lie on the CPU, but im glad you found a good alternative. I'll probably review the PR on keras-contrib sometime if it gets merged.
I must ask you to remove the image though.
from keras-adabound.
This is fine, when one is willing to pay the price on compute time wrt larger models, but it is not feasible in general case.
Batch size is very important parameter for GAN networks. So getting rid of optimizer's weights from VRAM, we can train higher batch size, sacrificing 10-20% of time per iteration.
Also I cannot feel noticeable performance loss on my coffelake machine with 2400Mhz 32GB ram.
from keras-adabound.
Sure, if one can disregard the additional training time, then your approach is fine. I wont be merging it into this since I keep 1:1 equivalence with Keras proper.
from keras-adabound.
Btw, a slight question, why not place the
if self.amsbound:
denom = (K.sqrt(vhat_t) + self.epsilon)
else:
denom = (K.sqrt(v_t) + self.epsilon)
# Compute the bounds
step_size_p = step_size * K.ones_like(denom)
inside the CPU block as well? That would offer even more memory saving since you dont need a K.ones_like() on the GPU then.
from keras-adabound.
should be tested, thanks for the tip.
from keras-adabound.
I must ask you to remove the image though.
and why?
your religion does not allow to look at the faces of women? :0
from keras-adabound.
Informed consent. If someone is casually browsing, they should not be shown random nsfw information, unless provided behind a link that clearly states that the content is nsfw and therefore they implicitly are responsible for viewing it at their own discretion.
from keras-adabound.
did not know that a bunch of clear women faces is not safe for work.
from keras-adabound.
Related Issues (10)
- Make a PR to the main keras repo? HOT 4
- Unclear how to import and use tf.keras version
- Unexpected keyword argument passed to optimizer: amsbound
- Using SGDM with lr=0.1 leads to not learning HOT 10
- clip by value HOT 2
- AdaBound.iterations HOT 10
- any explanation of final_lr ? HOT 3
- about lr HOT 1
- Can't set attribute HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from keras-adabound.