same as my PR <a class="issue-link js-issue-link" data-error-text="Failed to load titl

Btw, a slight question, why not place the <div class="snippet-clipboard-content no

suggestion: allow to train x2 or x3 bigger networks on same vram with TF backend about keras-adabound HOT 13 CLOSED

titu1994 commented on May 29, 2024

suggestion: allow to train x2 or x3 bigger networks on same vram with TF backend

from keras-adabound.

Comments (13)

titu1994 commented on May 29, 2024

While this is an interesting application of device placement for larger models, the cost is in training time.

Your moving average weights are on the cpu, whereas the gradients of every parameter are on the gpu. With your device blocks, you are effectively shuttling gpu gradients on the cpu, performing the op and then shuttling it back onto the gpu.

This has several issues :

Shuttling of gradients from GPU <-> CPU for large models will millions of parameters is done per batch. This costs too much time.
The CPU must perform multiple tasks, multiprocess data loading (if the images are from ImageNet or external source in general), batching, shuffling, and finally now it must also incur the cost of synchronizing gradients and performing CPU ops on large matrices. This will cause a bottleneck on the IO pipeline. As IO is generally the major bottleneck anyway, this is highly inefficient.

This is fine, when one is willing to pay the price on compute time wrt larger models, but it is not feasible in general case.

EDIT:
Why not then just force the entire optimizer to be on the CPU device, if you incur the cost of device shuttling anyway. That way, at least the CPU ops can be streamlined.

from keras-adabound.

titu1994 commented on May 29, 2024

In addition, Gradient Checkpointing likewise already does this in similar line of thought, recomputing gradients on requirement rather than preserve them on GPU RAM. You could look into that if memory is the bottleneck and time is not a consideration.

from keras-adabound.

iperov commented on May 29, 2024

I already tested it and applied in my DeepFaceLab project ( deepfakes ).
8bs - 128x128 maximum face model (~500MB model files) for my 6 GB card with tf_cpu_mode=0
4bs - 256x256 (~1000MB model files) with tf_cpu_mode=1 , -10% slower. But it is due to model bigger.
8bs - 256x256 (~1000MB model files) with tf_cpu_mode=2 , -30% slower.

So this approach brings deepfakes to the new era.

from keras-adabound.

iperov commented on May 29, 2024

if you dont like it, just close it :)
I just wanted to share the find.

from keras-adabound.

iperov commented on May 29, 2024

trying AdaBound right after Adam
same lr, but final_lr = lr * 100

history of last 5k iters
NSFW pic

interesting :)

from keras-adabound.

titu1994 commented on May 29, 2024

I believe you will find similar results with simply marking the entire optimizer to lie on the CPU, but im glad you found a good alternative. I'll probably review the PR on keras-contrib sometime if it gets merged.

I must ask you to remove the image though.

from keras-adabound.

iperov commented on May 29, 2024

This is fine, when one is willing to pay the price on compute time wrt larger models, but it is not feasible in general case.

Batch size is very important parameter for GAN networks. So getting rid of optimizer's weights from VRAM, we can train higher batch size, sacrificing 10-20% of time per iteration.
Also I cannot feel noticeable performance loss on my coffelake machine with 2400Mhz 32GB ram.

from keras-adabound.

titu1994 commented on May 29, 2024

Sure, if one can disregard the additional training time, then your approach is fine. I wont be merging it into this since I keep 1:1 equivalence with Keras proper.

from keras-adabound.

titu1994 commented on May 29, 2024

Btw, a slight question, why not place the

            if self.amsbound:
                denom = (K.sqrt(vhat_t) + self.epsilon)
            else:
                denom = (K.sqrt(v_t) + self.epsilon)                        

            # Compute the bounds
            step_size_p = step_size * K.ones_like(denom)

inside the CPU block as well? That would offer even more memory saving since you dont need a K.ones_like() on the GPU then.

from keras-adabound.

iperov commented on May 29, 2024

should be tested, thanks for the tip.

from keras-adabound.

iperov commented on May 29, 2024

I must ask you to remove the image though.

and why?
your religion does not allow to look at the faces of women? :0

from keras-adabound.

titu1994 commented on May 29, 2024

Informed consent. If someone is casually browsing, they should not be shown random nsfw information, unless provided behind a link that clearly states that the content is nsfw and therefore they implicitly are responsible for viewing it at their own discretion.

from keras-adabound.

iperov commented on May 29, 2024

did not know that a bunch of clear women faces is not safe for work.

from keras-adabound.

suggestion: allow to train x2 or x3 bigger networks on same vram with TF backend about keras-adabound HOT 13 CLOSED

Comments (13)

Related Issues (10)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent