Hi, I'm using an Nvidia T4 GPU, and I'm wondering whether I am experiencing reasonable

No worries! For your follow-up questions: Training tim

training time about srvp HOT 10 CLOSED

zengzhen commented on August 24, 2024

training time

from srvp.

Comments (10)

White-Link commented on August 24, 2024 1

Hi!

We trained our model on Moving MNIST with one GPU. For example, using an Nvidia Titan V with 12GB of VRAM should lead to approximately 6 iterations per second with the help of Apex, giving a training time of about two days.

The numbers that you report are indeed prohibitive. Regarding Apex, your recent GPU should allow substantial acceleration. In any case, even without using Apex, training the model on Moving MNIST should be a matter of days.

It seems like the model actually runs on CPU rather than GPU on your device, as it matches the performance that we observe when launching our code on CPU. Did you use the option --device followed by the index of the GPU (e.g., 0)?

from srvp.

White-Link commented on August 24, 2024 1

No worries! For your follow-up questions:

Training time: I am not sure that efficiency is linked to memory consumption for a given batch size. From what I can see, the Nvidia T4 GPU may be less powerful than the Nvidia Titan V in terms of computational power, explaining the higher training time on your side. You might want to check whether using Apex does indeed increase the number of iterations per second, to be sure that you encounter no technical issues.
Gradient overflow: This Apex-internal warning can periodically appear during training. If it does not dramatically decrease to near zero, this should not be a problem; it only skips the current optimization step. It actually results from Apex periodically trying to change the scale it applies on the loss, and reverting the change when noticing gradient issues. This is explained here (this documention is for PyTorch's amp package, which should behave similarly to Apex).
Choice of parameters: Intuitively, the size ny of the dynamic variable y (and the size nz of the random variable z which we choose of the same size as y) should correspond to the underlying dimension of the dynamic problem, without taking into account static information such as visual features which are captured by the content variable w. For instance, on Moving MNIST, there should be at least 4 dimensions per digit, as the model should at least consider digits position and speed. Since there may be other hidden variables to take into account, especially for non-synthetic videos, you may want to choose ny large enough in order to capture as much dynamic information as possible, while staying at the order of magnitude given by your intuition. Note that this is only an intuition, and only a thorough hyperparameter search can confirm this intuition.

from srvp.

zengzhen commented on August 24, 2024

Thank you for the response! Yes, I missed the device option, I was taking it for granted...should have double-checked.

I have a few follow-up questions:

I am now using the device option and Apex, and I'm getting around 80 hours as the estimated training time, could this just because I have much less VRAM (as shown below)?

I also tried training on the Bouncing Balls dataset with Apex enabled, I see "Gradient overflow" happens at every iteration, even after training for many iterations, as shown below

Based on your experience, is this normal during training (even after many iterations) or can this lead to issues? If it is an issue, what would you suggest to deal with it?

Regarding the choices of parameters ny & nz for different datasets, can you share some insights on how you made the choices during your experiments?

Thank you very much!

from srvp.

zengzhen commented on August 24, 2024

Thank you! Yes, Apex does increase the training speed, so I think more training time on my side it's because Nvidia T4 GPU is less powerful.

I've tried trained several times on MMNIST deterministic, there is a persistent issue that I ran into: after training around 60k~80k iterations on AWS, suddenly there is a huge read throughput (as shown below, I trained from scratch twice in this plot, and the sudden rise of read throughput occurs in both training runs)

and this is an issue because it will use up all the burst balance on AWS (and my AWS instance does have a reasonably large amount of burst balance available before this happens), and as a result, the AWS instance becomes extremely slow s.t. it takes forever to train the rest iterations.

What do you think could be causing this persistent sudden rise of the read throughput? Did this occur on your side? Do you provide a way to load a pre-trained model and resume training from there, as an easy workaround?

Thank you!

from srvp.

White-Link commented on August 24, 2024

We did not experience this issue as the program should not be very read-intensive past the data loading phase.

Do you have a more precise estimation of the number of iterations reached when this problem occurs? Could you please provide the command you used to launch training, including all options?

from srvp.

zengzhen commented on August 24, 2024

Yeah, I also think it should not be very read-intensive, so I wonder what was happening when the burst balance dropped like crazy.

So I ran 3 times, the number of iterations this occurs was not fixed, it was around 64k, 78k, 83k. The particular command that I used for training was

python train.py --ny 20 --nz 20 --beta_z 2 --nt_cond 5 --nt_inf 5 --dataset smmnist --deterministic --nc 1 --seq_len 15 --lr_scheduling_burnin 800000 --lr_scheduling_n_iter 100000 --save_path /home/ubuntu/workspace/srvp/models/mmnist_train/ --data_dir /home/ubuntu/MMNIST_dataset/ --chkpt_interval 30000 --device 0 --apex_amp

and I installed the dependencies using the requirements.txt

from srvp.

White-Link commented on August 24, 2024

Unfortunately we were not able to observe this problem on our side. Do you know whether it occurs only with Moving MNIST, or with other datasets as well?

You could try to lower the frequency of validation (with --val_interval) and checkpointing (with --chkpt_interval) just in case. You could even remove validation steps as we do not perform model selection on Moving MNIST. These two operations are not supposed to be read-intensive, but it might be worth trying to check if there is any issue.

You could also increase the batch size to check whether this problem occurs sooner in this setting, in order to investigate potential issues from the data loaders.

from srvp.

zengzhen commented on August 24, 2024

Thanks for the suggestions!

I found out that in the function def evaluate(...) in train.py, there is a memory leak due to x.cpu() in line:

all_mse = torch.mean(F.mse_loss(all_x, x.cpu().expand_as(all_x), reduction='none'), dim=[4, 5])

and I fixed it by either explicitly assigning x.cpu() to some variable to be used, or just leave x on GPU, and perform the evaluation with GPU tensors (instead of CPU tensors as before). Before the fix, every time the model was evaluated during training, the RAM consumption increases significantly, and eventually out of memory. In addition, now I don't run into the burst balance issues anymore on AWS (I'm not sure how these two issues are correlated, but it appears that they are)

Would you suggest me to do a pull request to fix this issue?
Thanks!

from srvp.

White-Link commented on August 24, 2024

Alright! I am not sure this is the intended behavior of PyTorch but we are welcoming any bug fix. You can submit a pull request that simply assigns a new variable containing x.cpu().

It is possible that the burst balance issue was actually caused by the program swapping when out of memory.

Thank you for reporting this bug!

from srvp.

White-Link commented on August 24, 2024

Hi! We integrated your suggestion to assign a new variable containing x.cpu() in the validation function in our most recent commits. We hope this solves your memory leak problem.

We'll thus close this issue for now. Thanks again for your interest and bug report, and let us know if you have any other question!

from srvp.

training time about srvp HOT 10 CLOSED

Comments (10)

Related Issues (10)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent