I am not having consistent GPU utilization, and it says 18 days for 1 v100 gpu(p3.2xla

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Training speed about yolact HOT 11 CLOSED

dbolya commented on August 18, 2024

Training speed

from yolact.

Comments (11)

dbolya commented on August 18, 2024 1

Hmm weird, I just booted up a p3.2xlarge instance to test and I can't reproduce this.

Running the command
python train.py --config=yolact_base_config
(i.e., with the default batch size of 8) I get an ETA of 4 days.

Running the command
python train.py --config=yolact_base_config --batch_size=12 --num_workers=8
results in an ETA of 5 days (also note that the training parameters are optimized for a batch size of 8, so you should use the first command).

Are you doing anything special that would slow it down? If not, what's your Pytorch and CUDA versions?

The timer column is the time it took for one training iteration, while the ETA averages out over a large number of iterations.

And to view training performance over time, log your console output to a file, pull my latest commit, and from the project's root directory run
python -m "scripts.plot_loss" <my_log_file>
to plot the loss over time.

To plot validation mAP over time, use
python -m "scripts.plot_loss" <my_log_file> val

from yolact.

jodusan commented on August 18, 2024

Thanks for a quick reply! I am using my own dataset, does that matter? (It should just resize images and proceed the same right?) I have generated coco-style annotations and have only added a new config, nothing else. Error exploded.

from yolact.

dbolya commented on August 18, 2024

For debugging purposes, can you try training on COCO to see what the ETA is? There might be an issue with how you set up your dataset / config.

from yolact.

harryb-kyutech commented on August 18, 2024

@dbolya do you have a script that

Hmm weird, I just booted up a p3.2xlarge instance to test and I can't reproduce this.

Running the command
python train.py --config=yolact_base_config
(i.e., with the default batch size of 8) I get an ETA of 4 days.

Running the command
python train.py --config=yolact_base_config --batch_size=12 --num_workers=8
results in an ETA of 5 days (also note that the training parameters are optimized for a batch size of 8, so you should use the first command).

Are you doing anything special that would slow it down? If not, what's your Pytorch and CUDA versions?

The timer column is the time it took for one training iteration, while the ETA averages out over a large number of iterations.

And to view training performance over time, log your console output to a file, pull my latest commit, and from the project's root directory run
python -m "scripts.plot_loss" <my_log_file>
to plot the loss over time.

To plot validation mAP over time, use
python -m "scripts.plot_loss" <my_log_file> val

@dbolya do you have a script that plots both the training loss and mAP AFTER the training has finished (i.e. perhaps during evaluation)? Would be of great help as it is urgently needed. Thank you nonetheless for the great work on YOLACT.

from yolact.

dbolya commented on August 18, 2024

@harrybolingot Those scripts plot those things after training, but you need the log of stdout during training or else there's nowhere to get the info from. Or are you asking for the final training loss?

from yolact.

harryb-kyutech commented on August 18, 2024

@harrybolingot Those scripts plot those things after training, but you need the log of stdout during training or else there's nowhere to get the info from. Or are you asking for the final training loss?

Yeah my problem is that I was not able to save the stdout during training. I've cut the training from time to time as training from 0-100k iterations took 36 hours. During this time I didn't realize that I should have logged the stdout to plot the training loss. I want to be able to plot the training loss from start to end having finished the training :(

from yolact.

dbolya commented on August 18, 2024

Yeah sorry that's not possible cause that data's not saved anywhere. And oof that training time is really bad. Is that expected given your hardware or do you think something's wrong?

from yolact.

harryb-kyutech commented on August 18, 2024

Our lab has four GTX 1080 Ti GPU. I used all of it during training. I'm not sure if this is the standard training speed for this algorithm on this GPU setup. I also have no idea if something's wrong as the ETA was fairly accurate during my first run of your algorithm on my custom dataset.

from yolact.

dbolya commented on August 18, 2024

Does your custom dataset have huge images or something? On COCO with 1 1080ti, the training time should be expected to be ~5 days, so 100k iterations in 36 hours is 3 days for 200k iterations (800k iters / 4 gpus), which is not even a 2x speed up. If you haven't already, you can check out #8 to see some tips for multiple GPUs.

Though I must confess, since I didn't use multiple GPUs while developing YOLACT, my code doesn't support multiple GPUs very well. Have you tried training on a single GPU and comparing the ETA (after the first 1k iterations though because it's not accurate until then)?

Btw the ETA is based on the current speed of training, so that's why it's accurate (as long as the training speed is consistent). It being accurate doesn't necessarily mean everything's working properly.

from yolact.

dbolya commented on August 18, 2024

Since this has been open so long, I'm going to close it. Feel free to reopen if you have any updates.

from yolact.

maskrcnnuser commented on August 18, 2024

@dbolya - I am training using my set of 20 images for training and 10 for validation - I purposefully kept the set small in the beginning to test out the system. However, my output says ETA 58 days!! My config is as follows.

Timer varies between 8.5 and 11.6 seconds.

Tensorflow: 2.0.0
Pytorch: 1.6.0

Am I doing something wrong?

from yolact.

Training speed about yolact HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent