Giter Club home page Giter Club logo

Comments (11)

dbolya avatar dbolya commented on August 18, 2024 1

Hmm weird, I just booted up a p3.2xlarge instance to test and I can't reproduce this.

Running the command
python train.py --config=yolact_base_config
(i.e., with the default batch size of 8) I get an ETA of 4 days.

Running the command
python train.py --config=yolact_base_config --batch_size=12 --num_workers=8
results in an ETA of 5 days (also note that the training parameters are optimized for a batch size of 8, so you should use the first command).

Are you doing anything special that would slow it down? If not, what's your Pytorch and CUDA versions?

The timer column is the time it took for one training iteration, while the ETA averages out over a large number of iterations.

And to view training performance over time, log your console output to a file, pull my latest commit, and from the project's root directory run
python -m "scripts.plot_loss" <my_log_file>
to plot the loss over time.

To plot validation mAP over time, use
python -m "scripts.plot_loss" <my_log_file> val

from yolact.

jodusan avatar jodusan commented on August 18, 2024

Thanks for a quick reply! I am using my own dataset, does that matter? (It should just resize images and proceed the same right?) I have generated coco-style annotations and have only added a new config, nothing else. Error exploded.

from yolact.

dbolya avatar dbolya commented on August 18, 2024

For debugging purposes, can you try training on COCO to see what the ETA is? There might be an issue with how you set up your dataset / config.

from yolact.

harryb-kyutech avatar harryb-kyutech commented on August 18, 2024

@dbolya do you have a script that

Hmm weird, I just booted up a p3.2xlarge instance to test and I can't reproduce this.

Running the command
python train.py --config=yolact_base_config
(i.e., with the default batch size of 8) I get an ETA of 4 days.

Running the command
python train.py --config=yolact_base_config --batch_size=12 --num_workers=8
results in an ETA of 5 days (also note that the training parameters are optimized for a batch size of 8, so you should use the first command).

Are you doing anything special that would slow it down? If not, what's your Pytorch and CUDA versions?

The timer column is the time it took for one training iteration, while the ETA averages out over a large number of iterations.

And to view training performance over time, log your console output to a file, pull my latest commit, and from the project's root directory run
python -m "scripts.plot_loss" <my_log_file>
to plot the loss over time.

To plot validation mAP over time, use
python -m "scripts.plot_loss" <my_log_file> val

@dbolya do you have a script that plots both the training loss and mAP AFTER the training has finished (i.e. perhaps during evaluation)? Would be of great help as it is urgently needed. Thank you nonetheless for the great work on YOLACT.

from yolact.

dbolya avatar dbolya commented on August 18, 2024

@harrybolingot Those scripts plot those things after training, but you need the log of stdout during training or else there's nowhere to get the info from. Or are you asking for the final training loss?

from yolact.

harryb-kyutech avatar harryb-kyutech commented on August 18, 2024

@harrybolingot Those scripts plot those things after training, but you need the log of stdout during training or else there's nowhere to get the info from. Or are you asking for the final training loss?

Yeah my problem is that I was not able to save the stdout during training. I've cut the training from time to time as training from 0-100k iterations took 36 hours. During this time I didn't realize that I should have logged the stdout to plot the training loss. I want to be able to plot the training loss from start to end having finished the training :(

from yolact.

dbolya avatar dbolya commented on August 18, 2024

Yeah sorry that's not possible cause that data's not saved anywhere. And oof that training time is really bad. Is that expected given your hardware or do you think something's wrong?

from yolact.

harryb-kyutech avatar harryb-kyutech commented on August 18, 2024

Our lab has four GTX 1080 Ti GPU. I used all of it during training. I'm not sure if this is the standard training speed for this algorithm on this GPU setup. I also have no idea if something's wrong as the ETA was fairly accurate during my first run of your algorithm on my custom dataset.

from yolact.

dbolya avatar dbolya commented on August 18, 2024

Does your custom dataset have huge images or something? On COCO with 1 1080ti, the training time should be expected to be ~5 days, so 100k iterations in 36 hours is 3 days for 200k iterations (800k iters / 4 gpus), which is not even a 2x speed up. If you haven't already, you can check out #8 to see some tips for multiple GPUs.

Though I must confess, since I didn't use multiple GPUs while developing YOLACT, my code doesn't support multiple GPUs very well. Have you tried training on a single GPU and comparing the ETA (after the first 1k iterations though because it's not accurate until then)?

Btw the ETA is based on the current speed of training, so that's why it's accurate (as long as the training speed is consistent). It being accurate doesn't necessarily mean everything's working properly.

from yolact.

dbolya avatar dbolya commented on August 18, 2024

Since this has been open so long, I'm going to close it. Feel free to reopen if you have any updates.

from yolact.

maskrcnnuser avatar maskrcnnuser commented on August 18, 2024

@dbolya - I am training using my set of 20 images for training and 10 for validation - I purposefully kept the set small in the beginning to test out the system. However, my output says ETA 58 days!! My config is as follows.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 426.23 Driver Version: 426.23 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 TCC | 00000001:00:00.0 Off | 0 |
| N/A 80C P0 143W / 149W | 10404MiB / 11448MiB | 96% Default |
+-------------------------------+----------------------+----------------------+

Timer varies between 8.5 and 11.6 seconds.

Tensorflow: 2.0.0
Pytorch: 1.6.0

Am I doing something wrong?

from yolact.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.