Comments (11)
Hmm weird, I just booted up a p3.2xlarge instance to test and I can't reproduce this.
Running the command
python train.py --config=yolact_base_config
(i.e., with the default batch size of 8) I get an ETA of 4 days.
Running the command
python train.py --config=yolact_base_config --batch_size=12 --num_workers=8
results in an ETA of 5 days (also note that the training parameters are optimized for a batch size of 8, so you should use the first command).
Are you doing anything special that would slow it down? If not, what's your Pytorch and CUDA versions?
The timer column is the time it took for one training iteration, while the ETA averages out over a large number of iterations.
And to view training performance over time, log your console output to a file, pull my latest commit, and from the project's root directory run
python -m "scripts.plot_loss" <my_log_file>
to plot the loss over time.
To plot validation mAP over time, use
python -m "scripts.plot_loss" <my_log_file> val
from yolact.
Thanks for a quick reply! I am using my own dataset, does that matter? (It should just resize images and proceed the same right?) I have generated coco-style annotations and have only added a new config, nothing else. Error exploded.
from yolact.
For debugging purposes, can you try training on COCO to see what the ETA is? There might be an issue with how you set up your dataset / config.
from yolact.
@dbolya do you have a script that
Hmm weird, I just booted up a p3.2xlarge instance to test and I can't reproduce this.
Running the command
python train.py --config=yolact_base_config
(i.e., with the default batch size of 8) I get an ETA of 4 days.Running the command
python train.py --config=yolact_base_config --batch_size=12 --num_workers=8
results in an ETA of 5 days (also note that the training parameters are optimized for a batch size of 8, so you should use the first command).Are you doing anything special that would slow it down? If not, what's your Pytorch and CUDA versions?
The timer column is the time it took for one training iteration, while the ETA averages out over a large number of iterations.
And to view training performance over time, log your console output to a file, pull my latest commit, and from the project's root directory run
python -m "scripts.plot_loss" <my_log_file>
to plot the loss over time.To plot validation mAP over time, use
python -m "scripts.plot_loss" <my_log_file> val
@dbolya do you have a script that plots both the training loss and mAP AFTER the training has finished (i.e. perhaps during evaluation)? Would be of great help as it is urgently needed. Thank you nonetheless for the great work on YOLACT.
from yolact.
@harrybolingot Those scripts plot those things after training, but you need the log of stdout during training or else there's nowhere to get the info from. Or are you asking for the final training loss?
from yolact.
@harrybolingot Those scripts plot those things after training, but you need the log of stdout during training or else there's nowhere to get the info from. Or are you asking for the final training loss?
Yeah my problem is that I was not able to save the stdout during training. I've cut the training from time to time as training from 0-100k iterations took 36 hours. During this time I didn't realize that I should have logged the stdout to plot the training loss. I want to be able to plot the training loss from start to end having finished the training :(
from yolact.
Yeah sorry that's not possible cause that data's not saved anywhere. And oof that training time is really bad. Is that expected given your hardware or do you think something's wrong?
from yolact.
Our lab has four GTX 1080 Ti GPU. I used all of it during training. I'm not sure if this is the standard training speed for this algorithm on this GPU setup. I also have no idea if something's wrong as the ETA was fairly accurate during my first run of your algorithm on my custom dataset.
from yolact.
Does your custom dataset have huge images or something? On COCO with 1 1080ti, the training time should be expected to be ~5 days, so 100k iterations in 36 hours is 3 days for 200k iterations (800k iters / 4 gpus), which is not even a 2x speed up. If you haven't already, you can check out #8 to see some tips for multiple GPUs.
Though I must confess, since I didn't use multiple GPUs while developing YOLACT, my code doesn't support multiple GPUs very well. Have you tried training on a single GPU and comparing the ETA (after the first 1k iterations though because it's not accurate until then)?
Btw the ETA is based on the current speed of training, so that's why it's accurate (as long as the training speed is consistent). It being accurate doesn't necessarily mean everything's working properly.
from yolact.
Since this has been open so long, I'm going to close it. Feel free to reopen if you have any updates.
from yolact.
@dbolya - I am training using my set of 20 images for training and 10 for validation - I purposefully kept the set small in the beginning to test out the system. However, my output says ETA 58 days!! My config is as follows.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 426.23 Driver Version: 426.23 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 TCC | 00000001:00:00.0 Off | 0 |
| N/A 80C P0 143W / 149W | 10404MiB / 11448MiB | 96% Default |
+-------------------------------+----------------------+----------------------+
Timer varies between 8.5 and 11.6 seconds.
Tensorflow: 2.0.0
Pytorch: 1.6.0
Am I doing something wrong?
from yolact.
Related Issues (20)
- Hyperparameter tuning
- AP输出
- Loss curve
- Just wonder how this project make the results sync while using thread pool
- How to change the network
- How to draw a specific shape on the screen while recognizing a specific object?
- epoch
- How to pick the best epoch? HOT 1
- KeyError 2156740059
- small-, medium-, large-mAP definition?
- Question on DCN implementation
- mAP and AR: definition of small medium large ? HOT 2
- Handling negative images or images with no object of interest
- After entering the verification command, do not display the verified image
- After entering the verification command, do not display the verified image
- I get Illegal instruction (core dumped)
- "mtrand.pyx", line 936, in numpy.random.mtrand.RandomState.choice HOT 2
- How to get the coordinates and polygon information of evaluation results
- Post processing process issues
- There is a bug when I want to do test with yolact. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from yolact.