Comments (14)
Can your dataloaders catch up? i.e. are the GPUs at (almost) full load all the time?
The reported training time is very rough (we used a mix of hardware at different times). We will re-train again and give a better estimate in the next revision of the paper.
In any case, it should take much less than 100h for s0 even with 2x 1080Ti. The most probable reason is dataloader bottleneck.
from stcn.
I think it's highly likely, cause my GPU sometimes far lower than 100%.
Can you give me some suggestions to solve it? Should I change OMP_NUM_THREADS (in command) or num_works (pytorch dataloader) bigger?
Thank you!
from stcn.
I find bigger num_works really speed things up in my case.
from stcn.
I think the general wisdom is to use higher OMP_NUM_THREADS and num_workers when you have more free CPU cores available.
That's great to hear.
from stcn.
OK. More num_works exactly helps, and OMP_NUM_THREADS=4 (1,8,16 will be slower even) as your original setting is the fastest in my case.
Thank you for your great work and quick reply !
from stcn.
BTW you can try adding the --benchmark
flag.
from stcn.
Thank you, I tried it but not really effective.
And bigger num_workers only bring 10% speed improvement .
Could you tell me what's your time consuming in log "retrain_s0 - It ******* [TRAIN] [time ]: ?". In my case , time≈1.0+.
from stcn.
Sorry, it should be a 25% improvement about speed. (bigger num_workers=16, 1*3090, --nproc_per_node=1, bs=16).
log:retrain_s0 - It 51300 [TRAIN] [time ]: 1.0771173
from stcn.
With 1x 3090 I am getting around 0.7 for [time].
2x 2080Ti should be faster than 1x 3090.
from stcn.
Hmm, it's actually 0.7 around the start of training and stabilizes around 0.5.
from stcn.
I compared 22080ti with 13090, and result is 1×3090 is a little faster than 2×2080ti.
If the [time] is round 0.5, s0 need 45 hours right?
In my case, s3 exactly need 30 hours. So I wanna to confirm "Regular training without BL30K takes around 30 hours"(in paper), 30-hour is refer to s3 or s0+s3?
from stcn.
It refers to s0+s3. I guess hardware infrastructure affects the training speed a lot.
from stcn.
Ok. Thank you !
from stcn.
May I ask the training time for stage 0 after you use bigger num_workers? (num_workers=16, 1*3090, --nproc_per_node=1, bs=16). @BWYWTB
from stcn.
Related Issues (20)
- I would like to ask a question about Youtube-vos2019 evaluation HOT 3
- How can i get the Jaccard and F-Score HOT 1
- if I use --flip, must use the result without --flip? HOT 1
- Multiple objects when training HOT 11
- about V HOT 4
- The same issue: good IOU in training but very bad results in testing. HOT 3
- @zhouweii234 Hi. Do you find the reason for the bad results? I meet the same issue with you. HOT 1
- RuntimeError: DataLoader worker (pid 46794) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit. HOT 1
- 四gpu训练 HOT 1
- How to check DAVIS, YouTube performence?? HOT 1
- How to check dot product performance? HOT 1
- 启动代码报错
- total_loss HOT 3
- git.Repo() has an error HOT 4
- STCN 代码中的一些问题 HOT 4
- 只在视频数据集上训练的结果 HOT 2
- questions regard training the model HOT 2
- youtube 2019上的测试结果 HOT 17
- For evaluation HOT 1
- How to get the results based on Youtube VOS 2019? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from stcn.