Comments (5)
@zydou I mean some of the cuda kernels that used by cudnn or torch C-implementation could be non-deterministic. One reason could be floating number addition is not associative. You can try in python 0.7 + 0.2 + 0.1 == 0.7 + 0.1 + 0.2
. It will print False
. This implies that the reduce Op with multiple threads / processes is non-deterministic.
When setting batch size to 1, I suspect there is no need to call the reduce Op. And thus lead to the same result.
from open-reid.
@zydou Thank you very much for the thorough investigation! I think your modification is correct. I suspect the reason why the final performance is still different is that GPU computation is inherently non-deterministic. Could you please try to run the experiment with single CPU core?
from open-reid.
@zydou You could run with argument -j 0
, which will use single thread.
I have tried it myself. When using GPU, I found that the losses of the first several iterations are the same across different trials. But they became different afterwards, and lead to different final results. For example, the first trial could be
Epoch: [0][1/27] Time 2.252 (2.252) Data 0.029 (0.029) Loss 5.377 (5.377) Prec 0.00% (0.00%)
Epoch: [0][2/27] Time 0.268 (1.260) Data 0.022 (0.026) Loss 5.382 (5.379) Prec 0.00% (0.00%)
Epoch: [0][3/27] Time 0.224 (0.915) Data 0.020 (0.024) Loss 5.432 (5.397) Prec 0.00% (0.00%)
Epoch: [0][4/27] Time 0.259 (0.751) Data 0.020 (0.023) Loss 5.431 (5.405) Prec 0.00% (0.00%)
Epoch: [0][5/27] Time 0.260 (0.652) Data 0.020 (0.022) Loss 5.464 (5.417) Prec 0.00% (0.00%)
Epoch: [0][6/27] Time 0.258 (0.587) Data 0.020 (0.022) Loss 5.553 (5.440) Prec 0.00% (0.00%)
While the second trial is
Epoch: [0][1/27] Time 2.229 (2.229) Data 0.029 (0.029) Loss 5.377 (5.377) Prec 0.00% (0.00%)
Epoch: [0][2/27] Time 0.273 (1.251) Data 0.022 (0.026) Loss 5.382 (5.379) Prec 0.00% (0.00%)
Epoch: [0][3/27] Time 0.219 (0.907) Data 0.020 (0.024) Loss 5.432 (5.397) Prec 0.00% (0.00%)
Epoch: [0][4/27] Time 0.261 (0.745) Data 0.020 (0.023) Loss 5.431 (5.405) Prec 0.00% (0.00%)
Epoch: [0][5/27] Time 0.259 (0.648) Data 0.020 (0.022) Loss 5.463 (5.417) Prec 0.00% (0.00%)
Epoch: [0][6/27] Time 0.259 (0.583) Data 0.020 (0.022) Loss 5.557 (5.440) Prec 0.00% (0.00%)
But if using CPU (may need remove the .cuda()
and DataParallel
in code), it will always lead to the same results. This verifies that GPU computation is inherently non-deterministic.
from open-reid.
@Cysu Hi, Tong Xiao. Thanks for your reply! I do a few more experiments below (using numpy.random version transform.py in all experiments):
- on CPU: Following your suggestion, I remove the
.cuda()
andDataParallel
in code) and runpython examples/softmax_loss.py -d viper -b 64 -j 2 -a resnet50 --logs-dir logs/softmax-loss/viper-resnet50
Then get the same results each time. - on GPU: When running on GPU(not remove
.cuda()
andDataParallel
) , the results are different as talked above. But when setting the batch size to 1, it will also lead to the same results. For example:python examples/softmax_loss.py -d viper -b 1 -j 2 -a resnet50 --logs-dir logs/softmax-loss/viper-resnet50
So I don't agree with
GPU computation is inherently non-deterministic.
But I can't explain why this happen. Do you know the reason? Thanks a lot!
from open-reid.
@Cysu Thanks a lot!
from open-reid.
Related Issues (20)
- Dependencies - setup.py
- is it generalised
- DukeMTMC dataset can't be downloaded HOT 2
- Problem with the file examine.softmax_loss.py HOT 5
- It can not converge on non-pretrained model HOT 1
- Train with only 1 camera in duke
- Viper is missing HOT 2
- TypeError: Can't instantiate abstract class Euclidean with abstract methods get_metric, score_pairs HOT 1
- How much video memory do I need? HOT 1
- DukeMTMC result reporting
- OIM loss HOT 1
- OIM loss initialize error
- IndexError: invalid index of a 0-dim tensor. HOT 6
- RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated
- RuntimeError: Duke
- TypeError: Can't instantiate abstract class Euclidean with abstract methods get_metric, score_pairs HOT 2
- something miss with sort and match? HOT 1
- AssertionError: Torch not compiled with CUDA enabled
- Oim Loss with 'NAN' problem
- eep q learning
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from open-reid.