I cloned the latest version open-reid (latest commit is <a class="commit-link" data-ho

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

random seed is wrong implementation about open-reid HOT 5 CLOSED

cysu commented on May 26, 2024

random seed is wrong implementation

from open-reid.

Comments (5)

Cysu commented on May 26, 2024 1

@zydou I mean some of the cuda kernels that used by cudnn or torch C-implementation could be non-deterministic. One reason could be floating number addition is not associative. You can try in python 0.7 + 0.2 + 0.1 == 0.7 + 0.1 + 0.2. It will print False. This implies that the reduce Op with multiple threads / processes is non-deterministic.

When setting batch size to 1, I suspect there is no need to call the reduce Op. And thus lead to the same result.

from open-reid.

Cysu commented on May 26, 2024

@zydou Thank you very much for the thorough investigation! I think your modification is correct. I suspect the reason why the final performance is still different is that GPU computation is inherently non-deterministic. Could you please try to run the experiment with single CPU core?

from open-reid.

Cysu commented on May 26, 2024

@zydou You could run with argument -j 0, which will use single thread.

I have tried it myself. When using GPU, I found that the losses of the first several iterations are the same across different trials. But they became different afterwards, and lead to different final results. For example, the first trial could be

Epoch: [0][1/27]  Time 2.252 (2.252)  Data 0.029 (0.029)  Loss 5.377 (5.377)  Prec 0.00% (0.00%)
Epoch: [0][2/27]  Time 0.268 (1.260)  Data 0.022 (0.026)  Loss 5.382 (5.379)  Prec 0.00% (0.00%)
Epoch: [0][3/27]  Time 0.224 (0.915)  Data 0.020 (0.024)  Loss 5.432 (5.397)  Prec 0.00% (0.00%)
Epoch: [0][4/27]  Time 0.259 (0.751)  Data 0.020 (0.023)  Loss 5.431 (5.405)  Prec 0.00% (0.00%)
Epoch: [0][5/27]  Time 0.260 (0.652)  Data 0.020 (0.022)  Loss 5.464 (5.417)  Prec 0.00% (0.00%)
Epoch: [0][6/27]  Time 0.258 (0.587)  Data 0.020 (0.022)  Loss 5.553 (5.440)  Prec 0.00% (0.00%)

While the second trial is

Epoch: [0][1/27]  Time 2.229 (2.229)  Data 0.029 (0.029)  Loss 5.377 (5.377)  Prec 0.00% (0.00%)
Epoch: [0][2/27]  Time 0.273 (1.251)  Data 0.022 (0.026)  Loss 5.382 (5.379)  Prec 0.00% (0.00%)
Epoch: [0][3/27]  Time 0.219 (0.907)  Data 0.020 (0.024)  Loss 5.432 (5.397)  Prec 0.00% (0.00%)
Epoch: [0][4/27]  Time 0.261 (0.745)  Data 0.020 (0.023)  Loss 5.431 (5.405)  Prec 0.00% (0.00%)
Epoch: [0][5/27]  Time 0.259 (0.648)  Data 0.020 (0.022)  Loss 5.463 (5.417)  Prec 0.00% (0.00%)
Epoch: [0][6/27]  Time 0.259 (0.583)  Data 0.020 (0.022)  Loss 5.557 (5.440)  Prec 0.00% (0.00%)

But if using CPU (may need remove the .cuda() and DataParallel in code), it will always lead to the same results. This verifies that GPU computation is inherently non-deterministic.

from open-reid.

zydou commented on May 26, 2024

@Cysu Hi, Tong Xiao. Thanks for your reply! I do a few more experiments below (using numpy.random version transform.py in all experiments):

on CPU: Following your suggestion， I remove the .cuda() and DataParallel in code) and run python examples/softmax_loss.py -d viper -b 64 -j 2 -a resnet50 --logs-dir logs/softmax-loss/viper-resnet50 Then get the same results each time.
on GPU: When running on GPU(not remove .cuda() and DataParallel) , the results are different as talked above. But when setting the batch size to 1, it will also lead to the same results. For example: python examples/softmax_loss.py -d viper -b 1 -j 2 -a resnet50 --logs-dir logs/softmax-loss/viper-resnet50

So I don't agree with

GPU computation is inherently non-deterministic.

But I can't explain why this happen. Do you know the reason? Thanks a lot!

from open-reid.

zydou commented on May 26, 2024

@Cysu Thanks a lot!

from open-reid.

random seed is wrong implementation about open-reid HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent