I have pre-trained the resnet50 model for 800 epochs. The loss looks fine: <a targ

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Looks like the issue with ResNet-50 is related to <a class="issue-link js-issue-link"

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

ImageNet finetuning exploding about spark HOT 9 OPEN

ds2268 commented on June 9, 2024

ImageNet finetuning exploding

from spark.

Comments (9)

ds2268 commented on June 9, 2024

I have also tried parameters from the paper (batch size 2048, lr=3e-8, etc.). The finetunning is still exploding (loss quickly to 0 and then NaN).

[12-07 18:37:04] (nstream_imagenet/main.py, line 174)=> [ep0 it  3/626]    L: 0.6937    Acc: 0.00    lr: 3.1e-05~3.8e-04    Remain: 3:26:47
[12-07 18:40:10] (nstream_imagenet/main.py, line 174)=> [ep0 it313/626]    L: 0.0078    Acc: 0.00    lr: 5.5e-04~6.7e-03    Remain: 0:04:24
[12-07 18:43:23] (nstream_imagenet/main.py, line 174)=> [ep0 it625/626]    L: 0.0059    Acc: 9.72    lr: 1.1e-03~1.3e-02    Remain: 0:00:00
[12-07 18:44:04] (nstream_imagenet/main.py, line  84)=> [ep0/300]    Max (Last) Acc: 8.97 (8.97 o 50000.0)    EMA: 0.13 (0.01 o 50000.0)    Ep cost: 500.25s,   Ev cost: 23.38,    Remain: 1 day, 17:32:55,    Finish @ 12-09 05:16
[12-07 18:44:06] (nstream_imagenet/main.py, line  60)=> [loader_train.sampler.set_epoch(1)]
[12-07 18:44:13] (nstream_imagenet/main.py, line 174)=> [ep1 it  3/626]    L: 0.0059    Acc: 15.62    lr: 1.1e-03~1.3e-02    Remain: 0:18:02
[12-07 18:47:18] (nstream_imagenet/main.py, line 174)=> [ep1 it313/626]    L: 0.0055    Acc: 21.09    lr: 1.6e-03~1.9e-02    Remain: 0:03:11
[12-07 18:50:15] (nstream_imagenet/main.py, line 174)=> [ep1 it625/626]    L: 0.0056    Acc: 23.61    lr: 2.1e-03~2.6e-02    Remain: 0:00:00
[12-07 18:50:15] (nstream_imagenet/main.py, line  84)=> [ep1/300]    Max (Last) Acc: 8.97 (8.97 o 50000.0)    EMA: 0.13 (0.01 o 50000.0)    Ep cost: 370.16s,   Ev cost: -,    Remain: 1 day, 6:38:28,    Finish @ 12-08 18:28
[12-07 18:50:17] (nstream_imagenet/main.py, line  60)=> [loader_train.sampler.set_epoch(2)]
[12-07 18:50:28] (nstream_imagenet/main.py, line 174)=> [ep2 it  3/626]    L: 0.0055    Acc: 23.44    lr: 2.1e-03~2.6e-02    Remain: 0:29:35
[12-07 18:53:36] (nstream_imagenet/main.py, line 174)=> [ep2 it313/626]    L: 0.0071    Acc: 13.28    lr: 2.6e-03~3.2e-02    Remain: 0:03:18
[12-07 18:56:33] (nstream_imagenet/main.py, line 174)=> [ep2 it625/626]    L: 0.0069    Acc: 5.56    lr: 3.2e-03~3.9e-02    Remain: 0:00:00
[12-07 18:56:33] (nstream_imagenet/main.py, line  84)=> [ep2/300]    Max (Last) Acc: 8.97 (8.97 o 50000.0)    EMA: 0.13 (0.01 o 50000.0)    Ep cost: 376.92s,   Ev cost: -,    Remain: 1 day, 7:05:45,    Finish @ 12-08 19:02
[12-07 18:56:34] (nstream_imagenet/main.py, line  60)=> [loader_train.sampler.set_epoch(3)]
[12-07 18:56:48] (nstream_imagenet/main.py, line 174)=> [ep3 it  3/626]    L: 0.0077    Acc: 0.78    lr: 3.2e-03~3.9e-02    Remain: 0:34:59
[12-07 18:59:55] (nstream_imagenet/main.py, line 174)=> [ep3 it313/626]    L: 62.9384    Acc: 0.00    lr: 3.7e-03~4.5e-02    Remain: 0:03:20
[12-07 19:02:52] (nstream_imagenet/main.py, line 174)=> [ep3 it625/626]    L: 317.5974    Acc: 0.00    lr: 4.2e-03~5.1e-02    Remain: 0:00:00
[12-07 19:02:52] (nstream_imagenet/main.py, line  84)=> [ep3/300]    Max (Last) Acc: 8.97 (8.97 o 50000.0)    EMA: 0.13 (0.01 o 50000.0)    Ep cost: 378.86s,   Ev cost: -,    Remain: 1 day, 7:09:03,    Finish @ 12-08 19:11
[12-07 19:03:08] (nstream_imagenet/main.py, line 174)=> [ep4 it  3/626]    L: 267.8481    Acc: 0.00    lr: 4.2e-03~5.1e-02    Remain: 0:38:13
[12-07 19:06:16] (nstream_imagenet/main.py, line 174)=> [ep4 it313/626]    L: 352016.5938    Acc: 0.00    lr: 4.7e-03~5.8e-02    Remain: 0:03:21
[12-07 19:09:15] (nstream_imagenet/main.py, line 174)=> [ep4 it625/626]    L: 3266225152.0000    Acc: 0.00    lr: 5.3e-03~6.4e-02    Remain: 0:00:00
[12-07 19:09:15] (nstream_imagenet/main.py, line  84)=> [ep4/300]    Max (Last) Acc: 8.97 (8.97 o 50000.0)    EMA: 0.13 (0.01 o 50000.0)    Ep cost: 382.58s,   Ev cost: -,    Remain: 1 day, 7:21:01,    Finish @ 12-08 19:30
[12-07 19:09:31] (nstream_imagenet/main.py, line 174)=> [ep5 it  3/626]    L: 3494824192.0000    Acc: 0.00    lr: 5.3e-03~6.4e-02    Remain: 0:38:32
[12-07 19:12:40] (nstream_imagenet/main.py, line 174)=> [ep5 it313/626]    L: nan    Acc: 1.56    lr: 5.3e-03~6.4e-02    Remain: 0:03:22
[12-07 19:15:39] (nstream_imagenet/main.py, line 174)=> [ep5 it625/626]    L: nan    Acc: 0.00    lr: 5.3e-03~6.4e-02    Remain: 0:00:00

from spark.

keyu-tian commented on June 9, 2024

Hi @ds2268, the 800-ep pre-training seems normal. The fine-tuning loss before explosion (5e-3, close to zero) is also as expected, since we are using BCE loss instead of CE. (ps: we never observed any loss explosion problem in all of our finetuning experiments)

Have you used mixed precision?

I also found that the default batch size should be 2048, maybe you can also try this.

from spark.

ds2268 commented on June 9, 2024

I have tried 2048 configs from the paper, with no success. I think that downstream ImageNet is not using mixed precision. I could only find apex libs in downstream mmdet.

from spark.

keyu-tian commented on June 9, 2024

Could you try running with timm==0.5.4?

from spark.

ds2268 commented on June 9, 2024

I am already running with:

timm 0.5.44
torch 1.12.0
torchvision 0.13.1

from spark.

ds2268 commented on June 9, 2024

Looks like the issue with ResNet-50 is related to #27

from spark.

keyu-tian commented on June 9, 2024

Honestly I have no idea what the problem is with the fine-tuning code (yes #27 is similar). Maybe you can try again with base_lr < 0.002. I will run this too.

from spark.

ds2268 commented on June 9, 2024

@keyu-tian, I have now pretrained ConvNext-S model (800 epochs) and performed ImageNet finetuning:

It's not yet finished (140 epochs / 200), but looks like it's working on ConvNext-S. The reported results for ConvNext-S are 84.1. I will probably not reach it by 200 epochs, but probably due to only 800 epochs pretraining.

The problem is then really just with the Resnet-50 stability.

from spark.

keyu-tian commented on June 9, 2024

@ds2268 thanks for your verification. So it should be LAMB or BCE causing the problem.

Currently I don't have enough GPU or time to debug more, you can start with convnext, or try to use a smaller finetune learning rate of resnet50, or try resnet101.

ps: it is always recommended to use the default hyperparameters in downstream_imagenet/args.py, not from the paper (which may be old) or elsewhere.

from spark.

ImageNet finetuning exploding about spark HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent