Giter Club home page Giter Club logo

Comments (9)

ds2268 avatar ds2268 commented on June 9, 2024

I have also tried parameters from the paper (batch size 2048, lr=3e-8, etc.). The finetunning is still exploding (loss quickly to 0 and then NaN).

[12-07 18:37:04] (nstream_imagenet/main.py, line 174)=> [ep0 it  3/626]    L: 0.6937    Acc: 0.00    lr: 3.1e-05~3.8e-04    Remain: 3:26:47
[12-07 18:40:10] (nstream_imagenet/main.py, line 174)=> [ep0 it313/626]    L: 0.0078    Acc: 0.00    lr: 5.5e-04~6.7e-03    Remain: 0:04:24
[12-07 18:43:23] (nstream_imagenet/main.py, line 174)=> [ep0 it625/626]    L: 0.0059    Acc: 9.72    lr: 1.1e-03~1.3e-02    Remain: 0:00:00
[12-07 18:44:04] (nstream_imagenet/main.py, line  84)=> [ep0/300]    Max (Last) Acc: 8.97 (8.97 o 50000.0)    EMA: 0.13 (0.01 o 50000.0)    Ep cost: 500.25s,   Ev cost: 23.38,    Remain: 1 day, 17:32:55,    Finish @ 12-09 05:16
[12-07 18:44:06] (nstream_imagenet/main.py, line  60)=> [loader_train.sampler.set_epoch(1)]
[12-07 18:44:13] (nstream_imagenet/main.py, line 174)=> [ep1 it  3/626]    L: 0.0059    Acc: 15.62    lr: 1.1e-03~1.3e-02    Remain: 0:18:02
[12-07 18:47:18] (nstream_imagenet/main.py, line 174)=> [ep1 it313/626]    L: 0.0055    Acc: 21.09    lr: 1.6e-03~1.9e-02    Remain: 0:03:11
[12-07 18:50:15] (nstream_imagenet/main.py, line 174)=> [ep1 it625/626]    L: 0.0056    Acc: 23.61    lr: 2.1e-03~2.6e-02    Remain: 0:00:00
[12-07 18:50:15] (nstream_imagenet/main.py, line  84)=> [ep1/300]    Max (Last) Acc: 8.97 (8.97 o 50000.0)    EMA: 0.13 (0.01 o 50000.0)    Ep cost: 370.16s,   Ev cost: -,    Remain: 1 day, 6:38:28,    Finish @ 12-08 18:28
[12-07 18:50:17] (nstream_imagenet/main.py, line  60)=> [loader_train.sampler.set_epoch(2)]
[12-07 18:50:28] (nstream_imagenet/main.py, line 174)=> [ep2 it  3/626]    L: 0.0055    Acc: 23.44    lr: 2.1e-03~2.6e-02    Remain: 0:29:35
[12-07 18:53:36] (nstream_imagenet/main.py, line 174)=> [ep2 it313/626]    L: 0.0071    Acc: 13.28    lr: 2.6e-03~3.2e-02    Remain: 0:03:18
[12-07 18:56:33] (nstream_imagenet/main.py, line 174)=> [ep2 it625/626]    L: 0.0069    Acc: 5.56    lr: 3.2e-03~3.9e-02    Remain: 0:00:00
[12-07 18:56:33] (nstream_imagenet/main.py, line  84)=> [ep2/300]    Max (Last) Acc: 8.97 (8.97 o 50000.0)    EMA: 0.13 (0.01 o 50000.0)    Ep cost: 376.92s,   Ev cost: -,    Remain: 1 day, 7:05:45,    Finish @ 12-08 19:02
[12-07 18:56:34] (nstream_imagenet/main.py, line  60)=> [loader_train.sampler.set_epoch(3)]
[12-07 18:56:48] (nstream_imagenet/main.py, line 174)=> [ep3 it  3/626]    L: 0.0077    Acc: 0.78    lr: 3.2e-03~3.9e-02    Remain: 0:34:59
[12-07 18:59:55] (nstream_imagenet/main.py, line 174)=> [ep3 it313/626]    L: 62.9384    Acc: 0.00    lr: 3.7e-03~4.5e-02    Remain: 0:03:20
[12-07 19:02:52] (nstream_imagenet/main.py, line 174)=> [ep3 it625/626]    L: 317.5974    Acc: 0.00    lr: 4.2e-03~5.1e-02    Remain: 0:00:00
[12-07 19:02:52] (nstream_imagenet/main.py, line  84)=> [ep3/300]    Max (Last) Acc: 8.97 (8.97 o 50000.0)    EMA: 0.13 (0.01 o 50000.0)    Ep cost: 378.86s,   Ev cost: -,    Remain: 1 day, 7:09:03,    Finish @ 12-08 19:11
[12-07 19:03:08] (nstream_imagenet/main.py, line 174)=> [ep4 it  3/626]    L: 267.8481    Acc: 0.00    lr: 4.2e-03~5.1e-02    Remain: 0:38:13
[12-07 19:06:16] (nstream_imagenet/main.py, line 174)=> [ep4 it313/626]    L: 352016.5938    Acc: 0.00    lr: 4.7e-03~5.8e-02    Remain: 0:03:21
[12-07 19:09:15] (nstream_imagenet/main.py, line 174)=> [ep4 it625/626]    L: 3266225152.0000    Acc: 0.00    lr: 5.3e-03~6.4e-02    Remain: 0:00:00
[12-07 19:09:15] (nstream_imagenet/main.py, line  84)=> [ep4/300]    Max (Last) Acc: 8.97 (8.97 o 50000.0)    EMA: 0.13 (0.01 o 50000.0)    Ep cost: 382.58s,   Ev cost: -,    Remain: 1 day, 7:21:01,    Finish @ 12-08 19:30
[12-07 19:09:31] (nstream_imagenet/main.py, line 174)=> [ep5 it  3/626]    L: 3494824192.0000    Acc: 0.00    lr: 5.3e-03~6.4e-02    Remain: 0:38:32
[12-07 19:12:40] (nstream_imagenet/main.py, line 174)=> [ep5 it313/626]    L: nan    Acc: 1.56    lr: 5.3e-03~6.4e-02    Remain: 0:03:22
[12-07 19:15:39] (nstream_imagenet/main.py, line 174)=> [ep5 it625/626]    L: nan    Acc: 0.00    lr: 5.3e-03~6.4e-02    Remain: 0:00:00

from spark.

keyu-tian avatar keyu-tian commented on June 9, 2024

Hi @ds2268, the 800-ep pre-training seems normal. The fine-tuning loss before explosion (5e-3, close to zero) is also as expected, since we are using BCE loss instead of CE. (ps: we never observed any loss explosion problem in all of our finetuning experiments)

Have you used mixed precision?

I also found that the default batch size should be 2048, maybe you can also try this.

from spark.

ds2268 avatar ds2268 commented on June 9, 2024

I have tried 2048 configs from the paper, with no success. I think that downstream ImageNet is not using mixed precision. I could only find apex libs in downstream mmdet.

from spark.

keyu-tian avatar keyu-tian commented on June 9, 2024

Could you try running with timm==0.5.4?

from spark.

ds2268 avatar ds2268 commented on June 9, 2024

I am already running with:

timm 0.5.44
torch 1.12.0
torchvision 0.13.1

from spark.

ds2268 avatar ds2268 commented on June 9, 2024

Looks like the issue with ResNet-50 is related to #27

from spark.

keyu-tian avatar keyu-tian commented on June 9, 2024

Honestly I have no idea what the problem is with the fine-tuning code (yes #27 is similar). Maybe you can try again with base_lr < 0.002. I will run this too.

from spark.

ds2268 avatar ds2268 commented on June 9, 2024

@keyu-tian, I have now pretrained ConvNext-S model (800 epochs) and performed ImageNet finetuning:

image

It's not yet finished (140 epochs / 200), but looks like it's working on ConvNext-S. The reported results for ConvNext-S are 84.1. I will probably not reach it by 200 epochs, but probably due to only 800 epochs pretraining.

image

The problem is then really just with the Resnet-50 stability.

from spark.

keyu-tian avatar keyu-tian commented on June 9, 2024

@ds2268 thanks for your verification. So it should be LAMB or BCE causing the problem.

Currently I don't have enough GPU or time to debug more, you can start with convnext, or try to use a smaller finetune learning rate of resnet50, or try resnet101.

ps: it is always recommended to use the default hyperparameters in downstream_imagenet/args.py, not from the paper (which may be old) or elsewhere.

from spark.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.