Giter Club home page Giter Club logo

Comments (9)

GuoxiaWang avatar GuoxiaWang commented on May 18, 2024

@geoexploring

看了一下,压根没学到东西。目前看不出什么问题,我看你是用单卡训的,可以从以下几个步骤进行排查。
(1)把 sample_ratio: 0.1 改成 sample_ratio: 1.0 试试, 先排除 PartialFC 的问题

此外,你用的是什么 paddle 版本,推荐使用稳定的 release 2.2 版本

from plsc.

geoexploring avatar geoexploring commented on May 18, 2024

@GuoxiaWang ,感谢您的及时回复。

我是在百度的AI Studio上训练的,Paddle的版本是paddlepaddle-gpu 2.2.2.post101

按照您的建议,将sample_ratio改成1.0后,仍然会出现:

Lr过小时会导致Loss为nan,特别是当Lr缩小为原来的十分之一时(比如0.025变为0.00250.1变化0.01),都会导致Loss变化nan;
在验证集上的评估后的结果也和上述相同。

另外,补充报错的信息:

Training: 2022-03-24 15:02:19,865 - loss nan, lr: 0.010000, epoch: 11, step: 18000, eta: 1.72 hours, throughput: 433.87 imgs/sec
testing verification..
[[ 0.  0.  0. ...  0.  0.  0.]
 [ 0.  0.  0. ...  0.  0.  0.]
 [ 0.  0.  0. ...  0.  0.  0.]
 ...
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]]
Traceback (most recent call last):
  File "/home/aistudio/PLSC/train_aio.py", line 304, in <module>
    train(args)
  File "/home/aistudio/PLSC/dynamic/train_aio.py", line 228, in train
    best_metric = callback_verification(global_step, backbone)
  File "/home/aistudio/PLSC/dynamic/utils/verification_aio.py", line 211, in __call__
    best_metric = self.ver_test(backbone, num_update)
  File "/home/aistudio/PLSC/dynamic/utils/verification_aio.py", line 143, in ver_test
    nfolds=10)
  File "<decorator-gen-287>", line 2, in test
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/base.py", line 351, in _decorate_function
    return func(*args, **kwargs)
  File "/home/aistudio/PLSC/dynamic/utils/verification_aio.py", line 89, in test
    embeddings = sklearn.preprocessing.normalize(embeddings)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sklearn/preprocessing/_data.py", line 1905, in normalize
    estimator='the normalize function', dtype=FLOAT_DTYPES)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sklearn/utils/validation.py", line 721, in check_array
    allow_nan=force_all_finite == 'allow-nan')
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/sklearn/utils/validation.py", line 106, in _assert_all_finite
    msg_dtype if msg_dtype is not None else X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

谢谢!

from plsc.

GuoxiaWang avatar GuoxiaWang commented on May 18, 2024

你的数据集是可见的吗?我可以拿你的数据以及你的配置复现一下

from plsc.

geoexploring avatar geoexploring commented on May 18, 2024

@GuoxiaWang , 谢谢您!

这个涉及到公司业务上的数据,而且数据组织很繁琐,可能会耽搁您很长时间,我再研究研究。

万分感谢!

from plsc.

GuoxiaWang avatar GuoxiaWang commented on May 18, 2024

@geoexploring 可以先拿公开数据集用你的配置来试试看看,如果公开集也有问题,那就是代码写得有问题了,如果公开集没问题,那就是你那边数据处理有问题

from plsc.

geoexploring avatar geoexploring commented on May 18, 2024

@GuoxiaWang ,谢谢您!

我们那个数据集属于另一种类型的问题了,目前还没有公开数据集。谢谢您的建议,我再看看网络架构和数据加载方面有没有啥问题。

谢谢!

from plsc.

geoexploring avatar geoexploring commented on May 18, 2024

@GuoxiaWang

发现新特点:当用FP16训练时,不会出现上述的训练中途Loss变为nan的情况,但是会经常弹出信息Found inf or nan of distributed parameter, dtype is paddle.float16;Found inf or nan, current scale is: 13824.0, decrease to: 13824.0*0.5,并且 Loss下降速度相比FP32会慢很多。请问这是什么原因呢?

谢谢!

from plsc.

GuoxiaWang avatar GuoxiaWang commented on May 18, 2024

@geoexploring

Found inf or nan of distributed parameter, dtype is paddle.float16;Found inf or nan, current scale is: 13824.0, decrease to: 13824.0*0.5

这个是正常的,我故意打印的,使用 FP16 的时候是有一个叫做 loss scaling 的东西,上面这句话是在当模型并行的 FC 中计算时,梯度出 nan/inf 了,这时候会跳过当前步的更新,同时 loss scaling 缩小一倍,继续走下一个 step,当2000步没出现 nan/inf 了,loss scaling 又调大一倍。

不过 loss 下降速度比 FP32 慢很多,我觉得首先训完看看吧,如果训完最后验证集上的精度合理那就合理。
通常我见到的 FP16 训练的 loss 会比 FP32 的大一些,这个是由于 FP16 精度没有 FP32 那么高导致的。

from plsc.

geoexploring avatar geoexploring commented on May 18, 2024

@GuoxiaWang , 谢谢您的快速回复!

这确实是一个不错的设计,其他问题我发邮件咨询您。谢谢!

from plsc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.