Giter Club home page Giter Club logo

Comments (9)

jazzly avatar jazzly commented on September 21, 2024

使用的版本如下:

  • paddlepaddle: 2.6.1
  • paddlenlp: 2.8.0

from paddlenlp.

w5688414 avatar w5688414 commented on September 21, 2024

可以看一下你的checkpoint/checkpoint-170目录,是不是没有保存tokenizer,一个简单的解决方式是去掉参数:

load_best_model_at_end

from paddlenlp.

jazzly avatar jazzly commented on September 21, 2024

可以看一下你的checkpoint/checkpoint-170目录,是不是没有保存tokenizer,一个简单的解决方式是去掉参数:

load_best_model_at_end

是这样的,如果要使用early_stopping ,那么load_best_model_at_end是必须项。当报这个错的时候,类似checkpoint-170这种目录已经不存在了。我查看worklog发现,其实训练已经完成了。但是可能是多进程开启的原因,每个进程都想load_best_model_at_end。所以只有一个进程能成功。其它的进程应该都失败了。

python3 -m paddle.distributed.launch --nproc_per_node=24

这样是正确开启多进程的方式吗? 在CPU模式下

from paddlenlp.

w5688414 avatar w5688414 commented on September 21, 2024

不建议在cpu上训练,训练效率低,gpu的分布式训练文档参考:

https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/distributed/launch_cn.html#launch

--nproc_per_node:每个节点启动的进程数,在 GPU 训练中,应该小于等于系统的 GPU 数量。例如 --nproc_per_node=8

from paddlenlp.

jazzly avatar jazzly commented on September 21, 2024

不建议在cpu上训练,训练效率低,gpu的分布式训练文档参考:

https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/distributed/launch_cn.html#launch

--nproc_per_node:每个节点启动的进程数,在 GPU 训练中,应该小于等于系统的 GPU 数量。例如 --nproc_per_node=8

暂时手头没有GPU可用,使用CPU测试的。 示例任务使用24个CPU核心训练大概4个小时不到就够了。还可一用。我的意思是,CPU模式如果不用 paddle.distributed.launch 那么应该如何正确开启多线程或多进程训练?

from paddlenlp.

w5688414 avatar w5688414 commented on September 21, 2024

这个可以在框架下面提issue,cpu场景不是很高频,应该是不支持的,分布式训练可以参考文档:

https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/06_distributed_training/index_cn.html

from paddlenlp.

jazzly avatar jazzly commented on September 21, 2024

这个可以在框架下面提issue,cpu场景不是很高频,应该是不支持的,分布式训练可以参考文档:

https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/06_distributed_training/index_cn.html

OK,明白了。感谢

from paddlenlp.

github-actions avatar github-actions commented on September 21, 2024

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。

from paddlenlp.

github-actions avatar github-actions commented on September 21, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天,即将关闭。

from paddlenlp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.