Comments (9)
使用的版本如下:
- paddlepaddle: 2.6.1
- paddlenlp: 2.8.0
from paddlenlp.
可以看一下你的checkpoint/checkpoint-170目录,是不是没有保存tokenizer,一个简单的解决方式是去掉参数:
load_best_model_at_end
from paddlenlp.
可以看一下你的checkpoint/checkpoint-170目录,是不是没有保存tokenizer,一个简单的解决方式是去掉参数:
load_best_model_at_end
是这样的,如果要使用early_stopping ,那么load_best_model_at_end是必须项。当报这个错的时候,类似checkpoint-170这种目录已经不存在了。我查看worklog发现,其实训练已经完成了。但是可能是多进程开启的原因,每个进程都想load_best_model_at_end。所以只有一个进程能成功。其它的进程应该都失败了。
python3 -m paddle.distributed.launch --nproc_per_node=24
这样是正确开启多进程的方式吗? 在CPU模式下
from paddlenlp.
不建议在cpu上训练,训练效率低,gpu的分布式训练文档参考:
https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/distributed/launch_cn.html#launch
--nproc_per_node:每个节点启动的进程数,在 GPU 训练中,应该小于等于系统的 GPU 数量。例如 --nproc_per_node=8
from paddlenlp.
不建议在cpu上训练,训练效率低,gpu的分布式训练文档参考:
https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/distributed/launch_cn.html#launch
--nproc_per_node:每个节点启动的进程数,在 GPU 训练中,应该小于等于系统的 GPU 数量。例如 --nproc_per_node=8
暂时手头没有GPU可用,使用CPU测试的。 示例任务使用24个CPU核心训练大概4个小时不到就够了。还可一用。我的意思是,CPU模式如果不用 paddle.distributed.launch 那么应该如何正确开启多线程或多进程训练?
from paddlenlp.
这个可以在框架下面提issue,cpu场景不是很高频,应该是不支持的,分布式训练可以参考文档:
https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/06_distributed_training/index_cn.html
from paddlenlp.
这个可以在框架下面提issue,cpu场景不是很高频,应该是不支持的,分布式训练可以参考文档:
https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/06_distributed_training/index_cn.html
OK,明白了。感谢
from paddlenlp.
This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。
from paddlenlp.
This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天,即将关闭。
from paddlenlp.
Related Issues (20)
- [Question]: 使用examples/machine_translation/transformer下的机器翻译案例时应该使用哪个版本的paddlenlp HOT 1
- [Question]: AttributeError: module 'fused_ln' has no attribute 'fused_rms_norm' HOT 3
- [Question]: RuntimeError: (NotFound) Operator (one_hot) is not registered. [ HOT 1
- [Question]: 训练ernie文本分类时,checkpoint文件夹内没有模型,只有个setting.txt文件 HOT 8
- 训练自己的样本时遇到错误,Error: ../paddle/phi/kernels/gpu/embedding_kernel.cu:45 Assertion `id < N` failed. Id should smaller than 512 but received an id value: 618. ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.[Question]:
- [Bug]: 情感分析textcnn示例程序出错 HOT 4
- [Question]: UIE 怎么多标签的混淆矩阵? HOT 1
- [Bug]: FSL的P-Tuning官方示例报错 HOT 1
- [Bug]: paddle.inference.create_predictor(_config)报错 HOT 3
- [Bug]: 使用ernie-3.0-base-zh作为预训练模型训练出来的信息抽取模型,使用paddlenlp的Taskflow方法报错
- [Question]: 使用百度昆仑芯进行taskflow uie实体抽取任务,推理设备有问题
- [Question]: uie抽取不出来两个类似的标签吗 HOT 2
- [Bug]: paddlenlp 的 Taskflow使用 Taskflow("text_classification", mode='finetune')下载文件时路径组合有错导致无法正确下载。 HOT 1
- [Bug]: 安装paddle_ops算子时出现报错 HOT 4
- [Bug]: GPTQ量化报错expand_shape[i] != 0
- [Question]: uie模型训练时max_seq_length 512报太小 1024又爆显存 HOT 4
- [Bug]: /home/ubuntu/miniconda3/envs/paddle/lib/python3.10/site-packages/paddlenlp/transformers/layoutxlm/modeling.py", line 1189, in build_relation if negative_mask.sum() > 0: AttributeError: 'bool' object has no attribute 'sum'
- [Bug]: paddlenlp/transformers/layoutxlm/modeling.py line:1183-1188 测试下来代码存在bug,negative_mask一直为False的bool值
- [Bug]: run_finetune.py ./config/qwen/lora_argument.json 参数解析错误 HOT 2
- [Question]: llama多卡高性能推理 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from paddlenlp.