Comments (6)
这个不是错误。这应该是在等待10.10.11.51响应。
你在两台机器上启动的命令是什么?能贴一下吗?
from plsc.
TRAINER_IP_LIST=10.10.11.50,10.10.11.51
CUDA_VISIBLE_DEVICES=0,1,2,3
python -m paddle.distributed.launch --ips=$TRAINER_IP_LIST --gpus=$CUDA_VISIBLE_DEVICES tools/train.py
--config_file configs/ms1mv3_r50.py
--is_static False
--backbone FresResNet50
--classifier LargeScaleClassifier
--embedding_size 512
--model_parallel True
--dropout 0.0
--sample_ratio 0.1
--loss ArcFace
--batch_size 128
--dataset MS1M_v3
--num_classes 93431
--data_dir MS1M_v3/
--label_file MS1M_v3/label.txt
--is_bin False
--log_interval_step 100
--validation_interval_step 2000
--fp16 True
--use_dynamic_loss_scaling True
--init_loss_scaling 27648.0
--num_workers 8
--train_unit 'epoch'
--warmup_num 0
--train_num 25
--decay_boundaries "10,16,22"
--output MS1M_v3_arcface_dynamic_0.1_NHWC_FP16
from plsc.
你这两个机器是在一个集群环境中吗?平常有训练过多机任务么?看着是没问题的。可能是网络不通的问题?IP 地址是否是你的环境中的地址?
from plsc.
你这两个机器是在一个集群环境中吗?平常有训练过多机任务么?看着是没问题的。可能是网络不通的问题?IP 地址是否是你的环境中的地址?
网络是通的,平时没训练过多机任务
from plsc.
你确定是两台机器上分别执行了上面的启动命令吗?
多机的话,需要在每个机器上都执行启动命令
from plsc.
你确定是两台机器上分别执行了上面的启动命令吗?
多机的话,需要在每个机器上都执行启动命令
哦这样子啊,我试一下
from plsc.
Related Issues (20)
- dynamic model export onnx error HOT 8
- MobilefaceNet_128_arcface_dynamic_0.1_fp16_NHWC resume KeyERROR HOT 2
- PLSC训练得到的模型转paddle和ONNX,同一张图片,二者输出结果不一致问题?
- AMP 支持哪些算子 HOT 1
- TypeError: __init__() got an unexpected keyword argument 'data_format' HOT 2
- 输出模型是如何设计的? HOT 4
- inference.py推理结果的含义是? HOT 9
- Lr过小时会导致Loss为nan HOT 9
- 训练报错 HOT 5
- Face Recognition inference模型 HOT 2
- ValueError: Flag FLAGS_cudnn_exhaustive_search cannot set its value through this functio HOT 1
- 请问当前工程版本对应的最新的paddlepaddle是啥? HOT 1
- Problems exporting model
- 分类数目变大,尽管可以将参数拆分到各个GPU上,但是各个GPU上的隐层特征allgather也带来显存消耗 HOT 1
- PLSC只能使用python2.7?
- issues of training with dynamic graph HOT 2
- 最新版本的 plsc 对paddle版本的要求有误 HOT 3
- 咱这个MobileFace-Paddle的pretrained model哪里可以下载呢? HOT 7
- 预训练模型预测示例图片错误率高达83.33%,请教可能出现问题的地方 HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from plsc.