Giter Club home page Giter Club logo

paddleclas's Issues

数据列表文件 delimiter

貌似配置里面不能设置数据列表的delimiter,我这数据集里面文件名带空格,能用 | 的话会很方便

下载预训练模型报错,请帮忙看一下

!python tools/download.py -a ResNet50_vd -p ./pretrained -d True
!python tools/download.py -a ResNet50_vd_ssld -p ./pretrained -d True
!python tools/download.py -a MobileNetV3_large_x1_0 -p ./pretrained -d True

Traceback (most recent call last):
File "tools/download.py", line 17, in
from ppcls import model_zoo
ModuleNotFoundError: No module named 'ppcls'
Traceback (most recent call last):
File "tools/download.py", line 17, in
from ppcls import model_zoo
ModuleNotFoundError: No module named 'ppcls'
Traceback (most recent call last):
File "tools/download.py", line 17, in
from ppcls import model_zoo
ModuleNotFoundError: No module named 'ppcls'

如何在paddleHub上部署

你好,我用SSLD模型微调-基于ResNet50_vd_ssld预训练模型来做训练,然后生成了inference模型,想用PaddleHub进行部署的操作,有没有什么借壳快速部署的方式替换一下原来的module下的inference模型就可以启动部署的方式。

train from scratch

你好,我想问一下,需要使用paddleClas从头训练自己的数据,但是那个train_list.txt 中除了图片路径外,位置坐标是使用中心点坐标和宽高,还是使用左上右下坐标呢?
image

您好,模型infer之后,同一张图的结果有diff

您好,我用infer脚本进行推断的时候遇到了如下的问题
第一次infer:class id: 1, probability: 0.9075
第二次infer:class id: 1, probability: 0.9048
第三次infer:class id: 1, probability: 0.9069

这是我的运行脚本:
export PYTHONPATH=$PWD:$PYTHONPATH
export CUDA_VISIBLE_DEVICES=0
#--model=EfficientNetB0 --pretrained_model=output/EfficientNetB0_val/best_model_in_epoch_124/ppcls --output_paht=./convert
python tools/infer/infer.py
--image_file=./tools/img.jpg
--model=EfficientNetB0
--pretrained_model=output/EfficientNetB0_val/best_model_in_epoch_124/ppcls \

其中我的改动是,在resize的时候去掉了resize_short模式,将图片直接resize到288大小

有遇到的小伙伴帮忙答疑一下呀,谢谢~~

make the concept: place clear

The concept: place confuse when someone tries to set available gpu places by indicating CUDA_VISIBLE_DEVICES

using Fleet interface, only the FLAGS_selected_gpus works

so we have to obtain gpu num by

gpu_num = paddle.fluid.core.get_cuda_device_count() if (
        'PADDLE_TRAINERS_NUM') and (
            'PADDLE_TRAINER_ID'
    ) not in env else int(env.get('PADDLE_TRAINERS_NUM', 0))
  • remove this switch

模型推断问题

请教下大佬:
1、使用如下命令貌似只能推断一张图片,如果做到推断一个文件夹呢?类似paddle detection那样指定一个infer_dir。
python tools/infer/predict.py
-m model文件路径
-p params文件路径
-i 图片路径
--use_gpu=1
--use_tensorrt=True

2、windows环境下,怎样设置环境变量呢?我用aistudio上面的命令,Windows终端不认啊:
export PYTHONPATH=$PWD:$PYTHONPATH

训练loss突增后变为nan

用MobileNetV3_large_x1_0训练分类模型,训练到第二个epoch,loss突然增大后又变为nan,这是为什么呢?大家有什么经验吗?
Uploading image.png…

PaddleClas训练数据不均衡

你好,请问如果训练数据不均衡出现数据倾斜,目前PaddleClas是否有相对应解决办法,谢谢。

日志文件在哪?

FAQ中说“启动运行后,日志会实时输出到mylog/workerlog.*中,可以在这里查看实时的日志。”
但我为什么我运行后却找不到mylog文件夹?另外怎么可视化训练过程?

Mixed Precision Training

Mixed precision training is available in PaddleCV/image_classification but not in this repo. According to Release Notes of PaddlePaddle 1.7, AMP interfaces have been added.
image
Based on these, I think it would be convenient to implement it.

Mixed precision training is critical to fast training on V100. Please consider adding it. Thank you!

export model 出现Tensor not initialized yet when Tensor::type() is called错误

根据教程 导出模型的过程:
python tools/export_model.py
--model=MobileNetV3_large_x1_0
--pretrained_model=./output/MobileNetV3_large_x1_0/best_model_in_epoch_7/
--output_path=./convert/ \

报错如下:有经验的小伙伴帮忙看看?

Python Call Stacks (More useful to users):

File "/root/anaconda3/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2525, in append_op
attrs=kwargs.get("attrs", None))
File "/root/anaconda3/lib/python3.7/site-packages/paddle/fluid/io.py", line 343, in save_vars
'save_to_memory': save_to_memory
File "/root/anaconda3/lib/python3.7/site-packages/paddle/fluid/io.py", line 295, in save_vars
filename=filename)
File "/root/anaconda3/lib/python3.7/site-packages/paddle/fluid/io.py", line 641, in save_persistables
filename=filename)
File "/root/anaconda3/lib/python3.7/site-packages/paddle/fluid/io.py", line 1246, in save_inference_model
save_persistables(executor, save_dirname, main_program, params_filename)
File "tools/export_model.py", line 74, in main
params_filename='params')
File "tools/export_model.py", line 78, in
main()


Error Message Summary:

Error: Tensor not initialized yet when Tensor::type() is called.
[Hint: holder_ should not be null.] at (/paddle/paddle/fluid/framework/tensor.h:140)
[operator < save_combine > error]

res2net 200模型命名python2错误

在创建res2net 200层模型时,py2会报错:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 4: invalid start byte

因为层数超过26个英文字母,代码里的命名会出错
conv_name = "res" + str(block+2) + chr(97+i)

代码里的上一个分支应该增加res2net200
if layers in [101, 152, 200] and block == 2:

se+hrnet

因为我想在HRNet下加上注意力机制,所以选择使用se+hrnet,在赢一个issue中反馈给我的是SE+HRNet需要有带SE的预训练,直接加载没有SE的预训练的模型精度会比较低。
我的问题:
1.是否有SE+HRNet的预训练
2.如果没有,我应该怎么训练能有一个较好的结果,是否有可行性的建议
3.是否有其他易于训练的注意力机制,相较于SE+HRNet在没有预训练模型的情况下容易训练。

十万分类预训练模型的推断

UnavailableError: Load operator fail to open file pretrained/ResNet50_vd_10w_pretrained/fc_0.w_0, please check whether the model file is complete or damaged.
[Hint: Expected static_cast(fin) == true, but received static_cast(fin):0 != true:1.] at (/paddle/paddle/fluid/operators/load_op.h:41)
[operator < load > error]

received rank:2 != label_dims.size():3

报错: File "tools/train.py", line 124, in
main(args)


Error Message Summary:

InvalidArgumentError: If Attr(soft_label) == true, Input(X) and Input(Label) shall have the same dimensions. But received: the dimensions of Input(X) is [2],the shape of Input(X) is [-1, 2], the dimensions of Input(Label) is [3], the shape ofInput(Label) is [-1, 1, 2]
[Hint: Expected rank == label_dims.size(), but received rank:2 != label_dims.size():3.] at (D:\1.8.1\paddle\paddle\fluid\operators\cross_entropy_op.cc:63)
[operator < cross_entropy > error]
INFO 2020-05-23 18:17:34,812 utils.py:272] terminate all the procs
ERROR 2020-05-23 18:17:34,812 utils.py:416] ABORT!!! Out of all 1 trainers, the trainer process with rank=[0] was aborted. Please check its log.
INFO 2020-05-23 18:17:34,813 utils.py:272] terminate all the procs

图片512*512png,8位深度,类别1,2,3。这个报错里的rank和label_dims.size()分别是什么意思??

when infer a image: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

aistudio@jupyter-305239-473669:~/work/PaddleClas$ python tools/infer/predict.py -m output_ca/ResNet50_vd/last/model -p output_ca/ResNet50_vd/last/params -i ./test0.jpg --use_gpu=1
Traceback (most recent call last):
File "tools/infer/predict.py", line 160, in
main()
File "tools/infer/predict.py", line 121, in main
inputs = preprocess(args.image_file, operators)
File "tools/infer/predict.py", line 88, in preprocess
data = open(fname).read()
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

what the problem?

export_model模型转换出错

export CUDA_VISIBLE_DEVICES=0
python -m paddle.distributed.launch
--selected_gpus="0"
tools/train.py
-c ./configs/quick_start/ResNet50_vd.yaml

使用上述命令训练模型后,然后通过export_model转换模型
python tools/export_model.py --model=ResNet50_vd --pretrained_model=output/ResNet50_vd/19/ --output_path=inference/ResNet50_vd --class_dim=102

报错
2020-05-09 14:36:17,701-WARNING: output/ResNet50_vd/19/.pdparams not found, try to load model file saved with [ save_params, save_persistables, save_vars ]
2020-05-09 14:36:17,701-WARNING: output/ResNet50_vd/19/.pdparams not found, try to load model file saved with [ save_params, save_persistables, save_vars ]
2020-05-09 14:36:17,703-WARNING: variable file [ output/ResNet50_vd/19/ppcls.pdopt output/ResNet50_vd/19/ppcls.pdparams output/ResNet50_vd/19/ppcls.pdmodel ] not used
2020-05-09 14:36:17,703-WARNING: variable file [ output/ResNet50_vd/19/ppcls.pdopt output/ResNet50_vd/19/ppcls.pdparams output/ResNet50_vd/19/ppcls.pdmodel ] not used
/home/lishi/anaconda3/lib/python3.7/site-packages/paddle/fluid/executor.py:804: UserWarning: There are no operators in the program to be executed. If you pass Program manually, please use fluid.program_guard to ensure the current Program is being used.
warnings.warn(error_info)
/home/lishi/anaconda3/lib/python3.7/site-packages/paddle/fluid/executor.py:782: UserWarning: The following exception is not an EOF exception.
"The following exception is not an EOF exception.")
Traceback (most recent call last):
File "tools/export_model.py", line 78, in
main()
File "tools/export_model.py", line 74, in main
params_filename='params')
File "/home/lishi/anaconda3/lib/python3.7/site-packages/paddle/fluid/io.py", line 1245, in save_inference_model
save_persistables(executor, save_dirname, main_program, params_filename)
File "/home/lishi/anaconda3/lib/python3.7/site-packages/paddle/fluid/io.py", line 640, in save_persistables
filename=filename)
File "/home/lishi/anaconda3/lib/python3.7/site-packages/paddle/fluid/io.py", line 295, in save_vars
filename=filename)
File "/home/lishi/anaconda3/lib/python3.7/site-packages/paddle/fluid/io.py", line 350, in save_vars
executor.run(save_program)
File "/home/lishi/anaconda3/lib/python3.7/site-packages/paddle/fluid/executor.py", line 783, in run
six.reraise(*sys.exc_info())
File "/home/lishi/anaconda3/lib/python3.7/site-packages/six.py", line 703, in reraise
raise value
File "/home/lishi/anaconda3/lib/python3.7/site-packages/paddle/fluid/executor.py", line 778, in run
use_program_cache=use_program_cache)
File "/home/lishi/anaconda3/lib/python3.7/site-packages/paddle/fluid/executor.py", line 831, in _run_impl
use_program_cache=use_program_cache)
File "/home/lishi/anaconda3/lib/python3.7/site-packages/paddle/fluid/executor.py", line 905, in _run_program
fetch_var_name)
paddle.fluid.core_avx.EnforceNotMet:


C++ Call Stacks (More useful to developers):

0 std::string paddle::platform::GetTraceBackString<std::string const&>(std::string const&, char const*, int)
1 paddle::platform::EnforceNotMet::EnforceNotMet(std::string const&, char const*, int)
2 paddle::framework::Tensor::type() const
3 paddle::operators::SaveCombineOpKernel<paddle::platform::CPUDeviceContext, float>::Compute(paddle::framework::ExecutionContext const&) const
4 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CPUPlace, false, 0ul, paddle::operators::SaveCombineOpKernel<paddle::platform::CPUDeviceContext, float>, paddle::operators::SaveCombineOpKernel<paddle::platform::CPUDeviceContext, double>, paddle::operators::SaveCombineOpKernel<paddle::platform::CPUDeviceContext, int>, paddle::operators::SaveCombineOpKernel<paddle::platform::CPUDeviceContext, long> >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
5 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const
6 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const
7 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)
8 paddle::framework::Executor::RunPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, bool, bool, bool)
9 paddle::framework::Executor::Run(paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool, std::vector<std::string, std::allocatorstd::string > const&, bool, bool)


Python Call Stacks (More useful to users):

File "/home/lishi/anaconda3/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2525, in append_op
attrs=kwargs.get("attrs", None))
File "/home/lishi/anaconda3/lib/python3.7/site-packages/paddle/fluid/io.py", line 343, in save_vars
'save_to_memory': save_to_memory
File "/home/lishi/anaconda3/lib/python3.7/site-packages/paddle/fluid/io.py", line 295, in save_vars
filename=filename)
File "/home/lishi/anaconda3/lib/python3.7/site-packages/paddle/fluid/io.py", line 640, in save_persistables
filename=filename)
File "/home/lishi/anaconda3/lib/python3.7/site-packages/paddle/fluid/io.py", line 1245, in save_inference_model
save_persistables(executor, save_dirname, main_program, params_filename)
File "tools/export_model.py", line 74, in main
params_filename='params')
File "tools/export_model.py", line 78, in
main()


Error Message Summary:

Error: Tensor not initialized yet when Tensor::type() is called.
[Hint: holder_ should not be null.] at (/paddle/paddle/fluid/framework/tensor.h:140)
[operator < save_combine > error]

run PaddleClas infer.py ERROR

my infer.sh:
export PYTHONPATH=$PWD:$PYTHONPATH

python -m paddle.distributed.launch
--selected_gpus="0"
tools/infer/infer.py -i "dataset/FGVC2020_SSFGRC/test/26.jpg"
-m "SENet154_vd"
-p "output/expr20_SENet154_vd_train_bestv1_25971.txt_val2000_val2750_78.84"

ERROR:
Traceback (most recent call last):
File "tools/infer/infer.py", line 121, in
main()
File "tools/infer/infer.py", line 113, in main
return_numpy=False)
File "/home/daibing/software/anaconda2/lib/python2.7/site-packages/paddle/fluid/executor.py", line 790, in run
six.reraise(*sys.exc_info())
File "/home/daibing/software/anaconda2/lib/python2.7/site-packages/paddle/fluid/executor.py", line 785, in run
use_program_cache=use_program_cache)
File "/home/daibing/software/anaconda2/lib/python2.7/site-packages/paddle/fluid/executor.py", line 838, in _run_impl
use_program_cache=use_program_cache)
File "/home/daibing/software/anaconda2/lib/python2.7/site-packages/paddle/fluid/executor.py", line 909, in _run_program
self._feed_data(program, feed, feed_var_name, scope)
File "/home/daibing/software/anaconda2/lib/python2.7/site-packages/paddle/fluid/executor.py", line 591, in _feed_data
check_feed_shape_type(var, cur_feed)
File "/home/daibing/software/anaconda2/lib/python2.7/site-packages/paddle/fluid/executor.py", line 230, in check_feed_shape_type
(var.name, len(var.shape), var.shape, feed_shape))
ValueError: The fed Variable u'image' should have dimensions = 4, shape = (-1L, 3L, 224L, 224L), but received fed shape [3L, 224L, 224L] on each device

--use_tensorrt=True

Error: Pass tensorrt_subgraph_pass has not been registered at (/paddle/paddle/fluid/framework/ir/pass.h:170)

使用自己的训练集train from scratch

你好,我使用自己的训练集(只有1类物体)进行train from scratch ,但是训练的过程中,top1和top2始终是1.0000(eval也是这样的),如图所示:
使用的配置文件为resnet50_vd.yaml,在配置文件中我改了类别数为2,请问这种情况应该怎末更改配置文件?还有一个问题是,如何拿PaddleClas训练完成的分类模型使用PaddleDetection进行目标检测?谢谢!
image

HWC->CHW function redundancy

In operators.py
It seems that to_np, order and channel_first is not necessary
we already have a ToCHWImage function

ValueError: Operator "gen_nccl_id" has not been registered.

E:\projects\PaddleClas-master>python -m paddle.distributed.launch --selected_gpus='0' tools/train.py -c configs/quick_start/ResNet50_vd_finetune_my.yaml
----------- Configuration Arguments -----------
cluster_node_ips: 127.0.0.1
log_dir: None
node_ip: 127.0.0.1
print_config: True
selected_gpus: '0'
started_port: 6170
training_script: tools/train.py
training_script_args: ['-c', 'configs/quick_start/ResNet50_vd_finetune_my.yaml']
use_paddlecloud: False

trainers_endpoints: 127.0.0.1:6170 , node_id: 0 , current_node_ip: 127.0.0.1 , num_nodes: 1 , node_ips: ['127.0.0.1'] , nranks: 1
2020-05-13 23:57:14 INFO:

== PaddleClas is powered by PaddlePaddle ! ==

== ==
== For more info please go to the following website. ==
== ==
== https://github.com/PaddlePaddle/PaddleClas ==

2020-05-13 23:57:14 INFO: ARCHITECTURE :
2020-05-13 23:57:14 INFO: name : ResNet50_vd
2020-05-13 23:57:14 INFO: ------------------------------------------------------------
2020-05-13 23:57:14 INFO: LEARNING_RATE :
2020-05-13 23:57:14 INFO: function : Cosine
2020-05-13 23:57:14 INFO: params :
2020-05-13 23:57:14 INFO: lr : 0.00375
2020-05-13 23:57:14 INFO: ------------------------------------------------------------
2020-05-13 23:57:14 INFO: OPTIMIZER :
2020-05-13 23:57:14 INFO: function : Momentum
2020-05-13 23:57:14 INFO: params :
2020-05-13 23:57:14 INFO: momentum : 0.9
2020-05-13 23:57:14 INFO: regularizer :
2020-05-13 23:57:14 INFO: factor : 1e-06
2020-05-13 23:57:14 INFO: function : L2
2020-05-13 23:57:14 INFO: ------------------------------------------------------------
2020-05-13 23:57:14 INFO: TRAIN :
2020-05-13 23:57:14 INFO: batch_size : 32
2020-05-13 23:57:14 INFO: data_dir : G:/ai_data/paddle/0513/
2020-05-13 23:57:14 INFO: file_list : G:/ai_data/paddle/0513train.list
2020-05-13 23:57:14 INFO: num_workers : 4
2020-05-13 23:57:14 INFO: shuffle_seed : 0
2020-05-13 23:57:14 INFO: transforms :
2020-05-13 23:57:14 INFO: DecodeImage :
2020-05-13 23:57:14 INFO: channel_first : False
2020-05-13 23:57:14 INFO: to_np : False
2020-05-13 23:57:14 INFO: to_rgb : True
2020-05-13 23:57:14 INFO: RandCropImage :
2020-05-13 23:57:14 INFO: size : 224
2020-05-13 23:57:14 INFO: RandFlipImage :
2020-05-13 23:57:14 INFO: flip_code : 1
2020-05-13 23:57:14 INFO: NormalizeImage :
2020-05-13 23:57:14 INFO: mean : [0.485, 0.456, 0.406]
2020-05-13 23:57:14 INFO: order :
2020-05-13 23:57:14 INFO: scale : 1./255.
2020-05-13 23:57:14 INFO: std : [0.229, 0.224, 0.225]
2020-05-13 23:57:14 INFO: ToCHWImage : None
2020-05-13 23:57:14 INFO: ------------------------------------------------------------
2020-05-13 23:57:14 INFO: VALID :
2020-05-13 23:57:14 INFO: batch_size : 20
2020-05-13 23:57:14 INFO: data_dir : G:/ai_data/paddle/0513/
2020-05-13 23:57:14 INFO: file_list : G:/ai_data/paddle/0513test.list
2020-05-13 23:57:14 INFO: num_workers : 4
2020-05-13 23:57:14 INFO: shuffle_seed : 0
2020-05-13 23:57:14 INFO: transforms :
2020-05-13 23:57:14 INFO: DecodeImage :
2020-05-13 23:57:14 INFO: channel_first : False
2020-05-13 23:57:14 INFO: to_np : False
2020-05-13 23:57:14 INFO: to_rgb : True
2020-05-13 23:57:14 INFO: ResizeImage :
2020-05-13 23:57:14 INFO: resize_short : 256
2020-05-13 23:57:14 INFO: CropImage :
2020-05-13 23:57:14 INFO: size : 224
2020-05-13 23:57:14 INFO: NormalizeImage :
2020-05-13 23:57:14 INFO: mean : [0.485, 0.456, 0.406]
2020-05-13 23:57:14 INFO: order :
2020-05-13 23:57:14 INFO: scale : 1.0/255.0
2020-05-13 23:57:14 INFO: std : [0.229, 0.224, 0.225]
2020-05-13 23:57:14 INFO: ToCHWImage : None
2020-05-13 23:57:14 INFO: ------------------------------------------------------------
2020-05-13 23:57:14 INFO: classes_num : 3
2020-05-13 23:57:14 INFO: epochs : 20
2020-05-13 23:57:14 INFO: image_shape : [3, 224, 224]
2020-05-13 23:57:14 INFO: mode : train
2020-05-13 23:57:14 INFO: model_save_dir : E:/projects/PaddleClas-master/output/
2020-05-13 23:57:14 INFO: pretrained_model : E:/projects/PaddleClas-master/ResNet50_vd_pretrained
2020-05-13 23:57:14 INFO: save_interval : 1
2020-05-13 23:57:14 INFO: topk : 5
2020-05-13 23:57:14 INFO: total_images : 795
2020-05-13 23:57:14 INFO: valid_interval : 1
2020-05-13 23:57:14 INFO: validate : True

API is deprecated since 2.0.0 Please use FleetAPI instead.
WIKI: https://github.com/PaddlePaddle/Fleet/blob/develop/markdown_doc/transpiler

Traceback (most recent call last):
File "tools/train.py", line 124, in
main(args)
File "tools/train.py", line 69, in main
config, train_prog, startup_prog, is_train=True)
File "E:\projects\PaddleClas-master\tools\program.py", line 341, in build
optimizer.minimize(fetchs['loss'][0])
File "C:\python\tf\lib\site-packages\paddle\fluid\incubate\fleet\collective_init_.py", line 424, in minimize
fleet.main_program = self.try_to_compile(startup_program, main_program)
File "C:\python\tf\lib\site-packages\paddle\fluid\incubate\fleet\collective_init
.py", line 358, in _try_to_compile
self.transpile(startup_program, main_program)
File "C:\python\tf\lib\site-packages\paddle\fluid\incubate\fleet\collective_init
.py", line 285, in _transpile
current_endpoint=current_endpoint)
File "C:\python\tf\lib\site-packages\paddle\fluid\transpiler\distribute_transpiler.py", line 625, in transpile
wait_port=self.config.wait_port)
File "C:\python\tf\lib\site-packages\paddle\fluid\transpiler\distribute_transpiler.py", line 397, in _transpile_nccl2
self.config.hierarchical_allreduce_inter_nranks
File "C:\python\tf\lib\site-packages\paddle\fluid\framework.py", line 2525, in append_op
attrs=kwargs.get("attrs", None))
File "C:\python\tf\lib\site-packages\paddle\fluid\framework.py", line 1797, in init
proto = OpProtoHolder.instance().get_op_proto(type)
File "C:\python\tf\lib\site-packages\paddle\fluid\framework.py", line 1679, in get_op_proto
raise ValueError("Operator "%s" has not been registered." % type)
ValueError: Operator "gen_nccl_id" has not been registered.
2020-05-13 15:57:16,981-ERROR: ABORT!!! Out of all 1 trainers, the trainer process with rank=[0] was aborted. Please check its log.
ERROR 2020-05-13 15:57:16,981 launch.py:284] ABORT!!! Out of all 1 trainers, the trainer process with rank=[0] was aborted. Please check its log.

这是什么问题?

multi_process reader的问题

Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.5/dist-packages/paddle/reader/decorator.py", line 549, in _read_into_queue
for sample in reader():
File "/usr/local/lib/python3.5/dist-packages/six.py", line 703, in reraise
raise value
File "/usr/local/lib/python3.5/dist-packages/paddle/reader/decorator.py", line 549, in _read_into_queue
for sample in reader():
File "/home/pd_source/cla/ppcls/data/reader.py", line 191, in reader
for line in full_lines:
File "/home/pd_source/cla/ppcls/data/reader.py", line 191, in reader
for line in full_lines:
File "/usr/lib/python3.5/bdb.py", line 48, in trace_dispatch
return self.dispatch_line(frame)
File "/usr/lib/python3.5/bdb.py", line 67, in dispatch_line
if self.quitting: raise BdbQuit
bdb.BdbQuit

/home/pd_source/cla/ppcls/data/reader.py(191)reader()
-> for line in full_lines:
(Pdb)
/home/pd_source/cla/ppcls/data/reader.py(191)reader()
-> for line in full_lines:
Process Process-2:
(Pdb)
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.5/dist-packages/paddle/reader/decorator.py", line 549, in _read_into_queue
for sample in reader():
File "/usr/local/lib/python3.5/dist-packages/six.py", line 703, in reraise
raise value
File "/usr/local/lib/python3.5/dist-packages/paddle/reader/decorator.py", line 549, in _read_into_queue
for sample in reader():
File "/home/pd_source/cla/ppcls/data/reader.py", line 191, in reader
for line in full_lines:
File "/home/pd_source/cla/ppcls/data/reader.py", line 191, in reader
for line in full_lines:
File "/usr/lib/python3.5/bdb.py", line 48, in trace_dispatch
return self.dispatch_line(frame)
File "/usr/lib/python3.5/bdb.py", line 67, in dispatch_line
if self.quitting: raise BdbQuit
bdb.BdbQuit
Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.5/dist-packages/paddle/reader/decorator.py", line 549, in _read_into_queue
for sample in reader():
File "/usr/local/lib/python3.5/dist-packages/six.py", line 703, in reraise
raise value
File "/usr/local/lib/python3.5/dist-packages/paddle/reader/decorator.py", line 549, in _read_into_queue
for sample in reader():
File "/home/pd_source/cla/ppcls/data/reader.py", line 191, in reader
for line in full_lines:
File "/home/pd_source/cla/ppcls/data/reader.py", line 191, in reader
for line in full_lines:
File "/usr/lib/python3.5/bdb.py", line 48, in trace_dispatch
return self.dispatch_line(frame)
File "/usr/lib/python3.5/bdb.py", line 67, in dispatch_line
if self.quitting: raise BdbQuit
bdb.BdbQuit
/home/pd_source/cla/ppcls/data/reader.py(191)reader()
-> for line in full_lines:
(Pdb)
Process Process-3:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.5/dist-packages/paddle/reader/decorator.py", line 549, in _read_into_queue
for sample in reader():
File "/usr/local/lib/python3.5/dist-packages/six.py", line 703, in reraise
raise value
File "/usr/local/lib/python3.5/dist-packages/paddle/reader/decorator.py", line 549, in _read_into_queue
for sample in reader():
File "/home/pd_source/cla/ppcls/data/reader.py", line 191, in reader
for line in full_lines:
File "/home/pd_source/cla/ppcls/data/reader.py", line 191, in reader
for line in full_lines:
File "/usr/lib/python3.5/bdb.py", line 48, in trace_dispatch
return self.dispatch_line(frame)
File "/usr/lib/python3.5/bdb.py", line 67, in dispatch_line
if self.quitting: raise BdbQuit
bdb.BdbQuit
2020-05-27 14:43:10 WARNING: Your reader has raised an exception!
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.5/dist-packages/paddle/fluid/reader.py", line 1156, in thread_main
six.reraise(*sys.exc_info())
File "/usr/local/lib/python3.5/dist-packages/six.py", line 703, in reraise
raise value
File "/usr/local/lib/python3.5/dist-packages/paddle/fluid/reader.py", line 1136, in thread_main
for tensors in self._tensor_reader():
File "/usr/local/lib/python3.5/dist-packages/paddle/fluid/reader.py", line 1206, in tensor_reader_impl
for slots in paddle_reader():
File "/usr/local/lib/python3.5/dist-packages/paddle/fluid/data_feeder.py", line 506, in reader_creator
for item in reader():
File "/home/pd_source/cla/ppcls/data/reader.py", line 267, in wrapper
for idx, sample in enumerate(reader()):
File "/usr/local/lib/python3.5/dist-packages/paddle/reader/decorator.py", line 572, in queue_reader
raise ValueError("multiprocess reader raises an exception")
ValueError: multiprocess reader raises an exception

/home/pd_source/cla/ppcls/data/reader.py(191)reader()
-> for line in full_lines:
(Pdb)
Process Process-4:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.5/dist-packages/paddle/reader/decorator.py", line 549, in _read_into_queue
for sample in reader():
File "/usr/local/lib/python3.5/dist-packages/six.py", line 703, in reraise
raise value
File "/usr/local/lib/python3.5/dist-packages/paddle/reader/decorator.py", line 549, in _read_into_queue
for sample in reader():
File "/home/pd_source/cla/ppcls/data/reader.py", line 191, in reader
for line in full_lines:
File "/home/pd_source/cla/ppcls/data/reader.py", line 191, in reader
for line in full_lines:
File "/usr/lib/python3.5/bdb.py", line 48, in trace_dispatch
return self.dispatch_line(frame)
File "/usr/lib/python3.5/bdb.py", line 67, in dispatch_line
if self.quitting: raise BdbQuit
bdb.BdbQuit
/home/pd_source/cla/ppcls/data/reader.py(191)reader()
-> for line in full_lines:
(Pdb)
Process Process-2:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.5/dist-packages/paddle/reader/decorator.py", line 549, in _read_into_queue
for sample in reader():
File "/usr/local/lib/python3.5/dist-packages/six.py", line 703, in reraise
raise value
File "/usr/local/lib/python3.5/dist-packages/paddle/reader/decorator.py", line 549, in _read_into_queue
for sample in reader():
File "/home/pd_source/cla/ppcls/data/reader.py", line 191, in reader
for line in full_lines:
File "/home/pd_source/cla/ppcls/data/reader.py", line 191, in reader
for line in full_lines:
File "/usr/lib/python3.5/bdb.py", line 48, in trace_dispatch
return self.dispatch_line(frame)
File "/usr/lib/python3.5/bdb.py", line 67, in dispatch_line
if self.quitting: raise BdbQuit
bdb.BdbQuit
Traceback (most recent call last):
File "./jaits_utils/task_tools.py", line 494, in inner
func(jif,*args, **kwargs)
File "cla/jaits_train.py", line 215, in main
epoch_id, 'train')
File "/home/pd_source/cla/program.py", line 413, in run
for idx, batch in enumerate(dataloader()):
File "/usr/local/lib/python3.5/dist-packages/paddle/fluid/reader.py", line 1102, in next
return self._reader.read_next()
paddle.fluid.core_avx.EnforceNotMet:


C++ Call Stacks (More useful to developers):

0 std::string paddle::platform::GetTraceBackString<std::string const&>(std::string const&, char const*, int)
1 paddle::platform::EnforceNotMet::EnforceNotMet(std::string const&, char const*, int)
2 paddle::operators::reader::BlockingQueue<std::vector<paddle::framework::LoDTensor, std::allocatorpaddle::framework::LoDTensor > >::Receive(std::vector<paddle::framework::LoDTensor, std::allocatorpaddle::framework::LoDTensor >)
3 paddle::operators::reader::PyReader::ReadNext(std::vector<paddle::framework::LoDTensor, std::allocatorpaddle::framework::LoDTensor >
)
4 std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result, std::__future_base::_Result_base::_Deleter>, unsigned long> >::_M_invoke(std::_Any_data const&)
5 std::__future_base::_State_base::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>&, bool&)
6 ThreadPool::ThreadPool(unsigned long)::{lambda()#1}::operator()() const


Error Message Summary:

Error: Blocking queue is killed because the data reader raises an exception
[Hint: Expected killed_ != true, but received killed_:1 == true:1.] at (/paddle/paddle/fluid/operators/reader/blocking_queue.h:141)

2020-05-27 14:43:10 INFO: SO:exception-Traceback (most recent call last):
File "./jaits_utils/task_tools.py", line 494, in inner
func(jif,*args, **kwargs)
File "cla/jaits_train.py", line 215, in main
epoch_id, 'train')
File "/home/pd_source/cla/program.py", line 413, in run
for idx, batch in enumerate(dataloader()):
File "/usr/local/lib/python3.5/dist-packages/paddle/fluid/reader.py", line 1102, in next
return self._reader.read_next()
paddle.fluid.core_avx.EnforceNotMet:


C++ Call Stacks (More useful to developers):

0 std::string paddle::platform::GetTraceBackString<std::string const&>(std::string const&, char const*, int)
1 paddle::platform::EnforceNotMet::EnforceNotMet(std::string const&, char const*, int)
2 paddle::operators::reader::BlockingQueue<std::vector<paddle::framework::LoDTensor, std::allocatorpaddle::framework::LoDTensor > >::Receive(std::vector<paddle::framework::LoDTensor, std::allocatorpaddle::framework::LoDTensor >)
3 paddle::operators::reader::PyReader::ReadNext(std::vector<paddle::framework::LoDTensor, std::allocatorpaddle::framework::LoDTensor >
)
4 std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result, std::__future_base::_Result_base::_Deleter>, unsigned long> >::_M_invoke(std::_Any_data const&)
5 std::__future_base::_State_base::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>&, bool&)
6 ThreadPool::ThreadPool(unsigned long)::{lambda()#1}::operator()() const


Error Message Summary:

Error: Blocking queue is killed because the data reader raises an exception
[Hint: Expected killed_ != true, but received killed_:1 == true:1.] at (/paddle/paddle/fluid/operators/reader/blocking_queue.h:141)

/home/pd_source/cla/ppcls/data/reader.py(191)reader()
-> for line in full_lines:
(Pdb)
Process Process-3:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.5/dist-packages/paddle/reader/decorator.py", line 549, in _read_into_queue
for sample in reader():
File "/usr/local/lib/python3.5/dist-packages/six.py", line 703, in reraise
raise value
File "/usr/local/lib/python3.5/dist-packages/paddle/reader/decorator.py", line 549, in _read_into_queue
for sample in reader():
File "/home/pd_source/cla/ppcls/data/reader.py", line 191, in reader
for line in full_lines:
File "/home/pd_source/cla/ppcls/data/reader.py", line 191, in reader
for line in full_lines:
File "/usr/lib/python3.5/bdb.py", line 48, in trace_dispatch
return self.dispatch_line(frame)
File "/usr/lib/python3.5/bdb.py", line 67, in dispatch_line
if self.quitting: raise BdbQuit
bdb.BdbQuit
2020-05-27 14:43:10 WARNING: Your reader has raised an exception!
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.5/dist-packages/paddle/fluid/reader.py", line 1156, in thread_main
six.reraise(*sys.exc_info())
File "/usr/local/lib/python3.5/dist-packages/six.py", line 703, in reraise
raise value
File "/usr/local/lib/python3.5/dist-packages/paddle/fluid/reader.py", line 1136, in thread_main
for tensors in self._tensor_reader():
File "/usr/local/lib/python3.5/dist-packages/paddle/fluid/reader.py", line 1206, in tensor_reader_impl
for slots in paddle_reader():
File "/usr/local/lib/python3.5/dist-packages/paddle/fluid/data_feeder.py", line 506, in reader_creator
for item in reader():
File "/home/pd_source/cla/ppcls/data/reader.py", line 267, in wrapper
for idx, sample in enumerate(reader()):
File "/usr/local/lib/python3.5/dist-packages/paddle/reader/decorator.py", line 572, in queue_reader
raise ValueError("multiprocess reader raises an exception")
ValueError: multiprocess reader raises an exception

/home/pd_source/cla/ppcls/data/reader.py(191)reader()
-> for line in full_lines:
(Pdb)
Process Process-4:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.5/dist-packages/paddle/reader/decorator.py", line 549, in _read_into_queue
for sample in reader():
File "/usr/local/lib/python3.5/dist-packages/six.py", line 703, in reraise
raise value
File "/usr/local/lib/python3.5/dist-packages/paddle/reader/decorator.py", line 549, in _read_into_queue
for sample in reader():
File "/home/pd_source/cla/ppcls/data/reader.py", line 191, in reader
for line in full_lines:
File "/home/pd_source/cla/ppcls/data/reader.py", line 191, in reader
for line in full_lines:
File "/usr/lib/python3.5/bdb.py", line 48, in trace_dispatch
return self.dispatch_line(frame)
File "/usr/lib/python3.5/bdb.py", line 67, in dispatch_line
if self.quitting: raise BdbQuit
bdb.BdbQuit
Traceback (most recent call last):
File "./jaits_utils/task_tools.py", line 494, in inner
func(jif,*args, **kwargs)
File "cla/jaits_train.py", line 215, in main
epoch_id, 'train')
File "/home/pd_source/cla/program.py", line 413, in run
for idx, batch in enumerate(dataloader()):
File "/usr/local/lib/python3.5/dist-packages/paddle/fluid/reader.py", line 1102, in next
return self._reader.read_next()
paddle.fluid.core_avx.EnforceNotMet:


C++ Call Stacks (More useful to developers):

0 std::string paddle::platform::GetTraceBackString<std::string const&>(std::string const&, char const*, int)
1 paddle::platform::EnforceNotMet::EnforceNotMet(std::string const&, char const*, int)
2 paddle::operators::reader::BlockingQueue<std::vector<paddle::framework::LoDTensor, std::allocatorpaddle::framework::LoDTensor > >::Receive(std::vector<paddle::framework::LoDTensor, std::allocatorpaddle::framework::LoDTensor >)
3 paddle::operators::reader::PyReader::ReadNext(std::vector<paddle::framework::LoDTensor, std::allocatorpaddle::framework::LoDTensor >
)
4 std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result, std::__future_base::_Result_base::_Deleter>, unsigned long> >::_M_invoke(std::_Any_data const&)
5 std::__future_base::_State_base::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>&, bool&)
6 ThreadPool::ThreadPool(unsigned long)::{lambda()#1}::operator()() const


Error Message Summary:

Error: Blocking queue is killed because the data reader raises an exception
[Hint: Expected killed_ != true, but received killed_:1 == true:1.] at (/paddle/paddle/fluid/operators/reader/blocking_queue.h:141)

2020-05-27 14:43:10 INFO: SO:exception-Traceback (most recent call last):
File "./jaits_utils/task_tools.py", line 494, in inner
func(jif,*args, **kwargs)
File "cla/jaits_train.py", line 215, in main
epoch_id, 'train')
File "/home/pd_source/cla/program.py", line 413, in run
for idx, batch in enumerate(dataloader()):
File "/usr/local/lib/python3.5/dist-packages/paddle/fluid/reader.py", line 1102, in next
return self._reader.read_next()
paddle.fluid.core_avx.EnforceNotMet:


C++ Call Stacks (More useful to developers):

0 std::string paddle::platform::GetTraceBackString<std::string const&>(std::string const&, char const*, int)
1 paddle::platform::EnforceNotMet::EnforceNotMet(std::string const&, char const*, int)
2 paddle::operators::reader::BlockingQueue<std::vector<paddle::framework::LoDTensor, std::allocatorpaddle::framework::LoDTensor > >::Receive(std::vector<paddle::framework::LoDTensor, std::allocatorpaddle::framework::LoDTensor >)
3 paddle::operators::reader::PyReader::ReadNext(std::vector<paddle::framework::LoDTensor, std::allocatorpaddle::framework::LoDTensor >
)
4 std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result, std::__future_base::_Result_base::_Deleter>, unsigned long> >::_M_invoke(std::_Any_data const&)
5 std::__future_base::_State_base::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>&, bool&)
6 ThreadPool::ThreadPool(unsigned long)::{lambda()#1}::operator()() const


Error Message Summary:

Error: Blocking queue is killed because the data reader raises an exception
[Hint: Expected killed_ != true, but received killed_:1 == true:1.] at (/paddle/paddle/fluid/operators/reader/blocking_queue.h:141)

demo运行出错

paddle环境1.7.2 cuda9.0 cudnn7.5
如果使用命令/home/vis/duyuting/app/anaconda3/bin/python -m paddle.distributed.launch --selected_gpus="0" tools/train.py -c ./configs/quick_start/ResNet50_vd.yaml 会报错:
Error: Failed to find dynamic library: libnccl.so ( /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /home/vis/duyuting/app/nccl_2.5.6-1+cuda10.0_x86_64/lib/libnccl.so) )
Please specify its path correctly using following ways:
Method. set environment variable LD_LIBRARY_PATH on Linux or DYLD_LIBRARY_PATH on Mac OS.
For instance, issue command: export LD_LIBRARY_PATH=...
Note: After Mac OS 10.11, using the DYLD_LIBRARY_PATH is impossible unless System Integrity Protection (SIP) is disabled. at (/paddle/paddle/fluid/platform/dynload/dynamic_loader.cc:177)
[operator < gen_nccl_id > error] 看起来是nccl问题
去官网下载了cuda9版本的nccl报错:
Error: An error occurred here. There is no accurate error hint for this error yet. We are continuously in the process of increasing hint for this kind of error check. It would be helpful if you could inform us of how this conversion went by opening a github issue. And we will resolve it with high priority.

  • New issue link: https://github.com/PaddlePaddle/Paddle/issues/new
  • Recommended issue content: all error stack information
    [unhandled system error] at (/paddle/paddle/fluid/operators/distributed_ops/gen_nccl_id_op.cc:162)
    [operator < gen_nccl_id > error]
    如果不使用分布式命令:/home/vis/duyuting/app/anaconda3/bin/python tools/train.py -c ./configs/quick_start/ResNet50_vd.yaml 报错:Traceback (most recent call last):
    File "tools/train.py", line 133, in
    main(args)
    File "tools/train.py", line 59, in main
    fleet.init(role)
    File "/home/vis/duyuting/app/anaconda3/lib/python3.7/site-packages/paddle/fluid/incubate/fleet/base/fleet_base.py", line 202, in init
    self._role_maker.generate_role()
    File "/home/vis/duyuting/app/anaconda3/lib/python3.7/site-packages/paddle/fluid/incubate/fleet/base/role_maker.py", line 500, in generate_role
    assert self._worker_endpoints is not None, "can't find PADDLE_TRAINER_ENDPOINTS"
    这个库难道不能单gpu运行????

Why Larger Batch Size Slows Training

I am training WRN-28-10 on CIFAR10 using PaddleClas. When batch size > 128, using larger batch size, training gets slower. A detailed comparison is shown below.

Batch Size Time (Per Epoch)
32 82.2s
64 72.8s
128 68.5s
256 74.1s
512 86.4s
1024 110.5s

The time of the 2nd epoch is reported, so warm-up time is not counted. Experiments showed that the results were consistent.

This behavior is strange and unexpected. Could you help me to find the reason?

Code to reproduce is here.

Thank you very much!

模型推理报错

你好,我在aistudio上已将训练的模型转成inference模型之后,在推断的时候报错了:
!export PYTHONPATH=./:$PYTHONPATH && python tools/infer/predict.py
-m=./inference/ResNet50_vd/model
-p=./inference/ResNet50_vd/params
-i=./dataset/flowers102/jpg/image_02275.jpg
--use_gpu=1
--use_tensorrt=True

报错信息如下:

Traceback (most recent call last):
File "tools/infer/predict.py", line 156, in
main()
File "tools/infer/predict.py", line 110, in main
predictor = create_predictor(args)
File "tools/infer/predict.py", line 66, in create_predictor
predictor = create_paddle_predictor(config)
paddle.fluid.core_avx.EnforceNotMet:


C++ Call Stacks (More useful to developers):

0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int)
1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int)
2 paddle::framework::ir::PassRegistry::Get(std::string const&) const
3 paddle::inference::analysis::IRPassManager::CreatePasses(paddle::inference::analysis::Argument*, std::vector<std::string, std::allocatorstd::string > const&)
4 paddle::inference::analysis::IRPassManager::IRPassManager(paddle::inference::analysis::Argument*)
5 paddle::inference::analysis::IrAnalysisPass::RunImpl(paddle::inference::analysis::Argument*)
6 paddle::inference::analysis::Analyzer::RunAnalysis(paddle::inference::analysis::Argument*)
7 paddle::AnalysisPredictor::OptimizeInferenceProgram()
8 paddle::AnalysisPredictor::PrepareProgram(std::shared_ptrpaddle::framework::ProgramDesc const&)
9 paddle::AnalysisPredictor::Init(std::shared_ptrpaddle::framework::Scope const&, std::shared_ptrpaddle::framework::ProgramDesc const&)
10 std::unique_ptr<paddle::PaddlePredictor, std::default_deletepaddle::PaddlePredictor > paddle::CreatePaddlePredictor<paddle::AnalysisConfig, (paddle::PaddleEngineKind)2>(paddle::AnalysisConfig const&)
11 std::unique_ptr<paddle::PaddlePredictor, std::default_deletepaddle::PaddlePredictor > paddle::CreatePaddlePredictorpaddle::AnalysisConfig(paddle::AnalysisConfig const&)


Error Message Summary:

Error: Pass tensorrt_subgraph_pass has not been registered at (/paddle/paddle/fluid/framework/ir/pass.h:201)

请问如何解决?

resnet50vd耗时

你好,我在v100上测试resnet50vd耗时接近24ms,你们的5ms以内是怎么测试的

Incorrect setting of `is_test` in EfficientNet

is_test is not correctly set in EfficientNet, leading to drop_connect in test time. It can be easily reproduced by a repeat of inferring in the same image, like what happens in the following.
image
The predicted probabilities were different between different runs.

The cause may like this.
image
is_test defaults to False in EfficientNet and is not being set to True in either infer.py or predict.py.

Moreover, duplicated definition of is_test in both __init__ and net leads to confusion.
image
In fact, _drop_connect uses self.is_test and is_test passed by methods is not used.

It would be better to fix it.

resnet_vd训练出错,没有is_test字段

paddle版本:1.7.1
config: ResNet50_vd.yaml
执行训练后出错:
image
resnet50_vd init中确实没有is_test字段,但是program.create_model中会传入这个字段:
image
请问下这里是我的版本问题吗?

调用模型微调命令训练出错

模型调用命令,使用百度ResNet50_vd_10w的预训练模型:
set CUDA_VISIBLE_DEVICES=0
python -m paddle.distributed.launch --selected_gpus="0" tools/train.py -c ./configs/quick_start/ResNet50_vd_10w_finetune.yaml

报错:

Traceback (most recent call last):
File "tools/train.py", line 150, in
main(args)
File "tools/train.py", line 75, in main
config, train_prog, startup_prog, is_train=True)
File "F:\pythonproject\PaddleClas\PaddleClas\tools\program.py", line 363, in build
optimizer.minimize(fetchs['loss'][0])
File "F:\Anaconda3\lib\site-packages\paddle\fluid\incubate\fleet\collective_init_.py", line 652, in minimize
fleet.main_program = self.try_to_compile(startup_program, main_program)
File "F:\Anaconda3\lib\site-packages\paddle\fluid\incubate\fleet\collective_init
.py", line 562, in _try_to_compile
self.transpile(startup_program, main_program)
File "F:\Anaconda3\lib\site-packages\paddle\fluid\incubate\fleet\collective_init
.py", line 489, in _transpile
current_endpoint=current_endpoint)
File "F:\Anaconda3\lib\site-packages\paddle\fluid\transpiler\distribute_transpiler.py", line 625, in transpile
wait_port=self.config.wait_port)
File "F:\Anaconda3\lib\site-packages\paddle\fluid\transpiler\distribute_transpiler.py", line 397, in _transpile_nccl2
self.config.hierarchical_allreduce_inter_nranks
File "F:\Anaconda3\lib\site-packages\paddle\fluid\framework.py", line 2610, in append_op
attrs=kwargs.get("attrs", None))
File "F:\Anaconda3\lib\site-packages\paddle\fluid\framework.py", line 1870, in init
proto = OpProtoHolder.instance().get_op_proto(type)
File "F:\Anaconda3\lib\site-packages\paddle\fluid\framework.py", line 1751, in get_op_proto
raise ValueError("Operator "%s" has not been registered." % type)
ValueError: Operator "gen_nccl_id" has not been registered.
INFO 2020-06-22 11:29:30,706 utils.py:272] terminate all the procs
ERROR 2020-06-22 11:29:30,706 utils.py:416] ABORT!!! Out of all 1 trainers, the trainer process with rank=[0] was aborted. Please check its log.
INFO 2020-06-22 11:29:30,706 utils.py:272] terminate all the procs

ResNet50_vd_10w_finetune.yaml文件配置如下:
mode: 'train'
ARCHITECTURE:
name: 'ResNet50_vd'
pretrained_model: "F:/pythonproject/PaddleClas/PaddleClas/ResNet50_vd_10w_pretrained/ResNet50_vd_10w_pretrained"
model_save_dir: "./output/"
classes_num: 5
total_images: 11745
save_interval: 1
validate: True
valid_interval: 1
epochs: 20
topk: 2
image_shape: [3, 224, 224]

LEARNING_RATE:
function: 'Cosine'
params:
lr: 0.00375

OPTIMIZER:
function: 'Momentum'
params:
momentum: 0.9
regularizer:
function: 'L2'
factor: 0.000001

TRAIN:
batch_size: 32
num_workers: 4
file_list: "F:/pythonproject\PaddleClas/PaddleClas/dataset/driver/train_list.txt"
data_dir: "F:/pythonproject\PaddleClas/PaddleClas/dataset/driver/"
shuffle_seed: 0
transforms:
- DecodeImage:
to_rgb: True
to_np: False
channel_first: False
- RandCropImage:
size: 224
- RandFlipImage:
flip_code: 1
- NormalizeImage:
scale: 1./255.
mean: [0.485, 0.456, 0.406]
std: [0.229, 0.224, 0.225]
order: ''
- ToCHWImage:

VALID:
batch_size: 20
num_workers: 4
file_list: "F:/pythonproject\PaddleClas/PaddleClas/dataset/driver/val_list.txt"
data_dir: "F:/pythonproject\PaddleClas/PaddleClas/dataset/driver/"
shuffle_seed: 0
transforms:
- DecodeImage:
to_rgb: True
to_np: False
channel_first: False
- ResizeImage:
resize_short: 256
- CropImage:
size: 224
- NormalizeImage:
scale: 1.0/255.0
mean: [0.485, 0.456, 0.406]
std: [0.229, 0.224, 0.225]
order: ''
- ToCHWImage:

想知道去重的具体步骤

非常感谢这么棒的项目!!! 我对数据集的去重方式有点疑问, 因为我现在的数据集也需要去重, 但是我只知道使用sift找到特征点, 但是不同图片匹配到的特征点数量也是不同的, 那么怎么判断两幅图片的相似百分比呢? 然后设定阈值去去重图片
aaa

window10 x64 如何写训练语句

我的笔记本是window10 x64, 显卡是NVIDIA GeForce GTX 1650.

我按照示例程序编写训练语句,如下:python -m paddle.distributed.launch
--selected_gpus="0"
tools/train.py
-c ./configs/quick_start/ResNet50_vd.yaml

结果提示 ”gen_nccl_id ” has not been registered, 咨询QQ群说是window不支持多卡,请问针对我目前情况,应该如何写训练语句

动态图版本支持情况

如题,开发者你好,请问一下目前这个库的动态图版本代码能正常运行么?和静态图版本的开发进度目前有哪些是不对齐的?

Add unittest in PaddleClas

As the CI is already built,
The unittest can be reconstructed, like:

|—— ppcls
|
|—— test
|————|———— test_reader.py
|————|———— test_imaug.py
|————|———— test_download.py
|————|———— test_compress.py
|————|———— test_model.py
|————|———— test_speed.py
|————|———— test_finetune.py
|————|———— test_eval.py
|————|———— test_train.py
|————|———— test_infer.py
|————|———— test_performance.py (IMPORTANT)
|_________|__________test_export.py

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.