Light

【用户使用问题】SR-GNN训练速度及推理速度不及预期 about paddlerec HOT 14 OPEN

paddlepaddle commented on August 28, 2024

【用户使用问题】SR-GNN训练速度及推理速度不及预期

from paddlerec.

Comments (14)

ucasiggcas commented on August 28, 2024

Traceback (most recent call last):
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/utils/envs.py", line 221, in lazy_instance_by_fliename
    globals(), locals(), package.split("."))
  File "models/recall/gnn/model.py", line 23, in <module>
    from paddlerec.core.metrics import RecallK
ImportError: cannot import name 'RecallK' from 'paddlerec.core.metrics' (/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/metrics/__init__.py)
Catch Exception:cannot import name 'RecallK' from 'paddlerec.core.metrics' (/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/metrics/__init__.py)
Traceback (most recent call last):
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainer.py", line 246, in run
    self.context_process(self._context)
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainer.py", line 207, in context_process
    self._status_processor[context['status']](context)
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainers/general_trainer.py", line 90, in network
    network_class.build_network(context)
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainers/framework/network.py", line 64, in build_network
    model_path, "Model")(context["env"])
TypeError: 'NoneType' object is not callable
Catch Exception:'NoneType' object is not callable

--------------------------------
PaddleRec Error Message Summary:
--------------------------------

Exit PaddleRec. catch exception in precoss status: [network_pass], except: 'NoneType' object is not callable
TypeError

from paddlerec.

ucasiggcas commented on August 28, 2024

PaddleRec: Runner single_cpu_train Begin
Executor Mode: train
processor_register begin
Running SingleInstance.
Running SingleNetwork.
Warning:please make sure there are no hidden files in the dataset folder and check these hidden files:[]
need_split_files: False
QueueDataset can not support PY3, change to DataLoader
Traceback (most recent call last):
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainer.py", line 256, in run
    self.context_process(self._context)
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainer.py", line 217, in context_process
    self._status_processor[context['status']](context)
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainers/general_trainer.py", line 90, in network
    network_class.build_network(context)
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainers/framework/network.py", line 80, in build_network
    model._data_loader)
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainers/framework/dataset.py", line 60, in get_dataloader
    reader_class_name=reader_class_name)
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/utils/dataloader_instance.py", line 96, in dataloader_by_name
    return gen_batch_reader()
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/utils/dataloader_instance.py", line 93, in gen_batch_reader
    return reader.generate_batch_from_trainfiles(files)
  File "models/recall/gnn/reader.py", line 135, in generate_batch_from_trainfiles
    self.input = self.base_read(files)
  File "models/recall/gnn/reader.py", line 35, in base_read
    for line in fin:
  File "/home/xulm1/anaconda3/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
Catch Exception:'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

--------------------------------
PaddleRec Error Message Summary:
--------------------------------

Exit PaddleRec. catch exception in precoss status: [network_pass], except: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
UnicodeDecodeError

from paddlerec.

ucasiggcas commented on August 28, 2024

运行的下面这句，第二个是重新安装后的结果
$ python -m paddlerec.run -m models/recall/gnn/config.yaml

from paddlerec.

ucasiggcas commented on August 28, 2024

不太理解的是召回的Cnt个数为啥越来越多？一共就没那么多item

from paddlerec.

ucasiggcas commented on August 28, 2024

2020-09-15 15:14:05,122-INFO: 	[Train],  epoch: 0,  batch: 1, time_each_interval: 29.89s, LOSS: [10.532445], InsCnt: [10000.], RecallCnt: [73.], Acc(Recall@20): [0.0073]
2020-09-15 15:14:18,110-INFO: 	[Train],  epoch: 0,  batch: 2, time_each_interval: 12.99s, LOSS: [10.150826], InsCnt: [15000.], RecallCnt: [266.], Acc(Recall@20): [0.01773333]
2020-09-15 15:14:30,812-INFO: 	[Train],  epoch: 0,  batch: 3, time_each_interval: 12.70s, LOSS: [9.429095], InsCnt: [20000.], RecallCnt: [459.], Acc(Recall@20): [0.02295]
2020-09-15 15:14:42,839-INFO: 	[Train],  epoch: 0,  batch: 4, time_each_interval: 12.03s, LOSS: [8.945746], InsCnt: [25000.], RecallCnt: [814.], Acc(Recall@20): [0.03256]
2020-09-15 15:14:54,804-INFO: 	[Train],  epoch: 0,  batch: 5, time_each_interval: 11.96s, LOSS: [8.617248], InsCnt: [30000.], RecallCnt: [1152.], Acc(Recall@20): [0.0384]
2020-09-15 15:15:06,927-INFO: 	[Train],  epoch: 0,  batch: 6, time_each_interval: 12.12s, LOSS: [8.601961], InsCnt: [35000.], RecallCnt: [1509.], Acc(Recall@20): [0.04311429]
2020-09-15 15:15:18,632-INFO: 	[Train],  epoch: 0,  batch: 7, time_each_interval: 11.70s, LOSS: [8.352413], InsCnt: [40000.], RecallCnt: [1921.], Acc(Recall@20): [0.048025]
2020-09-15 15:15:30,354-INFO: 	[Train],  epoch: 0,  batch: 8, time_each_interval: 11.72s, LOSS: [8.464729], InsCnt: [45000.], RecallCnt: [2270.], Acc(Recall@20): [0.05044444]

100万行训练数据，3万多item，一个batch12s，batch_size=5000，训练一轮需要100万/5000*12s=2400s，而tf版本只需不到10min，同样的数据量，需要提高啊。

from paddlerec.

ucasiggcas commented on August 28, 2024

这还没用1000万的训练数据呢，咋整啊，大数据还是用不起啊

from paddlerec.

ucasiggcas commented on August 28, 2024

推理是咋做的啊
每个用户推的items列表怎么取到啊
数据一定要存下来吗？？train和test，
然后再读取？
很麻烦，数据处理完就训练不行吗？整个流程

from paddlerec.

ucasiggcas commented on August 28, 2024

models/recall/gnn/data/config.txt
187993
7806633
这个文件下的俩数字怎么用脚本放到config.yaml文件中啊，这可咋整啊？？
好麻烦啊，我定时训练总不能自己每隔一段时间看看，然后手动改吧

from paddlerec.

ucasiggcas commented on August 28, 2024

另外如果要改config.yaml中的数据咋整？？这种形式好麻烦。
我倒是觉得不如直接来个argparse进行参数的输入

from paddlerec.

ucasiggcas commented on August 28, 2024

/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle/fluid/executor.py:789: UserWarning: The following exception is not an EOF exception.
  "The following exception is not an EOF exception.")
Traceback (most recent call last):
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainer.py", line 256, in run
    self.context_process(self._context)
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle_rec-0.1.0-py3.7.egg/paddlerec/core/trainer.py", line 217, in context_process
    self._status_processor[context['status']](context)
  File "core/trainers/general_trainer.py", line 113, in startup
    startup_class.startup(context)
  File "/data1/xulm1/PaddleRec/core/trainers/framework/startup.py", line 237, in startup
    context["exe"].run(startup_prog)
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle/fluid/executor.py", line 790, in run
    six.reraise(*sys.exc_info())
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/six.py", line 696, in reraise
    raise value
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle/fluid/executor.py", line 785, in run
    use_program_cache=use_program_cache)
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle/fluid/executor.py", line 838, in _run_impl
    use_program_cache=use_program_cache)
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/paddle/fluid/executor.py", line 912, in _run_program
    fetch_var_name)
paddle.fluid.core_avx.EnforceNotMet: 

--------------------------------------------
C++ Call Stacks (More useful to developers):
--------------------------------------------
0   std::string paddle::platform::GetTraceBackString<std::string>(std::string&&, char const*, int)
1   paddle::platform::EnforceNotMet::EnforceNotMet(paddle::platform::ErrorSummary const&, char const*, int)
2   paddle::platform::DeviceContextPool::Get(paddle::platform::Place const&)
3   paddle::framework::GarbageCollector::GarbageCollector(paddle::platform::Place const&, unsigned long)
4   paddle::framework::UnsafeFastGPUGarbageCollector::UnsafeFastGPUGarbageCollector(paddle::platform::CUDAPlace const&, unsigned long)
5   paddle::framework::Executor::RunPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, bool, bool, bool)
6   paddle::framework::Executor::Run(paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool, std::vector<std::string, std::allocator<std::string> > const&, bool, bool)

----------------------
Error Message Summary:
----------------------
Error: Place CUDAPlace(2) is not supported, Please check that your paddle compiles with WITH_GPU option or check that your train process hold the correct gpu_id if you use Executor at (/paddle/paddle/fluid/platform/device_context.cc:67)

from paddlerec.

ucasiggcas commented on August 28, 2024

而实际上是可以用2的

>>> import paddle.fluid as fluid
>>> fluid.CUDAPlace(2)
<paddle.fluid.core_avx.CUDAPlace object at 0x7fcf4e938c30>
>>>

from paddlerec.

ucasiggcas commented on August 28, 2024

train及infer都用1，显式设置gpu为1

----------------------
Error Message Summary:
----------------------
ResourceExhaustedError: 

Out of memory error on GPU 1. Cannot allocate 7.003248GB memory on GPU 1, available memory is only 2.751526GB.

Please check whether there is any other process using GPU 1.
1. If yes, please stop them, or start PaddlePaddle on another GPU.
2. If no, please decrease the batch size of your model. 

 at (/paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:69)

这说明，train结束后占用的内存并没有释放。
下面试试train 1 infer 0

from paddlerec.

ucasiggcas commented on August 28, 2024

仍旧不行啊，也不知道改了哪里不该改的了，心累

----------------------
Error Message Summary:
----------------------
Error: Place CUDAPlace(0) is not supported, Please check that your paddle compiles with WITH_GPU option or check that your train process hold the correct gpu_id if you use Executor at (/paddle/paddle/fluid/platform/device_context.cc:67)

EnforceNotMet

离实际应用的距离有点远

from paddlerec.

ucasiggcas commented on August 28, 2024

from paddlerec.

Related Issues (20)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.