Comments (9)
fate-client是否已经更新到了2.1版本呢?
from fate.
谢谢,我跟新了之后解决了上述这个问题。我遇到了新的以下两个问题。
- 在我的服务器上跑的fedkseed的实验结果和ipynb文件中的loss和不太一样,但zero-optimization的确是比adam效果差的
- 对于ipynb中的submit federated task,我用的standardalone,我遇到以下问题,
ValueError: query job is failed, response={'code': 1001, 'message': 'No found job: job_id[202404070456598629540],role[guest],party_id[9999]'}
请问我在submit 之前要做什么操作吗,我对ipynb的代码没有动过。
谢谢您的回复!
from fate.
from fate.
作者您好,在我把guest 从guest = '10000'改到了guest = '9999',任务可以一开始执行起来。但是执行了几轮之后就会产生同样的错误。针对这个问题请问能指导一下吗
from fate.
作者您好,在我把guest 从guest = '10000'改到了guest = '9999',任务可以一开始执行起来。但是执行了几轮之后就会产生同样的错误。针对这个问题请问能指导一下吗
能提供下相关的日志吗
from fate.
以下是gpt2,dolly数据的fate_flow_sql.log
[INFO] [2024-05-02 02:28:28,417] [202405020228282821400] [52:140389814081280] - [base_saver.execute_update] [line:223]: UPDATE "t_job" SET "f_update_time" = 1714616908416, "f_status" = 'waiting' WHERE (((("t_job"."f_job_id" = '202405020228282821400') AND ("t_job"."f_role" = 'guest')) AND ("t_job"."f_party_id" = '9999')) AND ("t_job"."f_status" = 'ready'))
[INFO] [2024-05-02 02:28:28,437] [202405020228282821400] [52:140389814081280] - [base_saver.execute_update] [line:223]: UPDATE "t_job" SET "f_update_time" = 1714616908436, "f_status" = 'waiting' WHERE (((("t_job"."f_job_id" = '202405020228282821400') AND ("t_job"."f_role" = 'arbiter')) AND ("t_job"."f_party_id" = '10000')) AND ("t_job"."f_status" = 'ready'))
[INFO] [2024-05-02 02:28:28,450] [202405020228282821400] [52:140389839259392] - [base_saver.execute_update] [line:223]: UPDATE "t_schedule_job" SET "f_update_time" = 1714616908449, "f_status" = 'waiting' WHERE (("t_schedule_job"."f_job_id" = '202405020228282821400') AND ("t_schedule_job"."f_status" = 'ready'))
[INFO] [2024-05-02 02:28:30,154] [202405020228282821400] [52:140389839259392] - [base_saver.execute_update] [line:223]: UPDATE "t_job" SET "f_update_time" = 1714616910154, "f_status" = 'running' WHERE (((("t_job"."f_job_id" = '202405020228282821400') AND ("t_job"."f_role" = 'guest')) AND ("t_job"."f_party_id" = '9999')) AND ("t_job"."f_status" = 'waiting'))
[INFO] [2024-05-02 02:28:30,162] [202405020228282821400] [52:140389839259392] - [base_saver.execute_update] [line:223]: UPDATE "t_job" SET "f_update_time" = 1714616910162, "f_start_time" = 1714616910141 WHERE ((("t_job"."f_job_id" = '202405020228282821400') AND ("t_job"."f_role" = 'guest')) AND ("t_job"."f_party_id" = '9999'))
[INFO] [2024-05-02 02:28:30,186] [202405020228282821400] [52:140389839259392] - [base_saver.execute_update] [line:223]: UPDATE "t_job" SET "f_update_time" = 1714616910185, "f_status" = 'running' WHERE (((("t_job"."f_job_id" = '202405020228282821400') AND ("t_job"."f_role" = 'arbiter')) AND ("t_job"."f_party_id" = '10000')) AND ("t_job"."f_status" = 'waiting'))
[INFO] [2024-05-02 02:28:30,194] [202405020228282821400] [52:140389839259392] - [base_saver.execute_update] [line:223]: UPDATE "t_job" SET "f_update_time" = 1714616910194, "f_start_time" = 1714616910174 WHERE ((("t_job"."f_job_id" = '202405020228282821400') AND ("t_job"."f_role" = 'arbiter')) AND ("t_job"."f_party_id" = '10000'))
[INFO] [2024-05-02 02:28:30,204] [202405020228282821400] [52:140391466657536] - [base_saver.execute_update] [line:223]: UPDATE "t_schedule_job" SET "f_update_time" = 1714616910204, "f_status" = 'running' WHERE (("t_schedule_job"."f_job_id" = '202405020228282821400') AND ("t_schedule_job"."f_status" = 'waiting'))
[INFO] [2024-05-02 02:28:30,256] [202405020228282821400] [52:140391466657536] - [base_saver.execute_update] [line:223]: UPDATE "t_schedule_task_status" SET "f_update_time" = 1714616910256, "f_status" = 'running' WHERE (((("t_schedule_task_status"."f_job_id" = '202405020228282821400') AND ("t_schedule_task_status"."f_task_id" = '202405020228282821400_reader_0')) AND ("t_schedule_task_status"."f_task_version" = 0)) AND ("t_schedule_task_status"."f_status" = 'waiting'))
[INFO] [2024-05-02 02:28:30,282] [202405020228282821400] [52:140389839259392] - [base_saver.execute_update] [line:223]: UPDATE "t_task" SET "f_update_time" = 1714616910281, "f_status" = 'running' WHERE (((((("t_task"."f_job_id" = '202405020228282821400') AND ("t_task"."f_task_id" = '202405020228282821400_reader_0')) AND ("t_task"."f_task_version" = 0)) AND ("t_task"."f_role" = 'guest')) AND ("t_task"."f_party_id" = '9999')) AND ("t_task"."f_status" = 'waiting'))
[INFO] [2024-05-02 02:28:30,399] [202405020228282821400] [52:140389839259392] - [base_saver.execute_update] [line:223]: UPDATE "t_task" SET "f_update_time" = 1714616910399, "f_worker_id" = 'aa2d6440082b11ef91f40242ac110004', "f_cmd" = '["/data/projects/fate/env/python/venv/bin/python", "/data/projects/fate/fate_flow/python/fate_flow/manager/worker/fate_flow_executor.py", "component", "entrypoint", "--env-name", "FATE_TASK_CONFIG"]', "f_run_ip" = '127.0.0.1', "f_run_port" = 9380, "f_run_pid" = 87938, "f_start_time" = 1714616910392 WHERE ((((("t_task"."f_job_id" = '202405020228282821400') AND ("t_task"."f_task_id" = '202405020228282821400_reader_0')) AND ("t_task"."f_task_version" = 0)) AND ("t_task"."f_role" = 'guest')) AND ("t_task"."f_party_id" = '9999'))
[INFO] [2024-05-02 02:28:30,413] [202405020228282821400] [52:140389839259392] - [base_saver.execute_update] [line:223]: UPDATE "t_task" SET "f_update_time" = 1714616910413, "f_party_status" = 'running' WHERE (((((("t_task"."f_job_id" = '202405020228282821400') AND ("t_task"."f_task_id" = '202405020228282821400_reader_0')) AND ("t_task"."f_task_version" = 0)) AND ("t_task"."f_role" = 'guest')) AND ("t_task"."f_party_id" = '9999')) AND ("t_task"."f_party_status" = 'waiting'))
[INFO] [2024-05-02 02:28:30,428] [202405020228282821400] [52:140389830866688] - [base_saver.execute_update] [line:223]: UPDATE "t_schedule_task" SET "f_update_time" = 1714616910428, "f_status" = 'running' WHERE (((((("t_schedule_task"."f_job_id" = '202405020228282821400') AND ("t_schedule_task"."f_task_id" = '202405020228282821400_reader_0')) AND ("t_schedule_task"."f_task_version" = 0)) AND ("t_schedule_task"."f_role" = 'guest')) AND ("t_schedule_task"."f_party_id" = '9999')) AND ("t_schedule_task"."f_status" = 'waiting'))
[INFO] [2024-05-02 02:28:33,366] [202405020228282821400] [52:140389839259392] - [base_saver.execute_update] [line:223]: UPDATE "t_task" SET "f_update_time" = 1714616913366, "f_party_status" = 'success', "f_end_time" = 1714616913366, "f_elapsed" = 2974 WHERE (((((("t_task"."f_job_id" = '202405020228282821400') AND ("t_task"."f_task_id" = '202405020228282821400_reader_0')) AND ("t_task"."f_task_version" = 0)) AND ("t_task"."f_role" = 'guest')) AND ("t_task"."f_party_id" = '9999')) AND ("t_task"."f_party_status" = 'running'))
[INFO] [2024-05-02 02:28:33,389] [202405020228282821400] [52:140389830866688] - [base_saver.execute_update] [line:223]: UPDATE "t_schedule_task" SET "f_update_time" = 1714616913389, "f_status" = 'success' WHERE (((((("t_schedule_task"."f_job_id" = '202405020228282821400') AND ("t_schedule_task"."f_task_id" = '202405020228282821400_reader_0')) AND ("t_schedule_task"."f_task_version" = 0)) AND ("t_schedule_task"."f_role" = 'guest')) AND ("t_schedule_task"."f_party_id" = '9999')) AND ("t_schedule_task"."f_status" = 'running'))
[INFO] [2024-05-02 02:28:34,535] [202405020228282821400] [52:140389839259392] - [base_saver.execute_update] [line:223]: UPDATE "t_task" SET "f_update_time" = 1714616914535, "f_status" = 'success', "f_end_time" = 1714616914535, "f_elapsed" = 4143 WHERE (((((("t_task"."f_job_id" = '202405020228282821400') AND ("t_task"."f_task_id" = '202405020228282821400_reader_0')) AND ("t_task"."f_task_version" = 0)) AND ("t_task"."f_role" = 'guest')) AND ("t_task"."f_party_id" = '9999')) AND ("t_task"."f_status" = 'running'))
[INFO] [2024-05-02 02:28:34,546] [202405020228282821400] [52:140391466657536] - [base_saver.execute_update] [line:223]: UPDATE "t_schedule_task_status" SET "f_update_time" = 1714616914545, "f_status" = 'success' WHERE (((("t_schedule_task_status"."f_job_id" = '202405020228282821400') AND ("t_schedule_task_status"."f_task_id" = '202405020228282821400_reader_0')) AND ("t_schedule_task_status"."f_task_version" = 0)) AND ("t_schedule_task_status"."f_status" = 'running'))
[INFO] [2024-05-02 02:28:34,627] [202405020228282821400] [52:140389839259392] - [base_saver.execute_update] [line:223]: UPDATE "t_task" SET "f_update_time" = 1714616914627, "f_end_time" = 1714616914627, "f_elapsed" = 4235 WHERE ((((("t_task"."f_job_id" = '202405020228282821400') AND ("t_task"."f_task_id" = '202405020228282821400_reader_0')) AND ("t_task"."f_task_version" = 0)) AND ("t_task"."f_role" = 'guest')) AND ("t_task"."f_party_id" = '9999'))
[INFO] [2024-05-02 02:28:34,659] [202405020228282821400] [52:140389839259392] - [base_saver.execute_update] [line:223]: UPDATE "t_task" SET "f_update_time" = 1714616914659, "f_kill_status" = 1 WHERE ((((("t_task"."f_job_id" = '202405020228282821400') AND ("t_task"."f_task_id" = '202405020228282821400_reader_0')) AND ("t_task"."f_task_version" = 0)) AND ("t_task"."f_role" = 'guest')) AND ("t_task"."f_party_id" = '9999'))
[INFO] [2024-05-02 02:28:34,720] [202405020228282821400] [52:140391466657536] - [base_saver.execute_update] [line:223]: UPDATE "t_schedule_task_status" SET "f_update_time" = 1714616914720, "f_status" = 'running' WHERE (((("t_schedule_task_status"."f_job_id" = '202405020228282821400') AND ("t_schedule_task_status"."f_task_id" = '202405020228282821400_nn_0')) AND ("t_schedule_task_status"."f_task_version" = 0)) AND ("t_schedule_task_status"."f_status" = 'waiting'))
[INFO] [2024-05-02 02:28:34,747] [202405020228282821400] [52:140389822473984] - [base_saver.execute_update] [line:223]: UPDATE "t_task" SET "f_update_time" = 1714616914746, "f_status" = 'running' WHERE (((((("t_task"."f_job_id" = '202405020228282821400') AND ("t_task"."f_task_id" = '202405020228282821400_nn_0')) AND ("t_task"."f_task_version" = 0)) AND ("t_task"."f_role" = 'guest')) AND ("t_task"."f_party_id" = '9999')) AND ("t_task"."f_status" = 'waiting'))
[INFO] [2024-05-02 02:28:34,775] [202405020228282821400] [52:140389839259392] - [base_saver.execute_update] [line:223]: UPDATE "t_task" SET "f_update_time" = 1714616914774, "f_status" = 'running' WHERE (((((("t_task"."f_job_id" = '202405020228282821400') AND ("t_task"."f_task_id" = '202405020228282821400_nn_0')) AND ("t_task"."f_task_version" = 0)) AND ("t_task"."f_role" = 'arbiter')) AND ("t_task"."f_party_id" = '10000')) AND ("t_task"."f_status" = 'waiting'))
[INFO] [2024-05-02 02:28:34,904] [202405020228282821400] [52:140389822473984] - [base_saver.execute_update] [line:223]: UPDATE "t_task" SET "f_update_time" = 1714616914904, "f_worker_id" = 'acdd9c14082b11ef8eff0242ac110004', "f_cmd" = '["/data/projects/fate/env/python/venv/bin/python", "/data/projects/fate/fate_flow/python/fate_flow/manager/worker/fate_flow_executor.py", "component", "entrypoint", "--env-name", "FATE_TASK_CONFIG"]', "f_run_ip" = '127.0.0.1', "f_run_port" = 9380, "f_run_pid" = 88049, "f_start_time" = 1714616914901 WHERE ((((("t_task"."f_job_id" = '202405020228282821400') AND ("t_task"."f_task_id" = '202405020228282821400_nn_0')) AND ("t_task"."f_task_version" = 0)) AND ("t_task"."f_role" = 'guest')) AND ("t_task"."f_party_id" = '9999'))
[INFO] [2024-05-02 02:28:34,913] [202405020228282821400] [52:140389822473984] - [base_saver.execute_update] [line:223]: UPDATE "t_task" SET "f_update_time" = 1714616914913, "f_party_status" = 'running' WHERE (((((("t_task"."f_job_id" = '202405020228282821400') AND ("t_task"."f_task_id" = '202405020228282821400_nn_0')) AND ("t_task"."f_task_version" = 0)) AND ("t_task"."f_role" = 'guest')) AND ("t_task"."f_party_id" = '9999')) AND ("t_task"."f_party_status" = 'waiting'))
[INFO] [2024-05-02 02:28:34,928] [202405020228282821400] [52:140389839259392] - [base_saver.execute_update] [line:223]: UPDATE "t_schedule_task" SET "f_update_time" = 1714616914928, "f_status" = 'running' WHERE (((((("t_schedule_task"."f_job_id" = '202405020228282821400') AND ("t_schedule_task"."f_task_id" = '202405020228282821400_nn_0')) AND ("t_schedule_task"."f_task_version" = 0)) AND ("t_schedule_task"."f_role" = 'guest')) AND ("t_schedule_task"."f_party_id" = '9999')) AND ("t_schedule_task"."f_status" = 'waiting'))
[INFO] [2024-05-02 02:28:35,034] [202405020228282821400] [52:140389839259392] - [base_saver.execute_update] [line:223]: UPDATE "t_task" SET "f_update_time" = 1714616915034, "f_worker_id" = 'acf06448082b11efae840242ac110004', "f_cmd" = '["/data/projects/fate/env/python/venv/bin/python", "/data/projects/fate/fate_flow/python/fate_flow/manager/worker/fate_flow_executor.py", "component", "entrypoint", "--env-name", "FATE_TASK_CONFIG"]', "f_run_ip" = '127.0.0.1', "f_run_port" = 9380, "f_run_pid" = 88089, "f_start_time" = 1714616915027 WHERE ((((("t_task"."f_job_id" = '202405020228282821400') AND ("t_task"."f_task_id" = '202405020228282821400_nn_0')) AND ("t_task"."f_task_version" = 0)) AND ("t_task"."f_role" = 'arbiter')) AND ("t_task"."f_party_id" = '10000'))
[INFO] [2024-05-02 02:28:35,047] [202405020228282821400] [52:140389839259392] - [base_saver.execute_update] [line:223]: UPDATE "t_task" SET "f_update_time" = 1714616915047, "f_party_status" = 'running' WHERE (((((("t_task"."f_job_id" = '202405020228282821400') AND ("t_task"."f_task_id" = '202405020228282821400_nn_0')) AND ("t_task"."f_task_version" = 0)) AND ("t_task"."f_role" = 'arbiter')) AND ("t_task"."f_party_id" = '10000')) AND ("t_task"."f_party_status" = 'waiting'))
[INFO] [2024-05-02 02:28:35,065] [202405020228282821400] [52:140389830866688] - [base_saver.execute_update] [line:223]: UPDATE "t_schedule_task" SET "f_update_time" = 1714616915065, "f_status" = 'running' WHERE (((((("t_schedule_task"."f_job_id" = '202405020228282821400') AND ("t_schedule_task"."f_task_id" = '202405020228282821400_nn_0')) AND ("t_schedule_task"."f_task_version" = 0)) AND ("t_schedule_task"."f_role" = 'arbiter')) AND ("t_schedule_task"."f_party_id" = '10000')) AND ("t_schedule_task"."f_status" = 'waiting'))
[INFO] [2024-05-02 02:28:35,092] [202405020228282821400] [52:140389839259392] - [base_saver.execute_update] [line:223]: UPDATE "t_job" SET "f_update_time" = 1714616915091, "f_progress" = 50 WHERE (((("t_job"."f_job_id" = '202405020228282821400') AND ("t_job"."f_role" = 'guest')) AND ("t_job"."f_party_id" = '9999')) AND ("t_job"."f_progress" <= 50))
[INFO] [2024-05-02 02:28:35,112] [202405020228282821400] [52:140389830866688] - [base_saver.execute_update] [line:223]: UPDATE "t_job" SET "f_update_time" = 1714616915111, "f_progress" = 50 WHERE (((("t_job"."f_job_id" = '202405020228282821400') AND ("t_job"."f_role" = 'arbiter')) AND ("t_job"."f_party_id" = '10000')) AND ("t_job"."f_progress" <= 50))
[INFO] [2024-05-02 02:28:35,130] [202405020228282821400] [52:140391466657536] - [base_saver.execute_update] [line:223]: UPDATE "t_schedule_job" SET "f_update_time" = 1714616915130, "f_progress" = 50 WHERE (("t_schedule_job"."f_job_id" = '202405020228282821400') AND ("t_schedule_job"."f_progress" <= 50))
[INFO] [2024-05-02 02:28:48,968] [202405020228282821400] [52:140389830866688] - [base_saver.execute_update] [line:223]: UPDATE "t_task" SET "f_update_time" = 1714616928968, "f_party_status" = 'failed', "f_end_time" = 1714616928968, "f_elapsed" = 14067 WHERE (((((("t_task"."f_job_id" = '202405020228282821400') AND ("t_task"."f_task_id" = '202405020228282821400_nn_0')) AND ("t_task"."f_task_version" = 0)) AND ("t_task"."f_role" = 'guest')) AND ("t_task"."f_party_id" = '9999')) AND ("t_task"."f_party_status" = 'running'))
[INFO] [2024-05-02 02:28:48,991] [202405020228282821400] [52:140389839259392] - [base_saver.execute_update] [line:223]: UPDATE "t_schedule_task" SET "f_update_time" = 1714616928991, "f_status" = 'failed' WHERE (((((("t_schedule_task"."f_job_id" = '202405020228282821400') AND ("t_schedule_task"."f_task_id" = '202405020228282821400_nn_0')) AND ("t_schedule_task"."f_task_version" = 0)) AND ("t_schedule_task"."f_role" = 'guest')) AND ("t_schedule_task"."f_party_id" = '9999')) AND ("t_schedule_task"."f_status" = 'running'))
[INFO] [2024-05-02 02:28:49,003] [202405020228282821400] [52:140389830866688] - [base_saver.execute_update] [line:223]: UPDATE "t_task" SET "f_update_time" = 1714616929002, "f_error_report" = 'Traceback (most recent call last):
File "/data/projects/fate/fate/python/fate/components/entrypoint/cli/component/execute_cli.py", line 147, in execute_component_from_config
component.execute(ctx, role, **execution_io.get_kwargs())
File "/data/projects/fate/fate/python/fate/components/core/component_desc/component.py", line 101, in execute
return self.callback(ctx, role, **kwargs)
File "/data/projects/fate/fate/python/fate/components/components/homo_nn.py", line 61, in train
train_procedure(
File "/data/projects/fate/fate/python/fate/components/components/nn/component_utils.py", line 155, in train_procedure
runner.train(train_data, validate_data_, output_dir, saved_model_path)
File "/data/projects/fate/fate/python/fate/components/components/nn/runner/homo_default_runner.py", line 270, in train
trainer.train()
File "/data/projects/fate/fate/python/fate_llm/fedkseed/fedkseed.py", line 123, in train
direction_derivative_history = self.train_once(
File "/data/projects/fate/fate/python/fate_llm/fedkseed/fedkseed.py", line 154, in train_once
trainer.train()
File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1928, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/accelerate/data_loader.py", line 452, in iter
current_batch = next(dataloader_iter)
File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 630, in next
data = self._next_data()
File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/transformers/data/data_collator.py", line 45, in call
return self.torch_call(features)
File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/transformers/data/data_collator.py", line 761, in torch_call
batch = pad_without_fast_tokenizer_warning(
File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/transformers/data/data_collator.py", line 66, in pad_without_fast_tokenizer_warning
padded = tokenizer.pad(*pad_args, **pad_kwargs)
File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 3286, in pad
padding_strategy, _, max_length, _ = self._get_padding_truncation_strategies(
File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2734, in _get_padding_truncation_strategies
raise ValueError(
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token
(tokenizer.pad_token = tokenizer.eos_token e.g.)
or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'})
.
' WHERE ((((("t_task"."f_job_id" = '202405020228282821400') AND ("t_task"."f_task_id" = '202405020228282821400_nn_0')) AND ("t_task"."f_task_version" = 0)) AND ("t_task"."f_role" = 'guest')) AND ("t_task"."f_party_id" = '9999'))
[INFO] [2024-05-02 02:28:49,395] [202405020228282821400] [52:140389830866688] - [base_saver.execute_update] [line:223]: UPDATE "t_task" SET "f_update_time" = 1714616929395, "f_status" = 'failed', "f_end_time" = 1714616929395, "f_elapsed" = 14494 WHERE (((((("t_task"."f_job_id" = '202405020228282821400') AND ("t_task"."f_task_id" = '202405020228282821400_nn_0')) AND ("t_task"."f_task_version" = 0)) AND ("t_task"."f_role" = 'guest')) AND ("t_task"."f_party_id" = '9999')) AND ("t_task"."f_status" = 'running'))
[INFO] [2024-05-02 02:28:49,422] [202405020228282821400] [52:140389839259392] - [base_saver.execute_update] [line:223]: UPDATE "t_task" SET "f_update_time" = 1714616929422, "f_status" = 'failed', "f_end_time" = 1714616929422, "f_elapsed" = 14395 WHERE (((((("t_task"."f_job_id" = '202405020228282821400') AND ("t_task"."f_task_id" = '202405020228282821400_nn_0')) AND ("t_task"."f_task_version" = 0)) AND ("t_task"."f_role" = 'arbiter')) AND ("t_task"."f_party_id" = '10000')) AND ("t_task"."f_status" = 'running'))
[INFO] [2024-05-02 02:28:49,434] [202405020228282821400] [52:140391466657536] - [base_saver.execute_update] [line:223]: UPDATE "t_schedule_task_status" SET "f_update_time" = 1714616929433, "f_status" = 'failed' WHERE (((("t_schedule_task_status"."f_job_id" = '202405020228282821400') AND ("t_schedule_task_status"."f_task_id" = '202405020228282821400_nn_0')) AND ("t_schedule_task_status"."f_task_version" = 0)) AND ("t_schedule_task_status"."f_status" = 'running'))
[INFO] [2024-05-02 02:28:49,520] [202405020228282821400] [52:140389839259392] - [base_saver.execute_update] [line:223]: UPDATE "t_task" SET "f_update_time" = 1714616929520, "f_end_time" = 1714616929520, "f_elapsed" = 14619 WHERE ((((("t_task"."f_job_id" = '202405020228282821400') AND ("t_task"."f_task_id" = '202405020228282821400_nn_0')) AND ("t_task"."f_task_version" = 0)) AND ("t_task"."f_role" = 'guest')) AND ("t_task"."f_party_id" = '9999'))
[INFO] [2024-05-02 02:28:49,544] [202405020228282821400] [52:140389839259392] - [base_saver.execute_update] [line:223]: UPDATE "t_task" SET "f_update_time" = 1714616929544, "f_kill_status" = 1 WHERE ((((("t_task"."f_job_id" = '202405020228282821400') AND ("t_task"."f_task_id" = '202405020228282821400_nn_0')) AND ("t_task"."f_task_version" = 0)) AND ("t_task"."f_role" = 'guest')) AND ("t_task"."f_party_id" = '9999'))
[INFO] [2024-05-02 02:28:49,627] [202405020228282821400] [52:140389830866688] - [base_saver.execute_update] [line:223]: UPDATE "t_task" SET "f_update_time" = 1714616929627, "f_party_status" = 'failed', "f_end_time" = 1714616929627, "f_elapsed" = 14600 WHERE (((((("t_task"."f_job_id" = '202405020228282821400') AND ("t_task"."f_task_id" = '202405020228282821400_nn_0')) AND ("t_task"."f_task_version" = 0)) AND ("t_task"."f_role" = 'arbiter')) AND ("t_task"."f_party_id" = '10000')) AND ("t_task"."f_party_status" = 'running'))
[INFO] [2024-05-02 02:28:49,643] [202405020228282821400] [52:140389839259392] - [base_saver.execute_update] [line:223]: UPDATE "t_schedule_task" SET "f_update_time" = 1714616929643, "f_status" = 'failed' WHERE (((((("t_schedule_task"."f_job_id" = '202405020228282821400') AND ("t_schedule_task"."f_task_id" = '202405020228282821400_nn_0')) AND ("t_schedule_task"."f_task_version" = 0)) AND ("t_schedule_task"."f_role" = 'arbiter')) AND ("t_schedule_task"."f_party_id" = '10000')) AND ("t_schedule_task"."f_status" = 'running'))
[INFO] [2024-05-02 02:28:49,654] [202405020228282821400] [52:140389830866688] - [base_saver.execute_update] [line:223]: UPDATE "t_task" SET "f_update_time" = 1714616929654, "f_kill_status" = 1 WHERE ((((("t_task"."f_job_id" = '202405020228282821400') AND ("t_task"."f_task_id" = '202405020228282821400_nn_0')) AND ("t_task"."f_task_version" = 0)) AND ("t_task"."f_role" = 'arbiter')) AND ("t_task"."f_party_id" = '10000'))
[INFO] [2024-05-02 02:28:49,671] [202405020228282821400] [52:140389830866688] - [base_saver.execute_update] [line:223]: UPDATE "t_job" SET "f_update_time" = 1714616929671, "f_progress" = 100 WHERE (((("t_job"."f_job_id" = '202405020228282821400') AND ("t_job"."f_role" = 'guest')) AND ("t_job"."f_party_id" = '9999')) AND ("t_job"."f_progress" <= 100))
[INFO] [2024-05-02 02:28:49,689] [202405020228282821400] [52:140389830866688] - [base_saver.execute_update] [line:223]: UPDATE "t_job" SET "f_update_time" = 1714616929689, "f_progress" = 100 WHERE (((("t_job"."f_job_id" = '202405020228282821400') AND ("t_job"."f_role" = 'arbiter')) AND ("t_job"."f_party_id" = '10000')) AND ("t_job"."f_progress" <= 100))
[INFO] [2024-05-02 02:28:49,707] [202405020228282821400] [52:140391466657536] - [base_saver.execute_update] [line:223]: UPDATE "t_schedule_job" SET "f_update_time" = 1714616929706, "f_progress" = 100 WHERE (("t_schedule_job"."f_job_id" = '202405020228282821400') AND ("t_schedule_job"."f_progress" <= 100))
[INFO] [2024-05-02 02:28:49,723] [202405020228282821400] [52:140389830866688] - [base_saver.execute_update] [line:223]: UPDATE "t_job" SET "f_update_time" = 1714616929723, "f_status" = 'failed', "f_end_time" = 1714616929723, "f_elapsed" = 19582 WHERE (((("t_job"."f_job_id" = '202405020228282821400') AND ("t_job"."f_role" = 'guest')) AND ("t_job"."f_party_id" = '9999')) AND ("t_job"."f_status" = 'running'))
[INFO] [2024-05-02 02:28:49,732] [202405020228282821400] [52:140389830866688] - [base_saver.execute_update] [line:223]: UPDATE "t_job" SET "f_update_time" = 1714616929731, "f_tag" = 'job_end', "f_end_time" = 1714616929731, "f_elapsed" = 19590 WHERE ((("t_job"."f_job_id" = '202405020228282821400') AND ("t_job"."f_role" = 'guest')) AND ("t_job"."f_party_id" = '9999'))
[INFO] [2024-05-02 02:28:49,758] [202405020228282821400] [52:140389830866688] - [base_saver.execute_update] [line:223]: UPDATE "t_job" SET "f_update_time" = 1714616929758, "f_status" = 'failed', "f_end_time" = 1714616929758, "f_elapsed" = 19584 WHERE (((("t_job"."f_job_id" = '202405020228282821400') AND ("t_job"."f_role" = 'arbiter')) AND ("t_job"."f_party_id" = '10000')) AND ("t_job"."f_status" = 'running'))
[INFO] [2024-05-02 02:28:49,768] [202405020228282821400] [52:140389830866688] - [base_saver.execute_update] [line:223]: UPDATE "t_job" SET "f_update_time" = 1714616929768, "f_tag" = 'job_end', "f_end_time" = 1714616929768, "f_elapsed" = 19594 WHERE ((("t_job"."f_job_id" = '202405020228282821400') AND ("t_job"."f_role" = 'arbiter')) AND ("t_job"."f_party_id" = '10000'))
[INFO] [2024-05-02 02:28:49,792] [202405020228282821400] [52:140391466657536] - [base_saver.execute_update] [line:223]: UPDATE "t_schedule_job" SET "f_update_time" = 1714616929792, "f_status" = 'failed' WHERE ((("t_schedule_job"."f_job_id" = '202405020228282821400') AND ("t_schedule_job"."f_rerun_signal" = 0)) AND ("t_schedule_job"."f_status" = 'running'))
[INFO] [2024-05-02 02:28:49,800] [202405020228282821400] [52:140391466657536] - [base_saver.execute_update] [line:223]: UPDATE "t_schedule_job" SET "f_update_time" = 1714616929800, "f_tag" = 'job_end' WHERE ("t_schedule_job"."f_job_id" = '202405020228282821400')
[INFO] [2024-05-02 02:28:49,886] [202405020228282821400] [52:140389830866688] - [base_saver.execute_update] [line:223]: UPDATE "t_task" SET "f_update_time" = 1714616929886, "f_end_time" = 1714616929886, "f_elapsed" = 14985 WHERE ((((("t_task"."f_job_id" = '202405020228282821400') AND ("t_task"."f_task_id" = '202405020228282821400_nn_0')) AND ("t_task"."f_task_version" = 0)) AND ("t_task"."f_role" = 'guest')) AND ("t_task"."f_party_id" = '9999'))
[INFO] [2024-05-02 02:28:49,911] [202405020228282821400] [52:140389830866688] - [base_saver.execute_update] [line:223]: UPDATE "t_task" SET "f_update_time" = 1714616929911, "f_kill_status" = 1 WHERE ((((("t_task"."f_job_id" = '202405020228282821400') AND ("t_task"."f_task_id" = '202405020228282821400_nn_0')) AND ("t_task"."f_task_version" = 0)) AND ("t_task"."f_role" = 'guest')) AND ("t_task"."f_party_id" = '9999'))
[INFO] [2024-05-02 02:28:49,916] [202405020228282821400] [52:140389830866688] - [base_saver.execute_update] [line:223]: UPDATE "t_job" SET "f_update_time" = 1714616929916, "f_end_time" = 1714616929916, "f_elapsed" = 19775 WHERE ((("t_job"."f_job_id" = '202405020228282821400') AND ("t_job"."f_role" = 'guest')) AND ("t_job"."f_party_id" = '9999'))
[INFO] [2024-05-02 02:28:49,920] [202405020228282821400] [52:140389830866688] - [base_saver.execute_update] [line:223]: UPDATE "t_job" SET "f_update_time" = 1714616929920, "f_tag" = 'job_end', "f_end_time" = 1714616929920, "f_elapsed" = 19779 WHERE ((("t_job"."f_job_id" = '202405020228282821400') AND ("t_job"."f_role" = 'guest')) AND ("t_job"."f_party_id" = '9999'))
[INFO] [2024-05-02 02:28:50,002] [202405020228282821400] [52:140389830866688] - [base_saver.execute_update] [line:223]: UPDATE "t_task" SET "f_update_time" = 1714616930002, "f_end_time" = 1714616930002, "f_elapsed" = 14975 WHERE ((((("t_task"."f_job_id" = '202405020228282821400') AND ("t_task"."f_task_id" = '202405020228282821400_nn_0')) AND ("t_task"."f_task_version" = 0)) AND ("t_task"."f_role" = 'arbiter')) AND ("t_task"."f_party_id" = '10000'))
[INFO] [2024-05-02 02:28:50,023] [202405020228282821400] [52:140389830866688] - [base_saver.execute_update] [line:223]: UPDATE "t_task" SET "f_update_time" = 1714616930023, "f_kill_status" = 1 WHERE ((((("t_task"."f_job_id" = '202405020228282821400') AND ("t_task"."f_task_id" = '202405020228282821400_nn_0')) AND ("t_task"."f_task_version" = 0)) AND ("t_task"."f_role" = 'arbiter')) AND ("t_task"."f_party_id" = '10000'))
[INFO] [2024-05-02 02:28:50,028] [202405020228282821400] [52:140389830866688] - [base_saver.execute_update] [line:223]: UPDATE "t_job" SET "f_update_time" = 1714616930028, "f_end_time" = 1714616930028, "f_elapsed" = 19854 WHERE ((("t_job"."f_job_id" = '202405020228282821400') AND ("t_job"."f_role" = 'arbiter')) AND ("t_job"."f_party_id" = '10000'))
[INFO] [2024-05-02 02:28:50,033] [202405020228282821400] [52:140389830866688] - [base_saver.execute_update] [line:223]: UPDATE "t_job" SET "f_update_time" = 1714616930033, "f_tag" = 'job_end', "f_end_time" = 1714616930033, "f_elapsed" = 19859 WHERE ((("t_job"."f_job_id" = '202405020228282821400') AND ("t_job"."f_role" = 'arbiter')) AND ("t_job"."f_party_id" = '10000'))
from fate.
以下是我的具体的代码
import time
from fate_client.pipeline.components.fate.reader import Reader
from fate_client.pipeline import FateFlowPipeline
from fate_client.pipeline.components.fate.homo_nn import HomoNN, get_config_of_seq2seq_runner
from fate_client.pipeline.components.fate.nn.algo_params import TrainingArguments, FedAVGArguments
from fate_client.pipeline.components.fate.nn.loader import LLMModelLoader, LLMDatasetLoader, LLMDataFuncLoader
guest = '10000'
host = '10000'
arbiter = '10000'
epochs = 0.01
batch_size = 1
lr = 1e-5
pipeline = FateFlowPipeline().set_parties(guest=guest, arbiter=arbiter)
pipeline.bind_local_path(path="/data/projects/fate/examples/data/dolly", namespace="experiment",
name="dolly")
time.sleep(5)
reader_0 = Reader("reader_0", runtime_parties=dict(guest=guest, host=host))
reader_0.guest.task_parameters(
namespace="experiment",
name="dolly"
)
reader_0.hosts[0].task_parameters(
namespace="experiment",
name="dolly"
)
tokenizer_params = dict(
pretrained_model_name_or_path="gpt2",
trust_remote_code=True,
)
conf = get_config_of_seq2seq_runner(
algo='fedkseed',
model=LLMModelLoader(
"hf_model",
"HFAutoModelForCausalLM",
# pretrained_model_name_or_path="datajuicer/LLaMA-1B-dj-refine-150B",
pretrained_model_name_or_path="gpt2",
trust_remote_code=True
),
dataset=LLMDatasetLoader(
"hf_dataset",
"Dolly15K",
split="train",
tokenizer_params=tokenizer_params,
tokenizer_apply_params=dict(
truncation=True,
max_length=1024,
)),
data_collator=LLMDataFuncLoader(
"cust_func.cust_data_collator",
"get_seq2seq_tokenizer",
tokenizer_params=tokenizer_params,
),
training_args=TrainingArguments(
num_train_epochs=0.01,
per_device_train_batch_size=batch_size,
remove_unused_columns=True,
learning_rate=lr,
fp16=False,
use_cpu=False,
disable_tqdm=False,
use_mps_device=True,
),
fed_args=FedAVGArguments(),
task_type='causal_lm',
save_trainable_weights_only=True,
)
conf["fed_args_conf"] = {}
homo_nn_0 = HomoNN(
'nn_0',
runner_conf=conf,
train_data=reader_0.outputs["output_data"],
runner_module="fedkseed_runner",
runner_class="FedKSeedRunner",
)
pipeline.add_tasks([reader_0, homo_nn_0])
pipeline.conf.set("task", dict(engine_run={"cores": 1}))
pipeline.compile()
pipeline.fit()
from fate.
我把tokenizer_params 改了一下, 解决了以上问题
tokenizer_params = dict(
pretrained_model_name_or_path="gpt2",
trust_remote_code=True,
pad_token="<|endoftext|>" # 添加 pad_token
)
from fate.
但我出现了新的以下的问题,麻烦能解答下吗
[INFO] [2024-05-02 03:36:43,059] [202405020336083952670] [52:140389839259392] - [base_saver.execute_update] [line:223]: UPDATE "t_task" SET "f_update_time" = 1714621003059, "f_error_report" = 'Traceback (most recent call last):
File "/data/projects/fate/fate/python/fate/components/entrypoint/cli/component/execute_cli.py", line 147, in execute_component_from_config
component.execute(ctx, role, **execution_io.get_kwargs())
File "/data/projects/fate/fate/python/fate/components/core/component_desc/component.py", line 101, in execute
return self.callback(ctx, role, **kwargs)
File "/data/projects/fate/fate/python/fate/components/components/homo_nn.py", line 61, in train
train_procedure(
File "/data/projects/fate/fate/python/fate/components/components/nn/component_utils.py", line 155, in train_procedure
runner.train(train_data, validate_data_, output_dir, saved_model_path)
File "/data/projects/fate/fate/python/fate/components/components/nn/runner/homo_default_runner.py", line 270, in train
trainer.train()
File "/data/projects/fate/fate/python/fate_llm/fedkseed/fedkseed.py", line 123, in train
direction_derivative_history = self.train_once(
File "/data/projects/fate/fate/python/fate_llm/fedkseed/fedkseed.py", line 154, in train_once
trainer.train()
File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1961, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/data/projects/fate/fate/python/fate_llm/fedkseed/trainer.py", line 96, in training_step
loss = self._kseed_optimizer.kseed_zeroth_order_step(closure=closure)
File "/data/projects/fate/fate/python/fate_llm/fedkseed/optimizer.py", line 228, in kseed_zeroth_order_step
directional_derivative_value, loss_right, loss_left = self.zeroth_order_step(seed, closure)
File "/data/projects/fate/fate/python/fate_llm/fedkseed/optimizer.py", line 129, in zeroth_order_step
loss_right = closure()
File "/data/projects/fate/fate/python/fate_llm/fedkseed/trainer.py", line 90, in closure
return self.compute_loss(model, inputs, return_outputs=False).detach()
File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/transformers/trainer.py", line 2925, in compute_loss
outputs = model(**inputs)
File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 175, in forward
inputs, module_kwargs = self.scatter(inputs, kwargs, self.device_ids)
File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 197, in scatter
return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 74, in scatter_kwargs
scattered_kwargs = scatter(kwargs, target_gpus, dim) if kwargs else []
File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 60, in scatter
res = scatter_map(inputs)
File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 51, in scatter_map
return [type(obj)(i) for i in zip(*map(scatter_map, obj.items()))]
File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 47, in scatter_map
return list(zip(*map(scatter_map, obj)))
File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 43, in scatter_map
return Scatter.apply(target_gpus, None, dim, obj)
File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/torch/autograd/function.py", line 539, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 96, in forward
outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams)
File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/torch/nn/parallel/comm.py", line 187, in scatter
return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams))
RuntimeError: CUDA error: peer mapping resources exhausted
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
from fate.
Related Issues (20)
- 请问2.X有和1.X对所有组件的DSL参数配置文档吗? HOT 1
- 请问“hetero_nn”中的“runner_conf”参数应该都写啥呢 HOT 2
- 有更高版本的文档吗?比如1.9 HOT 1
- 求交任务的结果所在namespace和table是什么?可以列出来fate上所有的namespace和table吗?
- SecureInformationRetrieval组件使用报错问题 HOT 1
- fate on spark 任务执行失败,找不到./python_env/bin/python HOT 2
- fate支持不同版本的联合建模吗? HOT 3
- multi-host任务失败 HOT 2
- 神经网络模块自定义图片数据集,本地读取数据和运行成功,联邦提交后host方报错提示找不到数据集 HOT 2
- FATE1.10.0集群版,我想自定义一个图像类的算法集成上去,但是自定义的图像数据集怎么上传呢?DSL和pipeline方式都不行吗? HOT 3
- 三方联合建模大数据量失败,无报错信息 HOT 3
- hetero_nn多个host训练 HOT 2
- sir算法中的security_level参数的作用
- fate建模如何避免提前终止? HOT 1
- 官方示例pipeline训练IMDB.CSV是否可以用api进行上传? HOT 2
- hdfs镜像构建方法
- Allinone中集群部署时dashboard启动失败如何解决?
- 横向联邦逻辑回归是否支持多host方式,支持的话是否可以提供一个简单事例? HOT 2
- KubeFATE v1.7.0 docker-compose-eggroll.yml 中的挂载文件没有正确配置到有效的路径
- 2.1版本pipeline的存储和加载如何实现?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fate.