(python3.8) wzp@vastai-NF5468M6:~/code/LLMData/open_source/data-juicer$ python tools/process_data.py --config configs/demo/process.yaml --executor_type ray
<class 'list'>
<class 'list'>
2023-11-30 15:32:52 | WARNING | data_juicer.config.config:329 - Cache management of datasets is disabled.
2023-11-30 15:32:52 | WARNING | data_juicer.config.config:340 - Set temp directory to store temp files to [None].
2023-11-30 15:32:52 | INFO | data_juicer.config.config:442 - Back up the input config file [/home/wzp/code/LLMData/open_source/data-juicer/configs/demo/process.yaml] into the work_dir [./outputs/demo-process]
2023-11-30 15:32:52 | INFO | data_juicer.config.config:463 - Configuration table:
โโโโโโโโโโโโโโโโโโโโโโโโโโคโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ key โ values โ
โโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโก
โ config โ [Path_fr(configs/demo/process.yaml, cwd=/home/wzp/code/LLMData/open_source/data-juicer)] โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ hpo_config โ None โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ project_name โ 'demo-process' โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ executor_type โ 'ray' โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ dataset_path โ 'demos/data/demo-dataset.jsonl' โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ export_path โ './outputs/demo-process/demo-processed.jsonl' โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ export_shard_size โ 0 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ export_in_parallel โ False โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ np โ 4 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ text_keys โ 'text' โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ image_key โ 'images' โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ image_special_token โ '<__dj__image>' โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ eoc_special_token โ '<|__dj__eoc|>' โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ suffixes โ [] โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ use_cache โ False โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ ds_cache_dir โ PosixPath('/home/wzp/.cache/huggingface/datasets') โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ cache_compress โ None โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ use_checkpoint โ False โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ temp_dir โ None โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ open_tracer โ False โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ op_list_to_trace โ [] โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ trace_num โ 10 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ op_fusion โ False โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ process โ [{'language_id_score_filter': {'image_key': 'images', โ
โ โ 'lang': 'zh', โ
โ โ 'min_score': 0.8, โ
โ โ 'text_key': 'text'}}] โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ save_stats_in_one_file โ True โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ ray_address โ 'auto' โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ work_dir โ './outputs/demo-process' โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ timestamp โ '20231130153252' โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ dataset_dir โ '/home/wzp/code/LLMData/open_source/data-juicer/demos/data' โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ add_suffix โ False โ
โโโโโโโโโโโโโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
2023-11-30 15:32:53 | INFO | data_juicer.core.ray_executor:35 - Initing Ray ...
2023-11-30 15:32:53,326 INFO worker.py:1458 -- Connecting to existing Ray cluster at address: 10.23.4.252:6379...
2023-11-30 15:32:53,333 INFO worker.py:1642 -- Connected to Ray cluster.
2023-11-30 15:32:53 | INFO | data_juicer.core.ray_executor:47 - Loading dataset with Ray...
2023-11-30 15:32:54,324 INFO read_api.py:406 -- To satisfy the requested parallelism of 192, each read task output is split into 192 smaller blocks.
2023-11-30 15:32:54 | INFO | data_juicer.core.ray_executor:51 - Preparing process operators...
2023-11-30 15:32:54 | INFO | data_juicer.utils.model_utils:87 - Loading fasttext language identification model...
2023-11-30 15:32:54 | INFO | data_juicer.core.ray_executor:59 - columns ['text', 'meta']
2023-11-30 15:32:54 | INFO | data_juicer.core.ray_executor:62 - Processing data...
2023-11-30 15:32:54,702 INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadJSON->SplitBlocks(192)] -> TaskPoolMapOperator[MapBatches(process_batch)->Map(compute_stats)->Filter(process)]
2023-11-30 15:32:54,702 INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-11-30 15:32:54,702 INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
(MapBatches(process_batch)->Map(compute_stats)->Filter(process) pid=48497) 2023-11-30 15:33:00.312 | ERROR | data_juicer.ops.filter.language_id_score_filter:compute_stats:52 - Model not loaded. Please retry later.
(MapBatches(process_batch)->Map(compute_stats)->Filter(process) pid=48497) 2023-11-30 15:33:00.363 | ERROR | data_juicer.ops.filter.language_id_score_filter:compute_stats:52 - Model not loaded. Please retry later.
(MapBatches(process_batch)->Map(compute_stats)->Filter(process) pid=48497) 2023-11-30 15:33:01.362 | ERROR | data_juicer.ops.filter.language_id_score_filter:compute_stats:52 - Model not loaded. Please retry later.
--- Logging error in Loguru Handler #1 ---
Record was: {'elapsed': datetime.timedelta(seconds=13, microseconds=527897), 'exception': (type=<class 'ray.exceptions.RayTaskError(ValueError)'>, value=RayTaskError(ValueError)(ValueError('Model not loaded. Please retry later.')), traceback=<traceback object at 0x7f298c1324c0>), 'extra': {}, 'file': (name='process_data.py', path='tools/process_data.py'), 'function': '<module>', 'level': (name='ERROR', no=40, icon='โ'), 'line': 19, 'message': "An error has been caught in function '<module>', process 'MainProcess' (48135), thread 'MainThread' (139830410995520):", 'module': 'process_data', 'name': '__main__', 'process': (id=48135, name='MainProcess'), 'thread': (id=139830410995520, name='MainThread'), 'time': datetime(2023, 11, 30, 15, 33, 2, 531776, tzinfo=datetime.timezone(datetime.timedelta(seconds=28800), 'CST'))}
Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 345, in ray._raylet.StreamingObjectRefGenerator._next_sync
File "python/ray/_raylet.pyx", line 4533, in ray._raylet.CoreWorker.try_read_next_object_ref_stream
File "python/ray/_raylet.pyx", line 443, in ray._raylet.check_status
ray.exceptions.ObjectRefStreamEndOfStreamError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 80, in on_waitable_ready
meta = ray.get(next(self._streaming_gen))
File "python/ray/_raylet.pyx", line 300, in ray._raylet.StreamingObjectRefGenerator.__next__
File "python/ray/_raylet.pyx", line 351, in ray._raylet.StreamingObjectRefGenerator._next_sync
StopIteration
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/loguru/_logger.py", line 1277, in catch_wrapper
return function(*args, **kwargs)
File "tools/process_data.py", line 15, in main
executor.run()
File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/core/ray_executor.py", line 83, in run
logger.info(f'Op [{op_name}] Done. Left '
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/dataset.py", line 2498, in count
[get_num_rows.remote(block) for block in self.get_internal_block_refs()]
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/dataset.py", line 4799, in get_internal_block_refs
blocks = self._plan.execute().get_blocks()
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/plan.py", line 591, in execute
blocks = execute_to_legacy_block_list(
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/legacy_compat.py", line 119, in execute_to_legacy_block_list
block_list = _bundles_to_block_list(bundles)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/legacy_compat.py", line 357, in _bundles_to_block_list
for ref_bundle in bundles:
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/interfaces/executor.py", line 37, in __next__
return self.get_next()
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor.py", line 129, in get_next
raise item
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor.py", line 187, in run
while self._scheduling_loop_step(self._topology) and not self._shutdown:
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor.py", line 235, in _scheduling_loop_step
process_completed_tasks(topology)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 333, in process_completed_tasks
active_tasks[ref].on_waitable_ready()
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 88, in on_waitable_ready
ex = ray.get(block_ref)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/_private/worker.py", line 2547, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::MapBatches(process_batch)->Map(compute_stats)->Filter(process)() (pid=48715, ip=10.23.4.252)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 405, in _map_task
for b_out in map_transformer.apply_transform(iter(blocks), ctx):
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 345, in __call__
for data in iter:
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 171, in __call__
yield from self._row_fn(input, ctx)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 256, in transform_fn
for row in rows:
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 223, in __call__
for block in blocks:
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 345, in __call__
for data in iter:
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 171, in __call__
yield from self._row_fn(input, ctx)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 233, in transform_fn
out_row = fn(row)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 119, in fn
return op_fn(item, *fn_args, **fn_kwargs)
File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/ops/filter/language_id_score_filter.py", line 53, in compute_stats
raise ValueError(err_msg)
ValueError: Model not loaded. Please retry later.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/loguru/_handler.py", line 204, in emit
self._queue.put(str_record)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/multiprocessing/queues.py", line 362, in put
obj = _ForkingPickler.dumps(obj)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <class 'ray.exceptions.RayTaskError(ValueError)'>: attribute lookup RayTaskError(ValueError) on ray.exceptions failed
--- End of logging error ---
(MapBatches(process_batch)->Map(compute_stats)->Filter(process) pid=48713) 2023-11-30 15:33:02.374 | ERROR | data_juicer.ops.filter.language_id_score_filter:compute_stats:52 - Model not loaded. Please retry later. [repeated 9x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)