Giter Club home page Giter Club logo

modelscope / data-juicer Goto Github PK

View Code? Open in Web Editor NEW
1.5K 13.0 94.0 34.23 MB

A one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs! ๐ŸŽ ๐Ÿ‹ ๐ŸŒฝ โžก๏ธ โžก๏ธ๐Ÿธ ๐Ÿน ๐Ÿทไธบๅคง่ฏญ่จ€ๆจกๅž‹ๆไพ›ๆ›ด้ซ˜่ดจ้‡ใ€ๆ›ดไธฐๅฏŒใ€ๆ›ดๆ˜“โ€ๆถˆๅŒ–โ€œ็š„ๆ•ฐๆฎ๏ผ

License: Apache License 2.0

Python 99.80% Shell 0.15% Dockerfile 0.05%
data-analysis data-science dataset large-language-models llm nlp chinese data-visualization opendata gpt

data-juicer's Introduction



Introduction

ModelScope is built upon the notion of โ€œModel-as-a-Serviceโ€ (MaaS). It seeks to bring together most advanced machine learning models from the AI community, and streamlines the process of leveraging AI models in real-world applications. The core ModelScope library open-sourced in this repository provides the interfaces and implementations that allow developers to perform model inference, training and evaluation.

In particular, with rich layers of API-abstraction, the ModelScope library offers unified experience to explore state-of-the-art models spanning across domains such as CV, NLP, Speech, Multi-Modality, and Scientific-computation. Model contributors of different areas can integrate models into the ModelScope ecosystem through the layered-APIs, allowing easy and unified access to their models. Once integrated, model inference, fine-tuning, and evaluations can be done with only a few lines of codes. In the meantime, flexibilities are also provided so that different components in the model applications can be customized wherever necessary.

Apart from harboring implementations of a wide range of different models, ModelScope library also enables the necessary interactions with ModelScope backend services, particularly with the Model-Hub and Dataset-Hub. Such interactions facilitate management of various entities (models and datasets) to be performed seamlessly under-the-hood, including entity lookup, version control, cache management, and many others.

Models and Online Accessibility

Hundreds of models are made publicly available on ModelScope (700+ and counting), covering the latest development in areas such as NLP, CV, Audio, Multi-modality, and AI for Science, etc. Many of these models represent the SOTA in their specific fields, and made their open-sourced debut on ModelScope. Users can visit ModelScope(modelscope.cn) and experience first-hand how these models perform via online experience, with just a few clicks. Immediate developer-experience is also possible through the ModelScope Notebook, which is backed by ready-to-use CPU/GPU development environment in the cloud - only one click away on ModelScope.



Some representative examples include:

NLP:

Multi-Modal:

CV:

Audio:

AI for Science:

Note: Most models on ModelScope are public and can be downloaded without account registration on modelscope website(www.modelscope.cn), please refer to instructions for model download, for dowloading models with api provided by modelscope library or git.

QuickTour

We provide unified interface for inference using pipeline, fine-tuning and evaluation using Trainer for different tasks.

For any given task with any type of input (image, text, audio, video...), inference pipeline can be implemented with only a few lines of code, which will automatically load the underlying model to get inference result, as is exemplified below:

>>> from modelscope.pipelines import pipeline
>>> word_segmentation = pipeline('word-segmentation',model='damo/nlp_structbert_word-segmentation_chinese-base')
>>> word_segmentation('ไปŠๅคฉๅคฉๆฐ”ไธ้”™๏ผŒ้€‚ๅˆๅ‡บๅŽปๆธธ็Žฉ')
{'output': 'ไปŠๅคฉ ๅคฉๆฐ” ไธ้”™ ๏ผŒ ้€‚ๅˆ ๅ‡บๅŽป ๆธธ็Žฉ'}

Given an image, portrait matting (aka. background-removal) can be accomplished with the following code snippet:

image

>>> import cv2
>>> from modelscope.pipelines import pipeline

>>> portrait_matting = pipeline('portrait-matting')
>>> result = portrait_matting('https://modelscope.oss-cn-beijing.aliyuncs.com/test/images/image_matting.png')
>>> cv2.imwrite('result.png', result['output_img'])

The output image with the background removed is: image

Fine-tuning and evaluation can also be done with a few more lines of code to set up training dataset and trainer, with the heavy-lifting work of training and evaluation a model encapsulated in the implementation of traner.train() and trainer.evaluate() interfaces.

For example, the gpt3 base model (1.3B) can be fine-tuned with the chinese-poetry dataset, resulting in a model that can be used for chinese-poetry generation.

>>> from modelscope.metainfo import Trainers
>>> from modelscope.msdatasets import MsDataset
>>> from modelscope.trainers import build_trainer

>>> train_dataset = MsDataset.load('chinese-poetry-collection', split='train'). remap_columns({'text1': 'src_txt'})
>>> eval_dataset = MsDataset.load('chinese-poetry-collection', split='test').remap_columns({'text1': 'src_txt'})
>>> max_epochs = 10
>>> tmp_dir = './gpt3_poetry'

>>> kwargs = dict(
     model='damo/nlp_gpt3_text-generation_1.3B',
     train_dataset=train_dataset,
     eval_dataset=eval_dataset,
     max_epochs=max_epochs,
     work_dir=tmp_dir)

>>> trainer = build_trainer(name=Trainers.gpt3_trainer, default_args=kwargs)
>>> trainer.train()

Why should I use ModelScope library

  1. A unified and concise user interface is abstracted for different tasks and different models. Model inferences and training can be implemented by as few as 3 and 10 lines of code, respectively. It is convenient for users to explore models in different fields in the ModelScope community. All models integrated into ModelScope are ready to use, which makes it easy to get started with AI, in both educational and industrial settings.

  2. ModelScope offers a model-centric development and application experience. It streamlines the support for model training, inference, export and deployment, and facilitates users to build their own MLOps based on the ModelScope ecosystem.

  3. For the model inference and training process, a modular design is put in place, and a wealth of functional module implementations are provided, which is convenient for users to customize their own model inference, training and other processes.

  4. For distributed model training, especially for large models, it provides rich training strategy support, including data parallel, model parallel, hybrid parallel and so on.

Installation

Docker

ModelScope Library currently supports popular deep learning framework for model training and inference, including PyTorch, TensorFlow and ONNX. All releases are tested and run on Python 3.7+, Pytorch 1.8+, Tensorflow1.15 or Tensorflow2.0+.

To allow out-of-box usage for all the models on ModelScope, official docker images are provided for all releases. Based on the docker image, developers can skip all environment installation and configuration and use it directly. Currently, the latest version of the CPU image and GPU image can be obtained from:

CPU docker image

# py37
registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-py37-torch1.11.0-tf1.15.5-1.6.1

# py38
registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-py38-torch2.0.1-tf2.13.0-1.9.5

GPU docker image

# py37
registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-cuda11.3.0-py37-torch1.11.0-tf1.15.5-1.6.1

# py38
registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-cuda11.8.0-py38-torch2.0.1-tf2.13.0-1.9.5

Setup Local Python Environment

One can also set up local ModelScope environment using pip and conda. ModelScope supports python3.7 and above. We suggest anaconda for creating local python environment:

conda create -n modelscope python=3.8
conda activate modelscope

PyTorch or TensorFlow can be installed separately according to each model's requirements.

  • Install pytorch doc
  • Install tensorflow doc

After installing the necessary machine-learning framework, you can install modelscope library as follows:

If you only want to play around with the modelscope framework, of trying out model/dataset download, you can install the core modelscope components:

pip install modelscope

If you want to use multi-modal models:

pip install modelscope[multi-modal]

If you want to use nlp models:

pip install modelscope[nlp] -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html

If you want to use cv models:

pip install modelscope[cv] -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html

If you want to use audio models:

pip install modelscope[audio] -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html

If you want to use science models:

pip install modelscope[science] -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html

Notes:

  1. Currently, some audio-task models only support python3.7, tensorflow1.15.4 Linux environments. Most other models can be installed and used on Windows and Mac (x86).

  2. Some models in the audio field use the third-party library SoundFile for wav file processing. On the Linux system, users need to manually install libsndfile of SoundFile(doc link). On Windows and MacOS, it will be installed automatically without user operation. For example, on Ubuntu, you can use following commands:

    sudo apt-get update
    sudo apt-get install libsndfile1
  3. Some models in computer vision need mmcv-full, you can refer to mmcv installation guide, a minimal installation is as follows:

    pip uninstall mmcv # if you have installed mmcv, uninstall it
    pip install -U openmim
    mim install mmcv-full

Learn More

We provide additional documentations including:

License

This project is licensed under the Apache License (Version 2.0).

data-juicer's People

Contributors

alibaba-oss avatar beachwang avatar cathy0908 avatar chenhesen avatar chg0901 avatar co63oc avatar drcege avatar garyzhang99 avatar hylcool avatar jongsky avatar lingzhq avatar liuyanyi avatar pan-x-c avatar shiweijiezero avatar xieyxclack avatar xuruidong avatar yxdyc avatar zhenqincn avatar zhijianma avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-juicer's Issues

[Bug]: ๆ— ๆณ•ๅฎ‰่ฃ…simhash-py

Before Reporting ๆŠฅๅ‘Šไน‹ๅ‰

  • I have pulled the latest code of main branch to run again and the bug still existed. ๆˆ‘ๅทฒ็ปๆ‹‰ๅ–ไบ†ไธปๅˆ†ๆ”ฏไธŠๆœ€ๆ–ฐ็š„ไปฃ็ ๏ผŒ้‡ๆ–ฐ่ฟ่กŒไน‹ๅŽ๏ผŒ้—ฎ้ข˜ไปไธ่ƒฝ่งฃๅ†ณใ€‚

  • I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) ๆˆ‘ๅทฒ็ปไป”็ป†้˜…่ฏปไบ† README ไธŠ็š„ๆ“ไฝœๆŒ‡ๅผ•๏ผŒๅนถไธ”ๅœจๅฎ‰่ฃ…่ฟ‡็จ‹ไธญๆฒกๆœ‰้”™่ฏฏๅ‘็”Ÿใ€‚๏ผˆๅฆๅˆ™๏ผŒๆˆ‘ไปฌๅปบ่ฎฎๆ‚จไฝฟ็”จQuestionๆจกๆฟๅ‘ๆˆ‘ไปฌ่ฟ›่กŒๆ้—ฎ๏ผ‰

Search before reporting ๅ…ˆๆœ็ดข๏ผŒๅ†ๆŠฅๅ‘Š

  • I have searched the Data-Juicer issues and found no similar bugs. ๆˆ‘ๅทฒ็ปๅœจ issueๅˆ—่กจ ไธญๆœ็ดขไฝ†ๆ˜ฏๆฒกๆœ‰ๅ‘็Žฐ็ฑปไผผ็š„bugๆŠฅๅ‘Šใ€‚

OS ็ณป็ปŸ

Ubuntu

Installation Method ๅฎ‰่ฃ…ๆ–นๅผ

from source

Data-Juicer Version Data-Juicer็‰ˆๆœฌ

No response

Python Version Python็‰ˆๆœฌ

3.9

Describe the bug ๆ่ฟฐ่ฟ™ไธชbug

/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/Cython/Compiler/Main.py:381: FutureWarning: Cython directive 'language_level' not set, using '3str' for now (Py3). This has changed from earlier releases! File: /tmp/pip-install-642i4337/simhash-py_707a16c4f0d24a878223a92ec6376dbd/simhash/simhash.pyx
tree = Parsing.p_module(s, pxd, full_module_name)

Error compiling Cython file:

...
import hashlib
import struct

from simhash cimport compute as c_compute
^

simhash/simhash.pyx:4:0: 'simhash/compute.pxd' not found

Error compiling Cython file:

...
import hashlib
import struct

from simhash cimport compute as c_compute
from simhash cimport find_all as c_find_all
^

simhash/simhash.pyx:5:0: 'simhash/find_all.pxd' not found

Error compiling Cython file:

...
Find the set of all matches within the provided vector of hashes.

  The provided hashes are manipulated in place, but upon completion are
  restored to their original state.
  '''
  cdef matches_t results_set = c_find_all(hashes, number_of_blocks, different_bits)
       ^

simhash/simhash.pyx:26:9: 'matches_t' is not a type identifier

Error compiling Cython file:

...

  The provided hashes are manipulated in place, but upon completion are
  restored to their original state.
  '''
  cdef matches_t results_set = c_find_all(hashes, number_of_blocks, different_bits)
  cdef vector[match_t] results_vector
       ^

simhash/simhash.pyx:27:9: 'vector' is not a type identifier

Error compiling Cython file:

...

  The provided hashes are manipulated in place, but upon completion are
  restored to their original state.
  '''
  cdef matches_t results_set = c_find_all(hashes, number_of_blocks, different_bits)
  cdef vector[match_t] results_vector
       ^

simhash/simhash.pyx:27:9: 'vector' is not a type identifier

Error compiling Cython file:

...
# Unpacks the binary bytes in digest into a Python integer
return struct.unpack('>Q', digest)[0] & 0xFFFFFFFFFFFFFFFF

def compute(hashes):
'''Compute the simhash of a vector of hashes.'''
return c_compute(hashes)
^

simhash/simhash.pyx:17:11: 'c_compute' is not a constant, variable or function identifier

Error compiling Cython file:

...
Find the set of all matches within the provided vector of hashes.

  The provided hashes are manipulated in place, but upon completion are
  restored to their original state.
  '''
  cdef matches_t results_set = c_find_all(hashes, number_of_blocks, different_bits)
                               ^

simhash/simhash.pyx:26:33: 'c_find_all' is not a constant, variable or function identifier
Traceback (most recent call last):
File "", line 2, in
File "", line 34, in
File "/tmp/pip-install-642i4337/simhash-py_707a16c4f0d24a878223a92ec6376dbd/setup.py", line 38, in
setup(
File "/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/setuptools/_distutils/core.py", line 148, in setup
return run_commands(dist)
File "/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/setuptools/_distutils/core.py", line 163, in run_commands
dist.run_commands()
File "/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/setuptools/_distutils/dist.py", line 967, in run_commands
self.run_command(cmd)
File "/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
cmd_obj.run()
File "/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/wheel/bdist_wheel.py", line 364, in run
self.run_command("build")
File "/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/setuptools/_distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
cmd_obj.run()
File "/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/setuptools/_distutils/command/build.py", line 135, in run
self.run_command(cmd_name)
File "/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/setuptools/_distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
cmd_obj.run()
File "/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/setuptools/_distutils/command/build_ext.py", line 339, in run
self.build_extensions()
File "/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/setuptools/_distutils/command/build_ext.py", line 448, in build_extensions
self._build_extensions_serial()
File "/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/setuptools/_distutils/command/build_ext.py", line 473, in _build_extensions_serial
self.build_extension(ext)
File "/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/Cython/Distutils/build_ext.py", line 130, in build_extension
new_ext = cythonize(
File "/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/Cython/Build/Dependencies.py", line 1154, in cythonize
cythonize_one(*args)
File "/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/Cython/Build/Dependencies.py", line 1321, in cythonize_one
raise CompileError(None, pyx_file)
Cython.Compiler.Errors.CompileError: simhash/simhash.pyx
error: subprocess-exited-with-error

ร— python setup.py bdist_wheel did not run successfully.
โ”‚ exit code: 1
โ•ฐโ”€> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
full command: /home/kemove/anaconda3/envs/sakura/bin/python -u -c '
exec(compile('"'"''"'"''"'"'

This is -- a caller that pip uses to run setup.py

- It imports setuptools before invoking setup.py, to enable projects that directly

import from distutils.core to work with newer packaging standards.

- It provides a clear error message when setuptools is not installed.

- It sets sys.argv[0] to the underlying setup.py, when invoking setup.py so

setuptools doesn'"'"'t think the script is -c. This avoids the following warning:

manifest_maker: standard file '"'"'-c'"'"' not found".

- It generates a shim setup.py, for handling setup.cfg-only projects.

import os, sys, tokenize

try:
import setuptools
except ImportError as error:
print(
"ERROR: Can not execute setup.py since setuptools is not available in "
"the build environment.",
file=sys.stderr,
)
sys.exit(1)

file = %r
sys.argv[0] = file

if os.path.exists(file):
filename = file
with tokenize.open(file) as f:
setup_py_code = f.read()
else:
filename = ""
setup_py_code = "from setuptools import setup; setup()"

exec(compile(setup_py_code, filename, "exec"))
'"'"''"'"''"'"' % ('"'"'/tmp/pip-install-642i4337/simhash-py_707a16c4f0d24a878223a92ec6376dbd/setup.py'"'"',), "", "exec"))' bdist_wheel -d /tmp/pip-wheel-oyoenh4d
cwd: /tmp/pip-install-642i4337/simhash-py_707a16c4f0d24a878223a92ec6376dbd/
Building wheel for simhash-py (setup.py) ... error
ERROR: Failed building wheel for simhash-py
Running setup.py clean for simhash-py
Running command python setup.py clean
Building from Cython
/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/setuptools/dist.py:723: UserWarning: Usage of dash-separated 'description-file' will not be supported in future versions. Please use the underscore name 'description_file' instead
warnings.warn(
running clean
removing 'build/lib.linux-x86_64-3.9' (and everything under it)
'build/bdist.linux-x86_64' does not exist -- can't clean it
'build/scripts-3.9' does not exist -- can't clean it
Failed to build simhash-py
ERROR: Could not build wheels for simhash-py, which is required to install pyproject.toml-based projects
WARNING: There was an error checking the latest version of pip.

To Reproduce ๅฆ‚ไฝ•ๅค็Žฐ

cd data-juicer
pip install -v -e .

Configs ้…็ฝฎไฟกๆฏ

No response

Logs ๆŠฅ้”™ๆ—ฅๅฟ—

No response

Screenshots ๆˆชๅ›พ

No response

Additional ้ขๅค–ไฟกๆฏ

No response

ๅŠ ่ฝฝmodelscopeๆ•ฐๆฎ้›†

Search before continuing ๅ…ˆๆœ็ดข๏ผŒๅ†็ปง็ปญ

  • I have searched the Data-Juicer issues and found no similar feature requests. ๆˆ‘ๅทฒ็ปๆœ็ดขไบ† Data-Juicer ็š„ issue ๅˆ—่กจไฝ†ๆ˜ฏๆฒกๆœ‰ๅ‘็Žฐ็ฑปไผผ็š„ๅŠŸ่ƒฝ้œ€ๆฑ‚ใ€‚

Description ๆ่ฟฐ

ๅŠ ่ฝฝmodelscopeๆ•ฐๆฎ้›†

Use case ไฝฟ็”จๅœบๆ™ฏ

No response

Additional ้ขๅค–ไฟกๆฏ

No response

Are you willing to submit a PR for this feature? ๆ‚จๆ˜ฏๅฆไนๆ„ไธบๆญคๅŠŸ่ƒฝๆไบคไธ€ไธช PR๏ผŸ

  • Yes I'd like to help by submitting a PR! ๆ˜ฏ็š„๏ผๆˆ‘ๆ„ฟๆ„ๆไพ›ๅธฎๅŠฉๅนถๆไบคไธ€ไธชPR๏ผ

[Bug]: ไฝฟ็”จdocument_simhash_deduplicator็ฎ—ๅญๆŠฅ้”™๏ผšNameError: name 'fingerprint_warnings' is not defined

Before Reporting ๆŠฅๅ‘Šไน‹ๅ‰

  • I have pulled the latest code of main branch to run again and the bug still existed. ๆˆ‘ๅทฒ็ปๆ‹‰ๅ–ไบ†ไธปๅˆ†ๆ”ฏไธŠๆœ€ๆ–ฐ็š„ไปฃ็ ๏ผŒ้‡ๆ–ฐ่ฟ่กŒไน‹ๅŽ๏ผŒ้—ฎ้ข˜ไปไธ่ƒฝ่งฃๅ†ณใ€‚

  • I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) ๆˆ‘ๅทฒ็ปไป”็ป†้˜…่ฏปไบ† README ไธŠ็š„ๆ“ไฝœๆŒ‡ๅผ•๏ผŒๅนถไธ”ๅœจๅฎ‰่ฃ…่ฟ‡็จ‹ไธญๆฒกๆœ‰้”™่ฏฏๅ‘็”Ÿใ€‚๏ผˆๅฆๅˆ™๏ผŒๆˆ‘ไปฌๅปบ่ฎฎๆ‚จไฝฟ็”จQuestionๆจกๆฟๅ‘ๆˆ‘ไปฌ่ฟ›่กŒๆ้—ฎ๏ผ‰

Search before reporting ๅ…ˆๆœ็ดข๏ผŒๅ†ๆŠฅๅ‘Š

  • I have searched the Data-Juicer issues and found no similar bugs. ๆˆ‘ๅทฒ็ปๅœจ issueๅˆ—่กจ ไธญๆœ็ดขไฝ†ๆ˜ฏๆฒกๆœ‰ๅ‘็Žฐ็ฑปไผผ็š„bugๆŠฅๅ‘Šใ€‚

OS ็ณป็ปŸ

Ubuntu

Installation Method ๅฎ‰่ฃ…ๆ–นๅผ

docker image build from Dockerfile by myself

Data-Juicer Version Data-Juicer็‰ˆๆœฌ

v0.1.2

Python Version Python็‰ˆๆœฌ

3.8.18

Describe the bug ๆ่ฟฐ่ฟ™ไธชbug

๏ผˆ1๏ผ‰ๅ…ณ้”ฎ้”™่ฏฏไฟกๆฏ
An error occurred during Op [document_simhash_deduplicator]

_raise PicklingError(_pickle.PicklingError: Can't pickle <cyfunction _Pyx_CFunc_size__t____hash__t____hash__t___to_py..wrap at 0x7ff22bad4e80>: it's not found as cfunc.to_py.wrap

NameError: name 'fingerprint_warnings' is not defined
๏ผˆ2๏ผ‰ๆŠฅ้”™ไฟกๆฏๆˆชๅ›พ
image
image

To Reproduce ๅฆ‚ไฝ•ๅค็Žฐ

1ใ€pulled the latest code๏ผŒand build docker image from Dockerfile by myself๏ผˆdocker image name๏ผš data-juicer:v0.1.2s๏ผ‰
2ใ€start docker container as follow
docker run -dit
--name dj
-v ~/.cache/:/root/.cache/
data-juicer:v0.1.2s /bin/bash

docker cp -a dj:/data-juicer/ /data/llm-data/data-juicer

docker stop dj && docker rm dj

docker run -dit
--name dj
-p 8501:8501
-v /data/llm-data/data-juicer:/data-juicer
-v /data/llm-data/cache/:/root/.cache/
data-juicer:v0.1.2s /bin/bash
3ใ€docker exec -it dj /bin/bash
4ใ€cd /data-juicer/demos/process_cft_zh_data
streamlit run app.py
5ใ€access ip:8501 by browser
6ใ€click "start to process data" button

Configs ้…็ฝฎไฟกๆฏ

No response

Logs ๆŠฅ้”™ๆ—ฅๅฟ—

No response

Screenshots ๆˆชๅ›พ

No response

Additional ้ขๅค–ไฟกๆฏ

issue83ๆœ‰็ฑปไผผ้—ฎ้ข˜๏ผŒไธ‹่ฝฝๆœ€ๆ–ฐไปฃ็  and ้‡่ฃ…datasets==2.11.0
dill==0.3.4ๅŒ…๏ผŒ้ƒฝๆ— ๆณ•่งฃๅ†ณ

token_num_filter็š„่ฎก็ฎ—

Before Asking ๅœจๆ้—ฎไน‹ๅ‰

  • I have read the README carefully. ๆˆ‘ๅทฒ็ปไป”็ป†้˜…่ฏปไบ† README ไธŠ็š„ๆ“ไฝœๆŒ‡ๅผ•ใ€‚

  • I have pulled the latest code of main branch to run again and the problem still existed. ๆˆ‘ๅทฒ็ปๆ‹‰ๅ–ไบ†ไธปๅˆ†ๆ”ฏไธŠๆœ€ๆ–ฐ็š„ไปฃ็ ๏ผŒ้‡ๆ–ฐ่ฟ่กŒไน‹ๅŽ๏ผŒ้—ฎ้ข˜ไปไธ่ƒฝ่งฃๅ†ณใ€‚

Search before asking ๅ…ˆๆœ็ดข๏ผŒๅ†ๆ้—ฎ

  • I have searched the Data-Juicer issues and found no similar questions. ๆˆ‘ๅทฒ็ปๅœจ issueๅˆ—่กจ ไธญๆœ็ดขไฝ†ๆ˜ฏๆฒกๆœ‰ๅ‘็Žฐ็ฑปไผผ็š„้—ฎ้ข˜ใ€‚

Question

https://github.com/alibaba/data-juicer/blob/main/data_juicer/ops/filter/token_num_filter.py#L49
ๅŸบไบŽEleutherAI/pythia-6.9b-deduped
่ฟ™ไธชๆจกๅž‹ๅŸบไบŽEnglish๏ผŒๅœจtoken_num_filterๅฏนไธญๆ–‡่ฟ›่กŒtokenizeๆ—ถ๏ผŒๅฏนtokenๆ•ฐ้‡็š„็ปŸ่ฎกๆ˜ฏไธๆ˜ฏๅญ˜ๅœจไธ€ไบ›้—ฎ้ข˜

image

Additional ้ขๅค–ไฟกๆฏ

No response

[Bug]: ๅ‘ฝไปค่กŒๅ‚ๆ•ฐ

Before Reporting ๆŠฅๅ‘Šไน‹ๅ‰

  • I have pulled the latest code of main branch to run again and the bug still existed. ๆˆ‘ๅทฒ็ปๆ‹‰ๅ–ไบ†ไธปๅˆ†ๆ”ฏไธŠๆœ€ๆ–ฐ็š„ไปฃ็ ๏ผŒ้‡ๆ–ฐ่ฟ่กŒไน‹ๅŽ๏ผŒ้—ฎ้ข˜ไปไธ่ƒฝ่งฃๅ†ณใ€‚

  • I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) ๆˆ‘ๅทฒ็ปไป”็ป†้˜…่ฏปไบ† README ไธŠ็š„ๆ“ไฝœๆŒ‡ๅผ•๏ผŒๅนถไธ”ๅœจๅฎ‰่ฃ…่ฟ‡็จ‹ไธญๆฒกๆœ‰้”™่ฏฏๅ‘็”Ÿใ€‚๏ผˆๅฆๅˆ™๏ผŒๆˆ‘ไปฌๅปบ่ฎฎๆ‚จไฝฟ็”จQuestionๆจกๆฟๅ‘ๆˆ‘ไปฌ่ฟ›่กŒๆ้—ฎ๏ผ‰

Search before reporting ๅ…ˆๆœ็ดข๏ผŒๅ†ๆŠฅๅ‘Š

  • I have searched the Data-Juicer issues and found no similar bugs. ๆˆ‘ๅทฒ็ปๅœจ issueๅˆ—่กจ ไธญๆœ็ดขไฝ†ๆ˜ฏๆฒกๆœ‰ๅ‘็Žฐ็ฑปไผผ็š„bugๆŠฅๅ‘Šใ€‚

OS ็ณป็ปŸ

ubuntu

Installation Method ๅฎ‰่ฃ…ๆ–นๅผ

pip

Data-Juicer Version Data-Juicer็‰ˆๆœฌ

v0.1.2

Python Version Python็‰ˆๆœฌ

3.8

Describe the bug ๆ่ฟฐ่ฟ™ไธชbug

-https://github.com/alibaba/data-juicer/blob/main/docs/DeveloperGuide_ZH.md#ไธฐๅฏŒ็š„้…็ฝฎๆบๅ’Œ็ฑปๅž‹ๆ็คบ

ไฝฟ็”จๅ‘ฝไปค่กŒๆ›ดๆ–ฐๅ‚ๆ•ฐๆ—ถ๏ผŒๅฆ‚ๆžœๅญ˜ๅœจprocess็ฎ—ๅญๅ‚ๆ•ฐๅ’Œ้žprocess็ฎ—ๅญๅ‚ๆ•ฐๆ—ถ๏ผŒไปฅไธ‹่กŒไผšๆŠฅ้”™๏ผš
https://github.com/alibaba/data-juicer/blob/main/data_juicer/config/config.py#L270

To Reproduce ๅฆ‚ไฝ•ๅค็Žฐ

python` tools/analyze_data.py --config configs/demo/analyser.yaml --project_name l666 --language_id_score_filter.min_score=0.9

Configs ้…็ฝฎไฟกๆฏ

้ป˜่ฎค็š„configs/demo/analyser.yaml

Logs ๆŠฅ้”™ๆ—ฅๅฟ—

No response

Screenshots ๆˆชๅ›พ

image

Additional ้ขๅค–ไฟกๆฏ

No response

ไธบไป€ไนˆๆˆ‘็š„็ฎ—ๅญๆ— ๆณ•ไผ ๅ‚ๅ‘ข

ไธŠไผ image.png...ๆˆ‘ๅˆ›ๅปบไบ†ไธ€ไธชmapper็ฎ—ๅญ๏ผŒๅœจ็ฎ—ๅญๅœจ็ฎ—ๅญ็š„initไธญๅ†™ไบ†้œ€่ฆไผ ้€’็š„ๅ‚ๆ•ฐ๏ผŒๅนถๅœจMapper็š„initไธญๆณจๅ†Œไบ†็ฎ—ๅญใ€‚ๅฝ“ๅ˜้‡ๅœจ็ฎ—ๅญไธญๅ†™ๆญปๆ—ถ็ฎ—ๅญๆ˜ฏๅฏไปฅ็”จ็š„๏ผŒ่€Œไธ”ๆ˜ฏๆญฃ็กฎ็š„ใ€‚ๅฝ“ๅœจconfigไธญไผ ๅ‚ๆ—ถๅฐฑไผšๆŠฅconfigๅ†™้”™ไบ†็š„้”™่ฏฏใ€‚ ่ฟ™ๆ˜ฏๆ€Žไนˆๅ›žไบ‹ๅ‘ข

[Bug]: formatter ่ฐƒ็”จ datasets.load_dataset ๆ—ถ็ผ“ๅญ˜ไฝ็ฝฎ้—ฎ้ข˜

Before Reporting ๆŠฅๅ‘Šไน‹ๅ‰

  • I have pulled the latest code of main branch to run again and the bug still existed. ๆˆ‘ๅทฒ็ปๆ‹‰ๅ–ไบ†ไธปๅˆ†ๆ”ฏไธŠๆœ€ๆ–ฐ็š„ไปฃ็ ๏ผŒ้‡ๆ–ฐ่ฟ่กŒไน‹ๅŽ๏ผŒ้—ฎ้ข˜ไปไธ่ƒฝ่งฃๅ†ณใ€‚

  • I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) ๆˆ‘ๅทฒ็ปไป”็ป†้˜…่ฏปไบ† README ไธŠ็š„ๆ“ไฝœๆŒ‡ๅผ•๏ผŒๅนถไธ”ๅœจๅฎ‰่ฃ…่ฟ‡็จ‹ไธญๆฒกๆœ‰้”™่ฏฏๅ‘็”Ÿใ€‚๏ผˆๅฆๅˆ™๏ผŒๆˆ‘ไปฌๅปบ่ฎฎๆ‚จไฝฟ็”จQuestionๆจกๆฟๅ‘ๆˆ‘ไปฌ่ฟ›่กŒๆ้—ฎ๏ผ‰

Search before reporting ๅ…ˆๆœ็ดข๏ผŒๅ†ๆŠฅๅ‘Š

  • I have searched the Data-Juicer issues and found no similar bugs. ๆˆ‘ๅทฒ็ปๅœจ issueๅˆ—่กจ ไธญๆœ็ดขไฝ†ๆ˜ฏๆฒกๆœ‰ๅ‘็Žฐ็ฑปไผผ็š„bugๆŠฅๅ‘Šใ€‚

OS ็ณป็ปŸ

Ubuntu 20.04.5 LTS x86_64

Installation Method ๅฎ‰่ฃ…ๆ–นๅผ

source

Data-Juicer Version Data-Juicer็‰ˆๆœฌ

v0.1.2

Python Version Python็‰ˆๆœฌ

3.10

Describe the bug ๆ่ฟฐ่ฟ™ไธชbug

ๆŸฅ้˜… data-juicer ๆบ็ ๅŽ๏ผŒๅ‘็Žฐ LocalFormatter.load_dataset ๅœจ่ฐƒ็”จ datasets.load_dataset ๆ—ถ๏ผŒ็ณป็ปŸๆŒ‡ๅฎš็š„ HF_DATASETS_CACHE ็Žฏๅขƒๅ˜้‡ๆฒกๆœ‰็”Ÿๆ•ˆ๏ผŒๅŠ ่ฝฝๆœฌๅœฐ็š„ json ๆ–‡ไปถ็š„ๆ—ถๅ€™ไป็„ถไผšๆ”พๅˆฐ้ป˜่ฎค็š„ ~/.cache/huggingface/datasets ไธ‹ใ€‚

To Reproduce ๅฆ‚ไฝ•ๅค็Žฐ

  1. Run command
python tools/process_data.py --config configs/data_juicer_recipes/alpaca_cot/alpaca-cot-en-refine.yaml
  1. Check .cache folder
ls ~/.cache/huggingface/datasets
  1. Shwo result
--> ls ~/.cache/huggingface/datasets
xxx-960372e22b98db9d_0.0.0_fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e.lock  json

Configs ้…็ฝฎไฟกๆฏ

global parameters

project_name: 'Data-Juicer-recipes-alpaca-cot-en'
dataset_path: '../data/raw_data/raw_data_en.jsonl' # path to your dataset directory or file
export_path: '../data/refine_data/dataset.jsonl'

np: 100 # number of subprocess to process your dataset
open_tracer: true

process schedule

a list of several process operators with their arguments

process:

  • document_deduplicator: # 104636705
    lowercase: true
    ignore_non_character: true

  • alphanumeric_filter: # 104636381
    tokenization: false
    min_ratio: 0.1

  • character_repetition_filter: # 104630030
    rep_len: 10
    max_ratio: 0.6

  • flagged_words_filter: # 104576967
    lang: en
    tokenization: true
    max_ratio: 0.017

  • maximum_line_length_filter: # 104575811
    min_len: 20

  • text_length_filter: # 104573711
    min_len: 30

  • document_simhash_deduplicator: # 72855345
    tokenization: space
    window_size: 3
    lowercase: true
    ignore_pattern: '\p{P}'
    num_blocks: 9
    hamming_distance: 7

Logs ๆŠฅ้”™ๆ—ฅๅฟ—

None

Screenshots ๆˆชๅ›พ

None

Additional ้ขๅค–ไฟกๆฏ

ไปฃ็ ๅ‘็”Ÿไฝ็ฝฎ data-juicer/data_juicer/format/formatter.py

้ข„ๆœŸ่งฃๅ†ณๆ–นๆกˆ๏ผš

  1. ๅฏไปŽ config ๅŠจๆ€็ป™ load_dataset ไผ ๅ‚๏ผŒๅณ้€š่ฟ‡ yaml ่ฏปๅ–้ขๅค–็š„ key ๅ’Œ value๏ผŒ้€š่ฟ‡ **kwargs ไผ ๅ…ฅ LocalFormatter.load_dataset ๆŒ‡ๅฎš cache_dir ๅ‚ๆ•ฐใ€‚

ๅฆ‚ๆžœๅฏ่ƒฝ็š„่ฏ๏ผŒๆˆ‘ๅฏไปฅไฝœไธบ็คพๅŒบๅผ€ๅ‘่€…ๆ PR ไฝœ่ดก็Œฎใ€‚

[MM] mmc42dj & dj2mmc4 tools

There are diverse formats for multimodal datasets. So we provide several format conversion tools, which convert original datasets to the Data-Juicer intermediate format and convert them back. Data-Juicer will accept datasets in Data-Juicer format and process them.

Here we add tools for converting MMC4-like dataset to the target dataset in Data-Juicer format and reverse first. Other formats will be supported in the future.

[Bug]: nlpaug could generate an indefinite number of augmented samples.

Before Reporting ๆŠฅๅ‘Šไน‹ๅ‰

  • I have pulled the latest code of main branch to run again and the bug still existed. ๆˆ‘ๅทฒ็ปๆ‹‰ๅ–ไบ†ไธปๅˆ†ๆ”ฏไธŠๆœ€ๆ–ฐ็š„ไปฃ็ ๏ผŒ้‡ๆ–ฐ่ฟ่กŒไน‹ๅŽ๏ผŒ้—ฎ้ข˜ไปไธ่ƒฝ่งฃๅ†ณใ€‚

  • I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) ๆˆ‘ๅทฒ็ปไป”็ป†้˜…่ฏปไบ† README ไธŠ็š„ๆ“ไฝœๆŒ‡ๅผ•๏ผŒๅนถไธ”ๅœจๅฎ‰่ฃ…่ฟ‡็จ‹ไธญๆฒกๆœ‰้”™่ฏฏๅ‘็”Ÿใ€‚๏ผˆๅฆๅˆ™๏ผŒๆˆ‘ไปฌๅปบ่ฎฎๆ‚จไฝฟ็”จQuestionๆจกๆฟๅ‘ๆˆ‘ไปฌ่ฟ›่กŒๆ้—ฎ๏ผ‰

Search before reporting ๅ…ˆๆœ็ดข๏ผŒๅ†ๆŠฅๅ‘Š

  • I have searched the Data-Juicer issues and found no similar bugs. ๆˆ‘ๅทฒ็ปๅœจ issueๅˆ—่กจ ไธญๆœ็ดขไฝ†ๆ˜ฏๆฒกๆœ‰ๅ‘็Žฐ็ฑปไผผ็š„bugๆŠฅๅ‘Šใ€‚

OS ็ณป็ปŸ

all

Installation Method ๅฎ‰่ฃ…ๆ–นๅผ

from source

Data-Juicer Version Data-Juicer็‰ˆๆœฌ

latest

Python Version Python็‰ˆๆœฌ

3.8

Describe the bug ๆ่ฟฐ่ฟ™ไธชbug

During the FT-Ranker competition, a user used nlpaug_en_mapper to augment the dataset but an error occurs during processing:

image

which shows that for some samples, nlpaug generated no augmented samples. Fields alignment except text should be modified similar to nlpcda_zh_mapper.

To Reproduce ๅฆ‚ไฝ•ๅค็Žฐ

Run nlpaug_en_mapper on FT-Ranker Competition dataset raw_data_en.jsonl.

Configs ้…็ฝฎไฟกๆฏ

Logs ๆŠฅ้”™ๆ—ฅๅฟ—

Screenshots ๆˆชๅ›พ

Additional ้ขๅค–ไฟกๆฏ

่ฏท้—ฎๅคฉๆฑ FT-Data Ranker 1B่ต›้“่ฎญ็ปƒๅฎŒๆจกๅž‹ๆต‹่ฏ•ๆ•ฐๆฎๆ—ถๆŠฅ้”™ RuntimeError: Error(s) in loading state_dict for FalconForCausalLM

ๅฏนfalcon-rw-1bๆจกๅž‹่ฟ›่กŒloraๅพฎ่ฐƒๅŽ๏ผŒ่ฐƒ็”จ่ฎญ็ปƒๅฅฝ็š„ๆจกๅž‹ๆต‹่ฏ•ๆ•ฐๆฎๅ‡บ็ŽฐๆŠฅ้”™

`(dj_comp) root@xht-ddc311903c-03ffb4f2:~/competition_kit/lm-evaluation-harness# bash ./examples/challenge-1B-stage1.sh \

dev
/root/output/1b
/root/competition_kit/data/challenge-data
/root/output/result_1b
[MODEL] /root/output/1b
[DATA] /root/competition_kit/data/challenge-data/dev
[OUT] /root/output/result_1b/dev
[TASK] challenge_mc: 25-shot
Using device 'cuda:0'
/root/miniconda3/envs/dj_comp/lib/python3.10/site-packages/torch/nn/init.py:412: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
Traceback (most recent call last):
File "/root/competition_kit/lm-evaluation-harness/main.py", line 103, in
main()
File "/root/competition_kit/lm-evaluation-harness/main.py", line 70, in main
results = evaluator.simple_evaluate(
File "/root/competition_kit/lm-evaluation-harness/lm_eval/utils.py", line 243, in _wrapper
return fn(*args, **kwargs)
File "/root/competition_kit/lm-evaluation-harness/lm_eval/evaluator.py", line 80, in simple_evaluate
lm = lm_eval.models.get_model(model).create_from_arg_string(
File "/root/competition_kit/lm-evaluation-harness/lm_eval/base.py", line 115, in create_from_arg_string
return cls(**args, **args2)
File "/root/competition_kit/lm-evaluation-harness/lm_eval/models/gpt2.py", line 85, in init
self.model = transformers.AutoModelForCausalLM.from_pretrained(
File "/root/miniconda3/envs/dj_comp/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 558, in from_pretrained
return model_class.from_pretrained(
File "/root/miniconda3/envs/dj_comp/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3187, in from_pretrained
) = cls._load_pretrained_model(
File "/root/miniconda3/envs/dj_comp/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3636, in _load_pretrained_model
raise RuntimeError(f"Error(s) in loading state_dict for {model.class.name}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for FalconForCausalLM:
size mismatch for transformer.word_embeddings.weight: copying a param with shape torch.Size([50258, 2048]) from checkpoint, the shape in current model is torch.Size([0, 2048]).
size mismatch for lm_head.weight: copying a param with shape torch.Size([50258, 2048]) from checkpoint, the shape in current model is torch.Size([0, 2048]).
You may consider adding ignore_mismatched_sizes=True in the model from_pretrained method.
`

  • ่ฏท้—ฎ่ฟ™็งๆƒ…ๅ†ตๆ˜ฏๆˆ‘่ฎญ็ปƒๅ‡บไบ†้—ฎ้ข˜ๅฏผ่‡ดๆต‹่ฏ•ๅ‡บ้—ฎ้ข˜๏ผŒ่ฟ˜ๆ˜ฏ่ฏดๅชๆ˜ฏๆต‹่ฏ•็š„ๆ—ถๅ€™ๅ“ช้‡Œๅ‡บไบ†้—ฎ้ข˜๏ผŒ่ฟ™ไธช[0,2048]็œ‹่ตทๆฅๅคช่ฏกๅผ‚ไบ†

googleๅ’Œbaiduไบ†ๅพˆๅคš๏ผŒๅคง้ƒจๅˆ†้ƒฝๆ˜ฏๅœจ่ฎฒ่ฟžๆŽฅๅฑ‚่พ“ๅ‡บ็š„้—ฎ้ข˜๏ผŒไฝ†ๆ˜ฏไธๅคช็†่งฃๆ€Žไนˆไผšๅ‡บ็Žฐ0ใ€‚่€Œไธ”้‡‡็”จไป–ๅปบ่ฎฎ็š„ ignore_mismatched_sizes=TrueไนŸๆ— ๆณ•่งฃๅ†ณ้—ฎ้ข˜

31G็š„ๆ•ฐๆฎ๏ผŒ่ฟ่กŒๅŽป้‡็š„ๆ—ถๅ€™๏ผŒไฟๅญ˜ๆ•ฐๆฎ็š„ๆ—ถๅ€™่ขซkill

Before Asking ๅœจๆ้—ฎไน‹ๅ‰

  • I have read the README carefully. ๆˆ‘ๅทฒ็ปไป”็ป†้˜…่ฏปไบ† README ไธŠ็š„ๆ“ไฝœๆŒ‡ๅผ•ใ€‚

  • I have pulled the latest code of main branch to run again and the problem still existed. ๆˆ‘ๅทฒ็ปๆ‹‰ๅ–ไบ†ไธปๅˆ†ๆ”ฏไธŠๆœ€ๆ–ฐ็š„ไปฃ็ ๏ผŒ้‡ๆ–ฐ่ฟ่กŒไน‹ๅŽ๏ผŒ้—ฎ้ข˜ไปไธ่ƒฝ่งฃๅ†ณใ€‚

Search before asking ๅ…ˆๆœ็ดข๏ผŒๅ†ๆ้—ฎ

  • I have searched the Data-Juicer issues and found no similar questions. ๆˆ‘ๅทฒ็ปๅœจ issueๅˆ—่กจ ไธญๆœ็ดขไฝ†ๆ˜ฏๆฒกๆœ‰ๅ‘็Žฐ็ฑปไผผ็š„้—ฎ้ข˜ใ€‚

Question

1111

Additional ้ขๅค–ไฟกๆฏ

1111111

ๅฆ‚ไฝ•็ฆป็บฟไฝฟ็”จ่ฟ™ไธชๅŒ…

Before Asking ๅœจๆ้—ฎไน‹ๅ‰

  • I have read the README carefully. ๆˆ‘ๅทฒ็ปไป”็ป†้˜…่ฏปไบ† README ไธŠ็š„ๆ“ไฝœๆŒ‡ๅผ•ใ€‚

  • I have pulled the latest code of main branch to run again and the problem still existed. ๆˆ‘ๅทฒ็ปๆ‹‰ๅ–ไบ†ไธปๅˆ†ๆ”ฏไธŠๆœ€ๆ–ฐ็š„ไปฃ็ ๏ผŒ้‡ๆ–ฐ่ฟ่กŒไน‹ๅŽ๏ผŒ้—ฎ้ข˜ไปไธ่ƒฝ่งฃๅ†ณใ€‚

Search before asking ๅ…ˆๆœ็ดข๏ผŒๅ†ๆ้—ฎ

  • I have searched the Data-Juicer issues and found no similar questions. ๆˆ‘ๅทฒ็ปๅœจ issueๅˆ—่กจ ไธญๆœ็ดขไฝ†ๆ˜ฏๆฒกๆœ‰ๅ‘็Žฐ็ฑปไผผ็š„้—ฎ้ข˜ใ€‚

Question

ๆˆ‘็›ฎๅ‰้œ€่ฆๅœจไธ€ไธช็ฆป็บฟๅฎนๅ™จไธญไฝฟ็”จ่ฟ™ไธชๅŒ…๏ผŒ่™ฝ็„ถ็›ฎๅ‰ๅŒ…ๅทฒ็ป่ฃ…ๅฅฝไบ†๏ผŒไฝ†ๅœจ่ฟ่กŒ็š„ๆ—ถๅ€™ๅ‘็Žฐ่ฟ™ไธชๅŒ…้œ€่ฆไธ‹่ฝฝๅพˆๅคšๆจกๅž‹ๆƒ้‡ไน‹็ฑป็š„ไธœ่ฅฟ๏ผŒ่ฏท้—ฎๆœ‰ไป€ไนˆๆฏ”่พƒๆœ‰ๆ•ˆ็š„ๅŠžๆณ•่ƒฝ่ฎฉๆˆ‘ๅœจ็ฆป็บฟ็ŽฏๅขƒไธญไนŸ่ƒฝไฝฟ็”จ่ฟ™ไธชๅŒ…ๅ—

Additional ้ขๅค–ไฟกๆฏ

No response

่ฏๆฑ‡ๅคšๆ ทๆ€งModelScope DemoๆŠฅ้”™

ไปŽhttps://modelscope.cn/studios/Data-Juicer/data_visulization_diversity/summary๏ผŒ่ฟ™ไธช้กต้ข็š„่ฏๆฑ‡ๅคšๆ ทๆ€งModelScope้“พๆŽฅ็‚น่ฟ›ๅŽปๅŽ๏ผŒModelScope้กต้ขๆŠฅ้”™๏ผš

ImportError: cannot import name 'prepare_diversity_model' from 'data_juicer.analysis.diversity_analysis' (/opt/conda/lib/python3.8/site-packages/data_juicer/analysis/diversity_analysis.py)
Traceback:
File "/opt/conda/lib/python3.8/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 552, in _run_script
exec(code, module.dict)
File "/home/studio_service/studio_file/PROJECT/app.py", line 9, in
from data_juicer.analysis.diversity_analysis import (DiversityAnalysis,

ไธๅฐๅฟƒๅฐ†analysis/overal.csvๆ–‡ไปถๅˆ ้™คๅŽ๏ผŒๅ†้‡ๆ–ฐ่ฟ่กŒๅˆ†ๆž๏ผŒไธ€็›ดๆ็คบๆ‰พไธๅˆฐ่ฏฅๆ–‡ไปถ๏ผŒไธไผš่‡ชๅŠจ้‡ๆ–ฐ็”Ÿๆˆ๏ผŒๅฆ‚ไฝ•่งฃๅ†ณ๏ผŸ

ไธๅฐๅฟƒๅฐ†analysis/overal.csvๆ–‡ไปถๅŠ็›ฎๅฝ•ๅˆ ้™คๅŽ๏ผŒๅ†้‡ๆ–ฐ่ฟ่กŒๅˆ†ๆž๏ผŒไธ€็›ดๆ็คบๆ‰พไธๅˆฐ่ฏฅๆ–‡ไปถ๏ผŒไธไผš่‡ชๅŠจ้‡ๆ–ฐ็”Ÿๆˆ๏ผŒๅฆ‚ไฝ•่งฃๅ†ณ๏ผŸ
use_cacheๅฑžๆ€ง่ฎพ็ฝฎๆˆfalseไบ†

Auto-HPOๆ˜ฏๅ…จๆต็จ‹่‡ชๅŠจๅŒ–็š„่ฟ˜ๆ˜ฏ้œ€่ฆไบบๅทฅไป‹ๅ…ฅ

Before Asking ๅœจๆ้—ฎไน‹ๅ‰

  • I have read the README carefully. ๆˆ‘ๅทฒ็ปไป”็ป†้˜…่ฏปไบ† README ไธŠ็š„ๆ“ไฝœๆŒ‡ๅผ•ใ€‚

  • I have pulled the latest code of main branch to run again and the problem still existed. ๆˆ‘ๅทฒ็ปๆ‹‰ๅ–ไบ†ไธปๅˆ†ๆ”ฏไธŠๆœ€ๆ–ฐ็š„ไปฃ็ ๏ผŒ้‡ๆ–ฐ่ฟ่กŒไน‹ๅŽ๏ผŒ้—ฎ้ข˜ไปไธ่ƒฝ่งฃๅ†ณใ€‚

Search before asking ๅ…ˆๆœ็ดข๏ผŒๅ†ๆ้—ฎ

  • I have searched the Data-Juicer issues and found no similar questions. ๆˆ‘ๅทฒ็ปๅœจ issueๅˆ—่กจ ไธญๆœ็ดขไฝ†ๆ˜ฏๆฒกๆœ‰ๅ‘็Žฐ็ฑปไผผ็š„้—ฎ้ข˜ใ€‚

Question

็œ‹ไบ†่ฎบๆ–‡๏ผŒๅ‘็Žฐ้‡Œ้ขๆœ‰Auto-HPO็š„ๅŠŸ่ƒฝ๏ผŒไผผไนŽๆฒกๆœ‰็œ‹ๅˆฐ็›ธๅ…ณ็š„ไปฃ็ ๏ผŒ่ฏท้—ฎ่ฟ™ไธชๅŠŸ่ƒฝ็Žฐๅœจๆ˜ฏๅ…จๆต็จ‹่‡ชๅŠจๅŒ–็š„ๅ—๏ผŒ่ฟ˜ๆ˜ฏ่ฏด๏ผŒ้€š่ฟ‡่ฐƒๆ•ดyamlๆ–‡ไปถ๏ผŒ็”ŸๆˆไธๅŒ็š„ๆ•ฐๆฎ๏ผŒ็„ถๅŽ็”จhttps://modelscope.cn/studios/Data-Juicer/auto_evaluation_helm/summary่ฟ™ไธช้‡Œ้ข็š„่ฏ„ๆต‹๏ผŒๆฅ่ง‚ๅฏŸ็ป“ๆžœ๏ผŒๅ†่ฐƒๆ•ดๆ•ฐๆฎ็”Ÿๆˆ็ญ–็•ฅ๏ผŒๅฆ‚ๆญคๅพช็Žฏ๏ผŒไบฆๆˆ–่€…ๆ˜ฏๅ…ถไป–ๆ–นๆณ•๏ผŸ

Additional ้ขๅค–ไฟกๆฏ

No response

[Bug]: scaleneๅˆ†ๆžๆŠฅ้”™

Before Reporting ๆŠฅๅ‘Šไน‹ๅ‰

  • I have pulled the latest code of main branch to run again and the bug still existed. ๆˆ‘ๅทฒ็ปๆ‹‰ๅ–ไบ†ไธปๅˆ†ๆ”ฏไธŠๆœ€ๆ–ฐ็š„ไปฃ็ ๏ผŒ้‡ๆ–ฐ่ฟ่กŒไน‹ๅŽ๏ผŒ้—ฎ้ข˜ไปไธ่ƒฝ่งฃๅ†ณใ€‚

  • I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) ๆˆ‘ๅทฒ็ปไป”็ป†้˜…่ฏปไบ† README ไธŠ็š„ๆ“ไฝœๆŒ‡ๅผ•๏ผŒๅนถไธ”ๅœจๅฎ‰่ฃ…่ฟ‡็จ‹ไธญๆฒกๆœ‰้”™่ฏฏๅ‘็”Ÿใ€‚๏ผˆๅฆๅˆ™๏ผŒๆˆ‘ไปฌๅปบ่ฎฎๆ‚จไฝฟ็”จQuestionๆจกๆฟๅ‘ๆˆ‘ไปฌ่ฟ›่กŒๆ้—ฎ๏ผ‰

Search before reporting ๅ…ˆๆœ็ดข๏ผŒๅ†ๆŠฅๅ‘Š

  • I have searched the Data-Juicer issues and found no similar bugs. ๆˆ‘ๅทฒ็ปๅœจ issueๅˆ—่กจ ไธญๆœ็ดขไฝ†ๆ˜ฏๆฒกๆœ‰ๅ‘็Žฐ็ฑปไผผ็š„bugๆŠฅๅ‘Šใ€‚

OS ็ณป็ปŸ

ubuntu

Installation Method ๅฎ‰่ฃ…ๆ–นๅผ

pip

Data-Juicer Version Data-Juicer็‰ˆๆœฌ

v0.1.2

Python Version Python็‰ˆๆœฌ

3.8

Describe the bug ๆ่ฟฐ่ฟ™ไธชbug

To Reproduce ๅฆ‚ไฝ•ๅค็Žฐ

pip install -U scalene
scalene tools/process_data.py --config configs/demo/process.yaml

Configs ้…็ฝฎไฟกๆฏ

# Process config example for dataset

# global parameters
project_name: 'demo-process'
dataset_path: 'demos/data/demo-dataset.jsonl'  # path to your dataset directory or file
np: 4  # number of subprocess to process your dataset
use_cache: false
export_path: './outputs/demo-process/demo-processed.jsonl'

# process schedule
# a list of several process operators with their arguments
process:
  - language_id_score_filter:
      lang: 'zh'

Logs ๆŠฅ้”™ๆ—ฅๅฟ—

2023-11-29 11:13:15 | INFO | data_juicer.core.executor:107 - Processing data...
2023-11-29 11:13:15 | ERROR | data_juicer.core.executor:165 - An error occurred during Op [language_id_score_filter].
Traceback (most recent call last):
File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/core/executor.py", line 131, in run
dataset = dataset.add_column(name=Fields.stats,
File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/core/data.py", line 255, in add_column
return NestedDataset(super().add_column(*args, **kargs))
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/fingerprint.py", line 511, in wrapper
out = func(dataset, *args, **kwargs)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 5446, in add_column
dataset = self.flatten_indices() if self._indices is not None else self
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/fingerprint.py", line 511, in wrapper
out = func(dataset, *args, **kwargs)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3592, in flatten_indices
return self.map(
File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/core/data.py", line 180, in map
new_ds = NestedDataset(super().map(*args, **kargs))
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3004, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3392, in _map_single
buf_writer, writer, tmp_file = init_buffer_and_writer()
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3326, in init_buffer_and_writer
tmp_file = tempfile.NamedTemporaryFile("wb", dir=os.path.dirname(cache_file_name), delete=False)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/tempfile.py", line 541, in NamedTemporaryFile
(fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/tempfile.py", line 250, in _mkstemp_inner
fd = _os.open(file, flags, 0o600)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp1pvg2dc5/tmpbf_w1ds6'

Screenshots ๆˆชๅ›พ

image

Additional ้ขๅค–ไฟกๆฏ

No response

[Bug]: date format changed from input to output

Before Reporting ๆŠฅๅ‘Šไน‹ๅ‰

  • I have pulled the latest code of main branch to run again and the bug still existed. ๆˆ‘ๅทฒ็ปๆ‹‰ๅ–ไบ†ไธปๅˆ†ๆ”ฏไธŠๆœ€ๆ–ฐ็š„ไปฃ็ ๏ผŒ้‡ๆ–ฐ่ฟ่กŒไน‹ๅŽ๏ผŒ้—ฎ้ข˜ไปไธ่ƒฝ่งฃๅ†ณใ€‚

  • I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) ๆˆ‘ๅทฒ็ปไป”็ป†้˜…่ฏปไบ† README ไธŠ็š„ๆ“ไฝœๆŒ‡ๅผ•๏ผŒๅนถไธ”ๅœจๅฎ‰่ฃ…่ฟ‡็จ‹ไธญๆฒกๆœ‰้”™่ฏฏๅ‘็”Ÿใ€‚๏ผˆๅฆๅˆ™๏ผŒๆˆ‘ไปฌๅปบ่ฎฎๆ‚จไฝฟ็”จQuestionๆจกๆฟๅ‘ๆˆ‘ไปฌ่ฟ›่กŒๆ้—ฎ๏ผ‰

Search before reporting ๅ…ˆๆœ็ดข๏ผŒๅ†ๆŠฅๅ‘Š

  • I have searched the Data-Juicer issues and found no similar bugs. ๆˆ‘ๅทฒ็ปๅœจ issueๅˆ—่กจ ไธญๆœ็ดขไฝ†ๆ˜ฏๆฒกๆœ‰ๅ‘็Žฐ็ฑปไผผ็š„bugๆŠฅๅ‘Šใ€‚

OS ็ณป็ปŸ

Ubuntu

Installation Method ๅฎ‰่ฃ…ๆ–นๅผ

from source

Data-Juicer Version Data-Juicer็‰ˆๆœฌ

latest

Python Version Python็‰ˆๆœฌ

3.8

Describe the bug ๆ่ฟฐ่ฟ™ไธชbug

I have a jsonl data to be processed, there is a time key in data records, it looks like '2023-10-13 16:06:31' originally, then I follow the python tools/process_data.py --config configs/demo/process.yaml command to process data, and in output jsonl, I found time is changed to 1678, a integer. I've found that it may be caused by datasets.to_json , there is a parameter called date_format, I set it to 'iso', the output will change to '1970-01-01T00:00:01.698', so it's not only bug in format, but the value also changed.

To Reproduce ๅฆ‚ไฝ•ๅค็Žฐ

  1. prepare a jsonl dataset with time.
  2. run python tools/process_data.py --config configs/demo/process.yaml
  3. output time format and value changed

Configs ้…็ฝฎไฟกๆฏ

project_name: 'all'                                         # project name for distinguish your configs
dataset_path: '/path/to/dataset/0.jsonl'                     
export_path: '/path/to/result/result.jsonl'              
export_shard_size: 0                                      
export_in_parallel: false                                  
np: 4                                                       # number of subprocess to process your dataset
text_keys: 'ๅ†…ๅฎน'                                      
suffixes: ['.jsonl']                                                # the suffix of files that will be read. For example: '.txt', 'txt' or ['txt', '.pdf', 'docx']
use_cache: true                                             # whether to use the cache management of Hugging Face datasets. It might take up lots of disk space when using cache
ds_cache_dir: null                                        
use_checkpoint: false                              
open_tracer: false                                          # whether to open the tracer to trace the changes during process. It might take more time when opening tracer
op_list_to_trace: []                                        # only ops in this list will be traced by tracer. If it's empty, all ops will be traced. Only available when tracer is opened.
trace_num: 10                                               # number of samples to show the differences between datasets before and after each op. Only available when tracer is opened.
op_fusion: false                                            # whether to fuse operators that share the same intermediate variables automatically. Op fusion might reduce the memory requirements slightly but speed up the whole process.
cache_compress: null                                        # The compression method of the cache file, which can be specified in ['gzip', 'zstd', 'lz4']. If this parameter is None, the cache file will not be compressed. We recommend you turn on this argument when your input dataset is larger than tens of GB and your disk space is not enough.

# for distributed processing
executor_type: default                                      # Type of executor, support "default" or "ray" for now.
ray_address: auto                                           # The address of the Ray cluster.

# only for data analysis
save_stats_in_one_file: false                               # whether to store all stats result into one file

# process schedule: a list of several process operators with their arguments
process:
  - clean_email_mapper:                                     # remove emails from text.
  - clean_html_mapper:                                      # remove html formats form text.
  - clean_ip_mapper:                                        # remove ip addresses from text.
  - clean_links_mapper:                                     # remove web links from text.
  - clean_copyright_mapper:                                 # remove copyright comments.
  - punctuation_normalization_mapper:                       # normalize unicode punctuations to English punctuations.
  - whitespace_normalization_mapper:                        # normalize different kinds of whitespaces to English whitespace.

Logs ๆŠฅ้”™ๆ—ฅๅฟ—

No response

Screenshots ๆˆชๅ›พ

No response

Additional ้ขๅค–ไฟกๆฏ

No response

[feature] release of data-juicer model checkpoints in huggingface format

Our reference models pre-trained by Data-Juicer are in Megatron format, we provide their download links as follows:

Plz @zhijianma paste the below links later.

Now, we are converting them into huggingface format, and will upload them in ModelScope and huggingface hub.

[MM] llava2dj & dj2llava tools

There are diverse formats for multimodal datasets. So we provide several format conversion tools, which convert original datasets to the Data-Juicer intermediate format and convert them back. Data-Juicer will accept datasets in Data-Juicer format and process them.

Here we add tools for converting LLaVA-like dataset to the target dataset in Data-Juicer format and reverse first. Other formats will be supported in the future.

[MM] image_deduplicator

A new Deduplicator image_deduplicator will be supported. It will remove duplicate images in multimodal samples. Maybe based on imagededup library

TBD: when a sample contains multiple images, remove duplicate images only or remove this whole sample?

[MM] image_shape_filter

A new Filter image_shape_filter will be supported. For a multimodal sample that contains images, if shapes (w, h) of images are out of a specific range, this sample will be filtered out.

[MM] clip_similarity_filter

A new Filter clip_similarity_filter will be supported. For a multimodal sample that contains images and texts, if the CLIP similarities between image-text pairs are out of a specific range, this sample will be filtered out.

ๅŽป้‡50Gๅทฆๅณไธญๆ–‡่ฏญๆ–™ๅกๆญป

ไฝฟ็”จ้กน็›ฎ่ฟ›่กŒไธญๆ–‡่ฏญๆ–™ๅŽป้‡(minhash)็ญ‰็ฎ—ๅญๆ“ไฝœ๏ผŒๅœจๅŽป้‡้˜ถๆฎตๅกๆญป๏ผŒ็บฆ4000wๆกๆ•ฐๆฎ
่ฏฆ็ป†้…็ฝฎๅฆ‚ไธ‹๏ผš

project_name: 'CC100-zh'
dataset_path: xxx.jsonl
export_path: xxx-processed.jsonl

np: 50
open_tracer: true
text_keys: 'text'
process:
  - perplexity_filter:
      lang: zh
      max_ppl: 2500
  - document_minhash_deduplicator:                          
      tokenization: character                                    
      window_size: 5                                        
      num_permutations: 256                                   
      jaccard_threshold: 0.7                                 
      num_bands: null                                        
      num_rows_per_band: null                                 
      lowercase: true                                         
      ignore_pattern: null                                    
  - text_length_filter:
      min_len: 200
      max_len: 65589
  - character_repetition_filter:
      rep_len: 10
      max_ratio: 0.3
  - word_repetition_filter:
      lang: zh
      tokenization: true
      rep_len: 10
      max_ratio: 0.279

ๅฆๅค–ๆƒณ้—ฎไธ€ไธ‹ๆ˜ฏๅฆๆœ‰ๅค„็†็ฑปไผผ่ง„ๆจก่ฏญๆ–™่ฏฆ็ป†็š„่€—ๆ—ถ็ปŸ่ฎก๏ผŸ

ๅฆ‚ไฝ•่ฎพ็ฝฎๆธ…ๆด—ๅŽ็š„ๆ•ฐๆฎไธๅŒ…ๅซ"__dj__stats__"ๅญ—ๆฎต๏ผŸ

Before Asking ๅœจๆ้—ฎไน‹ๅ‰

  • I have read the README carefully. ๆˆ‘ๅทฒ็ปไป”็ป†้˜…่ฏปไบ† README ไธŠ็š„ๆ“ไฝœๆŒ‡ๅผ•ใ€‚

  • I have pulled the latest code of main branch to run again and the problem still existed. ๆˆ‘ๅทฒ็ปๆ‹‰ๅ–ไบ†ไธปๅˆ†ๆ”ฏไธŠๆœ€ๆ–ฐ็š„ไปฃ็ ๏ผŒ้‡ๆ–ฐ่ฟ่กŒไน‹ๅŽ๏ผŒ้—ฎ้ข˜ไปไธ่ƒฝ่งฃๅ†ณใ€‚

Search before asking ๅ…ˆๆœ็ดข๏ผŒๅ†ๆ้—ฎ

  • I have searched the Data-Juicer issues and found no similar questions. ๆˆ‘ๅทฒ็ปๅœจ issueๅˆ—่กจ ไธญๆœ็ดขไฝ†ๆ˜ฏๆฒกๆœ‰ๅ‘็Žฐ็ฑปไผผ็š„้—ฎ้ข˜ใ€‚

Question

ๅœจไฝฟ็”จไธญๅ‘็Žฐๆธ…ๆด—ๅŽ็š„ๆ•ฐๆฎไธญไธไป…ๅŒ…ๅซๅŽŸๆœ‰ๆ•ฐๆฎ็š„ๅญ—ๆฎต๏ผŒ่ฟ˜ๆ–ฐๅขžไบ†"dj__stats"ๅญ—ๆฎต๏ผŒ้‡Œ้ขๅŒ…ๆ‹ฌไธ€็ณปๅˆ—ไธŽๆธ…ๆด—ๆ—ถ็š„็ฎ—ๅญ็›ธๅ…ณ็š„ๅฑžๆ€งๅ€ผใ€‚่ฟ™็กฎๅฎžๆ–นไพฟไบ†ๆˆ‘ไปฌไบ†่งฃๆ•ฐๆฎๅœจๆธ…ๆด—่ฟ‡็จ‹ไธญไธบไฝ•่ขซไฟ็•™ไธ‹ๆฅใ€‚

็Žฐๅœจๆˆ‘ๆƒณ่ฆ่ฟ›่กŒๅคšไธช้˜ถๆฎต็š„ๆ•ฐๆฎๆธ…ๆด—๏ผŒไพ‹ๅฆ‚็ฌฌไธ€้˜ถๆฎตๅœจๆ•ฐๆฎ้›†ๅ†…่ฟ›่กŒๆธ…ๆด—๏ผŒ็ฌฌไบŒ้˜ถๆฎตๅœจๆ•ฐๆฎ้›†้—ด่ฟ›่กŒๆธ…ๆด—ใ€‚้’ˆๅฏน่ฟ™็งๅคš้˜ถๆฎต็š„่ฎพ่ฎก๏ผŒๅฝ“ๆˆ‘ๆƒณ่ฆๅŸบไบŽ้ข„ๅค„็†ๅŽ็š„ๆ•ฐๆฎๅ†ๆฌกๅš้ข„ๅค„็†ๆ—ถ๏ผŒไผšๆŠฅ้”™๏ผš

pyarrow.lib.ArrowInvalid: Unable to merge: Field __dj__stats__ has incompatible types: struct<alnum_ratio: double, avg_line_length: double, char_rep_ratio: double, flagged_words_ratio: double, max_line_length: int64, num_words: int64, perplexity: double, special_char_ratio: double, text_len: int64, word_rep_ratio: double> vs struct<alnum_ratio: double>
  0%|          | 0/1 [00:00<?, ?it/s]

็œ‹่ตทๆฅไผผไนŽๆ˜ฏ"dj__stats"ๅญ—ๆฎต็š„ๆ ผๅผๅผ•่ตท็š„้—ฎ้ข˜ใ€‚
ๆˆ‘ๆฒกๆœ‰ๆ‰พๅˆฐๅฆ‚ไฝ•่ฎพ็ฝฎๆธ…ๆด—ๅŽ็š„ๆ•ฐๆฎไธๅŒ…ๅซ"dj__stats"ๅญ—ๆฎต๏ผŒ็œ‹ๆบไปฃ็ ไผผไนŽๆ˜ฏไธ€ไธช้ป˜่ฎคๅŒ…ๅซ"dj__stats"ๅญ—ๆฎต็š„่ฎพ่ฎก๏ผŸ

Additional ้ขๅค–ไฟกๆฏ

No response

[Bug]: clean_links_mapperใ€clean_ip_mapper็ญ‰็ฎ—ๅญๅฏผ่‡ดOOM

Before Reporting ๆŠฅๅ‘Šไน‹ๅ‰

  • I have pulled the latest code of main branch to run again and the bug still existed. ๆˆ‘ๅทฒ็ปๆ‹‰ๅ–ไบ†ไธปๅˆ†ๆ”ฏไธŠๆœ€ๆ–ฐ็š„ไปฃ็ ๏ผŒ้‡ๆ–ฐ่ฟ่กŒไน‹ๅŽ๏ผŒ้—ฎ้ข˜ไปไธ่ƒฝ่งฃๅ†ณใ€‚

  • I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) ๆˆ‘ๅทฒ็ปไป”็ป†้˜…่ฏปไบ† README ไธŠ็š„ๆ“ไฝœๆŒ‡ๅผ•๏ผŒๅนถไธ”ๅœจๅฎ‰่ฃ…่ฟ‡็จ‹ไธญๆฒกๆœ‰้”™่ฏฏๅ‘็”Ÿใ€‚๏ผˆๅฆๅˆ™๏ผŒๆˆ‘ไปฌๅปบ่ฎฎๆ‚จไฝฟ็”จQuestionๆจกๆฟๅ‘ๆˆ‘ไปฌ่ฟ›่กŒๆ้—ฎ๏ผ‰

Search before reporting ๅ…ˆๆœ็ดข๏ผŒๅ†ๆŠฅๅ‘Š

  • I have searched the Data-Juicer issues and found no similar bugs. ๆˆ‘ๅทฒ็ปๅœจ issueๅˆ—่กจ ไธญๆœ็ดขไฝ†ๆ˜ฏๆฒกๆœ‰ๅ‘็Žฐ็ฑปไผผ็š„bugๆŠฅๅ‘Šใ€‚

OS ็ณป็ปŸ

Ubuntu20.04

Installation Method ๅฎ‰่ฃ…ๆ–นๅผ

from source

Data-Juicer Version Data-Juicer็‰ˆๆœฌ

latest

Python Version Python็‰ˆๆœฌ

3.10.12

Describe the bug ๆ่ฟฐ่ฟ™ไธชbug

ๆธ…ๆด—่‹ฑๆ–‡webๆ•ฐๆฎ๏ผŒๆ•ฐๆฎ่ง„ๆจก็บฆ150Gๅทฆๅณ๏ผŒๅผ€ๅฏไปฅไธ‹็ฎ—ๅญ๏ผš

  - clean_email_mapper:                                     # remove emails from text.
  - clean_html_mapper:                                      # remove html formats form text.
  - clean_ip_mapper:                                        # remove ip addresses from text.
  - clean_links_mapper:                                     # remove web links from text.

ๆธ…ๆด—่ฟ‡็จ‹ๅ ็”จๅ†…ๅญ˜่พพ1.4TBๅฏผ่‡ดOOM

To Reproduce ๅฆ‚ไฝ•ๅค็Žฐ

่ฏฆ็ป†้…็ฝฎๆ–‡ไปถๅฆ‚ไธ‹๏ผš

project_name: 'xxx'
dataset_path: 'xxx'
export_path: 'xxx'

np: 128  
open_tracer: true
text_keys: 'text'
trace_num: 0x3f3f3f3f

process:
  - words_num_filter:                                     
      lang: en                                              
      tokenization: false                                    
      min_num: 100                                            
  - clean_email_mapper:                                     
  - clean_html_mapper:                                      
  - clean_ip_mapper:                                       
  - clean_links_mapper:                                     
  - perplexity_filter:
      lang: en
      max_ppl: 2500 
  - document_minhash_deduplicator:                          
      tokenization: space                                    
      window_size: 5                                          
      num_permutations: 256                                   
      jaccard_threshold: 0.7                                 
      num_bands: null                                        
      num_rows_per_band: null                                 
      lowercase: false                                         
      ignore_pattern: null                                    
  - word_repetition_filter:
      lang: en
      tokenization: true
      rep_len: 10
      max_ratio: 0.1  

Configs ้…็ฝฎไฟกๆฏ

No response

Logs ๆŠฅ้”™ๆ—ฅๅฟ—

No response

Screenshots ๆˆชๅ›พ

No response

Additional ้ขๅค–ไฟกๆฏ

No response

tokenization parameter in StopWordsFilter ops

Before Asking ๅœจๆ้—ฎไน‹ๅ‰

  • I have read the README carefully. ๆˆ‘ๅทฒ็ปไป”็ป†้˜…่ฏปไบ† README ไธŠ็š„ๆ“ไฝœๆŒ‡ๅผ•ใ€‚

  • I have pulled the latest code of main branch to run again and the problem still existed. ๆˆ‘ๅทฒ็ปๆ‹‰ๅ–ไบ†ไธปๅˆ†ๆ”ฏไธŠๆœ€ๆ–ฐ็š„ไปฃ็ ๏ผŒ้‡ๆ–ฐ่ฟ่กŒไน‹ๅŽ๏ผŒ้—ฎ้ข˜ไปไธ่ƒฝ่งฃๅ†ณใ€‚

Search before asking ๅ…ˆๆœ็ดข๏ผŒๅ†ๆ้—ฎ

  • I have searched the Data-Juicer issues and found no similar questions. ๆˆ‘ๅทฒ็ปๅœจ issueๅˆ—่กจ ไธญๆœ็ดขไฝ†ๆ˜ฏๆฒกๆœ‰ๅ‘็Žฐ็ฑปไผผ็š„้—ฎ้ข˜ใ€‚

Question

https://github.com/alibaba/data-juicer/blob/main/data_juicer/ops/filter/stopwords_filter.py#L85

  • tokenizationๅ‚ๆ•ฐ็œ‹่ตทๆฒกๆœ‰ๆ„ไน‰๏ผŒๅ› ไธบๅชๅฎž็Žฐไบ†ไธ€็ง็ปŸ่ฎกstopwords็š„ๆ–นๅผ๏ผˆๅฆไธ€ไธชๆ˜ฏไปŽๅทฒ็ปๅˆ†่ฏๅฎŒ็š„็Šถๆ€้‡Œๅ–๏ผ‰๏ผŒๅฐฑๆ˜ฏsentencepiece tokenization.

  • ่€Œไธ”#L83่กŒไธๆˆ็ซ‹ๆ—ถ๏ผŒtokenizationๅฆ‚่ฎพ็ฝฎไธบFalse๏ผŒๅฏ่ƒฝๅผ•่ตท#L86่กŒๆŠฅ้”™

Additional ้ขๅค–ไฟกๆฏ

No response

[MM] image_size_filter

A new Filter image_size_filter will be supported. For a multimodal sample that contains images, if the byte sizes of the images are out of a specific range, this sample will be filtered out.

pip install py-data-juicerๅฎ‰่ฃ…ๅคฑ่ดฅ

Before Asking ๅœจๆ้—ฎไน‹ๅ‰

  • I have read the README carefully. ๆˆ‘ๅทฒ็ปไป”็ป†้˜…่ฏปไบ† README ไธŠ็š„ๆ“ไฝœๆŒ‡ๅผ•ใ€‚

  • I have pulled the latest code of main branch to run again and the problem still existed. ๆˆ‘ๅทฒ็ปๆ‹‰ๅ–ไบ†ไธปๅˆ†ๆ”ฏไธŠๆœ€ๆ–ฐ็š„ไปฃ็ ๏ผŒ้‡ๆ–ฐ่ฟ่กŒไน‹ๅŽ๏ผŒ้—ฎ้ข˜ไปไธ่ƒฝ่งฃๅ†ณใ€‚

Search before asking ๅ…ˆๆœ็ดข๏ผŒๅ†ๆ้—ฎ

  • I have searched the Data-Juicer issues and found no similar questions. ๆˆ‘ๅทฒ็ปๅœจ issueๅˆ—่กจ ไธญๆœ็ดขไฝ†ๆ˜ฏๆฒกๆœ‰ๅ‘็Žฐ็ฑปไผผ็š„้—ฎ้ข˜ใ€‚

Question

pip install py-data-juicerๅฎ‰่ฃ…ๅคฑ่ดฅ

` Cython.Compiler.Errors.CompileError: simhash/simhash.pyx
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for simhash-py
Running setup.py clean for simhash-py
Failed to build kenlm simhash-py
ERROR: Could not build wheels for kenlm, simhash-py, which is required to install pyproject.toml-based projects`

ๅ…จ้ƒจๆŠฅ้”™ไฟกๆฏ่งๅ…จ้ƒจๆŠฅ้”™

Additional ้ขๅค–ไฟกๆฏ

  • windows11็ณป็ปŸ๏ผŒๅทฒ็ปๅฎ‰่ฃ…gcc, cmake, visual studio 2022 MSVC140/143็”Ÿๆˆๅทฅๅ…ท๏ผŒๆˆ‘ไธๅคชๆธ…ๆฅšๆ˜ฏไธๆ˜ฏๅ› ไธบgccๅ’Œcmake็‰ˆๆœฌไธๅฏน๏ผŒๅฏผ่‡ดkenlm, simhash-pyๆŠฅ้”™๏ผŸ
  • gcc (x86_64-win32-seh-rev3, Built by MinGW-W64 project) 12.1.0
    Copyright (C) 2022 Free Software Foundation, Inc.
    This is free software; see the source for copying conditions. There is NO
    warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
  • cmake version 3.28.0-rc5

dedup across different dataset

Before Asking ๅœจๆ้—ฎไน‹ๅ‰

  • I have read the README carefully. ๆˆ‘ๅทฒ็ปไป”็ป†้˜…่ฏปไบ† README ไธŠ็š„ๆ“ไฝœๆŒ‡ๅผ•ใ€‚

  • I have pulled the latest code of main branch to run again and the problem still existed. ๆˆ‘ๅทฒ็ปๆ‹‰ๅ–ไบ†ไธปๅˆ†ๆ”ฏไธŠๆœ€ๆ–ฐ็š„ไปฃ็ ๏ผŒ้‡ๆ–ฐ่ฟ่กŒไน‹ๅŽ๏ผŒ้—ฎ้ข˜ไปไธ่ƒฝ่งฃๅ†ณใ€‚

Search before asking ๅ…ˆๆœ็ดข๏ผŒๅ†ๆ้—ฎ

  • I have searched the Data-Juicer issues and found no similar questions. ๆˆ‘ๅทฒ็ปๅœจ issueๅˆ—่กจ ไธญๆœ็ดขไฝ†ๆ˜ฏๆฒกๆœ‰ๅ‘็Žฐ็ฑปไผผ็š„้—ฎ้ข˜ใ€‚

Question

Hi, thanks for the great work. I'm wondering whether juicer apply dedup across different datasets when reproducing redpajama. For example, if there will be duplicates between CommonCrawl-2023-06 and CommonCrawl-2022-05?

Additional ้ขๅค–ไฟกๆฏ

No response

[Bug]: RAY error

Before Reporting ๆŠฅๅ‘Šไน‹ๅ‰

  • I have pulled the latest code of main branch to run again and the bug still existed. ๆˆ‘ๅทฒ็ปๆ‹‰ๅ–ไบ†ไธปๅˆ†ๆ”ฏไธŠๆœ€ๆ–ฐ็š„ไปฃ็ ๏ผŒ้‡ๆ–ฐ่ฟ่กŒไน‹ๅŽ๏ผŒ้—ฎ้ข˜ไปไธ่ƒฝ่งฃๅ†ณใ€‚

  • I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) ๆˆ‘ๅทฒ็ปไป”็ป†้˜…่ฏปไบ† README ไธŠ็š„ๆ“ไฝœๆŒ‡ๅผ•๏ผŒๅนถไธ”ๅœจๅฎ‰่ฃ…่ฟ‡็จ‹ไธญๆฒกๆœ‰้”™่ฏฏๅ‘็”Ÿใ€‚๏ผˆๅฆๅˆ™๏ผŒๆˆ‘ไปฌๅปบ่ฎฎๆ‚จไฝฟ็”จQuestionๆจกๆฟๅ‘ๆˆ‘ไปฌ่ฟ›่กŒๆ้—ฎ๏ผ‰

Search before reporting ๅ…ˆๆœ็ดข๏ผŒๅ†ๆŠฅๅ‘Š

  • I have searched the Data-Juicer issues and found no similar bugs. ๆˆ‘ๅทฒ็ปๅœจ issueๅˆ—่กจ ไธญๆœ็ดขไฝ†ๆ˜ฏๆฒกๆœ‰ๅ‘็Žฐ็ฑปไผผ็š„bugๆŠฅๅ‘Šใ€‚

OS ็ณป็ปŸ

ubuntu

Installation Method ๅฎ‰่ฃ…ๆ–นๅผ

pip

Data-Juicer Version Data-Juicer็‰ˆๆœฌ

v0.1.2

Python Version Python็‰ˆๆœฌ

3.8

Describe the bug ๆ่ฟฐ่ฟ™ไธชbug

ไฝฟ็”จRAYๅฏนlanguage_id_score_filter็ฎ—ๅญ่ฟ›่กŒๅค„็†ๆ—ถๆŠฅ้”™ใ€‚

To Reproduce ๅฆ‚ไฝ•ๅค็Žฐ

# ok
python tools/process_data.py --config configs/demo/process.yaml

# error
python tools/process_data.py --config configs/demo/process.yaml --executor_type ray

# ok๏ผŒchange op to - alphanumeric_filter:
python tools/process_data.py --config configs/demo/process.yaml --executor_type ray

Configs ้…็ฝฎไฟกๆฏ

# Process config example for dataset

# global parameters
project_name: 'demo-process'
dataset_path: 'demos/data/demo-dataset.jsonl'  # path to your dataset directory or file
np: 4  # number of subprocess to process your dataset
use_cache: false
export_path: './outputs/demo-process/demo-processed.jsonl'
save_stats_in_one_file: true
# process schedule
# a list of several process operators with their arguments
process:
  - language_id_score_filter:
      lang: 'zh'
  # - alphanumeric_filter:

Logs ๆŠฅ้”™ๆ—ฅๅฟ—

(python3.8) wzp@vastai-NF5468M6:~/code/LLMData/open_source/data-juicer$ python tools/process_data.py --config configs/demo/process.yaml --executor_type ray
<class 'list'>
<class 'list'>
2023-11-30 15:32:52 | WARNING  | data_juicer.config.config:329 - Cache management of datasets is disabled.
2023-11-30 15:32:52 | WARNING  | data_juicer.config.config:340 - Set temp directory to store temp files to [None].
2023-11-30 15:32:52 | INFO     | data_juicer.config.config:442 - Back up the input config file [/home/wzp/code/LLMData/open_source/data-juicer/configs/demo/process.yaml] into the work_dir [./outputs/demo-process]
2023-11-30 15:32:52 | INFO     | data_juicer.config.config:463 - Configuration table: 
โ•’โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ••
โ”‚ key                    โ”‚ values                                                                                   โ”‚
โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
โ”‚ config                 โ”‚ [Path_fr(configs/demo/process.yaml, cwd=/home/wzp/code/LLMData/open_source/data-juicer)] โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ hpo_config             โ”‚ None                                                                                     โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ project_name           โ”‚ 'demo-process'                                                                           โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ executor_type          โ”‚ 'ray'                                                                                    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ dataset_path           โ”‚ 'demos/data/demo-dataset.jsonl'                                                          โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ export_path            โ”‚ './outputs/demo-process/demo-processed.jsonl'                                            โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ export_shard_size      โ”‚ 0                                                                                        โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ export_in_parallel     โ”‚ False                                                                                    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ np                     โ”‚ 4                                                                                        โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ text_keys              โ”‚ 'text'                                                                                   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ image_key              โ”‚ 'images'                                                                                 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ image_special_token    โ”‚ '<__dj__image>'                                                                          โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ eoc_special_token      โ”‚ '<|__dj__eoc|>'                                                                          โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ suffixes               โ”‚ []                                                                                       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ use_cache              โ”‚ False                                                                                    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ ds_cache_dir           โ”‚ PosixPath('/home/wzp/.cache/huggingface/datasets')                                       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ cache_compress         โ”‚ None                                                                                     โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ use_checkpoint         โ”‚ False                                                                                    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ temp_dir               โ”‚ None                                                                                     โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ open_tracer            โ”‚ False                                                                                    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ op_list_to_trace       โ”‚ []                                                                                       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ trace_num              โ”‚ 10                                                                                       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ op_fusion              โ”‚ False                                                                                    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ process                โ”‚ [{'language_id_score_filter': {'image_key': 'images',                                    โ”‚
โ”‚                        โ”‚                                'lang': 'zh',                                             โ”‚
โ”‚                        โ”‚                                'min_score': 0.8,                                         โ”‚
โ”‚                        โ”‚                                'text_key': 'text'}}]                                     โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ save_stats_in_one_file โ”‚ True                                                                                     โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ ray_address            โ”‚ 'auto'                                                                                   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ work_dir               โ”‚ './outputs/demo-process'                                                                 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ timestamp              โ”‚ '20231130153252'                                                                         โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ dataset_dir            โ”‚ '/home/wzp/code/LLMData/open_source/data-juicer/demos/data'                              โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ add_suffix             โ”‚ False                                                                                    โ”‚
โ•˜โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•›
2023-11-30 15:32:53 | INFO     | data_juicer.core.ray_executor:35 - Initing Ray ...
2023-11-30 15:32:53,326 INFO worker.py:1458 -- Connecting to existing Ray cluster at address: 10.23.4.252:6379...
2023-11-30 15:32:53,333 INFO worker.py:1642 -- Connected to Ray cluster.
2023-11-30 15:32:53 | INFO     | data_juicer.core.ray_executor:47 - Loading dataset with Ray...
2023-11-30 15:32:54,324 INFO read_api.py:406 -- To satisfy the requested parallelism of 192, each read task output is split into 192 smaller blocks.
2023-11-30 15:32:54 | INFO     | data_juicer.core.ray_executor:51 - Preparing process operators...
2023-11-30 15:32:54 | INFO     | data_juicer.utils.model_utils:87 - Loading fasttext language identification model...
2023-11-30 15:32:54 | INFO     | data_juicer.core.ray_executor:59 - columns ['text', 'meta']
2023-11-30 15:32:54 | INFO     | data_juicer.core.ray_executor:62 - Processing data...
2023-11-30 15:32:54,702 INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadJSON->SplitBlocks(192)] -> TaskPoolMapOperator[MapBatches(process_batch)->Map(compute_stats)->Filter(process)]
2023-11-30 15:32:54,702 INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-11-30 15:32:54,702 INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
(MapBatches(process_batch)->Map(compute_stats)->Filter(process) pid=48497) 2023-11-30 15:33:00.312 | ERROR    | data_juicer.ops.filter.language_id_score_filter:compute_stats:52 - Model not loaded. Please retry later.                                                                              
(MapBatches(process_batch)->Map(compute_stats)->Filter(process) pid=48497) 2023-11-30 15:33:00.363 | ERROR    | data_juicer.ops.filter.language_id_score_filter:compute_stats:52 - Model not loaded. Please retry later.                                                                              
(MapBatches(process_batch)->Map(compute_stats)->Filter(process) pid=48497) 2023-11-30 15:33:01.362 | ERROR    | data_juicer.ops.filter.language_id_score_filter:compute_stats:52 - Model not loaded. Please retry later.                                                                              
--- Logging error in Loguru Handler #1 ---                                                                                                                                                                                                                                                            
Record was: {'elapsed': datetime.timedelta(seconds=13, microseconds=527897), 'exception': (type=<class 'ray.exceptions.RayTaskError(ValueError)'>, value=RayTaskError(ValueError)(ValueError('Model not loaded. Please retry later.')), traceback=<traceback object at 0x7f298c1324c0>), 'extra': {}, 'file': (name='process_data.py', path='tools/process_data.py'), 'function': '<module>', 'level': (name='ERROR', no=40, icon='โŒ'), 'line': 19, 'message': "An error has been caught in function '<module>', process 'MainProcess' (48135), thread 'MainThread' (139830410995520):", 'module': 'process_data', 'name': '__main__', 'process': (id=48135, name='MainProcess'), 'thread': (id=139830410995520, name='MainThread'), 'time': datetime(2023, 11, 30, 15, 33, 2, 531776, tzinfo=datetime.timezone(datetime.timedelta(seconds=28800), 'CST'))}
Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 345, in ray._raylet.StreamingObjectRefGenerator._next_sync
  File "python/ray/_raylet.pyx", line 4533, in ray._raylet.CoreWorker.try_read_next_object_ref_stream
  File "python/ray/_raylet.pyx", line 443, in ray._raylet.check_status
ray.exceptions.ObjectRefStreamEndOfStreamError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 80, in on_waitable_ready
    meta = ray.get(next(self._streaming_gen))
  File "python/ray/_raylet.pyx", line 300, in ray._raylet.StreamingObjectRefGenerator.__next__
  File "python/ray/_raylet.pyx", line 351, in ray._raylet.StreamingObjectRefGenerator._next_sync
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/loguru/_logger.py", line 1277, in catch_wrapper
    return function(*args, **kwargs)
  File "tools/process_data.py", line 15, in main
    executor.run()
  File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/core/ray_executor.py", line 83, in run
    logger.info(f'Op [{op_name}] Done. Left '
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/dataset.py", line 2498, in count
    [get_num_rows.remote(block) for block in self.get_internal_block_refs()]
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/dataset.py", line 4799, in get_internal_block_refs
    blocks = self._plan.execute().get_blocks()
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/plan.py", line 591, in execute
    blocks = execute_to_legacy_block_list(
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/legacy_compat.py", line 119, in execute_to_legacy_block_list
    block_list = _bundles_to_block_list(bundles)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/legacy_compat.py", line 357, in _bundles_to_block_list
    for ref_bundle in bundles:
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/interfaces/executor.py", line 37, in __next__
    return self.get_next()
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor.py", line 129, in get_next
    raise item
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor.py", line 187, in run
    while self._scheduling_loop_step(self._topology) and not self._shutdown:
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor.py", line 235, in _scheduling_loop_step
    process_completed_tasks(topology)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 333, in process_completed_tasks
    active_tasks[ref].on_waitable_ready()
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 88, in on_waitable_ready
    ex = ray.get(block_ref)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/_private/worker.py", line 2547, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::MapBatches(process_batch)->Map(compute_stats)->Filter(process)() (pid=48715, ip=10.23.4.252)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 405, in _map_task
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 345, in __call__
    for data in iter:
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 171, in __call__
    yield from self._row_fn(input, ctx)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 256, in transform_fn
    for row in rows:
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 223, in __call__
    for block in blocks:
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 345, in __call__
    for data in iter:
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 171, in __call__
    yield from self._row_fn(input, ctx)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 233, in transform_fn
    out_row = fn(row)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 119, in fn
    return op_fn(item, *fn_args, **fn_kwargs)
  File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
    return f(*args, **kargs)
  File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/ops/filter/language_id_score_filter.py", line 53, in compute_stats
    raise ValueError(err_msg)
ValueError: Model not loaded. Please retry later.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/loguru/_handler.py", line 204, in emit
    self._queue.put(str_record)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/multiprocessing/queues.py", line 362, in put
    obj = _ForkingPickler.dumps(obj)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <class 'ray.exceptions.RayTaskError(ValueError)'>: attribute lookup RayTaskError(ValueError) on ray.exceptions failed
--- End of logging error ---
(MapBatches(process_batch)->Map(compute_stats)->Filter(process) pid=48713) 2023-11-30 15:33:02.374 | ERROR    | data_juicer.ops.filter.language_id_score_filter:compute_stats:52 - Model not loaded. Please retry later. [repeated 9x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)

Screenshots ๆˆชๅ›พ

No response

Additional ้ขๅค–ไฟกๆฏ

No response

[MM] image_aspect_ratio_filter

A new Filter image_aspect_ratio_filter is supported. It will filter out multimodal samples that contain images with unexpected aspect ratios.

export_path can not be a folder

Search before continuing ๅ…ˆๆœ็ดข๏ผŒๅ†็ปง็ปญ

  • I have searched the Data-Juicer issues and found no similar feature requests. ๆˆ‘ๅทฒ็ปๆœ็ดขไบ† Data-Juicer ็š„ issue ๅˆ—่กจไฝ†ๆ˜ฏๆฒกๆœ‰ๅ‘็Žฐ็ฑปไผผ็š„ๅŠŸ่ƒฝ้œ€ๆฑ‚ใ€‚

Description ๆ่ฟฐ

  • dataset_pathๆ˜ฏไธ€ไธชๆ–‡ไปถๅคน่ทฏๅพ„๏ผŒ้‡Œ้ขๅคšไธชjsonๆ–‡ไปถ
  • ๅค„็†ๅฎŒๅŽ๏ผŒๅธŒๆœ›export_pathไนŸๆ˜ฏไธ€ไธชๆ–‡ไปถๅคน๏ผŒ้‡Œ้ขไธบๅฏนๅบ”ๅคšไธชjson

Use case ไฝฟ็”จๅœบๆ™ฏ

No response

Additional ้ขๅค–ไฟกๆฏ

No response

Are you willing to submit a PR for this feature? ๆ‚จๆ˜ฏๅฆไนๆ„ไธบๆญคๅŠŸ่ƒฝๆไบคไธ€ไธช PR๏ผŸ

  • Yes I'd like to help by submitting a PR! ๆ˜ฏ็š„๏ผๆˆ‘ๆ„ฟๆ„ๆไพ›ๅธฎๅŠฉๅนถๆไบคไธ€ไธชPR๏ผ

[Bug]: NameError: name 'fingerprint_warnings' is not defined TypeError: cannot pickle 'OpenCC' object

Before Reporting ๆŠฅๅ‘Šไน‹ๅ‰

  • I have pulled the latest code of main branch to run again and the bug still existed. ๆˆ‘ๅทฒ็ปๆ‹‰ๅ–ไบ†ไธปๅˆ†ๆ”ฏไธŠๆœ€ๆ–ฐ็š„ไปฃ็ ๏ผŒ้‡ๆ–ฐ่ฟ่กŒไน‹ๅŽ๏ผŒ้—ฎ้ข˜ไปไธ่ƒฝ่งฃๅ†ณใ€‚

  • I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) ๆˆ‘ๅทฒ็ปไป”็ป†้˜…่ฏปไบ† README ไธŠ็š„ๆ“ไฝœๆŒ‡ๅผ•๏ผŒๅนถไธ”ๅœจๅฎ‰่ฃ…่ฟ‡็จ‹ไธญๆฒกๆœ‰้”™่ฏฏๅ‘็”Ÿใ€‚๏ผˆๅฆๅˆ™๏ผŒๆˆ‘ไปฌๅปบ่ฎฎๆ‚จไฝฟ็”จQuestionๆจกๆฟๅ‘ๆˆ‘ไปฌ่ฟ›่กŒๆ้—ฎ๏ผ‰

Search before reporting ๅ…ˆๆœ็ดข๏ผŒๅ†ๆŠฅๅ‘Š

  • I have searched the Data-Juicer issues and found no similar bugs. ๆˆ‘ๅทฒ็ปๅœจ issueๅˆ—่กจ ไธญๆœ็ดขไฝ†ๆ˜ฏๆฒกๆœ‰ๅ‘็Žฐ็ฑปไผผ็š„bugๆŠฅๅ‘Šใ€‚

OS ็ณป็ปŸ

ubuntu

Installation Method ๅฎ‰่ฃ…ๆ–นๅผ

pip

Data-Juicer Version Data-Juicer็‰ˆๆœฌ

v0.1.2

Python Version Python็‰ˆๆœฌ

3.8

Describe the bug ๆ่ฟฐ่ฟ™ไธชbug

image

To Reproduce ๅฆ‚ไฝ•ๅค็Žฐ

# configs/demo/process.yaml

# global parameters
project_name: 'demo-process'
dataset_path: './demos/data/demo-dataset.jsonl'  # path to your dataset directory or file
np: 4  # number of subprocess to process your dataset

export_path: './outputs/demo-process/demo-processed.jsonl'

# process schedule
# a list of several process operators with their arguments
process:
  # - language_id_score_filter:
  #     lang: 'zh'
  - chinese_convert_mapper:
        mode: 's2t'
python tools/process_data.py --config configs/demo/process.yaml

ๆŠฅ้”™ๅฆ‚ไธŠ้ขๆˆชๅ›พ

ไฝฟ็”จไปฅไธ‹ๅ•็‹ฌ่„šๆœฌๆต‹่ฏ•ๆ˜ฏๆญฃๅธธ็š„๏ผš

from data_juicer.ops.mapper.chinese_convert_mapper import ChineseConvertMapper

text = {"text": "ๆˆ‘ๅœจ้ฆฌ่ทฏ้‚Šๅซฒ็ท‘่ฎŠ"}

op = ChineseConvertMapper(mode = 't2s')

aa = op.process(text)
print(aa)

Configs ้…็ฝฎไฟกๆฏ

No response

Logs ๆŠฅ้”™ๆ—ฅๅฟ—

No response

Screenshots ๆˆชๅ›พ

No response

Additional ้ขๅค–ไฟกๆฏ

No response

FT-Data Ranker-1b OOM finetuning on single GPU

Before Asking ๅœจๆ้—ฎไน‹ๅ‰

  • I have read the README carefully. ๆˆ‘ๅทฒ็ปไป”็ป†้˜…่ฏปไบ† README ไธŠ็š„ๆ“ไฝœๆŒ‡ๅผ•ใ€‚

  • I have pulled the latest code of main branch to run again and the problem still existed. ๆˆ‘ๅทฒ็ปๆ‹‰ๅ–ไบ†ไธปๅˆ†ๆ”ฏไธŠๆœ€ๆ–ฐ็š„ไปฃ็ ๏ผŒ้‡ๆ–ฐ่ฟ่กŒไน‹ๅŽ๏ผŒ้—ฎ้ข˜ไปไธ่ƒฝ่งฃๅ†ณใ€‚

Search before asking ๅ…ˆๆœ็ดข๏ผŒๅ†ๆ้—ฎ

  • I have searched the Data-Juicer issues and found no similar questions. ๆˆ‘ๅทฒ็ปๅœจ issueๅˆ—่กจ ไธญๆœ็ดขไฝ†ๆ˜ฏๆฒกๆœ‰ๅ‘็Žฐ็ฑปไผผ็š„้—ฎ้ข˜ใ€‚

Question

ไฝฟ็”จ็ซž่ต›ๆไพ›็š„ไปฃ็ ๏ผŒdeepspeedๅ•ๅก๏ผˆ3090 24g๏ผ‰ๅพฎ่ฐƒfalcon-rw-1b๏ผŒ่ฐƒ่ฏ•่ฟ‡ๅ‚ๆ•ฐๅ’Œdeepspeed็š„้…็ฝฎ๏ผŒๅ‡ไธบOOMใ€‚

Additional ้ขๅค–ไฟกๆฏ

  • Script
#!/bin/bash

set -e 
export CUDA_DEVICE_MAX_CONNECTIONS=1

if [ -z $XDG_CACHE_HOME ]; then
    export XDG_CACHE_HOME=$HOME/.cache
fi

if [[ $# -ne 3 ]]; then
    echo "Three arguments required! " >&2
    exit 2
fi

# Model Path
# e.g /home/model/baichuan2-7b/
model_path=${1} #/path/to/your/model/
tokenizer=${model_path}

# Data Path
# e.g /home/data/train.jsonl
data_path=${2} # /path/to/your/dataset.jsonl

# Output Path
# e.g ${WORK_DIR}/checkpoints/baichuan2-7b/
output_path=${3} #/path/to/your/output/

mkdir -p ${output_path}/

WORK_DIR=$(echo `cd $(dirname $0); pwd | xargs dirname`)
cd ${WORK_DIR}

# Deepspeed
# ds_config_file=${WORK_DIR}/train_scripts/deepspeed_configs/ds_config_stage3.json
ds_config_file=${WORK_DIR}/train_scripts/deepspeed_configs/ds_config_stage3_offload-para.json

# Train Parameter
bs_per_gpu=1
num_nodes=1
# nproc_per_node=`nvidia-smi | grep MiB | wc -l`
nproc_per_node=1
master_port=50000

# grad_acc=`expr 256 / ${bs_per_gpu} / ${num_nodes} / ${nproc_per_node}`
grad_acc=`expr 32 / ${bs_per_gpu} / ${num_nodes} / ${nproc_per_node}`
deepspeed --num_gpus ${nproc_per_node} --num_nodes ${num_nodes} --master_port ${master_port} train.py \
    --model_name_or_path ${model_path} \
    --tokenizer ${tokenizer} \
    --data_path ${data_path} \
    --output_dir ${output_path} \
    --per_device_train_batch_size ${bs_per_gpu} \
    --gradient_accumulation_steps ${grad_acc} \
    --lang en \
    --bf16 True \
    --gradient_checkpointing_enable True \
    --num_train_epochs 3 \
    --model_max_length 1024 \
    --learning_rate 2.5e-5 \
    --weight_decay 0 \
    --warmup_ratio 0.03 \
    --evaluation_strategy "no" \
    --save_strategy "no" \
    --save_steps -1 \
    --save_total_limit 999 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --deepspeed ${ds_config_file} | tee ${output_path}/training_log.txt
  • Log
ORI NUMBER: 23237, AFTER FILETER: 22564, DROP NUMBER: 673
Total 22564 samples [ 6.48M tokens] in training!
  0%|                                                                                                                                                                                                                                                             | 0/705 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/ubuntu/Projects/ft_data_ranker_1b/competition_kit/lm-training/train.py", line 465, in <module>
    train()
  File "/home/ubuntu/Projects/ft_data_ranker_1b/competition_kit/lm-training/train.py", line 457, in train
    trainer.train()
  File "/home/ubuntu/Softwares/anaconda3/envs/dj_comp/lib/python3.10/site-packages/transformers/trainer.py", line 1591, in train
    return inner_training_loop(
  File "/home/ubuntu/Softwares/anaconda3/envs/dj_comp/lib/python3.10/site-packages/transformers/trainer.py", line 1971, in _inner_training_loop
    self.optimizer.step()
  File "/home/ubuntu/Softwares/anaconda3/envs/dj_comp/lib/python3.10/site-packages/accelerate/optimizer.py", line 145, in step
    self.optimizer.step(closure)
  File "/home/ubuntu/Softwares/anaconda3/envs/dj_comp/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/ubuntu/Softwares/anaconda3/envs/dj_comp/lib/python3.10/site-packages/torch/optim/optimizer.py", line 373, in wrapper
    out = func(*args, **kwargs)
  File "/home/ubuntu/Softwares/anaconda3/envs/dj_comp/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/home/ubuntu/Softwares/anaconda3/envs/dj_comp/lib/python3.10/site-packages/torch/optim/adamw.py", line 184, in step
    adamw(
  File "/home/ubuntu/Softwares/anaconda3/envs/dj_comp/lib/python3.10/site-packages/torch/optim/adamw.py", line 335, in adamw
    func(
  File "/home/ubuntu/Softwares/anaconda3/envs/dj_comp/lib/python3.10/site-packages/torch/optim/adamw.py", line 599, in _multi_tensor_adamw
    exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 0 has a total capacty of 23.69 GiB of which 36.38 MiB is free. Including non-PyTorch memory, this process has 23.64 GiB memory in use. Of the allocated memory 22.50 GiB is allocated by PyTorch, and 105.61 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

[Bug]: alphanumeric_filter, char.isalnum()

Before Reporting ๆŠฅๅ‘Šไน‹ๅ‰

  • I have pulled the latest code of main branch to run again and the bug still existed. ๆˆ‘ๅทฒ็ปๆ‹‰ๅ–ไบ†ไธปๅˆ†ๆ”ฏไธŠๆœ€ๆ–ฐ็š„ไปฃ็ ๏ผŒ้‡ๆ–ฐ่ฟ่กŒไน‹ๅŽ๏ผŒ้—ฎ้ข˜ไปไธ่ƒฝ่งฃๅ†ณใ€‚

  • I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) ๆˆ‘ๅทฒ็ปไป”็ป†้˜…่ฏปไบ† README ไธŠ็š„ๆ“ไฝœๆŒ‡ๅผ•๏ผŒๅนถไธ”ๅœจๅฎ‰่ฃ…่ฟ‡็จ‹ไธญๆฒกๆœ‰้”™่ฏฏๅ‘็”Ÿใ€‚๏ผˆๅฆๅˆ™๏ผŒๆˆ‘ไปฌๅปบ่ฎฎๆ‚จไฝฟ็”จQuestionๆจกๆฟๅ‘ๆˆ‘ไปฌ่ฟ›่กŒๆ้—ฎ๏ผ‰

Search before reporting ๅ…ˆๆœ็ดข๏ผŒๅ†ๆŠฅๅ‘Š

  • I have searched the Data-Juicer issues and found no similar bugs. ๆˆ‘ๅทฒ็ปๅœจ issueๅˆ—่กจ ไธญๆœ็ดขไฝ†ๆ˜ฏๆฒกๆœ‰ๅ‘็Žฐ็ฑปไผผ็š„bugๆŠฅๅ‘Šใ€‚

OS ็ณป็ปŸ

ubuntu

Installation Method ๅฎ‰่ฃ…ๆ–นๅผ

pip

Data-Juicer Version Data-Juicer็‰ˆๆœฌ

v0.1.2

Python Version Python็‰ˆๆœฌ

3.8

Describe the bug ๆ่ฟฐ่ฟ™ไธชbug

https://github.com/alibaba/data-juicer/blob/main/data_juicer/ops/filter/alphanumeric_filter.py#L75

            alnum_count = sum(
                map(lambda char: 1
                    if char.isalnum() else 0, sample[self.text_key]))

Python3้ป˜่ฎคไฝฟ็”จUnicode็ผ–็ ๏ผŒๆ‰€ไปฅ'ๆฑ‰ๅญ—'.isalnum()ไผš่ฟ”ๅ›žTrue๏ผ›encode()้ป˜่ฎค็ผ–็ ๆ˜ฏUTF-8๏ผŒ็ผ–็ ๆˆutf8ไน‹ๅŽ๏ผŒๆฑ‰ๅญ—ๅฐฑไธไผš่ฟ”ๅ›žTrueไบ†

            alnum_count = sum(
                map(lambda char: 1
                    if char.encode().isalnum() else 0, sample[self.text_key]))

To Reproduce ๅฆ‚ไฝ•ๅค็Žฐ

python tools/analyze_data.py --config configs/demo/analyser.yaml

Configs ้…็ฝฎไฟกๆฏ

project_name: 'demo-analyser'
dataset_path: 'demos/data/demo-dataset.jsonl'  # path to your dataset directory or file
np: 4  # number of subprocess to process your dataset
text_keys: 'text'
export_path: './outputs/demo-analyser/demo-analyser-result.jsonl'

# process schedule
# a list of several process operators with their arguments
process:
    - alphanumeric_filter:

Logs ๆŠฅ้”™ๆ—ฅๅฟ—

No response

Screenshots ๆˆชๅ›พ

image

็ฌฌไบ”่กŒๅ…จไธญๆ–‡๏ผŒalnum_ratioๅญ—ๆฏๆ•ฐๅญ—ๆฏ”ไพ‹๏ผŒๆญฃ็กฎๅบ”่ฏฅๆ˜ฏ0

Additional ้ขๅค–ไฟกๆฏ

No response

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.