Describe the bug
Tokenised transformers dataset object csv files are truncated if the sequence is too long.
To Reproduce
Please provide a minimal reproducible example with all steps to reproduce the behaviour before submitting an issue:
Fields input_tokens
, token_type_ids
, attention_mask
are truncated if the feature
is too long. This is true for output csv
file only.
# sample run on arbitrary file with very long item
create_dataset_bio <infile_path_1> <infile_path_2> <tokeniser>
# sample output csv file
some_seq,<very very long sequence>,1,"[10 ... 20]","[0 ... 0]","[1 ... 1]"
Please make sure to include environment info including python and dependency versions. You can access this with pip freeze
or conda list
as needed.
# this was installed with conda install -c tyronechen ziran
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_kmp_llvm conda-forge
_py-xgboost-mutex 2.0 cpu_0 conda-forge
abseil-cpp 20210324.2 h9c3ff4c_0 conda-forge
aiohttp 3.8.4 py39h72bdee0_0 conda-forge
aiohttp-cors 0.7.0 py_0 conda-forge
aioredis 1.3.1 py_0 conda-forge
aiosignal 1.3.1 pyhd8ed1ab_0 conda-forge
alsa-lib 1.2.8 h166bdaf_0 conda-forge
arrow-cpp 8.0.0 py39heccc63a_1_cpu conda-forge
async-timeout 4.0.2 pyhd8ed1ab_0 conda-forge
attr 2.5.1 h166bdaf_1 conda-forge
attrs 22.2.0 pyh71513ae_0 conda-forge
aws-c-cal 0.5.11 h95a6274_0 conda-forge
aws-c-common 0.6.2 h7f98852_0 conda-forge
aws-c-event-stream 0.2.7 h3541f99_13 conda-forge
aws-c-io 0.10.5 hfb6a706_0 conda-forge
aws-checksums 0.1.11 ha31a3da_7 conda-forge
aws-sdk-cpp 1.8.186 hecaee15_4 conda-forge
backcall 0.2.0 pyh9f0ad1d_0 conda-forge
backports 1.0 pyhd8ed1ab_3 conda-forge
backports.functools_lru_cache 1.6.4 pyhd8ed1ab_0 conda-forge
blessed 1.19.1 pyhe4f9e05_2 conda-forge
brotli 1.0.9 h166bdaf_8 conda-forge
brotli-bin 1.0.9 h166bdaf_8 conda-forge
brotlipy 0.7.0 py39hb9d737c_1005 conda-forge
bz2file 0.98 py_0 conda-forge
bzip2 1.0.8 h7f98852_4 conda-forge
c-ares 1.18.1 h7f98852_0 conda-forge
ca-certificates 2022.12.7 ha878542_0 conda-forge
cachetools 5.3.0 pyhd8ed1ab_0 conda-forge
cairo 1.16.0 ha61ee94_1014 conda-forge
captum 0.6.0 pyhd8ed1ab_0 conda-forge
certifi 2022.12.7 pyhd8ed1ab_0 conda-forge
cffi 1.15.1 py39he91dace_3 conda-forge
charset-normalizer 2.1.1 pyhd8ed1ab_0 conda-forge
click 8.0.4 py39hf3d152e_0 conda-forge
cloudpickle 2.2.1 pyhd8ed1ab_0 conda-forge
colorama 0.4.6 pyhd8ed1ab_0 conda-forge
colorful 0.5.4 pyhd8ed1ab_0 conda-forge
cryptography 39.0.0 py39hd598818_0 conda-forge
cudatoolkit 11.8.0 h37601d7_11 conda-forge
cudnn 8.4.1.50 hed8a83a_0 conda-forge
cycler 0.11.0 pyhd8ed1ab_0 conda-forge
dataclasses 0.8 pyhc8e2a94_3 conda-forge
datasets 2.10.1 pyhd8ed1ab_0 conda-forge
dbus 1.13.6 h5008d03_3 conda-forge
decorator 5.1.1 pyhd8ed1ab_0 conda-forge
dill 0.3.6 pyhd8ed1ab_1 conda-forge
distlib 0.3.6 pyhd8ed1ab_0 conda-forge
docker-pycreds 0.4.0 py_0 conda-forge
expat 2.5.0 h27087fc_0 conda-forge
fftw 3.3.10 nompi_hf0379b8_106 conda-forge
filelock 3.10.0 pyhd8ed1ab_0 conda-forge
font-ttf-dejavu-sans-mono 2.37 hab24e00_0 conda-forge
font-ttf-inconsolata 3.000 h77eed37_0 conda-forge
font-ttf-source-code-pro 2.038 h77eed37_0 conda-forge
font-ttf-ubuntu 0.83 hab24e00_0 conda-forge
fontconfig 2.14.2 h14ed4e7_0 conda-forge
fonts-conda-ecosystem 1 0 conda-forge
fonts-conda-forge 1 0 conda-forge
fonttools 4.39.2 py39h72bdee0_0 conda-forge
freetype 2.12.1 hca18f0e_1 conda-forge
frozenlist 1.3.3 py39hb9d737c_0 conda-forge
fsspec 2023.3.0 pyhd8ed1ab_1 conda-forge
future 0.18.3 pyhd8ed1ab_0 conda-forge
gensim 4.2.0 py39h1832856_0 conda-forge
gettext 0.21.1 h27087fc_0 conda-forge
gflags 2.2.2 he1b5a44_1004 conda-forge
gitdb 4.0.10 pyhd8ed1ab_0 conda-forge
gitpython 3.1.31 pyhd8ed1ab_0 conda-forge
glib 2.74.1 h6239696_1 conda-forge
glib-tools 2.74.1 h6239696_1 conda-forge
glog 0.6.0 h6f12383_0 conda-forge
google-api-core 2.10.0 pyhd8ed1ab_0 conda-forge
google-auth 2.16.2 pyh1a96a4e_0 conda-forge
googleapis-common-protos 1.57.0 py39hf3d152e_0 conda-forge
gpustat 1.0.0 pyhd8ed1ab_0 conda-forge
graphite2 1.3.13 h58526e2_1001 conda-forge
grpc-cpp 1.43.2 h9e046d8_3 conda-forge
grpcio 1.43.0 py39hff7568b_0 conda-forge
gst-plugins-base 1.21.3 h4243ec0_1 conda-forge
gstreamer 1.21.3 h25f0c4b_1 conda-forge
gstreamer-orc 0.4.33 h166bdaf_0 conda-forge
harfbuzz 6.0.0 h8e241bc_0 conda-forge
hiredis 2.0.0 py39hb9d737c_3 conda-forge
huggingface_hub 0.13.2 pyhd8ed1ab_0 conda-forge
hyperopt 0.2.7 pyhd8ed1ab_0 conda-forge
icu 70.1 h27087fc_0 conda-forge
idna 3.4 pyhd8ed1ab_0 conda-forge
importlib-metadata 6.0.0 pyha770c72_0 conda-forge
importlib_metadata 6.0.0 hd8ed1ab_0 conda-forge
importlib_resources 5.12.0 pyhd8ed1ab_0 conda-forge
ipython 7.33.0 py39hf3d152e_0 conda-forge
jack 1.9.22 h11f4161_0 conda-forge
jedi 0.18.2 pyhd8ed1ab_0 conda-forge
joblib 1.2.0 pyhd8ed1ab_0 conda-forge
jpeg 9e h0b41bf4_3 conda-forge
jsonschema 4.17.3 pyhd8ed1ab_0 conda-forge
keyutils 1.6.1 h166bdaf_0 conda-forge
kiwisolver 1.4.4 py39hf939315_1 conda-forge
krb5 1.20.1 hf9c8cef_0 conda-forge
lame 3.100 h166bdaf_1003 conda-forge
lcms2 2.15 hfd0df8a_0 conda-forge
ld_impl_linux-64 2.40 h41732ed_0 conda-forge
lerc 4.0.0 h27087fc_0 conda-forge
libblas 3.9.0 12_linux64_mkl conda-forge
libbrotlicommon 1.0.9 h166bdaf_8 conda-forge
libbrotlidec 1.0.9 h166bdaf_8 conda-forge
libbrotlienc 1.0.9 h166bdaf_8 conda-forge
libcap 2.66 ha37c62d_0 conda-forge
libcblas 3.9.0 12_linux64_mkl conda-forge
libclang 15.0.7 default_had23c3d_1 conda-forge
libclang13 15.0.7 default_h3e3d535_1 conda-forge
libcrc32c 1.1.2 h9c3ff4c_0 conda-forge
libcups 2.3.3 h36d4200_3 conda-forge
libcurl 7.87.0 h6312ad2_0 conda-forge
libdb 6.2.32 h9c3ff4c_0 conda-forge
libdeflate 1.17 h0b41bf4_0 conda-forge
libedit 3.1.20191231 he28a2e2_2 conda-forge
libev 4.33 h516909a_1 conda-forge
libevent 2.1.10 h9b69904_4 conda-forge
libffi 3.4.2 h7f98852_5 conda-forge
libflac 1.4.2 h27087fc_0 conda-forge
libgcc-ng 12.2.0 h65d4601_19 conda-forge
libgcrypt 1.10.1 h166bdaf_0 conda-forge
libgfortran-ng 12.2.0 h69a702a_19 conda-forge
libgfortran5 12.2.0 h337968e_19 conda-forge
libglib 2.74.1 h606061b_1 conda-forge
libgoogle-cloud 1.36.0 h6945097_0 conda-forge
libgpg-error 1.46 h620e276_0 conda-forge
libhwloc 2.9.0 hd6dc26d_0 conda-forge
libiconv 1.17 h166bdaf_0 conda-forge
liblapack 3.9.0 12_linux64_mkl conda-forge
libllvm15 15.0.7 hadd5161_1 conda-forge
libnghttp2 1.51.0 hdcd2b5c_0 conda-forge
libnsl 2.0.0 h7f98852_0 conda-forge
libogg 1.3.4 h7f98852_1 conda-forge
libopus 1.3.1 h7f98852_1 conda-forge
libpng 1.6.39 h753d276_0 conda-forge
libpq 15.1 h2baec63_3 conda-forge
libprotobuf 3.19.4 h780b84a_0 conda-forge
libsndfile 1.2.0 hb75c966_0 conda-forge
libsqlite 3.40.0 h753d276_0 conda-forge
libssh2 1.10.0 haa6b8db_3 conda-forge
libstdcxx-ng 12.2.0 h46fd767_19 conda-forge
libsystemd0 252 h2a991cd_0 conda-forge
libthrift 0.16.0 h491838f_2 conda-forge
libtiff 4.5.0 h6adf6a1_2 conda-forge
libtool 2.4.7 h27087fc_0 conda-forge
libudev1 253 h0b41bf4_0 conda-forge
libunwind 1.6.2 h9c3ff4c_0 conda-forge
libutf8proc 2.8.0 h166bdaf_0 conda-forge
libuuid 2.32.1 h7f98852_1000 conda-forge
libvorbis 1.3.7 h9c3ff4c_0 conda-forge
libwebp-base 1.3.0 h0b41bf4_0 conda-forge
libxcb 1.13 h7f98852_1004 conda-forge
libxgboost 1.7.1 cpu_ha3b9936_0 conda-forge
libxkbcommon 1.5.0 h79f4944_1 conda-forge
libxml2 2.10.3 hca2bb57_3 conda-forge
libzlib 1.2.13 h166bdaf_4 conda-forge
llvm-openmp 15.0.7 h0cdce71_0 conda-forge
lz4-c 1.9.3 h9c3ff4c_1 conda-forge
magma 2.5.4 hc72dce7_4 conda-forge
matplotlib 3.5.2 py39hf3d152e_1 conda-forge
matplotlib-base 3.5.2 py39h700656a_1 conda-forge
matplotlib-inline 0.1.6 pyhd8ed1ab_0 conda-forge
mkl 2021.4.0 h8d4b97c_729 conda-forge
mpg123 1.31.2 hcb278e6_0 conda-forge
mpmath 1.3.0 pyhd8ed1ab_0 conda-forge
msgpack-python 1.0.5 py39h4b4f3f3_0 conda-forge
multidict 6.0.4 py39h72bdee0_0 conda-forge
multiprocess 0.70.14 py39hb9d737c_3 conda-forge
munkres 1.1.4 pyh9f0ad1d_0 conda-forge
mysql-common 8.0.32 h14678bc_0 conda-forge
mysql-libs 8.0.32 h54cf53e_0 conda-forge
nccl 2.14.3.1 h0800d71_0 conda-forge
ncurses 6.3 h27087fc_1 conda-forge
networkx 3.0 pyhd8ed1ab_0 conda-forge
ninja 1.11.1 h924138e_0 conda-forge
nipals 0.5.5 pypi_0 pypi
nspr 4.35 h27087fc_0 conda-forge
nss 3.89 he45b914_0 conda-forge
numpy 1.24.2 py39h7360e5f_0 conda-forge
nvidia-ml-py 11.495.46 pyhd8ed1ab_0 conda-forge
opencensus 0.11.2 pyhd8ed1ab_0 conda-forge
opencensus-context 0.1.3 py39hf3d152e_1 conda-forge
openjpeg 2.5.0 hfec8fc6_2 conda-forge
openssl 1.1.1t h0b41bf4_0 conda-forge
orc 1.7.3 h1be678f_0 conda-forge
packaging 23.0 pyhd8ed1ab_0 conda-forge
pandas 1.4.2 py39h1832856_2 conda-forge
parquet-cpp 1.5.1 2 conda-forge
parso 0.8.3 pyhd8ed1ab_0 conda-forge
pathtools 0.1.2 py_1 conda-forge
patsy 0.5.3 pyhd8ed1ab_0 conda-forge
pcre2 10.40 hc3806b6_0 conda-forge
pexpect 4.8.0 pyh1a96a4e_2 conda-forge
pickleshare 0.7.5 py_1003 conda-forge
pillow 9.4.0 py39h2320bf1_1 conda-forge
pip 23.0.1 pyhd8ed1ab_0 conda-forge
pixman 0.40.0 h36c2ea0_0 conda-forge
pkgutil-resolve-name 1.3.10 pyhd8ed1ab_0 conda-forge
platformdirs 3.1.1 pyhd8ed1ab_0 conda-forge
ply 3.11 py_1 conda-forge
pooch 1.7.0 pyhd8ed1ab_0 conda-forge
powerlaw 1.4.6 pyh9f0ad1d_1 conda-forge
prometheus_client 0.13.1 pyhd8ed1ab_0 conda-forge
promise 2.3 py39hf3d152e_7 conda-forge
prompt-toolkit 3.0.38 pyha770c72_0 conda-forge
protobuf 3.19.4 py39he80948d_0 conda-forge
psutil 5.9.4 py39hb9d737c_0 conda-forge
pthread-stubs 0.4 h36c2ea0_1001 conda-forge
ptyprocess 0.7.0 pyhd3deb0d_0 conda-forge
pulseaudio 16.1 h4ab2085_1 conda-forge
py-spy 0.3.14 h87a5ac0_0 conda-forge
py-xgboost 1.7.1 cpu_py39h4655687_0 conda-forge
py4j 0.10.9.7 pyhd8ed1ab_0 conda-forge
pyarrow 8.0.0 py39h42d110c_1_cpu conda-forge
pyasn1 0.4.8 py_0 conda-forge
pyasn1-modules 0.2.7 py_0 conda-forge
pycparser 2.21 pyhd8ed1ab_0 conda-forge
pygments 2.14.0 pyhd8ed1ab_0 conda-forge
pyopenssl 23.0.0 pyhd8ed1ab_0 conda-forge
pyparsing 3.0.9 pyhd8ed1ab_0 conda-forge
pyqt 5.15.7 py39h5c7b992_3 conda-forge
pyqt5-sip 12.11.0 py39h227be39_3 conda-forge
pyrsistent 0.19.3 py39h72bdee0_0 conda-forge
pysocks 1.7.1 pyha2e5f31_6 conda-forge
python 3.9.15 h47a2c10_0_cpython conda-forge
python-dateutil 2.8.2 pyhd8ed1ab_0 conda-forge
python-xxhash 3.2.0 py39h72bdee0_0 conda-forge
python_abi 3.9 3_cp39 conda-forge
pytorch 1.10.0 cuda112py39h3ad47f5_1 conda-forge
pytz 2022.7.1 pyhd8ed1ab_0 conda-forge
pyu2f 0.1.5 pyhd8ed1ab_0 conda-forge
pyyaml 6.0 py39hb9d737c_5 conda-forge
qt-main 5.15.6 h18908ee_6 conda-forge
ray-core 1.13.0 py39hecbb631_2 conda-forge
ray-default 1.13.0 py39hf3d152e_2 conda-forge
re2 2022.02.01 h9c3ff4c_0 conda-forge
readline 8.1.2 h0f457ee_0 conda-forge
regex 2022.10.31 py39hb9d737c_0 conda-forge
requests 2.28.2 pyhd8ed1ab_0 conda-forge
responses 0.18.0 pyhd8ed1ab_0 conda-forge
rsa 4.9 pyhd8ed1ab_0 conda-forge
s2n 1.0.10 h9b69904_0 conda-forge
sacremoses 0.0.53 pyhd8ed1ab_0 conda-forge
scikit-learn 1.1.1 py39h4037b75_0 conda-forge
scipy 1.10.1 py39h7360e5f_0 conda-forge
screed 1.0.5 pyhd8ed1ab_1 conda-forge
seaborn 0.11.2 hd8ed1ab_0 conda-forge
seaborn-base 0.11.2 pyhd8ed1ab_0 conda-forge
sentencepiece 0.1.96 py39hf939315_1 conda-forge
sentry-sdk 1.17.0 pyhd8ed1ab_0 conda-forge
setproctitle 1.2.2 py39hb9d737c_2 conda-forge
setuptools 67.6.0 pyhd8ed1ab_0 conda-forge
shortuuid 1.0.11 pyhd8ed1ab_0 conda-forge
sip 6.7.7 py39h227be39_0 conda-forge
six 1.16.0 pyh6c4a22f_0 conda-forge
sleef 3.5.1 h9b69904_2 conda-forge
smart_open 6.3.0 pyhd8ed1ab_1 conda-forge
smmap 3.0.5 pyh44b312d_0 conda-forge
snappy 1.1.10 h9fff704_0 conda-forge
statsmodels 0.13.5 py39h2ae25f5_2 conda-forge
tabulate 0.9.0 pyhd8ed1ab_1 conda-forge
tbb 2021.8.0 hf52228f_0 conda-forge
threadpoolctl 3.1.0 pyh8a188c0_0 conda-forge
tk 8.6.12 h27826a3_0 conda-forge
tokenizers 0.12.1 py39h3045328_1 conda-forge
toml 0.10.2 pyhd8ed1ab_0 conda-forge
tornado 6.2 py39hb9d737c_1 conda-forge
tqdm 4.64.0 pyhd8ed1ab_0 conda-forge
traitlets 5.9.0 pyhd8ed1ab_0 conda-forge
transformers 4.23.1 pyhd8ed1ab_0 conda-forge
transformers-interpret 0.8.1 pyhd8ed1ab_0 conda-forge
typing-extensions 4.5.0 hd8ed1ab_0 conda-forge
typing_extensions 4.5.0 pyha770c72_0 conda-forge
tzdata 2022g h191b570_0 conda-forge
unicodedata2 15.0.0 py39hb9d737c_0 conda-forge
urllib3 1.26.15 pyhd8ed1ab_0 conda-forge
virtualenv 20.21.0 pyhd8ed1ab_0 conda-forge
wandb 0.13.4 pyhd8ed1ab_0 conda-forge
wcwidth 0.2.6 pyhd8ed1ab_0 conda-forge
weightwatcher 0.6.4 py_0 tyronechen
wheel 0.40.0 pyhd8ed1ab_0 conda-forge
xcb-util 0.4.0 h516909a_0 conda-forge
xcb-util-image 0.4.0 h166bdaf_0 conda-forge
xcb-util-keysyms 0.4.0 h516909a_0 conda-forge
xcb-util-renderutil 0.3.9 h166bdaf_0 conda-forge
xcb-util-wm 0.4.1 h516909a_0 conda-forge
xgboost 1.7.1 cpu_py39h4655687_0 conda-forge
xkeyboard-config 2.38 h0b41bf4_0 conda-forge
xorg-kbproto 1.0.7 h7f98852_1002 conda-forge
xorg-libice 1.0.10 h7f98852_0 conda-forge
xorg-libsm 1.2.3 hd9c2040_1000 conda-forge
xorg-libx11 1.8.4 h0b41bf4_0 conda-forge
xorg-libxau 1.0.9 h7f98852_0 conda-forge
xorg-libxdmcp 1.1.3 h7f98852_0 conda-forge
xorg-libxext 1.3.4 h0b41bf4_2 conda-forge
xorg-libxrender 0.9.10 h7f98852_1003 conda-forge
xorg-renderproto 0.11.1 h7f98852_1002 conda-forge
xorg-xextproto 7.3.0 h0b41bf4_1003 conda-forge
xorg-xproto 7.0.31 h7f98852_1007 conda-forge
xxhash 0.8.1 h0b41bf4_0 conda-forge
xz 5.2.6 h166bdaf_0 conda-forge
yaml 0.2.5 h7f98852_2 conda-forge
yarl 1.8.2 py39hb9d737c_0 conda-forge
yellowbrick 1.3.post1 pyhd8ed1ab_1 conda-forge
zipp 3.15.0 pyhd8ed1ab_0 conda-forge
ziran 1.0.9 0 tyronechen
zlib 1.2.13 h166bdaf_4 conda-forge
zstd 1.5.2 h3eb15da_6 conda-forge
Expected behavior
A clear and concise description of what you expected to happen.
csv
files should not have truncated array values.
Suggested fix
If known.
Temporary fix: Use parquet
and json
files as input for training since these are unaffected.
Long term fix: Increase the array size limit for printing on pandas
and/or numpy
.
Screenshots
If applicable, add screenshots to help explain your problem.
Not applicable