Giter Club home page Giter Club logo

Comments (2)

scampion avatar scampion commented on June 2, 2024

I downloaded the jsonl file and extract it manually.
The issue seems to be related to pyarrow.json

python3 -q -X faulthandler -c "from datasets import load_dataset; load_dataset('json', data_files='/Users/scampion/Downloads/1998-09.jsonl')"
Generating train split: 0 examples [00:00, ? examples/s]Fatal Python error: Segmentation fault

Thread 0x00007000000c1000 (most recent call first):

Thread 0x00007000024df000 (most recent call first):
File "/usr/local/Cellar/[email protected]/3.11.7/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 331 in wait
File "/usr/local/Cellar/[email protected]/3.11.7/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 629 in wait
File "/Users/scampion/src/test/venv_test/lib/python3.11/site-packages/tqdm/_monitor.py", line 60 in run
File "/usr/local/Cellar/[email protected]/3.11.7/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
File "/usr/local/Cellar/[email protected]/3.11.7/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 1002 in _bootstrap

Thread 0x00007ff845c66640 (most recent call first):
File "/Users/scampion/src/test/venv_test/lib/python3.11/site-packages/datasets/packaged_modules/json/json.py", line 122 in _generate_tables
File "/Users/scampion/src/test/venv_test/lib/python3.11/site-packages/datasets/builder.py", line 1995 in _prepare_split_single
File "/Users/scampion/src/test/venv_test/lib/python3.11/site-packages/datasets/builder.py", line 1882 in _prepare_split
File "/Users/scampion/src/test/venv_test/lib/python3.11/site-packages/datasets/builder.py", line 1122 in _download_and_prepare
File "/Users/scampion/src/test/venv_test/lib/python3.11/site-packages/datasets/builder.py", line 1027 in download_and_prepare
File "/Users/scampion/src/test/venv_test/lib/python3.11/site-packages/datasets/load.py", line 2609 in load_dataset
File "", line 1 in

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pyarrow.lib, pyarrow._hdfsio, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, charset_normalizer.md, yaml._yaml, pyarrow._parquet, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json (total: 72)
[1] 56678 segmentation fault python3 -q -X faulthandler -c
/usr/local/Cellar/[email protected]/3.11.7/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
(venv_test)

from datasets.

scampion avatar scampion commented on June 2, 2024

The error comes from data where one line contains "null"

from datasets.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.