Comments (2)
I downloaded the jsonl file and extract it manually.
The issue seems to be related to pyarrow.json
python3 -q -X faulthandler -c "from datasets import load_dataset; load_dataset('json', data_files='/Users/scampion/Downloads/1998-09.jsonl')"
Generating train split: 0 examples [00:00, ? examples/s]Fatal Python error: Segmentation fault
Thread 0x00007000000c1000 (most recent call first):
Thread 0x00007000024df000 (most recent call first):
File "/usr/local/Cellar/[email protected]/3.11.7/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 331 in wait
File "/usr/local/Cellar/[email protected]/3.11.7/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 629 in wait
File "/Users/scampion/src/test/venv_test/lib/python3.11/site-packages/tqdm/_monitor.py", line 60 in run
File "/usr/local/Cellar/[email protected]/3.11.7/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
File "/usr/local/Cellar/[email protected]/3.11.7/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 1002 in _bootstrap
Thread 0x00007ff845c66640 (most recent call first):
File "/Users/scampion/src/test/venv_test/lib/python3.11/site-packages/datasets/packaged_modules/json/json.py", line 122 in _generate_tables
File "/Users/scampion/src/test/venv_test/lib/python3.11/site-packages/datasets/builder.py", line 1995 in _prepare_split_single
File "/Users/scampion/src/test/venv_test/lib/python3.11/site-packages/datasets/builder.py", line 1882 in _prepare_split
File "/Users/scampion/src/test/venv_test/lib/python3.11/site-packages/datasets/builder.py", line 1122 in _download_and_prepare
File "/Users/scampion/src/test/venv_test/lib/python3.11/site-packages/datasets/builder.py", line 1027 in download_and_prepare
File "/Users/scampion/src/test/venv_test/lib/python3.11/site-packages/datasets/load.py", line 2609 in load_dataset
File "", line 1 in
Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pyarrow.lib, pyarrow._hdfsio, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, charset_normalizer.md, yaml._yaml, pyarrow._parquet, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json (total: 72)
[1] 56678 segmentation fault python3 -q -X faulthandler -c
/usr/local/Cellar/[email protected]/3.11.7/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
(venv_test)
from datasets.
The error comes from data where one line contains "null"
from datasets.
Related Issues (20)
- FAISS load to None HOT 1
- add `with_transform` and/or `set_transform` to IterableDataset
- Unable to load JSON saved using `to_json` HOT 2
- Better document defaults of to_json
- Regression bug: `NonMatchingSplitsSizesError` for (possibly) overwritten dataset
- datasets template guide :: issue in documentation YAML HOT 2
- List of dictionary features get standardized
- [WebDataset] KeyError with user-defined `Features` when a field is missing in an example
- HTTPError 403 raised by CLI convert_to_parquet when creating script branch on 3rd party repos
- Add the option of saving in parquet instead of arrow HOT 2
- Extraction protocol for arrow files is not defined
- irc_disentangle - Issue with splitting data
- Support the deserialization of json lines files comprised of lists HOT 1
- Fail to load "stas/c4-en-10k" dataset since 2.16 version HOT 2
- Add MedImg for streaming HOT 2
- Column order is nondeterministic when loading from JSON
- ```push_to_hub()``` - Prevent Automatic Generation of Splits
- WinError 32 The process cannot access the file during load_dataset
- NonMatchingSplitsSizesError when using data_dir HOT 2
- Invalid YAML in README.md: unknown tag !<tag:yaml.org,2002:python/tuple>
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datasets.