Hi there! First of all thank you for the tooling, it's incredibly powerful. I have been using json2parquet
to process some intricate .jsonl
files. I have had a good time with small sizes but not for the larger files. In my EC2 instance with 16GB of RAM, sometimes json2parquet will stall and never finish writing a parquet file, or the process gets killed.
Steps to reproduce
I created a very simplified example that should be easy to reproduce with a schema, a file that works, and a file that doesn't. Unfortunately it only happens with large files.
curl -OJ https://felixh-shareables.s3.us-west-2.amazonaws.com/schema.schema
curl -OJ https://felixh-shareables.s3.us-west-2.amazonaws.com/json2parquet-small.jsonl.gz # 80 MB compressed. 2GB uncompressed
curl -OJ https://felixh-shareables.s3.us-west-2.amazonaws.com/json2parquet-large.jsonl.gz # 1.1 GB compressed, 37 GB uncompressed
gunzip -c json2parquet-small.jsonl.gz | json2parquet -s schema.schema /dev/stdin json2parquet-small.parquet
# success - I can read the output to duckdb and confirm it works.
gunzip -c json2parquet-large.jsonl.gz | json2parquet -s schema.schema /dev/stdin json2parquet-large.parquet
# Error. Killed
gunzip -c json2parquet-large.jsonl.gz | json2parquet -s schema.schema /dev/stdin /dev/stdout >| myout.parquet
# Hangs, see results from htop
# breaking up the steps
gunzip json2parquet-large.jsonl.gz
json2parquet json2parquet-large.jsonl json2parquet-large.parquet
# here, it hangs for a while. It creates a parquet file of 0 bytes.
# After a long time, htop looks like in the image below. And then at some point the process gets killed.
Results from htop
![image](https://private-user-images.githubusercontent.com/50616313/274773555-18b40063-2487-4889-8b74-cd91c4633931.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjM1MjM2NTEsIm5iZiI6MTcyMzUyMzM1MSwicGF0aCI6Ii81MDYxNjMxMy8yNzQ3NzM1NTUtMThiNDAwNjMtMjQ4Ny00ODg5LThiNzQtY2Q5MWM0NjMzOTMxLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA4MTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwODEzVDA0MjkxMVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWQyZWNjZTUzOTVjYWRlZWMyZTkxMzA5NmM3NmVmZGIzY2NhMGM3Nzc5NmEzMWFlZTgzNDNiZTk3NmVhZTljYTImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0._-lEKMyGMEGwHIQgYeIWtNAU345cuZjTUHBvblMQ0xc)
I have let these processes hang for an hour.
This happens regardless of whether I specify if I want a compressed output or not.
Is there anything that I am missing in how I am using the tooling? The only think I can think about is --data-page-size-limit
, --write-batch-size
and --max-row-group-size
. Please if you have time, let me know what could be going wrong. Thanks!
Update:
json2parquet json2parquet-large.jsonl json2parquet-large.parquet -s schema.schema --data-page-size-limit 1024 --write-batch-size 1024 --max-row-group-size 1024 -c gzip
The command above works, however, piping from the compressed file still fails
gunzip -c 1gb.jsonl.gz | json2parquet -s schema.schema -c gzip --data-page-size-limit 1024 --write-batch-size 1024 --max-row-group-size 1024 /dev/stdin myfile.parquet.gz
I would prefer not to unpack the files in memory as they can be very large, this one is already 37GB.
Update II:
OK I think I figured it out...
gunzip -c 1gb.jsonl.gz | json2parquet -s schema.schema -c gzip --data-page-size-limit 1024 --write-batch-size 1024 --max-row-group-size 1024 --max-read-records 1 /dev/stdin myfile.parquet.gz
Here, --max-read-records 1 seems to do the trick. It seems that although I am passing a schema it is still trying to infer the schema? Hence it has to wait for the entire unzipped stream to go into memory which will not happen since the uncompressed file of 37 GB >> 16 GB of memory.