domoritz / arrow-tools Goto Github PK

View Code? Open in Web Editor NEW

139.0 7.0 8.0 282 KB

A collection of handy CLI tools to convert CSV and JSON to Apache Arrow and Parquet

License: Apache License 2.0

Rust 100.00%

arrow-tools's Introduction

Arrow CLI Tools

A collection of handy CLI tools to convert CSV and JSON to Apache Arrow and Parquet.

This repository contains five projects:

csv2arrow to convert CSV files to Apache Arrow.
csv2parquet to convert CSV files to Parquet.
json2arrow to convert JSON files to Apache Arrow.
json2parquet to convert JSON files to Parquet.
arrow-tools shared utilities used by the other four packages.

For usage examples, see the csv2parquet examples.

Homebrew formulas are pushed to domoritz/homebrew-tap for every release.

arrow-tools's People

Contributors

Stargazers

Watchers

Forkers

sacundim corneliusroemer leochencipher jupiter genostack igitur wangpatrick57 wynnw

arrow-tools's Issues

allow reading from stdin (ideally with schema inference)

See domoritz/csv2parquet#40 by @corneliusroemer

Add unit and/or integration tests

Some basic CI testing would be great to prevent regressions.

json2parquet: setting --max-read-records 0 reads zero records

See domoritz/json2parquet#99 by @cardi

Consolidated binary releases

Hi @domoritz

As others have mentioned, these tools are really powerful. Thanks for the great work. I'd like to add these to the scoop repositories. Scoop is a very convenient (best, IMHO) Windows package manager. I could add each of these tools as a separate package, but because their version numbers seem to be in sync, it would be ideal if there was a consolidated .zip file in the Releases that contains these tools.

Would you consider adding a single .zip file to the releases for upcoming versions? If you're OK with the principle, I could even try a new GitHub workflow myself, although I have limited experience with them.

Ability to compile to wasm/wasi

ability to compile the tools to wasm/wasi if possible. It currently fails for E0554.

Reading from stdin doesn't work

I tried the new releases but get an error.

> csv2arrow data/simple.csv -n
Schema:

{
  "fields": [
    {
      "name": "a",
      "data_type": "Int64",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false,
      "metadata": {}
    },
    {
      "name": "b",
      "data_type": "Boolean",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false,
      "metadata": {}
    }
  ],
  "metadata": {}
}

> cat data/simple.csv | csv2arrow /dev/stdin -n
Error: SchemaError("Error inferring schema: Io error: Seeking outside of buffer, please report to https://github.com/domoritz/arrow-tools/issues/new")

I am on macOS.

When creating/inferring schema only, do not buffer stdin

It's safest to infer the schema on the entire dataset.

When the dataset is larger than RAM, this is currently not possible via stdin as the implementation in #10 and #13 stores everything that's used for inference in memory.

In practice, one could stream the dataset via stdin twice: first time to get the schema, second time to convert.

This needs some internal changes to not buffer when options are set to infer schema only.

I have a JSON map that is inferred to be a struct, what does the Map schema syntax look like?

I'm having a difficult time reverse engineering the arrow-schema to understand what a map would look like in the schema json.

Add example to documentation

Hi there!

I was streaming very large files from curl, converting them to parquet and sending them back to s3 via Linux piping (without storing anything on disk) and I had some trouble figuring out how to pipe input through these tools.

For example, I was doing cat file | json2parquet >> out and I got an error, or cat file | json2parquet - myout.parquet and I got an error.

Eventually I figured out that what you need to do is cat file | json2parquet /dev/stdin /dev/stdout | gzip -c >> myparquet.parquet.gz. I figured this out by doing a deep dive into previous commits and issues like #3 . An example would have been a great time saver!

In any case, I was thinking of adding that adding a small use case example to the documentation would be useful for people who are not so familiar with piping in linux like me. It took me 30 minutes to figure it out although looking back is just elementary linux knowledge.

Happy to contribute the changes myself if the maintainers are on board!

Cheers,
Felix

How to deal with the tsv file

Thanks for the nice tools. Could you please give me some suggestion on how to deal with the tsv files?
I have run the code as below:

csv2parquet --header true -d '\t'  -p exprMatrix.tsv exprMatrix.parquet

the error messages are:
error: invalid value '\t' for '--delimiter ': too many characters in string
For more information, try '--help'.

So, how to use tab as the delimiter?

Timestamp mapping

Not sure I am doing this right, but I am trying to convert a CSV containing some timestamp to a parquet file.

Sample CSV

072e4a64-2ffb-437c-9458-4953abaa7a20,1,2023-01-18 23:05:10,104,-1,0
072e4a64-2ffb-437c-9458-4953abaa7a20,2,2023-01-18 23:05:10,104,-1,0
072e4a64-2ffb-437c-9458-4953abaa7a20,4,2023-01-18 23:05:10,104,-1,0

First, the schema is generated with the csv2parquet --max-read-records 5 -p option. It correctly infers the timestamp field

    {
      "name": "ts",
      "data_type": {
        "Timestamp": [
          "Second",
          null
        ]
      },
      "nullable": false,
      "dict_id": 0,
      "dict_is_ordered": false,
      "metadata": {}
    },

Then I do the actual conversion

csv2parquet --header false --schema-file mt_status.json /dev/stdin mt_status.parquet

Then I try to open the table using duckdb, and I can see all the records, but the timestamp field shows as Int64

┌──────────────────────────────────────┬───────┬────────────┬──────────┬────────┬───────────┐
│                 guid                 │  st   │     ts     │ tsmillis │ result │ synthetic │
│               varchar                │ int16 │   int64    │  int16   │ int16  │   int16   │
├──────────────────────────────────────┼───────┼────────────┼──────────┼────────┼───────────┤
│ 072e4a64-2ffb-437c-9458-4953abaa7a20 │     1 │ 1674083110 │      104 │     -1 │         0 │
│ 072e4a64-2ffb-437c-9458-4953abaa7a20 │     2 │ 1674083110 │      104 │     -1 │         0 │
│ 072e4a64-2ffb-437c-9458-4953abaa7a20 │     4 │ 1674083110 │      104 │     -1 │         0 │

And the parquet schema also shows the field as a Int64

│ mt_status.parquet │ ts │ INT64 │ │ REQUIRED │ │ │ │ │ │ │

Any hint ?
Thanks

Make binaries easily installable via: homebrew and/or conda-forge

To make the tools usable in general pipelines/instructions they need to be easily installed - can't require people to download binaries from Github or install rust/compile

Convert jsons inside a folder

Hello

Is there a solution to convert all files in a folder to parquet

Instead of entering the json files one by one, select all files from the folder at once

json2parquet c:\test\*.json test.PARQUET

json2parquet Parse json get error infer schema

I created small jsonl file called data.json. It has the three lines used from the README.md

data.json

{ "a": 42, "b": true }
{ "a": 12, "b": false }
{ "a": 7, "b": true }

json2parquest ./data.json ./data.parquet

Error: General("Error inferring schema: Json error: Not valid JSON: expected value at line 1 column 1")

Also can an example or point to docs to create the arrow json schema file? Thanks.

csv2parquet failure

For the latest release (v0.18.0) installed via cargo and building from latest master (last commit june 1 2024), I can't get cvs2parquet to generate data in the parquet file. Here's a trivial example:

simple.csv:
a,b 1,a 2,b 3,c 4,d

Run: csv2parquet simple.csv simple.parquet results in an output file that has the schema, but no data. Running csv2parquet -n simple.csv simple.parquet does autodetect a schema and print it out correctly. Using pqrs (installed via cargo) to inspect the file shows the schema with pqrs schema simple.parquet, but there is no actual data in the parquet file. This same pattern happens with the real data with large csv files we were experimenting on.

What am I doing wrong? Or is this a real bug in the release? I see the same thing when I check out the v0.17.0 release commit hash. This is all on amazonlinux2 (which I know is an older platform that is hard to support).

I also tried this with ubuntu 24.04 LTS and got the same behavior.

json2parquet hangs and gets killed when writing large files

Hi there! First of all thank you for the tooling, it's incredibly powerful. I have been using json2parquet to process some intricate .jsonl files. I have had a good time with small sizes but not for the larger files. In my EC2 instance with 16GB of RAM, sometimes json2parquet will stall and never finish writing a parquet file, or the process gets killed.

Steps to reproduce
I created a very simplified example that should be easy to reproduce with a schema, a file that works, and a file that doesn't. Unfortunately it only happens with large files.

curl -OJ https://felixh-shareables.s3.us-west-2.amazonaws.com/schema.schema
curl -OJ https://felixh-shareables.s3.us-west-2.amazonaws.com/json2parquet-small.jsonl.gz # 80 MB compressed. 2GB uncompressed
curl -OJ https://felixh-shareables.s3.us-west-2.amazonaws.com/json2parquet-large.jsonl.gz # 1.1 GB compressed, 37 GB uncompressed

gunzip -c json2parquet-small.jsonl.gz | json2parquet -s schema.schema /dev/stdin json2parquet-small.parquet
# success - I can read the output to duckdb and confirm it works. 

gunzip -c json2parquet-large.jsonl.gz | json2parquet -s schema.schema /dev/stdin json2parquet-large.parquet
# Error. Killed
gunzip -c json2parquet-large.jsonl.gz | json2parquet -s schema.schema /dev/stdin /dev/stdout >| myout.parquet
# Hangs, see results from htop

# breaking up the steps
gunzip json2parquet-large.jsonl.gz
json2parquet json2parquet-large.jsonl json2parquet-large.parquet
# here, it hangs for a while. It creates a parquet file of 0 bytes. 
# After a long time, htop looks like in the image below. And then at some point the process gets killed.

Results from htop

I have let these processes hang for an hour.

This happens regardless of whether I specify if I want a compressed output or not.

Is there anything that I am missing in how I am using the tooling? The only think I can think about is --data-page-size-limit, --write-batch-size and --max-row-group-size. Please if you have time, let me know what could be going wrong. Thanks!

Update:

json2parquet json2parquet-large.jsonl json2parquet-large.parquet -s schema.schema --data-page-size-limit 1024 --write-batch-size 1024 --max-row-group-size 1024 -c gzip

The command above works, however, piping from the compressed file still fails

gunzip -c 1gb.jsonl.gz | json2parquet -s schema.schema -c gzip --data-page-size-limit 1024 --write-batch-size 1024 --max-row-group-size 1024 /dev/stdin myfile.parquet.gz

I would prefer not to unpack the files in memory as they can be very large, this one is already 37GB.

Update II:
OK I think I figured it out...

gunzip -c 1gb.jsonl.gz | json2parquet -s schema.schema -c gzip --data-page-size-limit 1024 --write-batch-size 1024 --max-row-group-size 1024 --max-read-records 1  /dev/stdin myfile.parquet.gz

Here, --max-read-records 1 seems to do the trick. It seems that although I am passing a schema it is still trying to infer the schema? Hence it has to wait for the entire unzipped stream to go into memory which will not happen since the uncompressed file of 37 GB >> 16 GB of memory.

Schema examples: Decimal128, Dictionary etc

Thanks for making these tools. They are great.

Would help non-Rustaceans to have schema examples for nontrivial types: Decimal128, Dictionary etc