Giter Club home page Giter Club logo

arrow-tools's Introduction

Arrow CLI Tools

Rust

A collection of handy CLI tools to convert CSV and JSON to Apache Arrow and Parquet.

This repository contains five projects:

For usage examples, see the csv2parquet examples.

Homebrew formulas are pushed to domoritz/homebrew-tap for every release.

arrow-tools's People

Contributors

corneliusroemer avatar dependabot[bot] avatar domoritz avatar jupiter avatar lsh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

arrow-tools's Issues

Consolidated binary releases

Hi @domoritz

As others have mentioned, these tools are really powerful. Thanks for the great work. I'd like to add these to the scoop repositories. Scoop is a very convenient (best, IMHO) Windows package manager. I could add each of these tools as a separate package, but because their version numbers seem to be in sync, it would be ideal if there was a consolidated .zip file in the Releases that contains these tools.

Would you consider adding a single .zip file to the releases for upcoming versions? If you're OK with the principle, I could even try a new GitHub workflow myself, although I have limited experience with them.

Reading from stdin doesn't work

I tried the new releases but get an error.

> csv2arrow data/simple.csv -n
Schema:

{
  "fields": [
    {
      "name": "a",
      "data_type": "Int64",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false,
      "metadata": {}
    },
    {
      "name": "b",
      "data_type": "Boolean",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false,
      "metadata": {}
    }
  ],
  "metadata": {}
}
> cat data/simple.csv | csv2arrow /dev/stdin -n
Error: SchemaError("Error inferring schema: Io error: Seeking outside of buffer, please report to https://github.com/domoritz/arrow-tools/issues/new")

I am on macOS.

When creating/inferring schema only, do not buffer stdin

It's safest to infer the schema on the entire dataset.

When the dataset is larger than RAM, this is currently not possible via stdin as the implementation in #10 and #13 stores everything that's used for inference in memory.

In practice, one could stream the dataset via stdin twice: first time to get the schema, second time to convert.

This needs some internal changes to not buffer when options are set to infer schema only.

Add example to documentation

Hi there!

I was streaming very large files from curl, converting them to parquet and sending them back to s3 via Linux piping (without storing anything on disk) and I had some trouble figuring out how to pipe input through these tools.

For example, I was doing cat file | json2parquet >> out and I got an error, or cat file | json2parquet - myout.parquet and I got an error.

Eventually I figured out that what you need to do is cat file | json2parquet /dev/stdin /dev/stdout | gzip -c >> myparquet.parquet.gz. I figured this out by doing a deep dive into previous commits and issues like #3 . An example would have been a great time saver!

In any case, I was thinking of adding that adding a small use case example to the documentation would be useful for people who are not so familiar with piping in linux like me. It took me 30 minutes to figure it out although looking back is just elementary linux knowledge.

Happy to contribute the changes myself if the maintainers are on board!

Cheers,
Felix

How to deal with the tsv file

Thanks for the nice tools. Could you please give me some suggestion on how to deal with the tsv files?
I have run the code as below:

csv2parquet --header true -d '\t'  -p exprMatrix.tsv exprMatrix.parquet

the error messages are:
error: invalid value '\t' for '--delimiter ': too many characters in string
For more information, try '--help'.

So, how to use tab as the delimiter?

Timestamp mapping

Not sure I am doing this right, but I am trying to convert a CSV containing some timestamp to a parquet file.

Sample CSV

072e4a64-2ffb-437c-9458-4953abaa7a20,1,2023-01-18 23:05:10,104,-1,0
072e4a64-2ffb-437c-9458-4953abaa7a20,2,2023-01-18 23:05:10,104,-1,0
072e4a64-2ffb-437c-9458-4953abaa7a20,4,2023-01-18 23:05:10,104,-1,0
  1. First, the schema is generated with the csv2parquet --max-read-records 5 -p option. It correctly infers the timestamp field
    {
      "name": "ts",
      "data_type": {
        "Timestamp": [
          "Second",
          null
        ]
      },
      "nullable": false,
      "dict_id": 0,
      "dict_is_ordered": false,
      "metadata": {}
    },
  1. Then I do the actual conversion

csv2parquet --header false --schema-file mt_status.json /dev/stdin mt_status.parquet

  1. Then I try to open the table using duckdb, and I can see all the records, but the timestamp field shows as Int64
┌──────────────────────────────────────┬───────┬────────────┬──────────┬────────┬───────────┐
│                 guid                 │  st   │     ts     │ tsmillis │ result │ synthetic │
│               varchar                │ int16 │   int64    │  int16   │ int16  │   int16   │
├──────────────────────────────────────┼───────┼────────────┼──────────┼────────┼───────────┤
│ 072e4a64-2ffb-437c-9458-4953abaa7a20 │     1 │ 1674083110 │      104 │     -1 │         0 │
│ 072e4a64-2ffb-437c-9458-4953abaa7a20 │     2 │ 1674083110 │      104 │     -1 │         0 │
│ 072e4a64-2ffb-437c-9458-4953abaa7a20 │     4 │ 1674083110 │      104 │     -1 │         0 │
  1. And the parquet schema also shows the field as a Int64

│ mt_status.parquet │ ts │ INT64 │ │ REQUIRED │ │ │ │ │ │ │

Any hint ?
Thanks

Convert jsons inside a folder

Convert jsons inside a folder

Hello

Is there a solution to convert all files in a folder to parquet

Instead of entering the json files one by one, select all files from the folder at once

json2parquet c:\test\*.json test.PARQUET

json2parquet Parse json get error infer schema

I created small jsonl file called data.json. It has the three lines used from the README.md

data.json

{ "a": 42, "b": true }
{ "a": 12, "b": false }
{ "a": 7, "b": true }
json2parquest ./data.json ./data.parquet

Error: General("Error inferring schema: Json error: Not valid JSON: expected value at line 1 column 1")

Also can an example or point to docs to create the arrow json schema file? Thanks.

csv2parquet failure

For the latest release (v0.18.0) installed via cargo and building from latest master (last commit june 1 2024), I can't get cvs2parquet to generate data in the parquet file. Here's a trivial example:

simple.csv:
a,b 1,a 2,b 3,c 4,d

Run: csv2parquet simple.csv simple.parquet results in an output file that has the schema, but no data. Running csv2parquet -n simple.csv simple.parquet does autodetect a schema and print it out correctly. Using pqrs (installed via cargo) to inspect the file shows the schema with pqrs schema simple.parquet, but there is no actual data in the parquet file. This same pattern happens with the real data with large csv files we were experimenting on.

What am I doing wrong? Or is this a real bug in the release? I see the same thing when I check out the v0.17.0 release commit hash. This is all on amazonlinux2 (which I know is an older platform that is hard to support).

I also tried this with ubuntu 24.04 LTS and got the same behavior.

json2parquet hangs and gets killed when writing large files

Hi there! First of all thank you for the tooling, it's incredibly powerful. I have been using json2parquet to process some intricate .jsonl files. I have had a good time with small sizes but not for the larger files. In my EC2 instance with 16GB of RAM, sometimes json2parquet will stall and never finish writing a parquet file, or the process gets killed.

Steps to reproduce
I created a very simplified example that should be easy to reproduce with a schema, a file that works, and a file that doesn't. Unfortunately it only happens with large files.

curl -OJ https://felixh-shareables.s3.us-west-2.amazonaws.com/schema.schema
curl -OJ https://felixh-shareables.s3.us-west-2.amazonaws.com/json2parquet-small.jsonl.gz # 80 MB compressed. 2GB uncompressed
curl -OJ https://felixh-shareables.s3.us-west-2.amazonaws.com/json2parquet-large.jsonl.gz # 1.1 GB compressed, 37 GB uncompressed

gunzip -c json2parquet-small.jsonl.gz | json2parquet -s schema.schema /dev/stdin json2parquet-small.parquet
# success - I can read the output to duckdb and confirm it works. 

gunzip -c json2parquet-large.jsonl.gz | json2parquet -s schema.schema /dev/stdin json2parquet-large.parquet
# Error. Killed
gunzip -c json2parquet-large.jsonl.gz | json2parquet -s schema.schema /dev/stdin /dev/stdout >| myout.parquet
# Hangs, see results from htop

# breaking up the steps
gunzip json2parquet-large.jsonl.gz
json2parquet json2parquet-large.jsonl json2parquet-large.parquet
# here, it hangs for a while. It creates a parquet file of 0 bytes. 
# After a long time, htop looks like in the image below. And then at some point the process gets killed.

Results from htop
image

I have let these processes hang for an hour.

This happens regardless of whether I specify if I want a compressed output or not.

Is there anything that I am missing in how I am using the tooling? The only think I can think about is --data-page-size-limit, --write-batch-size and --max-row-group-size. Please if you have time, let me know what could be going wrong. Thanks!

Update:

json2parquet json2parquet-large.jsonl json2parquet-large.parquet -s schema.schema --data-page-size-limit 1024 --write-batch-size 1024 --max-row-group-size 1024 -c gzip

The command above works, however, piping from the compressed file still fails

gunzip -c 1gb.jsonl.gz | json2parquet -s schema.schema -c gzip --data-page-size-limit 1024 --write-batch-size 1024 --max-row-group-size 1024 /dev/stdin myfile.parquet.gz

I would prefer not to unpack the files in memory as they can be very large, this one is already 37GB.

Update II:
OK I think I figured it out...

gunzip -c 1gb.jsonl.gz | json2parquet -s schema.schema -c gzip --data-page-size-limit 1024 --write-batch-size 1024 --max-row-group-size 1024 --max-read-records 1  /dev/stdin myfile.parquet.gz

Here, --max-read-records 1 seems to do the trick. It seems that although I am passing a schema it is still trying to infer the schema? Hence it has to wait for the entire unzipped stream to go into memory which will not happen since the uncompressed file of 37 GB >> 16 GB of memory.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.