Giter Club home page Giter Club logo

csv2parquet's Introduction

I like all things data. I've worked with data my whole career, from things that would fit in a Tweet all the way up to multi-petabyte data warehouses running in the cloud.

Currently, I'm a founder at SyncWith. We want anyone to be able to get their data into Google Sheets and Looker Studio.

As a hobby, I'm dabbling with making small datasets much more accessible via SQLite and Datasette plugins.

You can find me on the web at cldellow.com or as @cldellow on Hachyderm.

csv2parquet's People

Contributors

cldellow avatar dazzag24 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

csv2parquet's Issues

Append to existing parquet file?

Hi,

Is it possible to run csv2parquet in a way that means it appends to an existing parquet file? I have a large number of CSV file with a fixed schema that I'd like to convert into a single parquet file.

Thanks

support excluding/including columns

Sometimes a CSV has extra columns in that you don't really care about. It would be nice to have a facility to skip those. By default, all columns should be included. If you start to use --include-XXX, it will default to all columns being excluded. It is an error to mix --include-XXX and --exclude-XXX.

eg:

csv2parquet --exclude-0 would include all columns except the first.

csv2parquet --exclude-age would include all columns except the one called age. If #4 is done, it refers to the final name after transformations.

csv2parquet --include-0 would include only the first column.

csv2parquet --include-age --include-name would include only the columns named age and name.

support explicitly providing names via `--rename-N foo` or `--rename-NAME foo`

Default to the name from the first row, it's entirely optional. eg, if you have a CSV like:

name,age
Colin,32

then csv2parquet --rename-1 age_in_years should create a parquet where the age column is called age_in_years

Similarly, csv2parquet --rename-age age_in_years should work. When renaming based on name, we should always use the original set of columns. eg chaining things like csv2parquet --rename-age age_in_years --rename-age_in_years age-in-years should fail.

sanitize column names

eg the csv at http://www.elections.ca/fin/oda/od_cntrbtn_audt_e.zip results in a parquet file that sqlite-parquet-vtable can't read. (separately: should figure out if it's the embedded space or the forward slash and probably file a bug on sqlite-parquet-vtable)

Instead, let's sanitize column names to ensure they can be used as identifiers without quoting. Lower case it, replace anything other than [a-z0-9] with _ then collapse runs of underscore to a single character.

Provide --raw-names to indicate that the user doesn't wish for this sanitization to happen.

Don't sanitize explicitly provided column names, eg via --name-0 "weird name"

parallelize ingestion

Unsure if this is possible due to the GIL, but can we get a speed boost from having multiple threads/processes reading the files, parsing a row group's worth of data and sending it to the main process to add to the arrow structures?

Even allowing for the IPC overhead, CSV parsing may be so inefficient that this is worth doing. Try to get a rough estimate of the benefit we'd get before spending too much time on this - maybe using the Elections Canada donor set and the Census set.

The cost of type conversions may also be relevant to consider (eg parsing a string is pretty cheap), so maybe visit this after we have types

upgrade to pyarrow 0.10

I think this drags in a version of parquet-cpp with lz4 support, which would be nice to expose

AttributeError: module 'pyarrow' has no attribute 'Column'

Thanks for this useful tool!

Not sure if this an error or I am doing something wrong?

Created an new virualenv
pipenv shell

Installed requirements
pipenv install --skip-lock pyarrow csv2parquet

Attempted to convert simple csv using simplest invocation as I want all rows and columns

> csv2parquet a.csv
Traceback (most recent call last):
  File "/home/xxx/.local/share/virtualenvs/csv2parquet-tyfa8dfH/bin/csv2parquet", line 8, in <module>
    sys.exit(main())
  File "/home/xxx/.local/share/virtualenvs/csv2parquet-tyfa8dfH/lib/python3.6/site-packages/csv2parquet/csv2parquet.py", line 247, in main
    main_with_args(convert, sys.argv[1:])
  File "/home/xxx/.local/share/virtualenvs/csv2parquet-tyfa8dfH/lib/python3.6/site-packages/csv2parquet/csv2parquet.py", line 244, in main_with_args
    args.type)
  File "/home/xxx/.local/share/virtualenvs/csv2parquet-tyfa8dfH/lib/python3.6/site-packages/csv2parquet/csv2parquet.py", line 157, in convert
    for x in range(len(fields)) if keep[x]]
  File "/home/xxx/.local/share/virtualenvs/csv2parquet-tyfa8dfH/lib/python3.6/site-packages/csv2parquet/csv2parquet.py", line 157, in <listcomp>
    for x in range(len(fields)) if keep[x]]
AttributeError: module 'pyarrow' has no attribute 'Column'

python --version Python 3.6.8

more Pipfile

[[source]]
name = "pypi"
url = "https://pypi.org/simple"
verify_ssl = true

[dev-packages]

[packages]
pyarrow = "*"
csv2parquet = "*"

[requires]
python_version = "3.6"```

support explicitly providing types via `--type-N x` or `--type-NAME x`

Default to strings, it's entirely optional. eg, if you have a CSV like:

name,age
Colin,32

then csv2parquet --type-1 int8 should create a parquet where the age column is an int.

Use the names from https://arrow.apache.org/docs/python/api.html#type-and-schema-factory-functions

Sometimes CSVs are messy - permit appending an ? to a type to indicate that values that can't be converted should be ignored. Print out a sampling of ignored values after running, eg given:

name,age
Colin,32
Jenn,unknown

then csv2parquet --type-1 int8? should succeed, but warn that an entry was ignored on line 3.

To be easier for a human to work with, it should also be possible to refer to the column by its name, eg csv2parquet --type-age int8 should work. Note that if #4 is done, the final name should be the one that is used to lookup the column.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.