apache / arrow-datafusion-python Goto Github PK

View Code? Open in Web Editor NEW

319.0 34.0 64.0 3.94 MB

Apache DataFusion Python Bindings

Home Page: https://datafusion.apache.org/python

License: Apache License 2.0

Python 30.89% Rust 63.46% Shell 4.63% Dockerfile 0.82% Batchfile 0.20%

arrow-datafusion-python's People

Stargazers

Watchers

arrow-datafusion-python's Issues

Build Python wheels for Mac ARM architecture (e.g. M1)

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We only build Python wheels for Mac with x86_64 architecture, so we cannot support newer macs with M1 chip or later

Describe the solution you'd like
Build wheels for Mac arm architecture

Describe alternatives you've considered
None

Additional context
None

Implement DataFrame execute_stream and execute_stream_partitioned

arrow-rs now has zero-copy to PyArrow for RecordBatchReaders.

We could use this implement the execute_stream and execute_stream_partitioned methods for DataFrames similar to the Rust DataFrame.

Build Python source distribution in GitHub workflow

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We do not currently build the Python source distribution manually.

Describe the solution you'd like
Update GitHub workflow to create Python source distribution.

Describe alternatives you've considered
Keep doing it manually

Additional context
None

Python Release Build failing after upgrading to maturin 14.2

Describe the bug
See https://github.com/apache/arrow-datafusion-python/actions/runs/3566140311

To Reproduce
Steps to reproduce the behavior:

Expected behavior
A clear and concise description of what you expected to happen.

Additional context
Add any other context about the problem here.

Add support for key-value pair configuration settings

DataFusion now has a configuration mechanism based on key-value pairs (see https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/config.rs) and it would be good to expose this via the Python bindings.

[DOC] - Does anyone know how to build the documentation under the `docs`

I want to add some documentation.

Add Python bindings for DataFrame methods write_csv, write_parquet, and write_json

The Rust implementation of DataFrame has write_csv, write_parquet, and write_json and we need to expose those in Python

Add Python binding for `approx_median`

This is not simply a case of exposing the Rust version because Rust has some logic that replaces approx_median with approx_percentile(expr, 0.5) instead.

We need to wait for apache/datafusion#3063 to be fixed then it will be simple

Upgrade Maturin image to 0.13.1 to Fix manylinux Build

The manylinux build is failing and I suspect it is due to an older version of the Maturin docker image and Rust being used which is incompatible with Arrow 18.0.0. I'd like to test upgrading to Maturin 0.13.1.

Run Apache RAT (Release Audit Tool) in CI

Because of the history of this project, we are missing the usual ASF infrastructure such as RAT tests to ensure that all files contain an ASF header.

I don't know if we should consider rewriting the history so that we fork from DataFusion just before the Python bindings were removed, and then delete everything else, and then re-apply changes that were made in this repo ... or just copy the parts we need from DataFusion?

EPIC: Add all `SessionContext` and `DataFrame` methods to Python API

The Python bindings currently only expose a subset of functionality, and we want to expose as much as possible.

Here is a list of all available rust methods. Note that there may be reasons why we don't want to expose some of these.

SessionContext

DataFrame

Add Python binding for DataFrame::collect_partitioned

Reading csv does not work

Describe the bug
ctx.read_csv( returns empty dataset

To Reproduce
`import datafusion

ctx = datafusion.SessionContext()

blasted = ctx.read_csv(path="/localpath/a.tsv", has_header=True, delimiter="\t", schema_infer_max_records=100)
blasted.show()
`
Expected behavior
Few rows from csv should be printed out

Support parquet WriterProperties

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
To enable compression, I would like to customize parquet WriterProperties when using DataFrame.write_parquet.

Describe the solution you'd like

Support building parquet WriterProperties from Python.

Describe alternatives you've considered

For basic properties such as default compression applied to all columns, adding compression option as an argument to DataFrame.write_parquet would be sufficient.

Add Python binding for `DataFrame::repartition`

Add Python bindings for DataFrame methods `intersect` and `except`

Cannot install on Mac M1 from source tarball from testpypi

Describe the bug
I am trying to install from testpypi on an M1 mac.

pip3 install -i https://test.pypi.org/simple/ datafusion==0.7.0

Looking in indexes: https://test.pypi.org/simple/
Collecting datafusion==0.7.0
  Using cached https://test-files.pythonhosted.org/packages/a4/4f/5c588562ec6ab1651659ff35e34c197a7c1eaa7663360f2ea9d7d777547d/datafusion-0.7.0.tar.gz (150 kB)
  Installing build dependencies ... error
  error: subprocess-exited-with-error
  
  × pip subprocess to install build dependencies did not run successfully.
  │ exit code: 1
  ╰─> [3 lines of output]
      Looking in indexes: https://test.pypi.org/simple/
      ERROR: Could not find a version that satisfies the requirement maturin<0.14,>=0.11 (from versions: none)
      ERROR: No matching distribution found for maturin<0.14,>=0.11
      [end of output]

To Reproduce

pip3 install -i https://test.pypi.org/simple/ datafusion==0.7.0

Expected behavior
A clear and concise description of what you expected to happen.

Additional context
Add any other context about the problem here.

Fix documentation and code samples

The documentation cannot be built and some sample code in the documentation cannot be run.

support creating arrow-datafusion-python conda environment

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Development in arrow-datafusion-python is currently limited to pip environments. While nothing is wrong with that lots of people use Conda these days. The problem lies in that if a developer/maintainer/user has Conda installed on their environment they cannot run maturin develop with both a Python virtual environment and Conda on the path. This confuses maturin leading to issues.

I think a simple solution is to also include an environment.yml file for Conda developers to use instead of relying on pip. This addition is easier enough and there is tooling available for generating pip requirements files from conda environment files which also means keeping those environments in sync shouldn't be an issue.

Describe the solution you'd like
Introduce an environment.yml file for creating an arrow-datafusion-python environment and supporting documentation. Bonus for maintainer documentation about keeping the pip and anaconda environment files in sync.

Describe alternatives you've considered
Open issue with Maturin to address the issue. The workaround of having conda environment seems worth it however.

Additional context
None

Add Python bindings for substrait module

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
arrow-datafusion recently accepted to donation of arrow-contrib/datafusion-substrait into its main branch. Now that the two repos are formally joined we should add Python bindings for those new bits. There are effectively 3 places to write the bindings. producer, consumer, and serializer.

Describe the solution you'd like
Python bindings to interact with arrow-datafusion substrait module

Describe alternatives you've considered
None

Additional context
None

Rust Release 1.63.0 add a new lint `borrow_deref_ref` causing Cargo Clippy error

https://rust-lang.github.io/rust-clippy/master/index.html#borrow_deref_ref

Release version 0.7.0

I would like to propose releasing version 0.7.0.

Checklist:

Register_parquet not working for pandas parquet files

Bug descripion
From pandas I am writing a parquet file (using gzip compression), I am looking to query this file using datafusion. The same file exported as .csv is working fine using this library, but the parquet version is not returning anything.

Steps to reproduce

import datafusion
import pandas as pd

df = pd.DataFrame(data={'col1': [1, 2, 4, 5, 6], 'col2': [3, 4, 3, 5, 2], 'col3': [3, 4, 1, 2, 3], 'col4': [3, 4, 4, 5, 6]})
df.to_csv('df.csv', compression=None)  
df.to_parquet('df.pq', compression=None)  

ctx = datafusion.SessionContext()
ctx.register_csv(name="example_csv", path="df.csv")
ctx.register_parquet(name="example_pq", path="df.pq")

# test csv
df = ctx.sql("SELECT * FROM example_csv")
result = df.collect()
res = result[0]

# test parquet
df = ctx.sql("SELECT * FROM example_pq")
result = df.collect()
res = result[0]

Expected behavior
The same result from both approaches

Additional context
It also seems that gzip compressed files are not working, I am not sure why this is, please consider the following example:

df.to_csv('df.csv.gz') 
ctx.register_csv(name="example_csv_gz", path="df.csv.gz")

# test csv
df = ctx.sql("SELECT * FROM example_csv_gz")
result = df.collect()
res = result[0]

Publish documentation for Python bindings

hi, I might have missed something, but there's no link on the github repo to the python documentation, and some quick googling points to https://arrow.apache.org/datafusion/python/index.html which returns "Not Found The requested URL was not found on this server.".

The main github page points to arrow.apache.org/datafusion but there's no python section there.

I hope I am not wasting your time by missing something obvious.

Add script for Python linting

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
CI runs Python linters and fails the build if formatting is incorrect.

Describe the solution you'd like
Add script that I can run locally to format the Python code

Describe alternatives you've considered
None

Additional context
None

Add Python bindings for `SessionContext` methods read_csv, read_json, read_parquet, read_avro

We should update the example in the README to show how to do this as well

[DISCUSS] arrow-datafusion-python versioning

Apologies if this isn't the correct venue for discussing this but I want to discuss versioning for arrow-datafusion-python. With the arrow-datafusion 16.0.0 release in motion I was thinking about the versioning for an upcoming arrow-datafusion-python. If the pattern continues the next version would be 0.8.0. This was confusing to me when first starting with arrow-datafusion-python since I was expecting to find the version for the python bindings to match that with the version of arrow-datafusion I was using, this isn't the case. So my proposal is, why don't we keep the release versions for arrow-datafusion-python in sync with arrow-datafusion and have the next release of arrow-datafusion-python be 16.0.0.

I believe this helps make things much more clear to the end user about which underlying version of datafusion they are using. If there are bugs in arrow-datafusion-python itself that warrant a release then we could just append a post-release fix version to the number.

Curious of others thoughts on this? I know software versioning can almost be a religious discussion at times and this is just a very coarse layout

Bindings for CSV/JSON compression support

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Support for reading compressed CSV/JSON files was recently added to DataFusion we should add supporting bindings here so that the python bindings can take advantage of that as well.

Describe the solution you'd like
Update context.rs::register_csv/json(...) to take advantage of the new CsvReadOptions which allows for specifying a file compression type.

Describe alternatives you've considered
None

Additional context
None

Add instructions and scripts for publishing the Python documentation

Add more maintainers in PyPi

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We should give more PMC members permission to manage this project in PyPi.

Describe the solution you'd like
PMC members should sign up for an account on the following sites and then post their usernames on this issue.

Describe alternatives you've considered

Additional context

Support fsspec based filesystems

Allow for using a Python fsspec filesystem as an ObjectStore which would allow DataFusion Python to support any filesystem supported by fsspec. I have this working for an internal project.

Would this contribution be valuable to this project? Fsspec supports many different filesystems and cloud providers so it could make using DataFusion easier if native Rust ObjectStores aren't available yet.

Supported built-in fsspec filesystems
Third-party implementations

EPIC: Add all functions to python binding `functions`

DataFusion

https://github.com/apache/arrow-datafusion/blob/master/datafusion/expr/src/built_in_function.rs
https://github.com/apache/arrow-datafusion/blob/master/datafusion/expr/src/aggregate_function.rs
https://github.com/apache/arrow-datafusion/blob/master/datafusion/expr/src/window_function.rs

DataFusion Python

https://github.com/apache/arrow-datafusion-python/blob/master/src/functions.rs

Scalar Functions:

Aggregate Functions

Window Functions (Not yet verified)

Reading delta table as pyarrow dataset does not work

Describe the bug
I am unable to display the contents of delta tables stored locally

To Reproduce

[tool.poetry.dependencies]
python = "^3.10"
datafusion = "^0.7.0"
deltalake = "^0.6.4"

then run the following code:

import pyarrow as pa
import pyarrow.dataset as ds

from deltalake import DeltaTable
import datafusion

ctx = datafusion.SessionContext()

delta_table = DeltaTable("/local_delta_path/")
pa_dataset = dt.to_pyarrow_dataset()

ctx.register_dataset("pa_dataset", pa_dataset)

tmp = ctx.sql("SELECT * FROM pa_dataset limit 10")
tmp.show()

When executed in notebook in vs code, this script can run for >20 min and I am unable to interrupt the execution.

Expected behavior
Top rows displayed

Github actions produce a lot of warnings

Describe the bug

For example https://github.com/martin-g/arrow-datafusion-python/actions/runs/3591220901 produced:


generate-license
Node.js 12 actions are deprecated. For more information see: https://github.blog/changelog/2022-09-22-github-actions-all-actions-will-begin-running-on-node16-instead-of-node12/. Please update the following actions to use Node.js 16: actions/checkout@v2, actions-rs/toolchain@v1, actions/upload-artifact@v2

generate-license
The `set-output` command is deprecated and will be disabled soon. Please upgrade to using Environment Files. For more information see: https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/


Manylinux
Node.js 12 actions are deprecated. For more information see: https://github.blog/changelog/2022-09-22-github-actions-all-actions-will-begin-running-on-node16-instead-of-node12/. Please update the following actions to use Node.js 16: actions/checkout@v2, actions/download-artifact@v2, actions/upload-artifact@v2

Manylinux
The `set-output` command is deprecated and will be disabled soon. Please upgrade to using Environment Files. For more information see: https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/


...

To Reproduce

Check the Summary of any workflow in Github Actions.

Expected behavior

0 or close to 0 warnings.

Additional context

Add `with_column` and `with_column_renamed` to DataFrame

window_lead test appears to be non-deterministic

Describe the bug

The test works for me locally but fails in CI/

2023-01-19T00:06:25.0267445Z df = <datafusion.DataFrame object at 0x7ffbd80ccc70>
2023-01-19T00:06:25.0268026Z 
2023-01-19T00:06:25.0268336Z     def test_window_lead(df):
2023-01-19T00:06:25.0268610Z         df = df.select(
2023-01-19T00:06:25.0268869Z             column("a"),
2023-01-19T00:06:25.0269119Z             f.alias(
2023-01-19T00:06:25.0269347Z                 f.window(
2023-01-19T00:06:25.0269666Z                     "lead", [column("b")], order_by=[f.order_by(column("b"))]
2023-01-19T00:06:25.0269979Z                 ),
2023-01-19T00:06:25.0270215Z                 "a_next",
2023-01-19T00:06:25.0270448Z             ),
2023-01-19T00:06:25.0270671Z         )
2023-01-19T00:06:25.0270865Z     
2023-01-19T00:06:25.0271152Z         table = pa.Table.from_batches(df.collect())
2023-01-19T00:06:25.0271426Z     
2023-01-19T00:06:25.0271707Z         expected = {"a": [1, 2, 3], "a_next": [5, 6, None]}
2023-01-19T00:06:25.0272031Z >       assert table.to_pydict() == expected
2023-01-19T00:06:25.0272883Z E       AssertionError: assert {'a': [3, 1, ... [None, 5, 6]} == {'a': [1, 2, ... [5, 6, None]}
2023-01-19T00:06:25.0273211Z E         Differing items:
2023-01-19T00:06:25.0273583Z E         {'a_next': [None, 5, 6]} != {'a_next': [5, 6, None]}
2023-01-19T00:06:25.0273949Z E         {'a': [3, 1, 2]} != {'a': [1, 2, 3]}
2023-01-19T00:06:25.0274208Z E         Full diff:
2023-01-19T00:06:25.0274555Z E         - {'a': [1, 2, 3], 'a_next': [5, 6, None]}
2023-01-19T00:06:25.0274917Z E         ?            ---              ------
2023-01-19T00:06:25.0275260Z E         + {'a': [3, 1, 2], 'a_next': [None, 5, 6]}
2023-01-19T00:06:25.0275725Z E         ?        +++                      ++++++

To Reproduce
Steps to reproduce the behavior:

Expected behavior
A clear and concise description of what you expected to happen.

Additional context
Add any other context about the problem here.

Add ability to cache DataFrame in memory

In stack overflow question https://stackoverflow.com/questions/73015779/how-to-persist-datafusion-dataframe-in-memory there is a question about how to persist a DataFrame in memory. It would be nice to have an easy way to do this from the Python API.

DataFusion has support for MemTable which could be used to achieve this.

ImportPathMismatchError when running pytest locally

Describe the bug

__________________________________________________________________________________________________ ERROR collecting test session __________________________________________________________________________________________________
venv/lib/python3.8/site-packages/_pytest/config/__init__.py:607: in _importconftest
    mod = import_path(conftestpath, mode=importmode, root=rootpath)
venv/lib/python3.8/site-packages/_pytest/pathlib.py:556: in import_path
    raise ImportPathMismatchError(module_name, module_file, path)
E   _pytest.pathlib.ImportPathMismatchError: ('datafusion.tests.conftest', '/home/andy/git/apache/arrow-datafusion-python/datafusion/tests/conftest.py', PosixPath('/home/andy/git/apache/arrow-datafusion-python/target/package/datafusion-python-0.7.0/datafusion/tests/conftest.py'))

To Reproduce
Steps to reproduce the behavior:

Expected behavior
A clear and concise description of what you expected to happen.

Additional context
Add any other context about the problem here.

Add Python bindings for DataFrame methods `union` and `union_distinct`

These methods already exist in Rust and need exposing in Python

Support reading from PyArrow datasets

Implement PyArrow Dataset TableProvider to support reading from a PyArrow Dataset with filter and projection push down.

Python bindings create duplicated qualified fields after joining

Describe the bug
im working on getting datafusion added to db-benchmark (#147). while putting the benchmarks together i came across an error while doing the join benchmark that i wasnt expecting. specifically the error is:

Traceback (most recent call last):
  File "datafusion/join-datafusion.py", line 72, in <module>
    df = ctx.create_dataframe([ans])
Exception: DataFusion error: Plan("Schema contains duplicate qualified field name 'ce9f0daee780e4f2796b9953bd267457c.id1'")

The test code that produced that is here:

question = "small inner on int" # q1
gc.collect()
t_start = timeit.default_timer()
ans = ctx.sql("SELECT * FROM x INNER JOIN small ON x.id1 = small.id1").collect()
shape = ans_shape(ans)
print(shape)
t = timeit.default_timer() - t_start
t_start = timeit.default_timer()
df = ctx.create_dataframe([ans])
chk = df.aggregate([], [f.sum(col("v1"))]).collect()[0].column(0)[0]
chkt = timeit.default_timer() - t_start
m = memory_usage()
write_log(task=task, data=data_name, in_rows=x_data.num_rows, question=question, out_rows=shape[0], out_cols=shape[1], solution=solution, version=ver, git=git, fun=fun, run=1, time_sec=t, mem_gb=m, cache=cache, chk=make_chk([chk]), chk_time_sec=chkt, on_disk=on_disk)
del ans
gc.collect()

if i update the sql to:

SELECT x.id1, x.id2, x.id3, x.id4, small.id4, x.id5, x.id6, x.v1, small.v2 FROM x INNER JOIN small ON x.id1 = small.id1

I get:

Traceback (most recent call last):
  File "datafusion/join-datafusion.py", line 73, in <module>
    df = ctx.create_dataframe([ans])
Exception: DataFusion error: Plan("Schema contains duplicate qualified field name 'cb53bcf8886f449c3bd2651571df185d4.id4'")

to me this looks like a bug as i think i should be able to write the query without having to alias the overlapping columns (when i alias the overlapping columns it works). for example, below is the equivalent spark query.

select * from x join small using (id1)

To Reproduce
Steps to reproduce the behavior:

Expected behavior
A clear and concise description of what you expected to happen.
I should be able to run either of the following

ans = ctx.sql("SELECT x.id1, x.id2, x.id3, x.id4, small.id4, x.id5, x.id6, x.v1, small.v2 FROM x INNER JOIN small ON x.id1 = small.id1").collect()
df = ctx.create_dataframe([ans])

ans = ctx.sql("SELECT * FROM x INNER JOIN small ON x.id1 = small.id1").collect()
df = ctx.create_dataframe([ans])

Additional context
Add any other context about the problem here.

README file points to wrong repository

In How to develop section - (https://github.com/apache/arrow-datafusion-python#how-to-develop) there is a reference to the old site datafusion-contrib/datafusion-python.git

Build error: could not compile `thiserror` due to 2 previous errors

Describe the bug

Build rustc 1.65.0 (897e37553 2022-11-02)

To Reproduce

source venv/bin/activate

maturin develop

Maturin build hangs on Linux ARM64

Describe the bug

Executing maturin build or maturin develop hangs on Linux ARM64 (openEuler 22.03 LTS)

To Reproduce
Steps to reproduce the behavior:

ssh to a Linux ARM64 system
maturin develop

Expected behavior

The build should pass successfully.

Additional context

N/A

ASF source release tarball has wrong directory name

Describe the bug

Directory should be arrow-datafusion-python-0.7.0 not arrow-datafusion-0.7.0

A    arrow/arrow-datafusion-0.7.0/apache-arrow-datafusion-python-0.7.0.tar.gz
A    arrow/arrow-datafusion-0.7.0/apache-arrow-datafusion-python-0.7.0.tar.gz.asc
A    arrow/arrow-datafusion-0.7.0/apache-arrow-datafusion-python-0.7.0.tar.gz.sha256
A    arrow/arrow-datafusion-0.7.0/apache-arrow-datafusion-python-0.7.0.tar.gz.sha512

To Reproduce
Steps to reproduce the behavior:

Expected behavior
A clear and concise description of what you expected to happen.

Additional context
Add any other context about the problem here.

Expand unit tests for built-in functions

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Hi @andygrove,

First of all I wanted to thank you for creating and open-sourcing datafusion - I have been following the project for some time and find it super exciting to see the next-gen data engineering infrastructure being built on top of Arrow!

I have no experience with Rust but quite some with writing applications in Python & PySpark so I thought I could contribute to the Python language bindings.

Recently I saw @francis-du added lots of functions to the Python package with #73 - thanks for the significant effort in improving the package!

To improve the test coverage, I've opened pull request #129 with a few unit tests for the built-in functions. Hope the tests are helpful, look forward to getting some feedback.

Describe the solution you'd like
Improve test coverage by adding more unit tests.

Describe alternatives you've considered
n/a

Additional context
n/a

Build failing due to rust-toolchain file

Currently the build is failing due to the rust-toolchain file that is check in:

error: failed to select a version for the requirement `parquet = "^18.0.0"`
candidate versions found which didn't match: 15.0.0, 14.0.0, 13.0.0, ...

Removing the rust-toolchain file fixes the issue. Do we need this file checked in? The original repo does not have it checked in.

Release version 0.8.0

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We should release a new version based on DataFusion 16 17 18

Describe the solution you'd like

Describe alternatives you've considered
None

Additional context
None

Add Python binding for `DataFrame::distinct`

Integrate with the new `object_store` crate

I would like to be able to query data on s3 with datafusion-python and the new object_store crate. i havent had the chance to play with that crate yet but once i get the chance i would like to get the necessary functionality added here.

apache / arrow-datafusion-python Goto Github PK

arrow-datafusion-python's People

Stargazers

Watchers

Forkers

arrow-datafusion-python's Issues

SessionContext

DataFrame

DataFusion

DataFusion Python

Scalar Functions:

Aggregate Functions

Window Functions (Not yet verified)

Recommend Projects

Recommend Topics

Recommend Org