apache / arrow-datafusion-python Goto Github PK
View Code? Open in Web Editor NEWApache DataFusion Python Bindings
Home Page: https://datafusion.apache.org/python
License: Apache License 2.0
Apache DataFusion Python Bindings
Home Page: https://datafusion.apache.org/python
License: Apache License 2.0
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We only build Python wheels for Mac with x86_64 architecture, so we cannot support newer macs with M1 chip or later
Describe the solution you'd like
Build wheels for Mac arm architecture
Describe alternatives you've considered
None
Additional context
None
arrow-rs now has zero-copy to PyArrow for RecordBatchReaders.
We could use this implement the execute_stream and execute_stream_partitioned methods for DataFrames similar to the Rust DataFrame.
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We do not currently build the Python source distribution manually.
Describe the solution you'd like
Update GitHub workflow to create Python source distribution.
Describe alternatives you've considered
Keep doing it manually
Additional context
None
Describe the bug
See https://github.com/apache/arrow-datafusion-python/actions/runs/3566140311
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Additional context
Add any other context about the problem here.
DataFusion now has a configuration mechanism based on key-value pairs (see https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/config.rs) and it would be good to expose this via the Python bindings.
I want to add some documentation.
The Rust implementation of DataFrame has write_csv
, write_parquet
, and write_json
and we need to expose those in Python
This is not simply a case of exposing the Rust version because Rust has some logic that replaces approx_median
with approx_percentile(expr, 0.5)
instead.
We need to wait for apache/datafusion#3063 to be fixed then it will be simple
The manylinux build is failing and I suspect it is due to an older version of the Maturin docker image and Rust being used which is incompatible with Arrow 18.0.0. I'd like to test upgrading to Maturin 0.13.1.
Because of the history of this project, we are missing the usual ASF infrastructure such as RAT tests to ensure that all files contain an ASF header.
I don't know if we should consider rewriting the history so that we fork from DataFusion just before the Python bindings were removed, and then delete everything else, and then re-apply changes that were made in this repo ... or just copy the parts we need from DataFusion?
The Python bindings currently only expose a subset of functionality, and we want to expose as much as possible.
Here is a list of all available rust methods. Note that there may be reasons why we don't want to expose some of these.
Describe the bug
ctx.read_csv( returns empty dataset
To Reproduce
`import datafusion
ctx = datafusion.SessionContext()
blasted = ctx.read_csv(path="/localpath/a.tsv", has_header=True, delimiter="\t", schema_infer_max_records=100)
blasted.show()
`
Expected behavior
Few rows from csv should be printed out
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
To enable compression, I would like to customize parquet WriterProperties
when using DataFrame.write_parquet
.
Describe the solution you'd like
WriterProperties
from Python.Describe alternatives you've considered
DataFrame.write_parquet
would be sufficient.Describe the bug
I am trying to install from testpypi on an M1 mac.
pip3 install -i https://test.pypi.org/simple/ datafusion==0.7.0
Looking in indexes: https://test.pypi.org/simple/
Collecting datafusion==0.7.0
Using cached https://test-files.pythonhosted.org/packages/a4/4f/5c588562ec6ab1651659ff35e34c197a7c1eaa7663360f2ea9d7d777547d/datafusion-0.7.0.tar.gz (150 kB)
Installing build dependencies ... error
error: subprocess-exited-with-error
× pip subprocess to install build dependencies did not run successfully.
│ exit code: 1
╰─> [3 lines of output]
Looking in indexes: https://test.pypi.org/simple/
ERROR: Could not find a version that satisfies the requirement maturin<0.14,>=0.11 (from versions: none)
ERROR: No matching distribution found for maturin<0.14,>=0.11
[end of output]
To Reproduce
pip3 install -i https://test.pypi.org/simple/ datafusion==0.7.0
Expected behavior
A clear and concise description of what you expected to happen.
Additional context
Add any other context about the problem here.
The documentation cannot be built and some sample code in the documentation cannot be run.
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Development in arrow-datafusion-python is currently limited to pip environments. While nothing is wrong with that lots of people use Conda these days. The problem lies in that if a developer/maintainer/user has Conda installed on their environment they cannot run maturin develop
with both a Python virtual environment and Conda on the path. This confuses maturin leading to issues.
I think a simple solution is to also include an environment.yml
file for Conda developers to use instead of relying on pip. This addition is easier enough and there is tooling available for generating pip requirements files from conda environment files which also means keeping those environments in sync shouldn't be an issue.
Describe the solution you'd like
Introduce an environment.yml
file for creating an arrow-datafusion-python environment and supporting documentation. Bonus for maintainer documentation about keeping the pip and anaconda environment files in sync.
Describe alternatives you've considered
Open issue with Maturin to address the issue. The workaround of having conda environment seems worth it however.
Additional context
None
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
arrow-datafusion
recently accepted to donation of arrow-contrib/datafusion-substrait
into its main branch. Now that the two repos are formally joined we should add Python bindings for those new bits. There are effectively 3 places to write the bindings. producer, consumer, and serializer.
Describe the solution you'd like
Python bindings to interact with arrow-datafusion
substrait module
Describe alternatives you've considered
None
Additional context
None
Bug descripion
From pandas I am writing a parquet file (using gzip compression), I am looking to query this file using datafusion. The same file exported as .csv is working fine using this library, but the parquet version is not returning anything.
Steps to reproduce
import datafusion
import pandas as pd
df = pd.DataFrame(data={'col1': [1, 2, 4, 5, 6], 'col2': [3, 4, 3, 5, 2], 'col3': [3, 4, 1, 2, 3], 'col4': [3, 4, 4, 5, 6]})
df.to_csv('df.csv', compression=None)
df.to_parquet('df.pq', compression=None)
ctx = datafusion.SessionContext()
ctx.register_csv(name="example_csv", path="df.csv")
ctx.register_parquet(name="example_pq", path="df.pq")
# test csv
df = ctx.sql("SELECT * FROM example_csv")
result = df.collect()
res = result[0]
# test parquet
df = ctx.sql("SELECT * FROM example_pq")
result = df.collect()
res = result[0]
Expected behavior
The same result from both approaches
Additional context
It also seems that gzip compressed files are not working, I am not sure why this is, please consider the following example:
df.to_csv('df.csv.gz')
ctx.register_csv(name="example_csv_gz", path="df.csv.gz")
# test csv
df = ctx.sql("SELECT * FROM example_csv_gz")
result = df.collect()
res = result[0]
hi, I might have missed something, but there's no link on the github repo to the python documentation, and some quick googling points to https://arrow.apache.org/datafusion/python/index.html which returns "Not Found The requested URL was not found on this server.".
The main github page points to arrow.apache.org/datafusion but there's no python section there.
I hope I am not wasting your time by missing something obvious.
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
CI runs Python linters and fails the build if formatting is incorrect.
Describe the solution you'd like
Add script that I can run locally to format the Python code
Describe alternatives you've considered
None
Additional context
None
We should update the example in the README to show how to do this as well
Apologies if this isn't the correct venue for discussing this but I want to discuss versioning for arrow-datafusion-python
. With the arrow-datafusion
16.0.0 release in motion I was thinking about the versioning for an upcoming arrow-datafusion-python
. If the pattern continues the next version would be 0.8.0
. This was confusing to me when first starting with arrow-datafusion-python
since I was expecting to find the version for the python bindings to match that with the version of arrow-datafusion
I was using, this isn't the case. So my proposal is, why don't we keep the release versions for arrow-datafusion-python
in sync with arrow-datafusion
and have the next release of arrow-datafusion-python
be 16.0.0
.
I believe this helps make things much more clear to the end user about which underlying version of datafusion they are using. If there are bugs in arrow-datafusion-python
itself that warrant a release then we could just append a post-release fix version to the number.
Curious of others thoughts on this? I know software versioning can almost be a religious discussion at times and this is just a very coarse layout
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Support for reading compressed CSV/JSON files was recently added to DataFusion we should add supporting bindings here so that the python bindings can take advantage of that as well.
Describe the solution you'd like
Update context.rs::register_csv/json(...)
to take advantage of the new CsvReadOptions
which allows for specifying a file compression type.
Describe alternatives you've considered
None
Additional context
None
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We should give more PMC members permission to manage this project in PyPi.
Describe the solution you'd like
PMC members should sign up for an account on the following sites and then post their usernames on this issue.
Describe alternatives you've considered
Additional context
Allow for using a Python fsspec filesystem as an ObjectStore which would allow DataFusion Python to support any filesystem supported by fsspec. I have this working for an internal project.
Would this contribution be valuable to this project? Fsspec supports many different filesystems and cloud providers so it could make using DataFusion easier if native Rust ObjectStores aren't available yet.
Supported built-in fsspec filesystems
Third-party implementations
https://github.com/apache/arrow-datafusion/blob/master/datafusion/expr/src/built_in_function.rs
https://github.com/apache/arrow-datafusion/blob/master/datafusion/expr/src/aggregate_function.rs
https://github.com/apache/arrow-datafusion/blob/master/datafusion/expr/src/window_function.rs
https://github.com/apache/arrow-datafusion-python/blob/master/src/functions.rs
Describe the bug
I am unable to display the contents of delta tables stored locally
To Reproduce
[tool.poetry.dependencies]
python = "^3.10"
datafusion = "^0.7.0"
deltalake = "^0.6.4"
then run the following code:
import pyarrow as pa
import pyarrow.dataset as ds
from deltalake import DeltaTable
import datafusion
ctx = datafusion.SessionContext()
delta_table = DeltaTable("/local_delta_path/")
pa_dataset = dt.to_pyarrow_dataset()
ctx.register_dataset("pa_dataset", pa_dataset)
tmp = ctx.sql("SELECT * FROM pa_dataset limit 10")
tmp.show()
When executed in notebook in vs code, this script can run for >20 min and I am unable to interrupt the execution.
Expected behavior
Top rows displayed
Describe the bug
For example https://github.com/martin-g/arrow-datafusion-python/actions/runs/3591220901 produced:
generate-license
Node.js 12 actions are deprecated. For more information see: https://github.blog/changelog/2022-09-22-github-actions-all-actions-will-begin-running-on-node16-instead-of-node12/. Please update the following actions to use Node.js 16: actions/checkout@v2, actions-rs/toolchain@v1, actions/upload-artifact@v2
generate-license
The `set-output` command is deprecated and will be disabled soon. Please upgrade to using Environment Files. For more information see: https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/
Manylinux
Node.js 12 actions are deprecated. For more information see: https://github.blog/changelog/2022-09-22-github-actions-all-actions-will-begin-running-on-node16-instead-of-node12/. Please update the following actions to use Node.js 16: actions/checkout@v2, actions/download-artifact@v2, actions/upload-artifact@v2
Manylinux
The `set-output` command is deprecated and will be disabled soon. Please upgrade to using Environment Files. For more information see: https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/
...
To Reproduce
Check the Summary
of any workflow in Github Actions.
Expected behavior
0 or close to 0 warnings.
Additional context
Describe the bug
The test works for me locally but fails in CI/
2023-01-19T00:06:25.0267445Z df = <datafusion.DataFrame object at 0x7ffbd80ccc70>
2023-01-19T00:06:25.0268026Z
2023-01-19T00:06:25.0268336Z def test_window_lead(df):
2023-01-19T00:06:25.0268610Z df = df.select(
2023-01-19T00:06:25.0268869Z column("a"),
2023-01-19T00:06:25.0269119Z f.alias(
2023-01-19T00:06:25.0269347Z f.window(
2023-01-19T00:06:25.0269666Z "lead", [column("b")], order_by=[f.order_by(column("b"))]
2023-01-19T00:06:25.0269979Z ),
2023-01-19T00:06:25.0270215Z "a_next",
2023-01-19T00:06:25.0270448Z ),
2023-01-19T00:06:25.0270671Z )
2023-01-19T00:06:25.0270865Z
2023-01-19T00:06:25.0271152Z table = pa.Table.from_batches(df.collect())
2023-01-19T00:06:25.0271426Z
2023-01-19T00:06:25.0271707Z expected = {"a": [1, 2, 3], "a_next": [5, 6, None]}
2023-01-19T00:06:25.0272031Z > assert table.to_pydict() == expected
2023-01-19T00:06:25.0272883Z E AssertionError: assert {'a': [3, 1, ... [None, 5, 6]} == {'a': [1, 2, ... [5, 6, None]}
2023-01-19T00:06:25.0273211Z E Differing items:
2023-01-19T00:06:25.0273583Z E {'a_next': [None, 5, 6]} != {'a_next': [5, 6, None]}
2023-01-19T00:06:25.0273949Z E {'a': [3, 1, 2]} != {'a': [1, 2, 3]}
2023-01-19T00:06:25.0274208Z E Full diff:
2023-01-19T00:06:25.0274555Z E - {'a': [1, 2, 3], 'a_next': [5, 6, None]}
2023-01-19T00:06:25.0274917Z E ? --- ------
2023-01-19T00:06:25.0275260Z E + {'a': [3, 1, 2], 'a_next': [None, 5, 6]}
2023-01-19T00:06:25.0275725Z E ? +++ ++++++
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Additional context
Add any other context about the problem here.
In stack overflow question https://stackoverflow.com/questions/73015779/how-to-persist-datafusion-dataframe-in-memory there is a question about how to persist a DataFrame in memory. It would be nice to have an easy way to do this from the Python API.
DataFusion has support for MemTable which could be used to achieve this.
Describe the bug
__________________________________________________________________________________________________ ERROR collecting test session __________________________________________________________________________________________________
venv/lib/python3.8/site-packages/_pytest/config/__init__.py:607: in _importconftest
mod = import_path(conftestpath, mode=importmode, root=rootpath)
venv/lib/python3.8/site-packages/_pytest/pathlib.py:556: in import_path
raise ImportPathMismatchError(module_name, module_file, path)
E _pytest.pathlib.ImportPathMismatchError: ('datafusion.tests.conftest', '/home/andy/git/apache/arrow-datafusion-python/datafusion/tests/conftest.py', PosixPath('/home/andy/git/apache/arrow-datafusion-python/target/package/datafusion-python-0.7.0/datafusion/tests/conftest.py'))
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Additional context
Add any other context about the problem here.
These methods already exist in Rust and need exposing in Python
Implement PyArrow Dataset TableProvider to support reading from a PyArrow Dataset with filter and projection push down.
Describe the bug
im working on getting datafusion added to db-benchmark (#147). while putting the benchmarks together i came across an error while doing the join benchmark that i wasnt expecting. specifically the error is:
Traceback (most recent call last):
File "datafusion/join-datafusion.py", line 72, in <module>
df = ctx.create_dataframe([ans])
Exception: DataFusion error: Plan("Schema contains duplicate qualified field name 'ce9f0daee780e4f2796b9953bd267457c.id1'")
The test code that produced that is here:
question = "small inner on int" # q1
gc.collect()
t_start = timeit.default_timer()
ans = ctx.sql("SELECT * FROM x INNER JOIN small ON x.id1 = small.id1").collect()
shape = ans_shape(ans)
print(shape)
t = timeit.default_timer() - t_start
t_start = timeit.default_timer()
df = ctx.create_dataframe([ans])
chk = df.aggregate([], [f.sum(col("v1"))]).collect()[0].column(0)[0]
chkt = timeit.default_timer() - t_start
m = memory_usage()
write_log(task=task, data=data_name, in_rows=x_data.num_rows, question=question, out_rows=shape[0], out_cols=shape[1], solution=solution, version=ver, git=git, fun=fun, run=1, time_sec=t, mem_gb=m, cache=cache, chk=make_chk([chk]), chk_time_sec=chkt, on_disk=on_disk)
del ans
gc.collect()
if i update the sql to:
SELECT x.id1, x.id2, x.id3, x.id4, small.id4, x.id5, x.id6, x.v1, small.v2 FROM x INNER JOIN small ON x.id1 = small.id1
I get:
Traceback (most recent call last):
File "datafusion/join-datafusion.py", line 73, in <module>
df = ctx.create_dataframe([ans])
Exception: DataFusion error: Plan("Schema contains duplicate qualified field name 'cb53bcf8886f449c3bd2651571df185d4.id4'")
to me this looks like a bug as i think i should be able to write the query without having to alias the overlapping columns (when i alias the overlapping columns it works). for example, below is the equivalent spark query.
select * from x join small using (id1)
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
I should be able to run either of the following
ans = ctx.sql("SELECT x.id1, x.id2, x.id3, x.id4, small.id4, x.id5, x.id6, x.v1, small.v2 FROM x INNER JOIN small ON x.id1 = small.id1").collect()
df = ctx.create_dataframe([ans])
ans = ctx.sql("SELECT * FROM x INNER JOIN small ON x.id1 = small.id1").collect()
df = ctx.create_dataframe([ans])
Additional context
Add any other context about the problem here.
In How to develop section - (https://github.com/apache/arrow-datafusion-python#how-to-develop) there is a reference to the old site datafusion-contrib/datafusion-python.git
Describe the bug
Executing maturin build
or maturin develop
hangs on Linux ARM64 (openEuler 22.03 LTS)
To Reproduce
Steps to reproduce the behavior:
maturin develop
Expected behavior
The build should pass successfully.
Additional context
N/A
Describe the bug
Directory should be arrow-datafusion-python-0.7.0
not arrow-datafusion-0.7.0
A arrow/arrow-datafusion-0.7.0/apache-arrow-datafusion-python-0.7.0.tar.gz
A arrow/arrow-datafusion-0.7.0/apache-arrow-datafusion-python-0.7.0.tar.gz.asc
A arrow/arrow-datafusion-0.7.0/apache-arrow-datafusion-python-0.7.0.tar.gz.sha256
A arrow/arrow-datafusion-0.7.0/apache-arrow-datafusion-python-0.7.0.tar.gz.sha512
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Additional context
Add any other context about the problem here.
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Hi @andygrove,
First of all I wanted to thank you for creating and open-sourcing datafusion - I have been following the project for some time and find it super exciting to see the next-gen data engineering infrastructure being built on top of Arrow!
I have no experience with Rust but quite some with writing applications in Python & PySpark so I thought I could contribute to the Python language bindings.
Recently I saw @francis-du added lots of functions to the Python package with #73 - thanks for the significant effort in improving the package!
To improve the test coverage, I've opened pull request #129 with a few unit tests for the built-in functions. Hope the tests are helpful, look forward to getting some feedback.
Describe the solution you'd like
Improve test coverage by adding more unit tests.
Describe alternatives you've considered
n/a
Additional context
n/a
Currently the build is failing due to the rust-toolchain file that is check in:
error: failed to select a version for the requirement `parquet = "^18.0.0"`
candidate versions found which didn't match: 15.0.0, 14.0.0, 13.0.0, ...
Removing the rust-toolchain file fixes the issue. Do we need this file checked in? The original repo does not have it checked in.
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We should release a new version based on DataFusion 16 17 18
Describe the solution you'd like
Describe alternatives you've considered
None
Additional context
None
I would like to be able to query data on s3 with datafusion-python
and the new object_store
crate. i havent had the chance to play with that crate yet but once i get the chance i would like to get the necessary functionality added here.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.