Comments (8)
hmmmm this seems like quite a problematic requirement for using the pandas cursor and means you can't really use pyathena for data pipelines as you can't read data from athena and the write back to s3. I've thought of a workaround by being explicit on which s3 file system I want when writing so s3:
s3 = s3fs.S3FileSystem()
with s3.open(f"{s3_bucket}/my_prefix/file.csv", "w") as file_in:
my_df.to_csv(file_in, index=False)
But would addressing this issue be something you are willing to think about? either by using s3fs, not registering your s3 implementation globally with fsspec (not sure if this is possible) or extending your implementation to support writing?
from pyathena.
https://github.com/laughingman7743/PyAthena/blob/master/pyathena/filesystem/s3.py#L403-L416
I do not currently support writing DataFrames to S3 with PandasCursor.
from pyathena.
Thanks for the swift response. Totally understand that your package doesn't have support to write to s3. The issue is that once we import the PandasCursor,
it interfers with how pandas dataframes write to s3. Notice in my example that my_df.to_csv(output_path)
has nothing to do with PyAthena. It seems to me from the stack trace that importing PandasCursor
interfers with how pandas interacts with s3fs and fsspec and this is an issue as it means we can't use pyathena and then write our resulting dataframe back to s3 using pandas.
from pyathena.
https://github.com/laughingman7743/PyAthena/blob/master/pyathena/pandas/__init__.py#L4-L5
Importing PandasCursor will read this code and overwrite the S3fs implementation.
from pyathena.
I used to use S3fs, but stopped using it because S3fs depends on aiobotocore, and aiobotocore sets only certain versions of botocore as dependencies, making it difficult to install dependencies.
https://github.com/aio-libs/aiobotocore/blob/master/setup.py#L10
I welcome pull requests, you know.
from pyathena.
I used to use S3fs, but stopped using it because S3fs depends on aiobotocore, and aiobotocore sets only certain versions of botocore as dependencies, making it difficult to install dependencies.
Yes this is fair enough - it's also a problem we've been encountering in projects.
I welcome pull requests, you know.
Touche. Given that you don't want to include s3fs as a dependancy do you think the best fix is to implement an s3 writer in this package? This seems slightly odd thing to be implementing in this package. The alternative would be to avoid registering your s3 implentation globally to fsspec. I'll admit I've not looked at the code for fsspec at all and don't know if there is a way to implement things without the global registration to fsspec.
from pyathena.
I am not sure what is odd about it.
If you want to save the Dataframe as an Athena table, you can also use the to_sql method.
https://github.com/laughingman7743/PyAthena#to-sql
from pyathena.
As a work around you can convert query results to pandas yourself instead of using the as_pandas
function or PandasCursor
.
E.g. This succeeds.
from pyathena import connect
import pandas as pd
with connect(s3_staging_dir="s3://...") as conn:
with conn.cursor() as cursor:
cursor.execute("select * from my_table limit 10")
df = pd.DataFrame(cursor, columns=[c[0] for c in cursor.description])
# later...
df.to_csv("s3://...")
I'll suggest that it would be preferred for this library to not support pandas conversion at all, rather than to support it in a way that breaks typical pandas usage in other contexts.
from pyathena.
Related Issues (20)
- Add LIMIT as argument for execute HOT 3
- Support `json_serializer` and `json_deserializer` engine/dialect paremeters HOT 1
- SyntaxError when importing AthenaDialect HOT 5
- pyathena hijacks pandas s3fs HOT 2
- result_reuse_enable and result_reuse_minutes HOT 2
- Athena Reuse Query Results HOT 2
- Supporting FILTER and other similar operations HOT 2
- Support for Python 3.12
- Implement all fsspec specs in the s3 file system HOT 4
- Mypy Error When using Connection.cursor method to instantiate cursor HOT 2
- Add custom filesystem object to arrow engine HOT 2
- Compatibility issue with SQLAlchemy<1.4 HOT 2
- `UUID` in a query gets garbled HOT 3
- Add support for Spark calculations HOT 8
- Add Endpoint_URL param to SQLAlchemy HOT 2
- SQLAlchemy dialect uses deprecated dbapi() method HOT 1
- Create documents in Sphinx and publish them on GitHub Pages HOT 1
- Breaking change in the release between 3.0.10 and 3.1.0 HOT 6
- Okta authentication support HOT 1
- Integer variant types incorrectly rendered in DDL HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pyathena.