Giter Club home page Giter Club logo

Comments (8)

louis-vines avatar louis-vines commented on June 12, 2024 1

hmmmm this seems like quite a problematic requirement for using the pandas cursor and means you can't really use pyathena for data pipelines as you can't read data from athena and the write back to s3. I've thought of a workaround by being explicit on which s3 file system I want when writing so s3:

s3 = s3fs.S3FileSystem()

with s3.open(f"{s3_bucket}/my_prefix/file.csv", "w") as file_in:
    my_df.to_csv(file_in, index=False)

But would addressing this issue be something you are willing to think about? either by using s3fs, not registering your s3 implementation globally with fsspec (not sure if this is possible) or extending your implementation to support writing?

from pyathena.

laughingman7743 avatar laughingman7743 commented on June 12, 2024

https://github.com/laughingman7743/PyAthena/blob/master/pyathena/filesystem/s3.py#L403-L416
I do not currently support writing DataFrames to S3 with PandasCursor.

from pyathena.

louis-vines avatar louis-vines commented on June 12, 2024

Thanks for the swift response. Totally understand that your package doesn't have support to write to s3. The issue is that once we import the PandasCursor, it interfers with how pandas dataframes write to s3. Notice in my example that my_df.to_csv(output_path) has nothing to do with PyAthena. It seems to me from the stack trace that importing PandasCursor interfers with how pandas interacts with s3fs and fsspec and this is an issue as it means we can't use pyathena and then write our resulting dataframe back to s3 using pandas.

from pyathena.

laughingman7743 avatar laughingman7743 commented on June 12, 2024

https://github.com/laughingman7743/PyAthena/blob/master/pyathena/pandas/__init__.py#L4-L5
Importing PandasCursor will read this code and overwrite the S3fs implementation.

from pyathena.

laughingman7743 avatar laughingman7743 commented on June 12, 2024

I used to use S3fs, but stopped using it because S3fs depends on aiobotocore, and aiobotocore sets only certain versions of botocore as dependencies, making it difficult to install dependencies.
https://github.com/aio-libs/aiobotocore/blob/master/setup.py#L10
I welcome pull requests, you know.

from pyathena.

louis-vines avatar louis-vines commented on June 12, 2024

I used to use S3fs, but stopped using it because S3fs depends on aiobotocore, and aiobotocore sets only certain versions of botocore as dependencies, making it difficult to install dependencies.

Yes this is fair enough - it's also a problem we've been encountering in projects.

I welcome pull requests, you know.

Touche. Given that you don't want to include s3fs as a dependancy do you think the best fix is to implement an s3 writer in this package? This seems slightly odd thing to be implementing in this package. The alternative would be to avoid registering your s3 implentation globally to fsspec. I'll admit I've not looked at the code for fsspec at all and don't know if there is a way to implement things without the global registration to fsspec.

from pyathena.

laughingman7743 avatar laughingman7743 commented on June 12, 2024

I am not sure what is odd about it.
If you want to save the Dataframe as an Athena table, you can also use the to_sql method.
https://github.com/laughingman7743/PyAthena#to-sql

from pyathena.

graydenshand avatar graydenshand commented on June 12, 2024

As a work around you can convert query results to pandas yourself instead of using the as_pandas function or PandasCursor.

E.g. This succeeds.

from pyathena import connect
import pandas as pd

with connect(s3_staging_dir="s3://...") as conn:
    with conn.cursor() as cursor:
        cursor.execute("select * from my_table limit 10")        
        df = pd.DataFrame(cursor, columns=[c[0] for c in cursor.description])
        # later...
        df.to_csv("s3://...")

I'll suggest that it would be preferred for this library to not support pandas conversion at all, rather than to support it in a way that breaks typical pandas usage in other contexts.

from pyathena.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.