Giter Club home page Giter Club logo

Comments (11)

hendrikmakait avatar hendrikmakait commented on September 24, 2024 2

We constantly re-create the expressions underneath the collection, creating completely new collections in the optimiser. Ensuring that this propagates seems non-trivial if the information is only available on the collection.

Are we ever interested in maintaining the correct value of .attrs for any intermediate expression or only in its value on the root collection? In the latter case, attaching it to the collection and fixing it during the optimization step feels manageable.

from dask.

LucaMarconato avatar LucaMarconato commented on September 24, 2024 1

I also experienced a problem with pandas and .attrs: copied DataFrame objects where not containing a copy of .attrs. (I initially thought it was a geopandas problem geopandas/geopandas#2920), but then it turned out to be from pandas pandas-dev/pandas#54134. Fortunately the bug above is now fixed. Which other problems did you find?

The use case for .attrs we have in the library we developed https://github.com/scverse/spatialdata/, is to store metadata associated to GeoDataFrame and DataFrame objects (both lazy and non-lazy). The metadata mostly contain json-like information that describes how various spatial objects are aligned together.

from dask.

phofl avatar phofl commented on September 24, 2024

Thanks for the report.

Could you add a bit more context about what you are trying to achieve with attr? attr doesn't really work in pandas either and support is spotty at most.

from dask.

LucaMarconato avatar LucaMarconato commented on September 24, 2024

Do you have any update on this? 😊 Some users reported problems with installation due to having pinned an older version of Dask in our config so it's an important issue for us. Thank you for your time.

from dask.

phofl avatar phofl commented on September 24, 2024

This is non-trivial to add and the semantics aren't completely clear either. This can't live on the collection level, it has to be on the expression level since we are constantly recreating the underlying expression, which makes it non-trivial. That said, the following is unclear to me:

df = dd.from_pandas(...)
df.attrs = "foo"
df = df.fillna(100)
df = df["a"]
df.attrs = "bar"

dask-expr will reorder the query and push the projection in front of the fillna. So what attr should take precedence here?

Contributions are very welcome. I won't have much time to think about this though

from dask.

LucaMarconato avatar LucaMarconato commented on September 24, 2024

Thanks for the answer. From my understanding (and for the purpose of our use case) the df.attrs slot should not be treated as lazy but always be executed immediately. This was the behavior implemented before Dask 2024.5.1. In particular in your example the computational graph would never contain nodes related to modifying .atttrs.

I think making a PR for this should be quick, if you agree with this semantic I could give it a try.

from dask.

phofl avatar phofl commented on September 24, 2024

That would be fine but this still doesn't cover what should happen after df.optimize() is called, which is the tricky part here

from dask.

LucaMarconato avatar LucaMarconato commented on September 24, 2024

I will try making some experiments and get back to you.

from dask.

fjetter avatar fjetter commented on September 24, 2024

This can't live on the collection level, it has to be on the expression level

can you elaborate? My initial gut reaction is that this should only live on the collection level and not on expr.

--

pandas-dev/pandas#52166 reads like there are a lot of questions around this feature and I wouldn't be surprised if some of this is subject to change.

from dask.

phofl avatar phofl commented on September 24, 2024

can you elaborate? My initial gut reaction is that this should only live on the collection level and not on expr.

We constantly re-create the expressions underneath the collection, creating completely new collections in the optimiser. Ensuring that this propagates seems non-trivial if the information is only available on the collection.

from dask.

fjetter avatar fjetter commented on September 24, 2024

To summarize a bit of an offline conversation since I got confused about some of the earlier comments

  • attrs is currently a poorly defined API in pandas where semantics are not always clearly defined (some examples are defined in pandas-dev/pandas#52166 for instance around copy-on-write)
  • This lack of specification makes it very hard for us to implement this. While we could attach high level metadata to a collection we could not rely this metadata to be there on an intermediate layer. This is pretty much impossible right now with how the optimizer works
  • This may be a problem for libraries that rely on this to define/control behavior

Note: this is not related to 2024.5.1 but was introduced in 2024.3.0 when we enabled query optimization by default

from dask.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.