Comments (11)
We constantly re-create the expressions underneath the collection, creating completely new collections in the optimiser. Ensuring that this propagates seems non-trivial if the information is only available on the collection.
Are we ever interested in maintaining the correct value of .attrs
for any intermediate expression or only in its value on the root collection? In the latter case, attaching it to the collection and fixing it during the optimization step feels manageable.
from dask.
I also experienced a problem with pandas
and .attrs
: copied DataFrame
objects where not containing a copy of .attrs
. (I initially thought it was a geopandas
problem geopandas/geopandas#2920), but then it turned out to be from pandas
pandas-dev/pandas#54134. Fortunately the bug above is now fixed. Which other problems did you find?
The use case for .attrs
we have in the library we developed https://github.com/scverse/spatialdata/, is to store metadata associated to GeoDataFrame
and DataFrame
objects (both lazy and non-lazy). The metadata mostly contain json-like information that describes how various spatial objects are aligned together.
from dask.
Thanks for the report.
Could you add a bit more context about what you are trying to achieve with attr? attr doesn't really work in pandas either and support is spotty at most.
from dask.
Do you have any update on this? 😊 Some users reported problems with installation due to having pinned an older version of Dask in our config so it's an important issue for us. Thank you for your time.
from dask.
This is non-trivial to add and the semantics aren't completely clear either. This can't live on the collection level, it has to be on the expression level since we are constantly recreating the underlying expression, which makes it non-trivial. That said, the following is unclear to me:
df = dd.from_pandas(...)
df.attrs = "foo"
df = df.fillna(100)
df = df["a"]
df.attrs = "bar"
dask-expr will reorder the query and push the projection in front of the fillna. So what attr should take precedence here?
Contributions are very welcome. I won't have much time to think about this though
from dask.
Thanks for the answer. From my understanding (and for the purpose of our use case) the df.attrs
slot should not be treated as lazy but always be executed immediately. This was the behavior implemented before Dask 2024.5.1. In particular in your example the computational graph would never contain nodes related to modifying .atttrs
.
I think making a PR for this should be quick, if you agree with this semantic I could give it a try.
from dask.
That would be fine but this still doesn't cover what should happen after df.optimize() is called, which is the tricky part here
from dask.
I will try making some experiments and get back to you.
from dask.
This can't live on the collection level, it has to be on the expression level
can you elaborate? My initial gut reaction is that this should only live on the collection level and not on expr.
--
pandas-dev/pandas#52166 reads like there are a lot of questions around this feature and I wouldn't be surprised if some of this is subject to change.
from dask.
can you elaborate? My initial gut reaction is that this should only live on the collection level and not on expr.
We constantly re-create the expressions underneath the collection, creating completely new collections in the optimiser. Ensuring that this propagates seems non-trivial if the information is only available on the collection.
from dask.
To summarize a bit of an offline conversation since I got confused about some of the earlier comments
attrs
is currently a poorly defined API in pandas where semantics are not always clearly defined (some examples are defined in pandas-dev/pandas#52166 for instance around copy-on-write)- This lack of specification makes it very hard for us to implement this. While we could attach high level metadata to a collection we could not rely this metadata to be there on an intermediate layer. This is pretty much impossible right now with how the optimizer works
- This may be a problem for libraries that rely on this to define/control behavior
Note: this is not related to 2024.5.1 but was introduced in 2024.3.0 when we enabled query optimization by default
from dask.
Related Issues (20)
- Dask 2024.8 started failing when indexing result of numpy.flatnonzero HOT 1
- DataFrame object vs string datatype HOT 3
- Columns are missing after rename HOT 2
- ⚠️ Upstream CI failed ⚠️ HOT 1
- Out of memory HOT 12
- Unexpected Behavior When Using `dask.delayed` with `xarray` to Load a Chunked Dataset HOT 3
- gpuCI broken HOT 2
- Expose a blockwise - reshape operation that doesn't guarantee to keep the ordering consistent for downstream libraries HOT 5
- ⚠️ Upstream CI failed ⚠️ HOT 3
- Requested dask.distributed scheduler but no Client active HOT 4
- cannot access local variable 'divisions' where it is not associated with a value
- Bump mindeps for pyarrow and numpy HOT 3
- read_sql_table would throw an exception when calling for unique values of a column
- `map_blocks()` with `new_axis` output has incorrect shape HOT 3
- order: Run ordering test on distributed cluster and compare against local ordering
- gpuCI failing HOT 3
- An inconsistency between the documentation of `dask.array.percentile` and code implementation HOT 2
- Suggesting updates on the doc of `dask.dataframe.read_sql_query` HOT 2
- Improve how normalize_chunks selects chunk sizes if auto is given HOT 2
- Better chunk size value for chunks=auto setting
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dask.