Comments (11)
This is dask/dask-expr#932
@phofl is this expected when enabling copy on write?
from dask.
FWIW The setting we're enabling here is something that will be enabled by default (if not enforced) in pandas 3.0 which is soon to be released. I suspect that this is desired behavior and you'll need to create a copy with to_numpy(copy=True)
from dask.
I agree that we're going to eventually need to support pandas with copy on write. I am also working on a patch which just works with copy-on-write
doing this.
But:
- I expect we're going to have to support pandas 2 for a while longer (we have optional dependencies and dependents pinning pandas pretty low)
- I think this is an unintuitive side effect of importing dask.dataframe.
from dask.
Ran into a deeper issue where numcodecs doesn't like being passed a read-only buffer, so I think we'll end up needing to pin dask for a bit on our end.
from dask.
I understand the problem. We'll likely want to continue using COW in dask but I can offer that we put in a toggle to control this, e.g.
import dask
dask.config.set({"dataframe.copy-on-write": False})
import dask.dataframe
Would this be a feasible workaround for you?
from dask.
dask.config.set({"dataframe.copy-on-write": False})
Would that do something much different than me using pd.set_option
?
I do think it makes sense that you want to opt-in to this. I was just hoping to be able to address this during the pandas 3.0 release candidate period when our canary would pick it up.
I also don't think we'll need to do any configuration if we can figure out the numcodecs issue, but unfortunately I don't know enough cython to figure out what incantation it wants.
from dask.
Would that do something much different than me using pd.set_option?
We want to enable this by default for dask users. The config option would be a way to opt-out of this opinionated choice.
from dask.
I also don't think we'll need to do any configuration if we can figure out the numcodecs issue, but unfortunately I don't know enough cython to figure out what incantation it wants.
Their Cython Code can't deal with read-only arrays (we had similar stuff in pandas), a PR that addressed this is here:
from dask.
@phofl, thank you for the pointer! zarr-developers/numcodecs#515
from dask.
No worries, pr looks good and should address those issues.
from dask.
We want to enable this by default for dask users. The config option would be a way to opt-out of this opinionated choice.
Please reconsider that philosophy. Package-wide settings are intended for users and applications, not for libraries. import
in Python should be free of side effects.
You can make your APIs only return COW pd.DataFrames, as that’s part of your API, but you should make sure a import dask.dataframe
doesn’t modify how pd.DataFrames behave in a different part of a user’s codebase that doesn’t use Dask at all.
from dask.
Related Issues (20)
- Combined save and calculation is using excessive memory HOT 2
- Array API in Dask
- Feedback - DataFrame query planning HOT 17
- Dumb code error in the Example code in Dask-SQL Homepage HOT 3
- dask.bag.Bag.to_dataframe behavior change in 2024.3.0 - setting dtype to string rather than object by default HOT 4
- TypeError: float() argument must be a string or a real number, not 'csr_matrix' HOT 1
- dask.dataframe.Series.reduction is not available when using query planning HOT 4
- Dask query planning string column unique bug HOT 2
- Dataframe constructed from single partition bag cannot be shuffled with query planning enabled HOT 2
- dask.dataframe.DataFrame.reduction fails on`split_every=False` if query planning is in effect HOT 1
- as of v2024.3.1, comparing a 1D dask.array.Array to a dask.dataframe.Series fails HOT 1
- value_counts with NaN sometimes raises ValueError: No objects to concatenate HOT 2
- .loc fails to select columns from boolean array (after dask-exp update)
- Minimal dd.to_datetime to convert a string column no longer works
- ``new_dd_object``'s array logic always assumes the metadata is ``numpy``
- `vindex` as outer indexer: memory and time performance
- Hash join transfer with error cannot pickle '_contextvars.ContextVar' object HOT 6
- `set_index` returns the divisions instead of the dataframe with query planning enabled
- Preserving divisions when reading/loading dataframes with structs containing multiple fields HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dask.