Giter Club home page Giter Club logo

Comments (11)

fjetter avatar fjetter commented on September 23, 2024

This is dask/dask-expr#932

@phofl is this expected when enabling copy on write?

from dask.

fjetter avatar fjetter commented on September 23, 2024

FWIW The setting we're enabling here is something that will be enabled by default (if not enforced) in pandas 3.0 which is soon to be released. I suspect that this is desired behavior and you'll need to create a copy with to_numpy(copy=True)

from dask.

ivirshup avatar ivirshup commented on September 23, 2024

I agree that we're going to eventually need to support pandas with copy on write. I am also working on a patch which just works with copy-on-write doing this.

But:

  • I expect we're going to have to support pandas 2 for a while longer (we have optional dependencies and dependents pinning pandas pretty low)
  • I think this is an unintuitive side effect of importing dask.dataframe.

from dask.

ivirshup avatar ivirshup commented on September 23, 2024

Ran into a deeper issue where numcodecs doesn't like being passed a read-only buffer, so I think we'll end up needing to pin dask for a bit on our end.

from dask.

fjetter avatar fjetter commented on September 23, 2024

I understand the problem. We'll likely want to continue using COW in dask but I can offer that we put in a toggle to control this, e.g.

import dask
dask.config.set({"dataframe.copy-on-write": False})
import dask.dataframe

Would this be a feasible workaround for you?

from dask.

ivirshup avatar ivirshup commented on September 23, 2024

dask.config.set({"dataframe.copy-on-write": False})

Would that do something much different than me using pd.set_option?


I do think it makes sense that you want to opt-in to this. I was just hoping to be able to address this during the pandas 3.0 release candidate period when our canary would pick it up.

I also don't think we'll need to do any configuration if we can figure out the numcodecs issue, but unfortunately I don't know enough cython to figure out what incantation it wants.

from dask.

fjetter avatar fjetter commented on September 23, 2024

Would that do something much different than me using pd.set_option?

We want to enable this by default for dask users. The config option would be a way to opt-out of this opinionated choice.

from dask.

phofl avatar phofl commented on September 23, 2024

I also don't think we'll need to do any configuration if we can figure out the numcodecs issue, but unfortunately I don't know enough cython to figure out what incantation it wants.

Their Cython Code can't deal with read-only arrays (we had similar stuff in pandas), a PR that addressed this is here:

pandas-dev/pandas#53703

from dask.

ivirshup avatar ivirshup commented on September 23, 2024

@phofl, thank you for the pointer! zarr-developers/numcodecs#515

from dask.

phofl avatar phofl commented on September 23, 2024

No worries, pr looks good and should address those issues.

from dask.

flying-sheep avatar flying-sheep commented on September 23, 2024

We want to enable this by default for dask users. The config option would be a way to opt-out of this opinionated choice.

Please reconsider that philosophy. Package-wide settings are intended for users and applications, not for libraries. import in Python should be free of side effects.

You can make your APIs only return COW pd.DataFrames, as that’s part of your API, but you should make sure a import dask.dataframe doesn’t modify how pd.DataFrames behave in a different part of a user’s codebase that doesn’t use Dask at all.

from dask.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.