Giter Club home page Giter Club logo

Comments (17)

phofl avatar phofl commented on July 30, 2024 1

Sorry about that, https://docs.dask.org/en/stable/dataframe-api.html we switched the link

from dask.

fjetter avatar fjetter commented on July 30, 2024 1

There is some general tips for minimal bug reports here https://matthewrocklin.com/minimal-bug-reports

Generally, start with randomized data and simplify the example step by step. I don't think you'll be leaking any corporate code if you hand over pure dask code. That's tedious work that takes time but there is no automatic thing that can generate a reproducer for you.


For the specific case when things are hanging it may be possible to generate a py-spy profile and share that or a screenshot of it.

Note that this obviously leaks function names, file paths, etc. and some people consider this sensitive information

py-spy record -s -t -o stuck-dask-expr.svg --format=speedscope -- python my_reproducer.py

Note: on OSX this has to run with sudo

You can abort this after a couple of seconds with Ctrl + C and share the profile. This will at least tell us which part of the code is slow/hanging. There are a couple of paths that are non-linear if certain things are not cached and this might help us. On top of this, a pprint, visualize could be useful.

This does not really replace the need for a reproducer since it is really guess work on our end and the time we can invest in an issue report like this is not that large.

If you provided us with a minimal reproducer we'll likely have a same day fix for you.

from dask.

thomas-fred avatar thomas-fred commented on July 30, 2024

Hello, I'm wondering if I've stumbled into a bug/change in usage with updating from 2024.2.0 to 2024.3.0. With the former, the below works, and indicies is bound to a list of indicies. With the latter, I get an empty list.

df = dask.dataframe.read_parquet(<file_path>)
indicies = list(set(df.index))

The strange thing is that with 2024.3.0, the following seems to know about the index values, look at the repr str..!

(Pdb) df.index.compute()
Empty DataFrame
Columns: []
Index: [2007345N18298, <truncated>, 2022255N15324]

But actually try and access those values ...

(Pdb) df.index.compute().values
array([], shape=(10, 0), dtype=float64)

Here's the data I'm trying to read as zipped parquet: test_data.zip

I installed both dask versions using micromamba 1.5.7.

from dask.

jackguac avatar jackguac commented on July 30, 2024

Using drop(col, axis=1) drops any column which begins with the prefix col, but everything is fine when using drop(columns=col) and also was fine I think on 2024.2.1. Here's an example:

import dask.dataframe as dd
import pandas as pd

# Create a range of dates
dates = pd.date_range(start='2024-01-01', periods=9, freq='D')

# Create a Pandas DataFrame with a series of numbers including NaN values and a datetime index
pdf = pd.DataFrame({
    'n1': [1, 1, 2, 2, 2, 2, 2, 2, 2],
    'n2': [1, pd.NA, 1, 5, pd.NA, 7, 8, 9, 10],
}, index=dates)

# Convert the Pandas DataFrame to a Dask DataFrame
ddf = dd.from_pandas(pdf, npartitions=2)

copy = ddf.copy().rename(columns={"n2": "n1_new"})
ddf = ddf.merge(copy, on=["n1"], how="left")

# THIS ONE REMOVES BOTH n1_new and n1:
ddf = ddf.drop("n1_new", axis=1) 
# THIS ONE (CORRECTLY) REMOVES JUST n1_new: 
# ddf = ddf.drop(columns=["n1_new"])

ddf.head()

from dask.

fjetter avatar fjetter commented on July 30, 2024

@thomas-fred I would consider it best practice to have an example prepared that does not involve sharing the actual parquet file regardless of how small it is. Parquet has been known to be vulnerable to arbitrary code execution. While the known issues have been fixed (see https://www.cve.org/CVERecord?id=CVE-2023-47248) it still requires a little trust to load a file from an otherwise unknown source.

I tried reproducing what you're describing as

import dask
import dask.dataframe as dd
import pandas as pd
pdf = pd.DataFrame({"foo": range(5)}, index=range(50, 55))
pdf.to_parquet("test.parquet")
dd.read_parquet("test.parquet").index.compute()

and encountered this error dask/dask-expr#993

can you attempt to recreate your issue this way?

from dask.

fjetter avatar fjetter commented on July 30, 2024

@jackguac I have a fix for this up here dask/dask-expr#992

from dask.

benrutter avatar benrutter commented on July 30, 2024

General feedback: I'm really excited about dask getting a query optimiser, but seems like there's a few sharp edges still.

We (my work) have library with a bunch of data pipeline functions, some of which are fairly involved, and all of which use dask.

I trialled bumping up the dask version today and our test suite hit 2 (previously passing) fails before hanging indefinitely.

I've raised a couple bugs for the issues I can recreate- will keep digging around the specifics and raise any bugs I see (and put in a PR if I track down the specific cause). In the meantime, we're setting query planning to false and will check in again with the next dask version.

Hope this is useful feedback btw, very excited about dask expressions becoming a thing.

from dask.

thomas-fred avatar thomas-fred commented on July 30, 2024

Hello @fjetter, thanks for getting back so promptly. I didn't know about arbitrary code execution issues with parquet files. I knew that my example wasn't ideal, but I also wanted to post something before I stopped work for the weekend. Anyway, I have played with your example and can reproduce your ZeroDivisionError.

I think the specific problem I ran into concerns accessing the values of a string index:

import dask.dataframe as dd
import pandas as pd
pdf = pd.DataFrame({"foo": range(3)}, index=["a", "b", "c"])
pdf.to_parquet("test.pq")
df = dd.read_parquet("test.pq")
print(f"{df.index.compute()=}")
print(f"{df.index.compute().values=}")

from dask.

phofl avatar phofl commented on July 30, 2024

Hi @thomas-fred

thanks for providing the reproducer. We have a fix here: dask/dask-expr#1000

from dask.

avriiil avatar avriiil commented on July 30, 2024

hey folks :) I'm getting a 404 error when trying to access the docs for dask-expr as listed in the changelog. Is there somewhere else I should be looking for API reference?

from dask.

phofl avatar phofl commented on July 30, 2024

@kritikakshirsagar03 Could you add a bit more context what you want to tell us with this?

from dask.

zmbc avatar zmbc commented on July 30, 2024

xref dask/dask-expr#1060 -- looks like a fix is already in the works!

from dask.

frbelotto avatar frbelotto commented on July 30, 2024

How do I disable query planning? I am facing some weird issues (my code freezes with no errors for hours)

from dask.

avriiil avatar avriiil commented on July 30, 2024

@frbelotto - you can disable query planning with

import dask
dask.config.set({'dataframe.query-planning': False})

see also: https://docs.dask.org/en/stable/changelog.html#query-planning

from dask.

fjetter avatar fjetter commented on July 30, 2024

How do I disable query planning? I am facing some weird issues (my code freezes with no errors for hours)

It would be very helpful if you could post a reproducer. We've encountered things like this in the past and the fixes are typically easy if we have a reproducer available

from dask.

manschoe avatar manschoe commented on July 30, 2024

How do I disable query planning? I am facing some weird issues (my code freezes with no errors for hours)

We have the same issue on version 2024.5.1, during testing nothing popped up and then after users started using the newest version all of a sudden things started hanging. Suspiciously it is not reproducible very easily, as sometimes it just starts working after some time. We now rolled back to version 2024.2.1 before query planning as disabling it is not working too turning off query planning. What could we do to figure this out?

from dask.

benrutter avatar benrutter commented on July 30, 2024

Just picking up from @manschoe but would be really handy if anyone knowledgable on dask-expr could put together a handy guide on tracking down issues!

I've come across quite a few issues when trying to upgrade where I'm seeing unexpected fails or hangs that appear to relate to dask-expr, but the nature of a query-optimiser makes creating a good reproducer is especially tricky. I can't share corporate code or datasets, but without the exact examples, it's hard to find exactly what's happened within dask-expr to cause the unexpected behaviour.

I think it's a little bit in the nature of optimisers, but I'm sure it's partly inexperience on my part with knowing how to hunt stuff down.

Would appreciate any pointers or guides! I'd love to help track down bugs, but don't really want to submit anything that's too vague.

from dask.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.