The latest release 2024.3.0 enabled query planning fo

Sorry about that, <a href="https://docs.dask.org/en/stable/dataframe-api.html" rel="no

There is some general tips for minimal bug reports here <a href="https://matthewrockli

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

hey folks :) I'm getting <a href="https://docs.dask.org/en/stable/dask-expr-api.html"

Feedback - DataFrame query planning about dask HOT 17 OPEN

fjetter commented on September 23, 2024 8

Feedback - DataFrame query planning

from dask.

Comments (17)

phofl commented on September 23, 2024 1

Sorry about that, https://docs.dask.org/en/stable/dataframe-api.html we switched the link

from dask.

fjetter commented on September 23, 2024 1

There is some general tips for minimal bug reports here https://matthewrocklin.com/minimal-bug-reports

Generally, start with randomized data and simplify the example step by step. I don't think you'll be leaking any corporate code if you hand over pure dask code. That's tedious work that takes time but there is no automatic thing that can generate a reproducer for you.

For the specific case when things are hanging it may be possible to generate a py-spy profile and share that or a screenshot of it.

Note that this obviously leaks function names, file paths, etc. and some people consider this sensitive information

py-spy record -s -t -o stuck-dask-expr.svg --format=speedscope -- python my_reproducer.py

Note: on OSX this has to run with sudo

You can abort this after a couple of seconds with Ctrl + C and share the profile. This will at least tell us which part of the code is slow/hanging. There are a couple of paths that are non-linear if certain things are not cached and this might help us. On top of this, a pprint, visualize could be useful.

This does not really replace the need for a reproducer since it is really guess work on our end and the time we can invest in an issue report like this is not that large.

If you provided us with a minimal reproducer we'll likely have a same day fix for you.

from dask.

thomas-fred commented on September 23, 2024

Hello, I'm wondering if I've stumbled into a bug/change in usage with updating from 2024.2.0 to 2024.3.0. With the former, the below works, and indicies is bound to a list of indicies. With the latter, I get an empty list.

df = dask.dataframe.read_parquet(<file_path>)
indicies = list(set(df.index))

The strange thing is that with 2024.3.0, the following seems to know about the index values, look at the repr str..!

(Pdb) df.index.compute()
Empty DataFrame
Columns: []
Index: [2007345N18298, <truncated>, 2022255N15324]

But actually try and access those values ...

(Pdb) df.index.compute().values
array([], shape=(10, 0), dtype=float64)

Here's the data I'm trying to read as zipped parquet: test_data.zip

I installed both dask versions using micromamba 1.5.7.

from dask.

jackguac commented on September 23, 2024

Using drop(col, axis=1) drops any column which begins with the prefix col, but everything is fine when using drop(columns=col) and also was fine I think on 2024.2.1. Here's an example:

import dask.dataframe as dd
import pandas as pd

# Create a range of dates
dates = pd.date_range(start='2024-01-01', periods=9, freq='D')

# Create a Pandas DataFrame with a series of numbers including NaN values and a datetime index
pdf = pd.DataFrame({
    'n1': [1, 1, 2, 2, 2, 2, 2, 2, 2],
    'n2': [1, pd.NA, 1, 5, pd.NA, 7, 8, 9, 10],
}, index=dates)

# Convert the Pandas DataFrame to a Dask DataFrame
ddf = dd.from_pandas(pdf, npartitions=2)

copy = ddf.copy().rename(columns={"n2": "n1_new"})
ddf = ddf.merge(copy, on=["n1"], how="left")

# THIS ONE REMOVES BOTH n1_new and n1:
ddf = ddf.drop("n1_new", axis=1) 
# THIS ONE (CORRECTLY) REMOVES JUST n1_new: 
# ddf = ddf.drop(columns=["n1_new"])

ddf.head()

from dask.

fjetter commented on September 23, 2024

@thomas-fred I would consider it best practice to have an example prepared that does not involve sharing the actual parquet file regardless of how small it is. Parquet has been known to be vulnerable to arbitrary code execution. While the known issues have been fixed (see https://www.cve.org/CVERecord?id=CVE-2023-47248) it still requires a little trust to load a file from an otherwise unknown source.

I tried reproducing what you're describing as

import dask
import dask.dataframe as dd
import pandas as pd
pdf = pd.DataFrame({"foo": range(5)}, index=range(50, 55))
pdf.to_parquet("test.parquet")
dd.read_parquet("test.parquet").index.compute()

and encountered this error dask/dask-expr#993

can you attempt to recreate your issue this way?

from dask.

fjetter commented on September 23, 2024

@jackguac I have a fix for this up here dask/dask-expr#992

from dask.

benrutter commented on September 23, 2024

General feedback: I'm really excited about dask getting a query optimiser, but seems like there's a few sharp edges still.

We (my work) have library with a bunch of data pipeline functions, some of which are fairly involved, and all of which use dask.

I trialled bumping up the dask version today and our test suite hit 2 (previously passing) fails before hanging indefinitely.

I've raised a couple bugs for the issues I can recreate- will keep digging around the specifics and raise any bugs I see (and put in a PR if I track down the specific cause). In the meantime, we're setting query planning to false and will check in again with the next dask version.

Hope this is useful feedback btw, very excited about dask expressions becoming a thing.

from dask.

thomas-fred commented on September 23, 2024

Hello @fjetter, thanks for getting back so promptly. I didn't know about arbitrary code execution issues with parquet files. I knew that my example wasn't ideal, but I also wanted to post something before I stopped work for the weekend. Anyway, I have played with your example and can reproduce your ZeroDivisionError.

I think the specific problem I ran into concerns accessing the values of a string index:

import dask.dataframe as dd
import pandas as pd
pdf = pd.DataFrame({"foo": range(3)}, index=["a", "b", "c"])
pdf.to_parquet("test.pq")
df = dd.read_parquet("test.pq")
print(f"{df.index.compute()=}")
print(f"{df.index.compute().values=}")

from dask.

phofl commented on September 23, 2024

Hi @thomas-fred

thanks for providing the reproducer. We have a fix here: dask/dask-expr#1000

from dask.

avriiil commented on September 23, 2024

hey folks :) I'm getting a 404 error when trying to access the docs for dask-expr as listed in the changelog. Is there somewhere else I should be looking for API reference?

from dask.

phofl commented on September 23, 2024

@kritikakshirsagar03 Could you add a bit more context what you want to tell us with this?

from dask.

zmbc commented on September 23, 2024

xref dask/dask-expr#1060 -- looks like a fix is already in the works!

from dask.

frbelotto commented on September 23, 2024

How do I disable query planning? I am facing some weird issues (my code freezes with no errors for hours)

from dask.

avriiil commented on September 23, 2024

@frbelotto - you can disable query planning with

import dask
dask.config.set({'dataframe.query-planning': False})

from dask.

fjetter commented on September 23, 2024

How do I disable query planning? I am facing some weird issues (my code freezes with no errors for hours)

It would be very helpful if you could post a reproducer. We've encountered things like this in the past and the fixes are typically easy if we have a reproducer available

from dask.

manschoe commented on September 23, 2024

How do I disable query planning? I am facing some weird issues (my code freezes with no errors for hours)

We have the same issue on version 2024.5.1, during testing nothing popped up and then after users started using the newest version all of a sudden things started hanging. Suspiciously it is not reproducible very easily, as sometimes it just starts working after some time. We now rolled back to version 2024.2.1 before query planning as disabling it is not working too turning off query planning. What could we do to figure this out?

from dask.

benrutter commented on September 23, 2024

Just picking up from @manschoe but would be really handy if anyone knowledgable on dask-expr could put together a handy guide on tracking down issues!

I've come across quite a few issues when trying to upgrade where I'm seeing unexpected fails or hangs that appear to relate to dask-expr, but the nature of a query-optimiser makes creating a good reproducer is especially tricky. I can't share corporate code or datasets, but without the exact examples, it's hard to find exactly what's happened within dask-expr to cause the unexpected behaviour.

I think it's a little bit in the nature of optimisers, but I'm sure it's partly inexperience on my part with knowing how to hunt stuff down.

Would appreciate any pointers or guides! I'd love to help track down bugs, but don't really want to submit anything that's too vague.

from dask.

Feedback - DataFrame query planning about dask HOT 17 OPEN

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent