Comments (17)
Sorry about that, https://docs.dask.org/en/stable/dataframe-api.html we switched the link
from dask.
There is some general tips for minimal bug reports here https://matthewrocklin.com/minimal-bug-reports
Generally, start with randomized data and simplify the example step by step. I don't think you'll be leaking any corporate code if you hand over pure dask code. That's tedious work that takes time but there is no automatic thing that can generate a reproducer for you.
For the specific case when things are hanging it may be possible to generate a py-spy profile and share that or a screenshot of it.
Note that this obviously leaks function names, file paths, etc. and some people consider this sensitive information
py-spy record -s -t -o stuck-dask-expr.svg --format=speedscope -- python my_reproducer.py
Note: on OSX this has to run with sudo
You can abort this after a couple of seconds with Ctrl + C
and share the profile. This will at least tell us which part of the code is slow/hanging. There are a couple of paths that are non-linear if certain things are not cached and this might help us. On top of this, a pprint
, visualize
could be useful.
This does not really replace the need for a reproducer since it is really guess work on our end and the time we can invest in an issue report like this is not that large.
If you provided us with a minimal reproducer we'll likely have a same day fix for you.
from dask.
Hello, I'm wondering if I've stumbled into a bug/change in usage with updating from 2024.2.0
to 2024.3.0
. With the former, the below works, and indicies
is bound to a list of indicies. With the latter, I get an empty list.
df = dask.dataframe.read_parquet(<file_path>)
indicies = list(set(df.index))
The strange thing is that with 2024.3.0
, the following seems to know about the index values, look at the repr str..!
(Pdb) df.index.compute()
Empty DataFrame
Columns: []
Index: [2007345N18298, <truncated>, 2022255N15324]
But actually try and access those values ...
(Pdb) df.index.compute().values
array([], shape=(10, 0), dtype=float64)
Here's the data I'm trying to read as zipped parquet: test_data.zip
I installed both dask versions using micromamba 1.5.7.
from dask.
Using drop(col, axis=1)
drops any column which begins with the prefix col
, but everything is fine when using drop(columns=col)
and also was fine I think on 2024.2.1. Here's an example:
import dask.dataframe as dd
import pandas as pd
# Create a range of dates
dates = pd.date_range(start='2024-01-01', periods=9, freq='D')
# Create a Pandas DataFrame with a series of numbers including NaN values and a datetime index
pdf = pd.DataFrame({
'n1': [1, 1, 2, 2, 2, 2, 2, 2, 2],
'n2': [1, pd.NA, 1, 5, pd.NA, 7, 8, 9, 10],
}, index=dates)
# Convert the Pandas DataFrame to a Dask DataFrame
ddf = dd.from_pandas(pdf, npartitions=2)
copy = ddf.copy().rename(columns={"n2": "n1_new"})
ddf = ddf.merge(copy, on=["n1"], how="left")
# THIS ONE REMOVES BOTH n1_new and n1:
ddf = ddf.drop("n1_new", axis=1)
# THIS ONE (CORRECTLY) REMOVES JUST n1_new:
# ddf = ddf.drop(columns=["n1_new"])
ddf.head()
from dask.
@thomas-fred I would consider it best practice to have an example prepared that does not involve sharing the actual parquet file regardless of how small it is. Parquet has been known to be vulnerable to arbitrary code execution. While the known issues have been fixed (see https://www.cve.org/CVERecord?id=CVE-2023-47248) it still requires a little trust to load a file from an otherwise unknown source.
I tried reproducing what you're describing as
import dask
import dask.dataframe as dd
import pandas as pd
pdf = pd.DataFrame({"foo": range(5)}, index=range(50, 55))
pdf.to_parquet("test.parquet")
dd.read_parquet("test.parquet").index.compute()
and encountered this error dask/dask-expr#993
can you attempt to recreate your issue this way?
from dask.
@jackguac I have a fix for this up here dask/dask-expr#992
from dask.
General feedback: I'm really excited about dask getting a query optimiser, but seems like there's a few sharp edges still.
We (my work) have library with a bunch of data pipeline functions, some of which are fairly involved, and all of which use dask.
I trialled bumping up the dask version today and our test suite hit 2 (previously passing) fails before hanging indefinitely.
I've raised a couple bugs for the issues I can recreate- will keep digging around the specifics and raise any bugs I see (and put in a PR if I track down the specific cause). In the meantime, we're setting query planning to false and will check in again with the next dask version.
Hope this is useful feedback btw, very excited about dask expressions becoming a thing.
from dask.
Hello @fjetter, thanks for getting back so promptly. I didn't know about arbitrary code execution issues with parquet files. I knew that my example wasn't ideal, but I also wanted to post something before I stopped work for the weekend. Anyway, I have played with your example and can reproduce your ZeroDivisionError
.
I think the specific problem I ran into concerns accessing the values of a string index:
import dask.dataframe as dd
import pandas as pd
pdf = pd.DataFrame({"foo": range(3)}, index=["a", "b", "c"])
pdf.to_parquet("test.pq")
df = dd.read_parquet("test.pq")
print(f"{df.index.compute()=}")
print(f"{df.index.compute().values=}")
from dask.
Hi @thomas-fred
thanks for providing the reproducer. We have a fix here: dask/dask-expr#1000
from dask.
hey folks :) I'm getting a 404 error when trying to access the docs for dask-expr
as listed in the changelog. Is there somewhere else I should be looking for API reference?
from dask.
@kritikakshirsagar03 Could you add a bit more context what you want to tell us with this?
from dask.
xref dask/dask-expr#1060 -- looks like a fix is already in the works!
from dask.
How do I disable query planning? I am facing some weird issues (my code freezes with no errors for hours)
from dask.
@frbelotto - you can disable query planning with
import dask
dask.config.set({'dataframe.query-planning': False})
see also: https://docs.dask.org/en/stable/changelog.html#query-planning
from dask.
How do I disable query planning? I am facing some weird issues (my code freezes with no errors for hours)
It would be very helpful if you could post a reproducer. We've encountered things like this in the past and the fixes are typically easy if we have a reproducer available
from dask.
How do I disable query planning? I am facing some weird issues (my code freezes with no errors for hours)
We have the same issue on version 2024.5.1, during testing nothing popped up and then after users started using the newest version all of a sudden things started hanging. Suspiciously it is not reproducible very easily, as sometimes it just starts working after some time. We now rolled back to version 2024.2.1 before query planning as disabling it is not working too turning off query planning. What could we do to figure this out?
from dask.
Just picking up from @manschoe but would be really handy if anyone knowledgable on dask-expr could put together a handy guide on tracking down issues!
I've come across quite a few issues when trying to upgrade where I'm seeing unexpected fails or hangs that appear to relate to dask-expr, but the nature of a query-optimiser makes creating a good reproducer is especially tricky. I can't share corporate code or datasets, but without the exact examples, it's hard to find exactly what's happened within dask-expr to cause the unexpected behaviour.
I think it's a little bit in the nature of optimisers, but I'm sure it's partly inexperience on my part with knowing how to hunt stuff down.
Would appreciate any pointers or guides! I'd love to help track down bugs, but don't really want to submit anything that's too vague.
from dask.
Related Issues (20)
- Discrepancy Between Native Pandas Types And Dask Computed Pandas Types HOT 2
- Negative lookahead suddenly incorrectly parsed HOT 3
- Removal of Sphinx context injection at build time HOT 1
- Roundtripping timezone-aware DataFrame through parquet doesn't preserve timestamp resolution
- from_pandas fails when given empty dataframe and chunksize HOT 2
- Array slicing is using low level materialized graphs HOT 1
- Pyarrow <NA> filters are not being applied in read_parquet HOT 3
- When Dask merges two large tables, the _merge column disappears. However, it appears normally when merging with a smaller table. HOT 6
- `read_sql_table` no longer sets index name in the resulting ddf meta HOT 1
- Optimizer applies parquet `filters` after loading when using `read_parquet(...).map_partitions(...).compute()` HOT 1
- Create dask array from non-dask npy stack
- Not all chunks are square
- 'Lock is not yet acquired' using distributed lock HOT 2
- Local memory explodes on isin() HOT 1
- Docstring mismatch of chunksize parameter with actual implementation in multiprocessing `get` scheduler function HOT 1
- Incorrect df.describe() results with timestamp64[ms] HOT 4
- Add Array shuffle / general take using tasks HOT 5
- dask.dataframe.to_partquet() not work on dask.dataframe with HUGE rows (around 5 billions) HOT 6
- Dask read_sql does not support nullable Integer HOT 4
- Dask-expr - Extremely slow with using .compute HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dask.