Comments (3)
@fjetter offline you expressed concern for dask-expr optimization performance. I'm observing a 50~150ms slowdown for the full TPCH queries.
IMHO it's negligible.
runtime for graph definition + optimization.
Note that it incorporates fetching the metadata of the input dataframe from s3, which I suspect takes the lion share's of both the mean time and the variance (this is just an intuition; I didn't collect numerical evidence about it).
end-to-end runtime on the Coiled cluster:
The other TPCH queries show similar behaviour.
from dask.
All PRs are now only waiting for review
from dask.
Summary of changes
- tokenize() is now deterministic, within the same interpreter, in most cases
- in the rare edge cases where it is not, you can trust
tokenize(..., ensure_determinstic=True)
to raise robustly - there should be no expectation of determinism across interpreter restarts, hosts, OSs, or dependency versions
The issue of key collision on the scheduler (same key, but different run_spec and possibly different dependencies), which is chiefly caused by #9888, has been mitigated:
Legend
✔️ produces correct output
📛 cluster crashes or hangs on AssertionError
😕 task completes successfully, but output is wrong
run_spec | Task output | Dependencies | Old task status | 2024.2.0 | 2024.2.1 | Use case of #9888? |
---|---|---|---|---|---|---|
same | same | same | * | ✔️ | ✔️ | no |
differs | same | same | pending | ✔️ | yes | |
differs | same | new task has fewer | pending | ✔️ | yes | |
differs | same | new task has more | pending | 📛 | yes | |
differs | same | same | memory | ✔️ | ✔️ | yes |
differs | same | new task has fewer | memory | ✔️ | yes | |
differs | same | new task has more | memory | ✔️ | ✔️ | yes |
differs | same | * | released | ✔️ | yes | |
differs | differs | same | pending | 😕 | no | |
differs | differs | new task has fewer | pending | 😕 | no | |
differs | differs | new task has more | pending | 📛 | no | |
differs | differs | same | memory | 😕 | 😕[1] | no |
differs | differs | new task has fewer | memory | 😕 | no | |
differs | differs | new task has more | memory | 😕 | 😕[1] | no |
differs | differs | * | released | 😕 | no |
[1] this is not great and could deserve a follow-up
from dask.
Related Issues (20)
- Add a `dask.array.sample` functionality mirroring `dask.dataframe.sample` with an optional `ignore_nan` argument
- Inconsistency in ddf.astype(Arrow Dict) HOT 1
- CI is Failing HOT 4
- ddf.drop is inconsistent when passed a set of columns HOT 4
- test_division_or_partition in test_sql is failing for pandas 3
- Sorting by a categorical column doesn't always work
- Use case focused docs pages HOT 2
- TypeError: can only concatenate str (not "traceback") to str
- ⚠️ Upstream CI failed ⚠️
- Add support for `pip install dask[jobqueue]` HOT 4
- Mean fails to compute for very large column of pyarrow type HOT 1
- Previously working time series resampling breaks in new version of Dask HOT 3
- When using PyArrow dtypes, aggregations create NaNs of unexpected type HOT 1
- Column with object dtype get converted to string when selecting the column HOT 1
- aggregate function that operates on vector(array of numeric) data
- Dask .head() returns error as .compute returns ok! HOT 2
- API docs missing for `read_csv`, `read_fwf` and `read_table` HOT 3
- New CI failure showing up in fsspec HOT 2
- Overlap with `new_axis` option is not trimmed correctly HOT 1
- ValueError: An error occurred while calling the read_csv method registered to the pandas backend HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dask.