Comments (9)
Thanks for chasing this down! I agre this should raise, I opened #11174 to address this
from dask.
what pyarrow version did you use to create the dataset and which one are you using to read the dataset?
from dask.
@fjetter, thanks for a quick reply. We installed pyarrow 14.0.2
.
The important observation is that TypeError error disappears if I take only part of the dataset as follows:
ddf.loc[:100000]
However, disabling the dask-expr
still leads to an error:
ValueError: Length of values (0) does not match length of index (100001)
(I've added a note to the description)
from dask.
@fjetter , in case of TypeError: cannot concatenate object of type '<class 'tuple'>'; only Series and DataFrame objs are valid
I'm getting the following values here:
https://github.com/dask/dask/blob/main/dask/dataframe/backends.py#L688
I've added the debug code on all nodes as follows:
from dask.distributed import print
print(dfs3[0], dfs3[1], dfs3, join)
out = pd.concat(dfs3, join=join, sort=False)
dfs3[0]
- is DataFrame
dfs3[1]
- is tuple
with the following value:
('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 4)
dfs3
is a list
that looks like as follows:
[ .. .[48081 rows x 4 columns], ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 4), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 5), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 6), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 7), ('repa
rtition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 8), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 9), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 10), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 11), ('repartition-split-100000000-fe2ba6
93bab4c1021c4766fe26e0d5dc', 12), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 13), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 14), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 15), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc',
16), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 17), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 18), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 19), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 20), ('repartition-split-10
0000000-fe2ba693bab4c1021c4766fe26e0d5dc', 21), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 22), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 23), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 24), ('repartition-split-100000000-fe2ba693bab4c1021c47
66fe26e0d5dc', 25), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 26), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 27), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 28), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 29), ('repart
ition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 30), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 31), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 32), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 33), ('repartition-split-100000000-fe2ba6
93bab4c1021c4766fe26e0d5dc', 34), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 35), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 36), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 37), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc',
38), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 39), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 40), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 41), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 42), ('repartition-split-10
0000000-fe2ba693bab4c1021c4766fe26e0d5dc', 43), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 44), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 45), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 46), ('repartition-split-100000000-fe2ba693bab4c1021c47
66fe26e0d5dc', 47), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 48), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 49), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 50), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 51), ('repart
ition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 52), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 53), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 54), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 55), ('repartition-split-100000000-fe2ba6
93bab4c1021c4766fe26e0d5dc', 56), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 57), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 58), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 59), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 60), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 61), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 62), ('repartition-split-100000000-fe2ba693bab4c1021c4766fe26e0d5dc', 63)]
from dask.
@fjetter , in case of ValueError: Length of values (0) does not match length of index (12295809)
I'm getting the following values here:
https://github.com/dask/dask/blob/main/dask/dataframe/backends.py#L523-L525
I've added the debug code on all nodes as follows:
@hash_object_dispatch.register((pd.DataFrame, pd.Series, pd.Index))
def hash_object_pandas(
obj, index=True, encoding="utf8", hash_key=None, categorize=True
):
from dask.distributed import print
print(obj)
print(index)
print(encoding)
print(hash_key)
print(categorize)
return pd.util.hash_pandas_object(
obj, index=index, encoding=encoding, hash_key=hash_key, categorize=categorize
)
Here is an output:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...]
[12295809 rows x 0 columns]
False
utf8
None
True
from dask.
@fjetter , the problem is that I'm shuffling using the existing column. Dask should provide a proper error message in this case. I've added preproduction steps. Please see:
- Use any parquet data
- Try to shuffle using a non-existing column and export the data:
ddf = (dd
.read_parquet('gs://.../....parquet')
.shuffle(on='does_not_exist', npartitions=64)
.repartition(partition_size='100MB')
.to_parquet('data_test.parquet'))
from dask.
@fjetter , the problem isn't strictly related to dask-expr
. The different problem appears even when dask-expr
is off using dask.config.set({"dataframe.query-planning": False})
as described in the ticket's description.
from dask.
indeed, I transferred the issue to dask/dask
from dask.
@fjetter , after spending some time testing the solution, I've found that there are two separate problems:
Problem 1
The following problem still appears when the dask-expr
is enabled, and I don't have a reproduction step yet.
TypeError: cannot concatenate object of type '<class 'tuple'>'; only Series and DataFrame objs are valid
Please see more details here: #11160 (comment)
Problem 2
The following problem is gone and appears only when dask-expr
is disabled and incorrect on=<column>
provided for shuffle()
method:
ValueError: Length of values (0) does not match length of index (100001)
Please see #11160 (comment)
from dask.
Related Issues (20)
- Concat with unknown divisions raises TypeError HOT 1
- Dask 2024.5.1 removed `.attrs` HOT 11
- Dask 2024.5.1 raises exception when `.compute()` is called on a categorical column HOT 3
- a tutorial for distributed text deduplication HOT 5
- Large memory use when loading file with np.memmap in recent dask versions HOT 8
- `dask/dataframe/tests/test_indexing.py::test_getitem_integer_slice` failing with nightly `pandas`
- dask.dataframe import error for Python 3.12.3 HOT 3
- 'SeriesGroupBy' object has no attribute 'nunique_approx' HOT 6
- Categorical column information incorrectly copied over when using series to create new dataframe resulting in a broken dataframe
- calling repartition on ddf with timeseries index after resample causes ValueError: left side of old and new divisions are different
- P2P rechunking of ERA-5 from spatial to temporal dimension is failing hard HOT 15
- Improve documentation for `dd.from_map(...)` HOT 1
- AssertionError: DataFrame are different with dask 2024.5.1 and python 3.12 HOT 3
- `test_quantile` flaky
- Shuffle not raising exception when `on` does not exist HOT 1
- [FEA] Add official mechanism to check if query-planning is enabled in ``dask.dataframe`` HOT 3
- UnboundLocalError in test_dt_accessor when dd._dask_expr_enabled is False HOT 2
- Error with the default tokenizer. HOT 4
- Cannot bind async delayed
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dask.