Comments (5)
@FlorisCalkoen thanks for the report!
I can reproduce the error, but I am not sure it is related to the pyarrow version (I get the error with pyarrow 10 or 11 as well), but maybe rather related to the dask/distributed version?
My current understanding is that this error comes from the new shuffle implementation in distributed (https://blog.coiled.io/blog/shuffling-large-data-at-constant-memory.html, starting with dask 2023.2.1), which now uses Arrow IPC to serialize the data and send them between workers. But converting a geopandas.GeoDataFrame to pyarrow.Table doesn't work out of the box, because arrow doesn't know what to do with the geometry column.
And I can confirm this by specifying to use the older task-based shuffling:
ddf.spatial_shuffle(shuffle="tasks")
That works without error for me.
from dask-geopandas.
We should of course ensure this works with the new P2P shuffle as well, as that brings many benefits.
I have to look a bit closer into it, but essentially we have to make the following work:
In [7]: rivers = gpd.read_file(geodatasets.get_path("eea large_rivers")).to_crs(4326)
In [8]: import pyarrow as pa
In [9]: pa.Table.from_pandas(rivers)
...
ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for column geometry with type geometry')
This is something that could be fixed on the GeoPandas side by defining an arrow extension type (to control how the geometry column gets converted to arrow and back). However, I am not fully sure how dask/distributed could know we want back a GeoDataFrame and not a DataFrame (something to try out).
Or dask/distributed needs to give us some way to register a method to override this default conversion, similarly as we did for just dask's to_parquet to register a pyarrow_schema_dispatch
(82da8f1)
from dask-geopandas.
cc @hendrikmakait, you were interested how P2P works with dask-geopandas. It doesn't at the moment :).
from dask-geopandas.
With the following patch to geopandas, the above example works:
--- a/geopandas/array.py
+++ b/geopandas/array.py
@@ -1257,7 +1257,10 @@ class GeometryArray(ExtensionArray):
# GH 1413
if isinstance(scalars, BaseGeometry):
scalars = [scalars]
- return from_shapely(scalars)
+ try:
+ return from_shapely(scalars)
+ except TypeError:
+ return from_wkb(scalars)
def _values_for_factorize(self):
# type: () -> Tuple[np.ndarray, Any]
@@ -1454,6 +1457,11 @@ class GeometryArray(ExtensionArray):
"""
return to_shapely(self)
+ def __arrow_array__(self, type=None):
+ # convert the underlying array values to a pyarrow Array
+ import pyarrow
+ return pyarrow.array(to_wkb(self), type=type)
+
def _binop(self, other, op):
def convert_values(param):
if not _is_scalar_geometry(param) and (
Explanation:
- Adding
__arrow_array__
ensures that the conversion to an Arrow table works automatically (which is done in distributed's P2P shuffle, it callspa.Table.from_pandas
on each chunk. - When only adding
__arrow_array__
, the conversion back from arrow to pandas fails: distributed will try to cast the pyarrow->pandas converted DataFrame to the original dtypes (df.astype(meta.dtypes, copy=False)
at https://github.com/dask/distributed/blob/9beab9a06a7777cb8d6bb2d90ae961b69de2e532/distributed/shuffle/_arrow.py#L71-L72). This fails, because geopandas currently doesn't support doingSeries[binary WKB values].astype("geometry")
(while we probably should?) - Overriding
_from_sequence
to callfrom_wkb
iffrom_shapely
fails is a quick workaround to get.astype("geometry")
working.
from dask-geopandas.
While the points I raise are things we should address in geopandas anyway at some point (although there are some questions about which default representation to use when converting to arrow), there are also other solutions in dask and distributed itself: dask added a dispatch method for pyarrow<->pandas conversion (dask/dask#10312) which we can implement, and I think that should also fix this issue when that dispatch method is used in distributed (WIP PR for this is at dask/distributed#7743)
from dask-geopandas.
Related Issues (20)
- Unpin sphinx-book-theme HOT 1
- Add support for Pandas 2.0.0 `dtype_backend` argument in `read_feather`
- read parquet from s3 failing with 'GeoArrowEngine' has no attribute 'extract_filesystem' HOT 3
- 0.3.1 release HOT 2
- dtype('O') not supported since geopandas 0.13.0
- FeatureError from filegdbtable.cpp when reading file HOT 2
- Drop distributed as a required dependency? HOT 1
- Question regarding parallelism over many seperate GeoSeries HOT 2
- dask geopandas to parquet does not seem to persist spatial paritions HOT 1
- Can someone answer why the number and x columns of '201105. shp' in the output of this code also become 0? HOT 1
- msgpack - ValueError: 2369781118 exceeds max_bin_len(2147483647 HOT 1
- Remove dask anti-pattern example on README and docs HOT 1
- DeprecationWarning: underlying geometries through the `.data` attribute is deprecated HOT 1
- Error when reading geoparquet file HOT 3
- Support latest dask.dataframe with query planning (dask-expr) HOT 3
- ddf._meta_nonempty doesnt instantiate correctly when calling `from_dask_dataframe` HOT 1
- BUG: `to_parquet()` failing with `dask=2024.4.1` HOT 2
- Uninformative AttributeError for aggregation methods
- AttributeError: 'DataFrame' object has no attribute 'within' HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dask-geopandas.