Giter Club home page Giter Club logo

Comments (5)

jorisvandenbossche avatar jorisvandenbossche commented on July 29, 2024 1

@FlorisCalkoen thanks for the report!

I can reproduce the error, but I am not sure it is related to the pyarrow version (I get the error with pyarrow 10 or 11 as well), but maybe rather related to the dask/distributed version?

My current understanding is that this error comes from the new shuffle implementation in distributed (https://blog.coiled.io/blog/shuffling-large-data-at-constant-memory.html, starting with dask 2023.2.1), which now uses Arrow IPC to serialize the data and send them between workers. But converting a geopandas.GeoDataFrame to pyarrow.Table doesn't work out of the box, because arrow doesn't know what to do with the geometry column.

And I can confirm this by specifying to use the older task-based shuffling:

ddf.spatial_shuffle(shuffle="tasks")

That works without error for me.

from dask-geopandas.

jorisvandenbossche avatar jorisvandenbossche commented on July 29, 2024

We should of course ensure this works with the new P2P shuffle as well, as that brings many benefits.
I have to look a bit closer into it, but essentially we have to make the following work:

In [7]: rivers = gpd.read_file(geodatasets.get_path("eea large_rivers")).to_crs(4326)

In [8]: import pyarrow as pa

In [9]: pa.Table.from_pandas(rivers)
...
ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for column geometry with type geometry')

This is something that could be fixed on the GeoPandas side by defining an arrow extension type (to control how the geometry column gets converted to arrow and back). However, I am not fully sure how dask/distributed could know we want back a GeoDataFrame and not a DataFrame (something to try out).

Or dask/distributed needs to give us some way to register a method to override this default conversion, similarly as we did for just dask's to_parquet to register a pyarrow_schema_dispatch (82da8f1)

from dask-geopandas.

martinfleis avatar martinfleis commented on July 29, 2024

cc @hendrikmakait, you were interested how P2P works with dask-geopandas. It doesn't at the moment :).

from dask-geopandas.

jorisvandenbossche avatar jorisvandenbossche commented on July 29, 2024

With the following patch to geopandas, the above example works:

--- a/geopandas/array.py
+++ b/geopandas/array.py
@@ -1257,7 +1257,10 @@ class GeometryArray(ExtensionArray):
         # GH 1413
         if isinstance(scalars, BaseGeometry):
             scalars = [scalars]
-        return from_shapely(scalars)
+        try:
+            return from_shapely(scalars)
+        except TypeError:
+            return from_wkb(scalars)
 
     def _values_for_factorize(self):
         # type: () -> Tuple[np.ndarray, Any]
@@ -1454,6 +1457,11 @@ class GeometryArray(ExtensionArray):
         """
         return to_shapely(self)
 
+    def __arrow_array__(self, type=None):
+        # convert the underlying array values to a pyarrow Array
+        import pyarrow
+        return pyarrow.array(to_wkb(self), type=type)
+
     def _binop(self, other, op):
         def convert_values(param):
             if not _is_scalar_geometry(param) and (

Explanation:

  • Adding __arrow_array__ ensures that the conversion to an Arrow table works automatically (which is done in distributed's P2P shuffle, it calls pa.Table.from_pandas on each chunk.
  • When only adding __arrow_array__, the conversion back from arrow to pandas fails: distributed will try to cast the pyarrow->pandas converted DataFrame to the original dtypes (df.astype(meta.dtypes, copy=False) at https://github.com/dask/distributed/blob/9beab9a06a7777cb8d6bb2d90ae961b69de2e532/distributed/shuffle/_arrow.py#L71-L72). This fails, because geopandas currently doesn't support doing Series[binary WKB values].astype("geometry") (while we probably should?)
  • Overriding _from_sequence to call from_wkb if from_shapely fails is a quick workaround to get .astype("geometry") working.

from dask-geopandas.

jorisvandenbossche avatar jorisvandenbossche commented on July 29, 2024

While the points I raise are things we should address in geopandas anyway at some point (although there are some questions about which default representation to use when converting to arrow), there are also other solutions in dask and distributed itself: dask added a dispatch method for pyarrow<->pandas conversion (dask/dask#10312) which we can implement, and I think that should also fix this issue when that dispatch method is used in distributed (WIP PR for this is at dask/distributed#7743)

from dask-geopandas.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.