developmentseed / lonboard Goto Github PK

View Code? Open in Web Editor NEW

414.0 13.0 22.0 56.41 MB

Python library for fast, interactive geospatial vector data visualization in Jupyter.

Home Page: https://developmentseed.org/lonboard/latest/

License: MIT License

JavaScript 0.14% Python 78.27% TypeScript 21.39% CSS 0.20%

anywidget apache-arrow deck-gl geoarrow geopandas jupyter jupyter-widget longboard map-visualization data-visualization

lonboard's Introduction

Lonboard

A Python library for fast, interactive geospatial vector data visualization in Jupyter.

Building on cutting-edge technologies like GeoArrow and GeoParquet in conjunction with GPU-based map rendering, Lonboard aims to enable visualizing large geospatial datasets interactively through a simple interface.

3 million points rendered from a GeoPandas GeoDataFrame in JupyterLab. Example notebook.

Install

To install Lonboard using pip:

pip install lonboard

Lonboard is on conda-forge and can be installed using conda, mamba, or pixi. To install Lonboard using conda:

conda install -c conda-forge lonboard

To install from source, refer to the developer documentation.

Get Started

For the simplest rendering, pass geospatial data into the top-level viz function.

import geopandas as gpd
from lonboard import viz

gdf = gpd.GeoDataFrame(...)
viz(gdf)

Under the hood, this delegates to a ScatterplotLayer, PathLayer, or PolygonLayer. Refer to the documentation and examples for more control over rendering.

Documentation

Refer to the documentation at developmentseed.org/lonboard.

Why the name?

This is a new binding to the deck.gl geospatial data visualization library. A "deck" is the part of a skateboard you ride on. What's a fast, geospatial skateboard? A lonboard.

lonboard's People

Contributors

Stargazers

Watchers

lonboard's Issues

Allow pandas series to FloatAccessor

TraitError: The 'get_radius' trait of a ScatterplotLayer instance expected a float value or numpy ndarray or pyarrow array representing an array of floats, not the Series ...

Validate accessors have same array length as main table

See https://traitlets.readthedocs.io/en/stable/using_traitlets.html#custom-cross-validation

Support array of hex strings in ColorAccessor trait

numpy array of strings that all start with #

Private code to generate test data for geoarrow/deck.gl-layers

There's so much helper code here to create geoarrow-formatted data and validate other attributes, that it would be nice to have a private method to export for test data for the JS lib

Manually created geoarrow table support in ScatterplotLayer

I was able to load 20 million polygons in lonboard. It was amazing! Now I am trying to figure out how to load 60 million points without having to use GeoPandas but I keep hitting code paths that expect either an interleaved list or paths that go back to numpy or paths that expect a byte like object from C.

Here is roughly what I am trying:

import lonboard
import gzip
import geoarrow.pyarrow as ga
import pyarrow.csv as pv

with gzip.open("/Users/x/data/points_s2_level_4_gzip/397_buildings.csv.gz") as fp:
        table = pv.read_csv(fp)

points = ga.point().from_geobuffers(None, table["latitude"], y=table["longitude"])

geoarrow_schema = pa.schema([pa.field("geometry", points.type, metadata={b"ARROW:extension:name": b"geoarrow.point"})])

point_table = pa.Table.from_arrays([points], schema=geoarrow_schema)
point_table.schema.field("geometry").metadata.get(b"ARROW:extension:name")
map_ = lonboard.ScatterplotLayer(table=point_table)

allow geopandas input to `init` without having to call from_geopandas

Tooltip

It would be great to have a tooltip that shows the row of data when hovered or clicked. This relies on geoarrow/deck.gl-layers#30

OSM attribution in corner of map

The current map is missing any attribution, and should be fixed

Type hints for vectorized accessor callbacks

Recall that you can use a Protocol with a __call__ method to define the API for a function callback. So for accessors like get fill color, you should define this protocol to take in a geodataframe and return an NDArray[np.uint8]

you should also have runtime checks to verify the correct data format

Docs: perf advice: Don't create new map objects

mutate existing map objects whenever possible. Every time you create a new map object from scratch, you have to download all that new data to your browser.

Better accessor documentation

Group examples by amount of data downloaded

e.g. some examples should be illustrative and just use a very small data download. other examples should show off performance, and thus require large datasets (and maybe a large filter of an even larger dataset) but should be grouped in such a way to make it very clear.

docs note about datashader vs lonboard

One note about the difference between datashader and my deck.gl-based visualization... It looked like datashader was re rendering in a specific area when joris zoomed in and panned around. So in that sense datashader is "minimizing rendering" based on the viewport. My deck.gl-based renderer does not minimize rendering... When I'm rendering 3 million points, all 3 million of those are loaded onto the user's GPU at once. So in that sense it's not "infinitely scalable", it just uses your hardware better than any previous library

Like when you zoom in with datashader, it'll re-rasterize based on a new aggregation with the current viewport, and can do that up to your RAM size. So datashader is limited by your RAM (I think?, maybe it supports larger-than-ram) while lonboard is limited by your GPU RAM.

Example with sidecar

https://github.com/jupyter-widgets/jupyterlab-sidecar

More informative validation error messages

e.g.

class ColorAccessorWidget(Widget):
    color = ColorAccessor()

def test_color_accessor_validation():
    color_arr = np.array([1, 2, 3]).reshape(-1, 3)
    ColorAccessorWidget(color=color_arr)

raises

TraitError: The 'color' trait of a ColorAccessorWidget instance expected a tuple or list representing an RGB(A) color or numpy ndarray or pyarrow FixedSizeList representing an array of RGB(A) colors, not the ndarray array([[1, 2, 3]]).

The issue here is that the input array has a dtype of np.int64 instead of np.uint8, but the error isn't displaying properly

React issues

Goals:

Render multiple dataframes/layers on a single map
Enable updates of Python properties like get_fill_color to propagate to the map

Attempt:

Python side

On the Python side I attempted to write a sort of "container widget" (ref manzt/anywidget#194), where each Layer object is a Python jupyter widget, and where the Map object collects each of the underlying Widgets. It's useful to have each Layer be its own Widget, because that enables event handling on each layer.

from ipywidgets import Widget # Base widget class
from anywidget import AnyWidget # high level widget helper that subclasses Widget

class BaseLayer(Widget):
    """Base class for our layer types"""
    ...

class PointLayer(BaseLayer):
    ...

class LineStringLayer(BaseLayer):
    ...

class PolygonLayer(BaseLayer):
    ...

class Map(AnyWidget):
    _esm = "path to esm JS bundle"
    _css = "optional path to CSS styling"

    # list of instances of classes that subclass from BaseLayer
    layers = List[BaseLayer]

Then a user will create a variety of layers and instantiate a Map object:

import geopandas as gpd

point_data = gpd.GeoDataFrame(...)
polygon_data = gpd.GeoDataFrame(...)

point_layer = PointLayer(point_data)
polygon_layer = PolygonLayer(polygon_data)

map = Map(layers=[point_layer, polygon_layer])
map # putting this last in a cell "prints" the object..., which in this case renders the map

The goal with this setup is to let a user run

point_layer.fill_color = [255, 0, 0]

and the points on the map turn red.

In order for this to happen, the JS side needs to be able to receive these events and re-render

JS Side

When you render the map object

This will then sync data with the JS side and render the App object

lonboard/src/index.tsx

Line 142 in 8cd1a19

function App() {

The data from the Python side is available on the model object on the JS side. This can be accessed either via useModel or via anywidget's helper useModelState. useModelState is a small shim around useState and useEffect to keep track of the state of that value and propagate updates when the model announces that a field has changed.

The crux of the issue is that if you just use model.get(), you can access the initial value, but you never know when the value has been updated. Using useModelState from the "top level" works well, but only lets you access the attributes on the top level model.

My current code appeared to work well, but uses a synchronous private attribute that didn't appear to work in colab. And then switching to the async function didn't seem to work for that. See #34 for a description of this issue.

So the goal is to define the JS object in such a way that we hook into a model's event handlers so that we know when the on:change events happen and can update the map accordingly.

Support for rendering inside VSCode

Hi, I'm really looking forward to playing with lonboard but I can't get viz to to render. I've tried the two example notebooks... They render in your Binder and Colab links, but when I try them in local Jupyter, nothing gets displayed but a blank white bar in the cell output. The objects returned from viz seem valid -- I can see coordinates etc when I print them as a string. I assume there might be a widget support issue or something like that. I'm using conda to create env's on a Windows machine. Tried various python 3 versions, tried upgrading/downgrading Jupyter-related packages but no luck so far. Also tried upgrading & downgrading pyogrio, lonboard, pyarrow...

Binder badge for example notebooks

Can create a binder badge to the repo from this page: https://mybinder.org/; more env docs here: https://mybinder.readthedocs.io/en/latest/introduction.html

Integrate with ipywidgets ColorPicker widget

https://ipywidgets.readthedocs.io/en/stable/reference/ipywidgets.html#ipywidgets.widgets.widget_color.ColorPicker

Just need to be able to parse the hex to RGB

Polygon winding order

The deck.gl SolidPolygonLayer has a render option _windingOrder which says

This prop is only effective with _normalize: false. It specifies the winding order of rings in the polygon data, one of:

'CW': outer-ring is clockwise, and holes are counter-clockwise

'CCW': outer-ring is counter-clockwise, and holes are clockwise

The proper value depends on the source of your data. Most geometry formats enforce a specific winding order. Incorrectly set winding order will cause an extruded polygon's surfaces to be flipped, affecting culling and the lighting effect.

Thus, this is probably not the highest priority, given that it only happens with extruded polygons, but should be fixed eventually.

In GEOS, polygon winding order is unspecified, so we'd need to check/force it manually. There's no vectorized shapely function to do this, ref shapely/shapely#1366. So the options are either:

non-vectorized shapely implementation of orient. This would be unacceptably slow.
geoarrow-based orient implementation in python. This would be ideal but not likely to be imminently implemented.
JS-based orientation implementation. This either brings in a wasm implementation (not ideal here) or implements a custom JS function on geoarrow arrays (preferred, but the tooling for geoarrow in pure JS isn't there yet)

So the end goal here is:

Implement a fast winding order algorithm in rust on geoarrow memory to do in python.
Implement an orientation checking function in pure JS on geoarrow memory (maybe in geoarrow/deck.gl-layers for now)
Set the geoarrow winding order flag when winding order is checked/validated in python so it doesn't get done again in js.

Ref geoarrow/deck.gl-layers#36

check for crs 84 in from geopandas

maybe print a warning and then reproject automatically?

store pyarrow table in traitlet on widget?

Refer to pydeck binary serialization. Might be possible to store the Table object directly on the widget, with a custom "to_json" which creates {"data": memoryview(feather_buffer)} or similar

https://github.com/visgl/deck.gl/blob/master/bindings/pydeck/pydeck/widget/widget.py#L62C75-L62C75
https://github.com/visgl/deck.gl/blob/master/bindings/pydeck/pydeck/data_utils/binary_transfer.py

compute_view crashes on empty geometries

compute_view crashes when empty points exist (giving an infinite bounding box)

Maybe give a warning that null points exist, and then filter them out for creating a bbox?

Make non-public modules start with underscore

Should be more careful to signify what's public and what's not

Separate into multiple widgets/layers?

The rendering API/options will be different based on the type of layer. Should you have a PointWidget, LineStringWidget, PolygonWidget, and then have .get_fill_color as an autocompletion-able attribute on only the PolygonWidget? And have like create_widget(gdf) as a top-level API that creates the table and then switches to create one of the widgets?

Advanced docs: explain high level of how to use js layers using data on layer

Integrate `mapclassify`

From the geopandas.plot docstring

Name of a choropleth classification scheme (requires mapclassify). A mapclassify.MapClassifier object will be used under the hood. Supported are all schemes provided by mapclassify (e.g. ‘BoxPlot’, ‘EqualInterval’, ‘FisherJenks’, ‘FisherJenksSampled’, ‘HeadTailBreaks’, ‘JenksCaspall’, ‘JenksCaspallForced’, ‘JenksCaspallSampled’, ‘MaxP’, ‘MaximumBreaks’, ‘NaturalBreaks’, ‘Quantiles’, ‘Percentiles’, ‘StdMean’, ‘UserDefined’). Arguments can be passed in classification_kwds.

Let widget fill available height

When using with jupyter sidecar, the map can be placed on the right side of the notebook screen:

Right now the div containing the deck.gl widget is hard-coded to 500px:

lonboard/src/scatterplot-layer.tsx

Line 80 in d4a05a3

This means that when used with sidecar, if the screen is more than 500px tall, it'll have a weird empty space at the bottom.

Ideally we want to let the widget fill all available height in its containing div, but I can't figure out how to do that. When I switch to, say,

style={{ display: "flex", flexFlow: "column", flexGrow: 1, overflow: "auto" }}

it creates a div with zero height:

cc @vgeorge

Allow pyarrow objects as attributes (e.g. for colors) as well as numpy

Select data by bounding box

A user draws a bounding box to select an array of feature indices that fall within the bounding box. The features are highlighted on the map and selected in the geodataframe.

Useful for exploratory data analysis.

Kyle to add details.

Docs: Performance characteristics & advice

discuss impact of being on a remote server
ultimately dependent on the user's GPU for rendering
In contrast to datashader, doesn't minimize the amount of data being rendered; just does it more effectively
Use arrow data types in pandas
Exclude columns from dataframe before passing into layer

Sync view state between Python and JS

Right now we include an _initial_view_state that lets Python set the initial view state.

deckgl allows you to pass in an initialViewState param which then lets deck manage the internal view state. Or you can manage the view state independently from deck, which you update with onViewStateChange and pass into deck's viewState parameter.

Set the state from python but allow the JS side to vary independently (otherwise you couldn't pan)
Debounce for messages from JS -> Python to not clog the web socket
not debounce for setting the view state from onViewStateChange (because we don't want to slow the deck updates)

The existing implementation of useModelState (in anywidget/react) is:

export function useModelState(key) {
  let model = useModel();
  let [value, setValue] = React.useState(model.get(key));
  React.useEffect(() => {
    let callback = () => setValue(model.get(key));
    model.on(`change:${key}`, callback);
    return () => model.off(`change:${key}`, callback);
  }, [model, key]);
  return [
    value,
    (value) => {
      model.set(key, value);
      model.save_changes();
    },
  ];
}

We probably want something like useModelStateDebounced which returns a callback that immediately calls model.set(key, value) but debounces for model.save_changes().

Note to self: https://www.joshwcomeau.com/snippets/javascript/debounce/ for implementation of debounce + note to use useMemo in react. It's unclear if we do want useMemo because we seemingly do want to re-render the react component on every view state change, because deck is reactive and won't re-render the full map

Align class names with deck.gl

I'm thinking it's better to start aligned with deck.gl and then change names in the future if we find it easier... 🤷‍♂️

So that means starting with e.g. the ScatterplotLayer instead of the PointLayer. We can also link to the ScatterplotLayer docs for this and it'll be hopefully more clear that we're exposing the same api as upstream

fail validation for 3d coords until supported in js

colormap helpers

provide at least a helper that takes in values 0-1 and maps them into the user-provided colormap

maybe have different clamping options, just like the GPU. either discrete which rounds to the nearest 1/256 color integer, or continuous which takes the ideal color in between the two nearest choices

Fix tsconfig to allow jsx

Tests for accessor/table validation/serialization

Per-environment warnings

E.g. it's easy to check if you're in colab, and then print a warning over, say, 1M coordinates that it tends to get unstable

Validate polygons with holes

Try to deduplicate `@traitlets.validate`

It takes in a plural names, so maybe it would be possible to have a single validator and call it on

@traitlets.validate("get_radius", "get_fill_color", ...)

instead of having a separate one for each one

Auto-downcast numeric attribute types in `from_geopandas`

Check for float, signed int, unsigned int data types, and call pd.to_numeric(downcast=...).

It would be nice to check if this works with pyarrow-based data types as well.

This should be a kwarg, maybe named auto_downcast: bool = True?

Change default execution language in notebook to `python3`

Switch dataframe to be stored on widget as geodataframe?

Instead of storing the buffer on the widget, you could instead store a more structured object and customize the ipywidgets serialization.

Note that this will mean that the widget depends on geopandas instead of just interface with geopandas, so probably not desired.

Probably the best middle ground is to store the GeoArrow table representation (as a pyarrow.Table) on the widget

Advanced docs: per-coordinate accessors

Not quite supported yet in deck.gl-layers

Data compression over the wire

Right now data is transferred from Python to JS fully uncompressed:

lonboard/lonboard/layer.py

Line 68 in 6a64c6f

feather.write_feather(table, bio, compression="uncompressed")

Uncompressed data is fine for local kernels, where Python and the browser are on the same machine, but not ideal for remote kernels, like JupyterHub or Colab, where Python is on a remote server and data has to be downloaded before it can be rendered on a map.

Data Compression options

There are a few options for data compression:

Uncompressed
Apply a simple compression like gzip to the entire table buffer. This is simple to implement on both the Python and JS sides, but is quite slow
Apply compression in the Arrow IPC format. This file format supports only "light compression" (LZ4 or ZSTD) and doesn't do any other encoding like delta encoding for smaller file size. The downside is that reading compressed IPC files is not currently supported by Arrow JS.
Use Parquet. This has the most efficient compression, but it has the downsides of requiring a WebAssembly-based parser on the JS side. Adding the Wasm could make the build setup more difficult.

Different settings for local/remote?

Another question is whether it's possible to have different compression defaults based on whether the Python session is local or remote. Ideally a local Python kernel could use no compression while a remote Python kernel could use the most efficient compression.

The problem is that because Python-Jupyter follows a server-client model, I don't know of a good way to know from Python whether the attached client is running locally or remotely. There could be some heuristics like checking if google.colab in sys.modules but that's only valid in the colab case.

So it seems like the best default would be fast, moderate-size compression, and then have a parameter to let the user choose either no compression or slow, small-file-size compression.

Unscientific benchmarks

Unscientific benchmarks using the utah dataset of 1 million buildings (7M coords):

Compression Type	File size	Write time
Feather (uncompressed)	144 MB	17 ms
gzip full-buffer compression	64 MB	13 s
Feather (ZSTD)	80 MB	200 ms
Feather (LZ4)	97 MB	147 ms
Parquet (Snappy)	82 MB	444 ms
Parquet (gzip)	60 MB	4.5 s
Parquet (brotli)	45 MB	3.7 s
Parquet (ZSTD)	74 MB	466 ms
Parquet (ZSTD level 22)	41.6 MB	11 s
Parquet (ZSTD level 18)	41.6 MB	9.8 s
Parquet (ZSTD level 16)	48.3 MB	5.7 s
Parquet (ZSTD level 14)	49.8 MB	2.7 s
Parquet (ZSTD level 12)	49.8 MB	1.9 s
Parquet (ZSTD level 10)	49.8 MB	1.7 s
Parquet (ZSTD level 8)	50.3 MB	1.4 s
Parquet (ZSTD level 7)	50.3 MB	1.25 s
Parquet (ZSTD level 6)	51.4 MB	1.2 s
Parquet (ZSTD level 4)	57.8 MB	800 ms
Parquet (ZSTD level 2)	69.1 MB	560 ms

Given this, ZSTD around level ~7 seems to have a very good combination of write speed and file size, and likely makes sense as a default.

Serialize the Arrow table to a base64 string
Create an HTML file with data, layer parameters, etc

Test on polygons with holes

Ref geoarrow/deck.gl-layers#37

Sync the clicked index back to Python

It would be great, besides a tooltip to display on the JS side, to sync the index of the object that was clicked. Then the user can do gdf.iloc[map_.clicked_index] to retrieve the specific row

Note that this can probably be an array of indices?