dhi-gras / terracotta Goto Github PK

A light-weight, versatile XYZ tile server, built with Flask and Rasterio :earth_africa:

Home Page: https://terracotta-python.readthedocs.org

License: MIT License

Python 69.27% HTML 1.02% JavaScript 13.02% TypeScript 16.31% Dockerfile 0.14% Makefile 0.04% Shell 0.21%

xyz tileserver python rasterio cloud-optimized-geotiff serverless

terracotta's Introduction

Terracotta is a pure Python tile server that runs as a WSGI app on a dedicated webserver or as a serverless app on AWS Lambda. It is built on a modern Python stack, powered by awesome open-source software such as Flask, Zappa, and Rasterio.

Read the docs | Try the demo | Explore the API | Satlas, powered by Terracotta | Docker Image

Why Terracotta?

It is trivial to get going. Got a folder full of cloud-optimized GeoTiffs in different projections you want to have a look at in your browser? terracotta serve -r {name}.tif and terracotta connect localhost:5000 get you there.
We make minimal assumptions about your data, so you stay in charge. Keep using the tools you know and love to create and organize your data, Terracotta serves it exactly as it is.
Serverless deployment is a first-priority use case, so you don’t have to worry about maintaining or scaling your architecture.
Terracotta instances are self-documenting. Everything the frontend needs to know about your data is accessible from only a handful of API endpoints.

The Terracotta workflow

1. Optimize raster files

$ ls -lh
total 1.4G
-rw-r--r-- 1 dimh 1049089 231M Aug 29 16:45 S2A_20160724_135032_27XVB_B02.tif
-rw-r--r-- 1 dimh 1049089 231M Aug 29 16:45 S2A_20160724_135032_27XVB_B03.tif
-rw-r--r-- 1 dimh 1049089 231M Aug 29 16:46 S2A_20160724_135032_27XVB_B04.tif
-rw-r--r-- 1 dimh 1049089 231M Aug 29 16:56 S2A_20170831_171901_25XEL_B02.tif
-rw-r--r-- 1 dimh 1049089 231M Aug 29 16:57 S2A_20170831_171901_25XEL_B03.tif
-rw-r--r-- 1 dimh 1049089 231M Aug 29 16:57 S2A_20170831_171901_25XEL_B04.tif

$ terracotta optimize-rasters *.tif -o optimized/

Optimizing rasters: 100%|██████████████████████████| [05:16<00:00, file=S2A_20170831_...25XEL_B04.tif]

2. Create a database from file name pattern

$ terracotta ingest optimized/S2A_{date}_{}_{tile}_{band}.tif -o greenland.sqlite
Ingesting raster files: 100%|███████████████████████████████████████████| 6/6 [00:49<00:00,  8.54s/it]

3. Serve it up

$ terracotta serve -d greenland.sqlite
 * Serving Flask app "terracotta.server" (lazy loading)
 * Environment: production
   WARNING: Do not use the development server in a production environment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://localhost:5000/ (Press CTRL+C to quit)

4. Explore the running server

Manually

You can use any HTTP-capable client, such as curl.

$ curl localhost:5000/datasets?tile=25XEL
{"page":0,"limit":100,"datasets":[{"date":"20170831","tile":"25XEL","band":"B02"},{"date":"20170831","tile":"25XEL","band":"B03"},{"date":"20170831","tile":"25XEL","band":"B04"}]}

Modern browsers (e.g. Chrome or Firefox) will render the JSON as a tree.

Interactively

Terracotta also includes a web client. You can start the client (assuming the server is running at http://localhost:5000) using

$ terracotta connect localhost:5000
 * Serving Flask app "terracotta.client" (lazy loading)
 * Environment: production
   WARNING: Do not use the development server in a production environment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://127.0.0.1:5100/ (Press CTRL+C to quit)

Then open the client page (http://127.0.0.1:5100/ in this case) in your browser.

Development

We gladly accept bug reports and pull requests via GitHub. For your code to be useful, make sure that it is covered by tests and that it satisfies our linting practices (via mypy and flake8).

To run the tests, just install the necessary dependencies via

$ pip install -e .[test]

Then, you can run

$ pytest

from the root of the repository.

terracotta's People

Contributors

Stargazers

Watchers

terracotta's Issues

Support non-linear rescaling

Such as conversions between different color spaces

Add capability to pre-compute metadata before ingesting

This allows users to parallelize metadata computation

Support RGB data

Support serving multi-band images as RGB. A special case of #14.

Use downsampled version when computing metadata on the fly

Adapt to AWS λ

daskify metadata computation

We have a project where we want to use Terracotta to serve up some huge watermasks.
There's no way we can load an entire file into memory and do computations (a 32gb machine fails when computing the metadata), this is of course no problem for serving the files, as they are cloud-optimized.

However, the metadata computation when creating the database still assumes that the entire file fits into memory and then some. So we should use Dask to chunk the computations when sizes exceed the memory limit.

To speed up the common case (where files fit into memory) we could do this only when a MemoryError is thrown. Or we could set a memory limit that we think is reasonable and always chunk the files such that we never exceed that and then maybe decrease it if we hit a MemoryError. Thoughts?

Support datasets spanning the antimeridian

Untested, but since none of the methods are written with the antimeridian in mind, I expect it to fail.

Add WSGI instructions to README

Test running Terracotta as a WSGI app (maybe on EC2)
Profile and fix possible performance issues
Document setup process

Use some logging

Better progress bar in optimize-rasters

Can show progress for current raster, since we iterate over blocks anyway (requires tqdm).

False-color support

It should be possible to choose the mapping from band to RGB, in a multi-band raster.

Presumably, the best method would be for the client to pass the mapping as HTTP query parameters.
Additionally, we could have a mapping from band name (e.g. NIR) to band number in raster. This mapping could be specified in the dataset configuration. The client could then specify the false color mapping as something like ?r=nir&g=blue&b=green.

This issue is dependent on / related to #12.

Figure out how to handle previews

Possibilities:

Store in the database as base64-encoded binary blob. Might lead to significantly larger databases though.
Store a file path in the database. But leading where? This would require additional user input.
Generate previews on the fly through /rgb or /singleband with a low zoom level, or add another API endpoint that reads a whole dataset (as opposed to an XYZ tile).

Caches should be invalidated when settings change

Probably the cleanest solution would be to pass relevant settings to cached methods explicitly instead of calling get_settings inside them.

Create COGs through rasterio instead of GDAL

See e.g. https://github.com/mapbox/rio-cogeo/blob/master/rio_cogeo/profiles.py

Add preview as a stand-alone command

Can be used to quickly connect to a running Terracotta instance and explore the data stored there.

Figure out how to handle categorical data

Challenges:

values must be mapped to colors consistently
stretching does not make sense
legend must be able to return categories
whether a dataset is categorical or not must be known at ingestion time
or is there a way to provide most of this while keeping terracotta agnostic of categories?
how big of a use case is categorical data in the real world™️?

Add support for in-memory / streamed datasets?

Possible use case: Use terracotta to serve up tiles from bathymetry.

Properly test legend call

Should check whether returned images match the returned legend

Add pagination for bulk requests

Returning too many rows overloads both frontend and backend. This is usually solved by introducing a page and limit parameter to iterate through results.

Steps to implement:

Add page parameter to /datasets schema
Add global query limit setting
Add some LIMIT and offset to SQL queries

Out of memory when serving large rasters

I used an overview to compute metadata, to get around the issue in #49. When I serve the data in Terracotta, I sometimes see this:

[2018-08-28 14:41:43,573] ERROR in app: Exception on /singleband/20171231/3/3/5.png [GET]
Traceback (most recent call last):
  File "/home/phgr/.conda/envs/terracotta/lib/python3.6/site-packages/flask/app.py", line 1982, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/phgr/.conda/envs/terracotta/lib/python3.6/site-packages/flask/app.py", line 1614, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/phgr/.conda/envs/terracotta/lib/python3.6/site-packages/flask/app.py", line 1517, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/home/phgr/.conda/envs/terracotta/lib/python3.6/site-packages/flask/_compat.py", line 33, in reraise
    raise value
  File "/home/phgr/.conda/envs/terracotta/lib/python3.6/site-packages/flask/app.py", line 1612, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/phgr/.conda/envs/terracotta/lib/python3.6/site-packages/flask/app.py", line 1598, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/home/phgr/terracotta/terracotta/api/flask_api.py", line 52, in inner
    return fun(*args, **kwargs)
  File "/home/phgr/terracotta/terracotta/api/singleband.py", line 83, in get_singleband
    parsed_keys, tile_xyz, **options
  File "/home/phgr/terracotta/terracotta/handlers/singleband.py", line 35, in singleband
    tilesize=tile_size)
  File "/home/phgr/terracotta/terracotta/xyz.py", line 27, in get_tile_data
    return driver.get_raster_tile(keys, bounds=target_bounds, tilesize=tilesize, nodata=nodata)
  File "/home/phgr/terracotta/terracotta/drivers/base.py", line 274, in get_raster_tile
    nodata=nodata
  File "/home/phgr/.conda/envs/terracotta/lib/python3.6/site-packages/cachetools/__init__.py", line 87, in wrapper
    v = method(self, *args, **kwargs)
  File "/home/phgr/terracotta/terracotta/drivers/base.py", line 27, in inner
    return fun(self, *args, **kwargs)
  File "/home/phgr/terracotta/terracotta/drivers/base.py", line 208, in _get_raster_tile
    src.crs, target_crs, src.width, src.height, *src.bounds
  File "/home/phgr/.conda/envs/terracotta/lib/python3.6/site-packages/rasterio/env.py", line 363, in wrapper
    return f(*args, **kwds)
  File "/home/phgr/.conda/envs/terracotta/lib/python3.6/site-packages/rasterio/warp.py", line 418, in calculate_default_transform
    src_crs, dst_crs, width, height, left, bottom, right, top, gcps)
  File "rasterio/_warp.pyx", line 646, in rasterio._warp._calculate_default_transform
  File "rasterio/_io.pyx", line 1664, in rasterio._io.InMemoryRaster.__cinit__
  File "rasterio/_err.pyx", line 188, in rasterio._err.exc_wrap_pointer
rasterio._err.CPLE_OutOfMemoryError: memdataset.cpp, 1545: cannot allocate 5816105575 bytes

Which becomes a 500 response. It happens when I zoom out a bit, which might indicate that this could be a problem with loading the overviews. The innermost (highest res) overview is 43846x33163, which corresponds to a size of 1.45 GB (the raster is uint8), so the attempted allocation of 5.8 GB looks like a cast to some 32-bit size dtype of the innermost overview.

More error handling during ingestion

Currently, users can corrupt their database if exceptions occur during ingestion. We need to change that.

Add parallel preprocessing capabilities

Preprocessing is pretty slow on large rasters. Processing several blocks in parallel could mitigate that. Alternatively, we can process multiple files in parallel (e.g. in optimize_rasters).

Introduce multiqueries for dataset lookup

To make it easier to scale to large data collections, we should support queries in /datasets such as

/datasets?year=[2016,2018]

which would return all datasets from 2016 and 2018.

Another consideration could be range-based queries, but that would require the introduction of per-key datatypes, which is something I'd like to avoid for now.

Check database integrity on first access

Support GTiffs with explicit nodata masks

Code Review

Great job so far! Here's the things I stumbled upon:

Documentation

Be consistent: timestep vs timestamp
I don't think we need to explain the layout of the option files; an example is sufficient.
I think it is tremendously helpful to see example responses of the API calls early on.

Configuration

Why split path and regex? Just have path_regex.
Not sure about the yes/no syntax for boolean settings. How does e.g. Apache or Nginx handle that?

CLI

Config path could be a positional argument
Please wrap the config path in os.expanduser for us poor windows souls
💡: Accept rasters from the command line to quickly serve up anything: terracotta *.tif (then open a leaflet map in the browser, with the data already added as a layer, for the ultimate wow effect 😄)

API

I don't think the API queries should include terracotta. You would either run this as a Flask app on its own port, or configure the proxy in your webserver.
Using a non-timestep API endpoint for a timestepped dataset causes an uncaught exception (500 server error; should give "Bad Request" or so)

I'll have a look at the actual code and do some profiling later. I'll update this issue with my findings.

Experiment with different resampling settings

Maybe introduce a config parameter.

Support band math

Let users supply key names before deployment?

Currently, key names are read from the database. Alternatively, we could require users to supply both a database and the associated keys.

Pro:

API spec can include key names, and becomes fully OpenAPI compliant
API endpoints can fail immediately (without database lookup) if request supplies the wrong keys
One less database lookup per request, cleaner code in driver (one less table in database)

Con:

Either no guaranteed consistency between keys and database structure (if keys are directly supplied by the user) or requires database connection from deploy machine (if read from DB during deployment)
Need to introduce a factory for every API route and request schema

Write PNGs through rasterio / GDAL

Gets rid of pillow
Less awkward nodata handling

Support vector layers

Support additional image formats

E.g. JPEG or webp

Check database MD5 hash before downloading

Example : https://github.com/Miserlou/zappa-django-utils/blob/master/zappa_django_utils/db/backends/s3sqlite/base.py

Add delete method to driver API

Add option to use lossy compression in optimize-rasters

Contrast parameters

The client needs to be able to adjust the contrast of the images through query parameters.

For grayscale / single-band this can be easily done, by passing contrast_min and contrast_max query parameters to the existing contrast_stretch function.

For RGB / false-color images, this could be significantly more complicated.

API Spec

We should settle on a stable API that we can document for the front-end developers, as soon as possible.

Figure out request authentication

Add TTL for database retrieval

In certain cases (empty images) most time is spent retrieving remote databases, even if hashes are matching. I propose to cache remote databases with cachetools.TTLCache, so Terracotta only needs to check for a database update every 10 minutes or so.

Do we really need two separate options?
At which zoom level should the breakpoint between up- and downsampling occur?