From: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-

Use windowed writing in to_raster about rioxarray HOT 16 CLOSED

corteva commented on May 28, 2024

Use windowed writing in to_raster

from rioxarray.

Comments (16)

djhoese commented on May 28, 2024 1

Looks like they do a lot of the same stuff we did in the trollimage package. The main difference is that we allow you to not compute the result right away. This lets you delay the data writing until later which is needed by some of the stuff we do in Satpy. I should probably look at how xarray is handling things nowadays with handling open file objects (and more importantly serializing them).

from rioxarray.

snowman2 commented on May 28, 2024

Currently dask is not a dependency to rioxarray.

If I understand correctly, it seems this is how you are achieving this in trollimage here: https://github.com/pytroll/trollimage/blob/63fa32f2d40bb65ebc39c4be1fb1baf8f163db98/trollimage/xrimage.py#L371

With the use of RIOFile: https://github.com/pytroll/trollimage/blob/63fa32f2d40bb65ebc39c4be1fb1baf8f163db98/trollimage/xrimage.py#L63-L138

I am wondering if this could be an option to toggle on and will only work if you have dask installed.

from rioxarray.

djhoese commented on May 28, 2024

Yes. Dask's store function expects a destination that supports numpy-like array indexing. Our RIOFile helps with that. I'm sure core xarray has some helpful things for this now, they didn't when we first created RIOFile.

There is also the complication of when to open/close the file. This is one of the uglier parts of how Satpy/trollimage does things. Dask has no way of having a task that opens a file, runs a series of tasks (saving array chunks), and then closing the file when they are all done. There are related issues on dask's github but I don't have time to find them right now.

Since trollimage requires dask (because Satpy uses dask) we haven't had to deal with the difficulty of handling numpy and dask arrays in the most efficient way possible.

from rioxarray.

snowman2 commented on May 28, 2024

Interesting. I will have to dig into it later.

from rioxarray.

snowman2 commented on May 28, 2024

Reading on the subject leads me to think that using the blockxsize and blockysize in GDAL/rasterio in connection with the dask chunks would be a good idea for reading from/writing to rasters.

If it is a dask array, the chunks property is available and the blockxsize and blockysize should be set based on that property.
If it is not a dask array, the chunk method could be used to convert it into a dask array based on the blockxsize and blockysize properties.

from rioxarray.

djhoese commented on May 28, 2024

In point 1, doesn't blockxsize and blockysize assume tiling? What if I wanted a striped geotiff but was using dask arrays with 2D chunks?

For point 2, what "chunk" method are you referring to?

from rioxarray.

snowman2 commented on May 28, 2024

For point 1, it does assume tiling. With my current understanding I was thinking about how to optimize, but I am not entirely sure at the moment if it is a good idea or not. How often do you use a striped raster?

For point 2, when writing with dask chunks and you didn't start with a dask array: http://xarray.pydata.org/en/stable/generated/xarray.DataArray.chunk.html

from rioxarray.

snowman2 commented on May 28, 2024

Also, these could be defaults that could be overridden.

from rioxarray.

djhoese commented on May 28, 2024

Point 1, ok. Since the default for gdal/rasterio is striped rasters that's what I typically do but I give users the option to switch to tiled if they'd like. I'd be ok with more libraries changing the default to tiled but I'd be curious about gdal's reasoning for not doing tiling by default. This would likely remove the issues I mentioned with GDAL's internal caching. The only thing I don't 100% agree with here is that the chunk size be used directly for block size. I think it could be used, but in most cases (in my experience) the dask array chunk size is much larger than "usual" tiled geotiffs (256x256, 512x512, etc); especially if you consider cloud optimized geotiffs as the "preferred" version of a geotiff.

Point 2, ah cool. I didn't know that existed.

from rioxarray.

snowman2 commented on May 28, 2024

The only thing I don't 100% agree with here is that the chunk size be used directly for block size.

That is definitely a valid concern. In these cases, we would have to have a check for that. The block sizes for a geotiff for overviews are a power-of-two value between 64 and 4096 (https://gdal.org/drivers/raster/gtiff.html#overviews), but I am not sure what it is for normal tiffs. I have read here that the TIFF spec states that tile width (and this goes for height, I presume) must be a multiple of 16.

So, it seems you can get away with larger block sizes as long as they are in the power of 2. But, this does limit the allowed dask chunk sizes when writing to a file. Ideally the dask chunk size and the block size would be the same. Thoughts?

from rioxarray.

snowman2 commented on May 28, 2024

Another option for optimizing writing would be to call the chunk function beforehand to better match the chunk sizes. But, it would be a good idea to profile that one.

from rioxarray.

djhoese commented on May 28, 2024

Agree with everything you said. Perhaps if this method is defaulting to tiled geotiffs then it could have default block sizes of 256x256 (like GDAL does) and it could rechunk the data before writing it. If you tiled=False then no rechunking is needed.

Brain dump/side note: I wonder if rechunking for a striped geotiff to stripe size chunks would improve geotiff writing performance and final geotiff size. I'll have to try that in satpy some time.

from rioxarray.

snowman2 commented on May 28, 2024

Sounds good. Looks like we have enough here to tinker with some ideas later. It will be interesting to hear back about the stripsize tests as well.

from rioxarray.

snowman2 commented on May 28, 2024

This looks interesting: https://github.com/dymaxionlabs/dask-rasterio. Hasn't been updated in a while and only works with poetry. But, seems like a good additional reference.

from rioxarray.

snowman2 commented on May 28, 2024

One small difference I saw is the use of lock=True:
https://github.com/dymaxionlabs/dask-rasterio/blob/8dd7fdece7ad094a41908c0ae6b4fe6ca49cf5e1/dask_rasterio/write.py#L35

Reference:
https://docs.dask.org/en/latest/array-api.html#dask.array.store

from rioxarray.

snowman2 commented on May 28, 2024

Was planning to tinker around with this and noticed the GPL license in trollimage, As such, dask-rasterio is going to have to be the reference point for this since it has a BSD-2 Clause license.

from rioxarray.

Use windowed writing in to_raster about rioxarray HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent