Giter Club home page Giter Club logo

Comments (16)

djhoese avatar djhoese commented on May 28, 2024 1

Looks like they do a lot of the same stuff we did in the trollimage package. The main difference is that we allow you to not compute the result right away. This lets you delay the data writing until later which is needed by some of the stuff we do in Satpy. I should probably look at how xarray is handling things nowadays with handling open file objects (and more importantly serializing them).

from rioxarray.

snowman2 avatar snowman2 commented on May 28, 2024

Currently dask is not a dependency to rioxarray.

If I understand correctly, it seems this is how you are achieving this in trollimage here: https://github.com/pytroll/trollimage/blob/63fa32f2d40bb65ebc39c4be1fb1baf8f163db98/trollimage/xrimage.py#L371

With the use of RIOFile: https://github.com/pytroll/trollimage/blob/63fa32f2d40bb65ebc39c4be1fb1baf8f163db98/trollimage/xrimage.py#L63-L138

I am wondering if this could be an option to toggle on and will only work if you have dask installed.

from rioxarray.

djhoese avatar djhoese commented on May 28, 2024

Yes. Dask's store function expects a destination that supports numpy-like array indexing. Our RIOFile helps with that. I'm sure core xarray has some helpful things for this now, they didn't when we first created RIOFile.

There is also the complication of when to open/close the file. This is one of the uglier parts of how Satpy/trollimage does things. Dask has no way of having a task that opens a file, runs a series of tasks (saving array chunks), and then closing the file when they are all done. There are related issues on dask's github but I don't have time to find them right now.

Since trollimage requires dask (because Satpy uses dask) we haven't had to deal with the difficulty of handling numpy and dask arrays in the most efficient way possible.

from rioxarray.

snowman2 avatar snowman2 commented on May 28, 2024

Interesting. I will have to dig into it later.

from rioxarray.

snowman2 avatar snowman2 commented on May 28, 2024

Reading on the subject leads me to think that using the blockxsize and blockysize in GDAL/rasterio in connection with the dask chunks would be a good idea for reading from/writing to rasters.

  1. If it is a dask array, the chunks property is available and the blockxsize and blockysize should be set based on that property.
  2. If it is not a dask array, the chunk method could be used to convert it into a dask array based on the blockxsize and blockysize properties.

from rioxarray.

djhoese avatar djhoese commented on May 28, 2024

In point 1, doesn't blockxsize and blockysize assume tiling? What if I wanted a striped geotiff but was using dask arrays with 2D chunks?

For point 2, what "chunk" method are you referring to?

from rioxarray.

snowman2 avatar snowman2 commented on May 28, 2024

For point 1, it does assume tiling. With my current understanding I was thinking about how to optimize, but I am not entirely sure at the moment if it is a good idea or not. How often do you use a striped raster?

For point 2, when writing with dask chunks and you didn't start with a dask array: http://xarray.pydata.org/en/stable/generated/xarray.DataArray.chunk.html

from rioxarray.

snowman2 avatar snowman2 commented on May 28, 2024

Also, these could be defaults that could be overridden.

from rioxarray.

djhoese avatar djhoese commented on May 28, 2024

Point 1, ok. Since the default for gdal/rasterio is striped rasters that's what I typically do but I give users the option to switch to tiled if they'd like. I'd be ok with more libraries changing the default to tiled but I'd be curious about gdal's reasoning for not doing tiling by default. This would likely remove the issues I mentioned with GDAL's internal caching. The only thing I don't 100% agree with here is that the chunk size be used directly for block size. I think it could be used, but in most cases (in my experience) the dask array chunk size is much larger than "usual" tiled geotiffs (256x256, 512x512, etc); especially if you consider cloud optimized geotiffs as the "preferred" version of a geotiff.

Point 2, ah cool. I didn't know that existed.

from rioxarray.

snowman2 avatar snowman2 commented on May 28, 2024

The only thing I don't 100% agree with here is that the chunk size be used directly for block size.

That is definitely a valid concern. In these cases, we would have to have a check for that. The block sizes for a geotiff for overviews are a power-of-two value between 64 and 4096 (https://gdal.org/drivers/raster/gtiff.html#overviews), but I am not sure what it is for normal tiffs. I have read here that the TIFF spec states that tile width (and this goes for height, I presume) must be a multiple of 16.

So, it seems you can get away with larger block sizes as long as they are in the power of 2. But, this does limit the allowed dask chunk sizes when writing to a file. Ideally the dask chunk size and the block size would be the same. Thoughts?

from rioxarray.

snowman2 avatar snowman2 commented on May 28, 2024

Another option for optimizing writing would be to call the chunk function beforehand to better match the chunk sizes. But, it would be a good idea to profile that one.

from rioxarray.

djhoese avatar djhoese commented on May 28, 2024

Agree with everything you said. Perhaps if this method is defaulting to tiled geotiffs then it could have default block sizes of 256x256 (like GDAL does) and it could rechunk the data before writing it. If you tiled=False then no rechunking is needed.

Brain dump/side note: I wonder if rechunking for a striped geotiff to stripe size chunks would improve geotiff writing performance and final geotiff size. I'll have to try that in satpy some time.

from rioxarray.

snowman2 avatar snowman2 commented on May 28, 2024

Sounds good. Looks like we have enough here to tinker with some ideas later. It will be interesting to hear back about the stripsize tests as well.

from rioxarray.

snowman2 avatar snowman2 commented on May 28, 2024

This looks interesting: https://github.com/dymaxionlabs/dask-rasterio. Hasn't been updated in a while and only works with poetry. But, seems like a good additional reference.

from rioxarray.

snowman2 avatar snowman2 commented on May 28, 2024

One small difference I saw is the use of lock=True:
https://github.com/dymaxionlabs/dask-rasterio/blob/8dd7fdece7ad094a41908c0ae6b4fe6ca49cf5e1/dask_rasterio/write.py#L35

Reference:
https://docs.dask.org/en/latest/array-api.html#dask.array.store

from rioxarray.

snowman2 avatar snowman2 commented on May 28, 2024

Was planning to tinker around with this and noticed the GPL license in trollimage, As such, dask-rasterio is going to have to be the reference point for this since it has a BSD-2 Clause license.

from rioxarray.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.