Comments (16)
Looks like they do a lot of the same stuff we did in the trollimage package. The main difference is that we allow you to not compute the result right away. This lets you delay the data writing until later which is needed by some of the stuff we do in Satpy. I should probably look at how xarray is handling things nowadays with handling open file objects (and more importantly serializing them).
from rioxarray.
Currently dask is not a dependency to rioxarray
.
If I understand correctly, it seems this is how you are achieving this in trollimage
here: https://github.com/pytroll/trollimage/blob/63fa32f2d40bb65ebc39c4be1fb1baf8f163db98/trollimage/xrimage.py#L371
With the use of RIOFile
: https://github.com/pytroll/trollimage/blob/63fa32f2d40bb65ebc39c4be1fb1baf8f163db98/trollimage/xrimage.py#L63-L138
I am wondering if this could be an option to toggle on and will only work if you have dask installed.
from rioxarray.
Yes. Dask's store
function expects a destination that supports numpy-like array indexing. Our RIOFile
helps with that. I'm sure core xarray
has some helpful things for this now, they didn't when we first created RIOFile
.
There is also the complication of when to open/close the file. This is one of the uglier parts of how Satpy/trollimage does things. Dask has no way of having a task that opens a file, runs a series of tasks (saving array chunks), and then closing the file when they are all done. There are related issues on dask's github but I don't have time to find them right now.
Since trollimage requires dask (because Satpy uses dask) we haven't had to deal with the difficulty of handling numpy and dask arrays in the most efficient way possible.
from rioxarray.
Interesting. I will have to dig into it later.
from rioxarray.
Reading on the subject leads me to think that using the blockxsize
and blockysize
in GDAL/rasterio in connection with the dask chunks would be a good idea for reading from/writing to rasters.
- If it is a dask array, the
chunks
property is available and theblockxsize
andblockysize
should be set based on that property. - If it is not a dask array, the
chunk
method could be used to convert it into a dask array based on theblockxsize
andblockysize
properties.
from rioxarray.
In point 1, doesn't blockxsize and blockysize assume tiling? What if I wanted a striped geotiff but was using dask arrays with 2D chunks?
For point 2, what "chunk" method are you referring to?
from rioxarray.
For point 1, it does assume tiling. With my current understanding I was thinking about how to optimize, but I am not entirely sure at the moment if it is a good idea or not. How often do you use a striped raster?
For point 2, when writing with dask chunks and you didn't start with a dask array: http://xarray.pydata.org/en/stable/generated/xarray.DataArray.chunk.html
from rioxarray.
Also, these could be defaults that could be overridden.
from rioxarray.
Point 1, ok. Since the default for gdal/rasterio is striped rasters that's what I typically do but I give users the option to switch to tiled if they'd like. I'd be ok with more libraries changing the default to tiled but I'd be curious about gdal's reasoning for not doing tiling by default. This would likely remove the issues I mentioned with GDAL's internal caching. The only thing I don't 100% agree with here is that the chunk size be used directly for block size. I think it could be used, but in most cases (in my experience) the dask array chunk size is much larger than "usual" tiled geotiffs (256x256, 512x512, etc); especially if you consider cloud optimized geotiffs as the "preferred" version of a geotiff.
Point 2, ah cool. I didn't know that existed.
from rioxarray.
The only thing I don't 100% agree with here is that the chunk size be used directly for block size.
That is definitely a valid concern. In these cases, we would have to have a check for that. The block sizes for a geotiff for overviews are a power-of-two value between 64 and 4096
(https://gdal.org/drivers/raster/gtiff.html#overviews), but I am not sure what it is for normal tiffs. I have read here that the TIFF spec states that tile width (and this goes for height, I presume) must be a multiple of 16
.
So, it seems you can get away with larger block sizes as long as they are in the power of 2. But, this does limit the allowed dask chunk sizes when writing to a file. Ideally the dask chunk size and the block size would be the same. Thoughts?
from rioxarray.
Another option for optimizing writing would be to call the chunk
function beforehand to better match the chunk sizes. But, it would be a good idea to profile that one.
from rioxarray.
Agree with everything you said. Perhaps if this method is defaulting to tiled geotiffs then it could have default block sizes of 256x256 (like GDAL does) and it could rechunk the data before writing it. If you tiled=False
then no rechunking is needed.
Brain dump/side note: I wonder if rechunking for a striped geotiff to stripe size chunks would improve geotiff writing performance and final geotiff size. I'll have to try that in satpy some time.
from rioxarray.
Sounds good. Looks like we have enough here to tinker with some ideas later. It will be interesting to hear back about the stripsize tests as well.
from rioxarray.
This looks interesting: https://github.com/dymaxionlabs/dask-rasterio. Hasn't been updated in a while and only works with poetry. But, seems like a good additional reference.
from rioxarray.
One small difference I saw is the use of lock=True
:
https://github.com/dymaxionlabs/dask-rasterio/blob/8dd7fdece7ad094a41908c0ae6b4fe6ca49cf5e1/dask_rasterio/write.py#L35
Reference:
https://docs.dask.org/en/latest/array-api.html#dask.array.store
from rioxarray.
Was planning to tinker around with this and noticed the GPL license in trollimage
, As such, dask-rasterio
is going to have to be the reference point for this since it has a BSD-2 Clause license.
from rioxarray.
Related Issues (20)
- rioxarray.open_rasterio(file) doesn't work HOT 1
- Support GCPs without known z coordinate HOT 3
- `spatial_ref` coordinate not accessible after saving dataset HOT 2
- Add a rio.fill.fillnodata-based nodata interpolation in addition to existing scipy-based HOT 1
- write somewhere in the documentation that extentions are not respecting F401
- Rename bands as variables using long_name attribute HOT 8
- typo in docs
- `rio.transform()` does not retrieve exact transform stored in `rio.write_transform()` HOT 2
- Document difference between `set_crs` and `write_crs` HOT 4
- Typo in docs: longitute
- Fail to reproject and reproject_match a dataset with rotation affine. HOT 3
- reproject_match renames dims HOT 2
- Xarray padding with mode='reflect'
- Padding and Croping doesn't end up same result HOT 2
- overview_level failing in xarray with engine='rasterio' due to missing doc? HOT 2
- Rio array merge missing HOT 3
- Delayed/chunked opening (sentinel) SAFE data with bands as variables fails HOT 1
- `reproject_match` raises `MissingSpatialDimensionError` with spatial dims set HOT 1
- Save larger raster with zstd compression writes dirty block HOT 3
- Memory leak when looping through data variables of a dataset loaded from a VRT HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rioxarray.