fsspec / kerchunk Goto Github PK

View Code? Open in Web Editor NEW

283.0 283.0 78.0 12.08 MB

Cloud-friendly access to archival data

Home Page: https://fsspec.github.io/kerchunk/

License: MIT License

Python 100.00%

hacktoberfest python

kerchunk's People

Contributors

Stargazers

Watchers

kerchunk's Issues

Avoiding metadata bloat caused by many long URLs

I woke up this morning wondering whether it would be possible to allow a variable to be defined in the referencefile spec. I'm worried about long s3 (or other) urls bloating the metadata.

If we could do something like:

prefix001='first/superlongurl/that/keeps/going/on/and/on/for/ever/to/some/dir'
prefix002='second/superlongurl/that/keeps/going/on/and/on/for/ever/to/some/dir'

"key1": {
    ["s3://$prefix001/data001.nc", 10000, 100]
  }
"key2": {
    ["s3://$prefix001/data001.nc", 10100, 100]
  }
"key3": {
    ["s3://$prefix002/data001.nc", 10000, 100]
  }
"key4": {
    ["s3://$prefix002/data001.nc", 10100, 100]
  }

we could make the bloat a lot smaller

fletcher32 checksum

All datasets with fletcher32 == True are rejected by referenceMaker. This is the checksum supported by HDF5.
Would it be possible to add it to nomcodecs? @ajelenak

We need a better name for this project, in time for my pydata talk next week.
In doing the migration, I would include an alias python package fsspec_reference_maker for a while, introducing a depr warning in the future and eventually removing it. Could even release fsspec-reference-maker on pypi for a while.

Also, there will be an fsspec github org, so this renamed project could be its first component.

Support RAMS Model Output

So I'm going to start trying to put together a way to get reference maker to work with the output from the model I use (https://github.com/RAMSmodel/rams). This is probably a very fringe usecase, but if I get this working on my research data that means I can spend a bit more time working on reference maker and justify it as research work 😉

There are a few things that might make this tricky.

phony dims

I added a patch to add phony dims if they weren't present already #45 (which fixed #41) but this seems to break on this dataset. Namely, the RAMS output has different dimensions for different variables. This wouldn't be a problem if 2D variables had dimension order ('x', 'y') and 3D had ('x', 'y', 'z'), meaning that via #45 'x' would always be assigned phony_dim_0, etc.

But the reverse is true for RAMS -- 2D variables are ordered ('x', 'y') and 3D variables are ordered ('z', 'x', 'y'), so in the current method phony_dim_0 is assigned to 'y' for a 2D variable and would be assigned to 'z' for a 3D variable. This wonky ordering has caused many headaches regarding compatibility with other software.

Reading the file header with ncdump, it's clear that netcdf is able to figure this out no problem:

netcdf ccn0-newnudge-A-2017-05-12-030000-g1 {
dimensions:
        phony_dim_0 = 96 ;
        phony_dim_1 = 96 ;
        phony_dim_2 = 200 ;
        phony_dim_3 = 2 ;
        phony_dim_4 = 1 ;
        phony_dim_5 = 5 ;
variables:
        float ACCPA(phony_dim_0, phony_dim_1) ;
        float ACCPD(phony_dim_0, phony_dim_1) ;
        float ACCPG(phony_dim_0, phony_dim_1) ;
        float ACCPH(phony_dim_0, phony_dim_1) ;
        float ACCPP(phony_dim_0, phony_dim_1) ;
        float ACCPR(phony_dim_0, phony_dim_1) ;
        float ACCPS(phony_dim_0, phony_dim_1) ;
        float AGGREGATET(phony_dim_2, phony_dim_0, phony_dim_1) ;
        float AGGRPRISSNOWT(phony_dim_2, phony_dim_0, phony_dim_1) ;
        float AGGRSELFPRIST(phony_dim_2, phony_dim_0, phony_dim_1) ;
        float AGGRSELFSNOWT(phony_dim_2, phony_dim_0, phony_dim_1) ;

And so I'm curious if there's a good way to replicate this behaviour in reference maker.

Time var/dimension

There is no time variable or dimension in RAMS output, the only place where the date/time is recorded is in the filename. I have created a hacky solution to this by running a file list through a glob pattern match and return a list of np.datetime64 -- https://github.com/lsterzinger/ramslibs/blob/4ec5af45e31a217399e9175f1c9c434cf71942ad/ramslibs/data_tools.py#L247-L282 .

I have created another function to add this to a correctly named dimension in an xarray dataset via https://github.com/lsterzinger/ramslibs/blob/4ec5af45e31a217399e9175f1c9c434cf71942ad/ramslibs/data_tools.py#L285-L346

Current error with fsspec-reference-maker

I have included a small sample data file data.zip (initial t=0 file, so most variables are 0). When I try and run this through reference maker, I get the following:

from fsspec_reference_maker.hdf import SingleHdf5ToZarr
import fsspec
import xarray as xr

f = './ccn0-newnudge-A-2017-05-12-030000-g1.h5'

with fsspec.open(f, 'rb') as inf:
    ref = SingleHdf5ToZarr(inf, f, inline_threshold=300).translate()

fs = fsspec.filesystem('reference', fo=ref, remote_protocol='file')
ds = xr.open_dataset(fs.get_mapper(""), engine='zarr')

ValueError: conflicting sizes for dimension 'phony_dim_0': length 200 on 'AGGREGATET' and length 96 on {'phony_dim_0': 'ACCPA', 'phony_dim_1': 'ACCPA'}

Which is due to the ordering of phony dims in #45

Include storage options in references?

Currently, we include only the URLs of references in the generated JSON files. With fsspec/filesystem_spec#867 , this becomes more complicated, as we can have multiple remote filesystems, each with their own set of optional storage_options.

Since the notebook cells instantiating a ReferenceFileSystem were already getting rather large (although you can encode these in an Intake entry), perhaps we could encode these options directly into the references? On the downside, this makes is less convenient to relocate the original data, by providing an override for URL templates - but I don't know if that feature was being used in practice.

Issue kerchunking NWM medium range netcdf via HTTPS

I'm trying to kerchunk the medium range netcdf files from the National Water Model, which are only accessible via HTTPS from NOMADS (not S3).

I can successfully a few files and then it bombs out with a cryptic error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_1913/3841065069.py in <module>
      1 for url in urls:
----> 2     gen_json(url)

/tmp/ipykernel_1913/476567446.py in gen_json(u)
      6         h5chunks = SingleHdf5ToZarr(infile, u, inline_threshold=300)
      7         with fs2.open(outf, 'wb') as f:
----> 8             f.write(ujson.dumps(h5chunks.translate()).encode());

/home/conda/store/61285a2505cb2471f92bbdbcc3eccd36f27fdbd1be0fd3e864dac20b9c422482-pangeo/lib/python3.9/site-packages/kerchunk/hdf.py in translate(self)
     71         lggr.debug('Translation begins')
     72         self._transfer_attrs(self._h5f, self._zroot)
---> 73         self._h5f.visititems(self._translator)
     74         if self.inline > 0:
     75             self._do_inline(self.inline)

/home/conda/store/61285a2505cb2471f92bbdbcc3eccd36f27fdbd1be0fd3e864dac20b9c422482-pangeo/lib/python3.9/site-packages/h5py/_hl/group.py in visititems(self, func)
    610                 name = self._d(name)
    611                 return func(name, self[name])
--> 612             return h5o.visit(self.id, proxy)
    613 
    614     @with_phil

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5o.pyx in h5py.h5o.visit()

RuntimeError: Object visitation failed (wrong fractal heap header signature)

Do you think this is on the provider side or on our end?

Here's a notebook that should reproduce the issue:
https://nbviewer.org/gist/rsignell-usgs/2af84b60301a9c7c9a46aaef2b84d2fa

MultiZarrToZarr creating 5D references for 4D data

@martindurant, getting all NaN when trying to access data from the consolidated JSON file for 10 NetCDF files with variables that look like:

ds.water_u.shape
(10, 1, 1420, 1800)

Although the single JSON files work fine, the consolidated JSON is not working because it's creating 5D references for the 4D data.

This is particularly confusing because if we use same xr. concat arguments on the remote NetCDF files, it works fine.

Reproducible Notebook example

Incorrect times after concatenating along `time`

This is a failing test case for concatenating multiple HDF5 files along a time coordinate

def test_times(tmpdir):
    lat = xr.DataArray(np.linspace(-90, 90, 10), dims=["lat"], name="lat")
    lon = xr.DataArray(np.linspace(-90, 90, 10), dims=["lon"], name="lon")
    time_attrs = {'axis': 'T', 'long_name': 'time', 'standard_name': 'time'}
    time1 = xr.DataArray(
        np.arange(-631108800000000000, -630158390000000000, 86400000000000).view("datetime64[ns]"),
        dims=["time"], name="time", attrs=time_attrs
    )

    x1 = xr.DataArray(
        np.zeros((12, 10, 10)),
        dims=["time", "lat", "lon"],
        coords={"time": time1, "lat": lat, "lon": lon},
        name="prcp",
    )
    url = str(tmpdir.join("x1.nc"))
    x1.to_netcdf(url, engine="h5netcdf")

    with fsspec.open(url) as f:
        h5chunks = SingleHdf5ToZarr(f, url)
        test_dict = h5chunks.translate()

    m = fsspec.get_mapper(
        "reference://",
        fo=test_dict,
    )
    result = xr.open_dataset(m, engine="zarr", backend_kwargs=dict(consolidated=False))
    expected = x1.to_dataset()
    xr.testing.assert_equal(result, expected)

>       xr.testing.assert_equal(result, expected)
E       AssertionError: Left and right Dataset objects are not equal
E
E       Differing coordinates:
E       L * time     (time) datetime64[ns] NaT ... 1950-01-12T12:00:00
E       R * time     (time) datetime64[ns] 1950-01-01T12:00:00 ... 1950-01-12T12:00:00
E       Differing data variables:
E       L   prcp     (time, lat, lon) float64 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
E       R   prcp     (time, lat, lon) float64 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0

This is probably #70. Looking into it today.

(edited to remove unnecessary combine; the problem starts in hdf)

Can read files on this https site, but listing returns empty

I can see the size of files here okay, but for some reason can't list them:

import fsspec
fs = fsspec.filesystem('https')
fs.glob('https://nasagrace.unl.edu/data/20020401/*.nc4')

returns [].
Yet this works:

url = 'https://nasagrace.unl.edu/data/20020401/GRACEDADM_CLSM0125US_7D.A20020401.030.nc4'

print(fs.size(url)/1e6 )

returning 4.97MB

Full notebook:
https://nbviewer.jupyter.org/gist/rsignell-usgs/8e3051974d0e3a1722369cafe9e0005e

Loosing time coordinate

I'm trying to make references for OISST, and the time returned from combined refs is off.

Test generation script

import json

import fsspec
from fsspec_reference_maker.hdf import SingleHdf5ToZarr
from fsspec_reference_maker.combine import MultiZarrToZarr

# import xarray as xr

# from oisst_clim_daily.utils.getCurrentDownloads import (
#     getCurrentDownloads,
#     netcdf_links,
#     date_from_url,
# )

# this_year, this_month, past_year, past_month = getCurrentDownloads()

# urls = netcdf_links(this_year, this_month) + netcdf_links(past_year, past_month)

# print(urls)


urls = [
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210801.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210802.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210803.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210804.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210805.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210806.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210807.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210808.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210809.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210810.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210811.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210812.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210813.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210814.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210815.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210816.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210817.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210818.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210819_preliminary.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210820_preliminary.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210821_preliminary.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210822_preliminary.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210823_preliminary.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210824_preliminary.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210825_preliminary.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210826_preliminary.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210827_preliminary.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210828_preliminary.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210829_preliminary.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210830_preliminary.nc",
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/202108/oisst-avhrr-v02r01.20210831_preliminary.nc",
]

refs = []

for url in urls:
    with fsspec.open(url) as inf:
        h5chunks = SingleHdf5ToZarr(inf, url, inline_threshold=100)
        refs.append(h5chunks.translate())

with open("first.json", "w") as f:
    json.dump(refs[0], f)

mzz = MultiZarrToZarr(refs, remote_protocol="http", xarray_concat_args={"dim": "time"})
combined = mzz.translate(template_count=None)

with open("combined.json", "w") as f:
    json.dump(combined, f)

Trying to load the combined month's worth of data returns dates starting in 2065.

fs = fsspec.filesystem("reference", fo="./combined.json", skip_intance_cache=True, remote_protocol="http")
m = fs.get_mapper("")
ds_combined = xr.open_dataset(m, engine="zarr")
ds_combined

Loading the first reference gets the right date.

fs = fsspec.filesystem("reference", fo="./first.json", skip_intance_cache=True, remote_protocol="http")
m = fs.get_mapper("")
ds_first = xr.open_dataset(m, engine="zarr")
ds_first

While I'm testing with preliminary data, it does not appear to change the results if I don't include those files.

Package versions

➜ pip list
Package                Version
---------------------- -----------
aiohttp                3.7.4.post0
alembic                1.7.1
aniso8601              7.0.0
argh                   0.26.2
asciitree              0.3.3
async-timeout          3.0.1
attrs                  21.2.0
beautifulsoup4         4.9.3
bleach                 4.1.0
bokeh                  2.3.3
brotlipy               0.7.0
cached-property        1.5.2
certifi                2021.5.30
cffi                   1.14.6
chardet                4.0.0
charset-normalizer     2.0.0
click                  7.1.2
cloudpickle            1.6.0
colorama               0.4.4
coloredlogs            14.0
croniter               1.0.15
cryptography           3.4.7
cytoolz                0.11.0
dagit                  0.12.8
dagster                0.12.8
dagster-graphql        0.12.8
dagster-postgres       0.12.8
dask                   2021.8.1
defusedxml             0.7.1
distributed            2021.8.1
docstring-parser       0.10
entrypoints            0.3
fasteners              0.16
Flask                  1.1.2
Flask-Cors             3.0.10
Flask-GraphQL          2.0.1
Flask-Sockets          0.2.1
fsspec                 2021.8.1
fsspec-reference-maker 0.0.2
geojson                2.5.0
gevent                 21.8.0
gevent-websocket       0.10.1
gql                    2.0.0
graphene               2.1.9
graphql-core           2.3.2
graphql-relay          2.0.1
graphql-server-core    1.2.0
graphql-ws             0.3.1
greenlet               1.1.1
grpcio                 1.38.1
grpcio-health-checking 1.38.1
h5netcdf               0.11.0
h5py                   3.4.0
HeapDict               1.0.1
humanfriendly          9.2
idna                   3.1
importlib-metadata     4.8.1
importlib-resources    5.2.2
ipython-genutils       0.2.0
itsdangerous           2.0.1
Jinja2                 2.11.3
jsonschema             3.2.0
jupyter-core           4.7.1
locket                 0.2.0
Mako                   1.1.5
MarkupSafe             2.0.1
mistune                0.8.4
monotonic              1.5
msgpack                1.0.2
multidict              5.1.0
nbconvert              5.6.0
nbformat               5.1.3
numcodecs              0.9.0
numpy                  1.21.2
olefile                0.46
packaging              21.0
pandas                 1.3.2
pandocfilters          1.4.2
partd                  1.2.0
pendulum               2.1.2
Pillow                 8.3.1
pip                    21.2.4
promise                2.3
protobuf               3.17.2
psutil                 5.8.0
psycopg2               2.9.1
psycopg2-binary        2.9.1
pycparser              2.20
Pygments               2.10.0
pyOpenSSL              20.0.1
pyparsing              2.4.7
pyrsistent             0.17.3
PySocks                1.7.1
python-dateutil        2.8.2
pytz                   2021.1
pytzdata               2020.1
PyYAML                 5.4.1
requests               2.26.0
Rx                     1.6.1
sentry-sdk             1.3.1
setuptools             57.4.0
Shapely                1.7.1
six                    1.16.0
sortedcontainers       2.4.0
soupsieve              2.0.1
SQLAlchemy             1.4.23
tabulate               0.8.9
tblib                  1.7.0
testpath               0.5.0
toolz                  0.11.1
toposort               1.6
tornado                6.1
tqdm                   4.62.2
traitlets              5.1.0
typing-compat          0.1.0
typing-extensions      3.10.0.0
tzlocal                1.5.1
ujson                  4.0.2
urllib3                1.26.6
watchdog               2.1.5
webencodings           0.5.1
Werkzeug               1.0.1
wheel                  0.37.0
xarray                 0.19.0
yarl                   1.6.3
zarr                   2.9.5
zict                   2.0.0
zipp                   3.5.0
zope.event             4.5.0
zope.interface         5.4.0

NWM data out of date

Current references for, e.g., nwm-forecast refer to paths like noaa-nwm-pds/nwm.20210701 , but (as of today) the oldest folder in the bucket is noaa-nwm-pds/nwm.20211128

cc @rsignell-usgs

Combining ReferenceFileSystem, dask, and async requests for HDF data

https://medium.com/pangeo/cloud-performant-netcdf4-hdf5-with-zarr-fsspec-and-intake-3d3a3e7cb935
This blog post is great! There is huge potential in using unmodified HDF files in object storage and obtaining better read efficiency with fsspec and sidecar metadata.

Often existing HDF data has small chunks (the example in the blog post suggests an HDF chunk size of 10MB, but then uses dask with dask chunks specified as 30MB to speed up computations). It's not uncommon to come across files written with 1MB chunks either! But here are screenshots of the two cases from the blog post:

1. Dask chunks = HDF chunks:

2. Dask chunks > HDF chunks

It's unclear to me if fsspec async requests come into play in the case where dask chunks>hdf chunks by default as suggested in this forum post (https://discourse.pangeo.io/t/understanding-async/1098/6). It seems like there would be a major performance boost in that case.

Perhaps @rsignell-usgs , @martindurant , @ajelenak , or @rabernat can clarify?

Coordinate somehow with Climate Forecast Aggregation conventions

The folks at CEDA have come out with a new standard that aims to solve the same problem as kerchunk for the many-netcdf use case:

The approach is different and more limited in scope. In particular, I'm not sure if that spec makes it possible to map individual compressed chunk locations.

Is there any interest in discussing with them and trying achieve some degree of interoperability?

Update the required dependencies

What are the required dependencies of kerchunk? Right now https://github.com/fsspec/kerchunk/blob/6b57f0620c9e1e55206523ac54dadf60830b0935/requirements.txt has ujson on fsspec.

Working with HDF5 files (https://github.com/fsspec/kerchunk/blob/main/kerchunk/hdf.py) seems to require h5py, zarr (and numcodecs), and numpy.

combine requires those plus xarray and dask (since it uses chunks={}).

So I'd recommend the additional requirements

h5py
zarr
numcodecs
numpy
xarray
dask[array]

ReferenceFileSystem generation for Zarr

MultiZarrToZarr expects a collection of Zarr ReferenceFileSystem JSON paths. What steps are required to generate ReferenceFileSystem JSON from a Zarr Directory or Zip and is this something for kerchunk to support? Thanks!

Warnings rather than exits

In #38 I added error messages that were more specific so that we could more easily identify problems with various flavors of HDF5.

What I would really like is making these RuntimeErrors into warnings so we could finish processing files instead of exiting. IMHO, the same thing should happen with HDF5 data types that can not be processed, i.e. variable-length strings...

One thing that might work is writing out a -1 for a size of the dataset so that someone could take a look at the json to find errors. Of course, there could be better approaches...

[Proposal] Allow for generated `[url]` reference (not only `[url, offset, length]`)

Proposal

Currently the gen field of V1 requires both offset and length and can only generate references of type [url, offset, length]. I think it would be useful to generate the [url] mapping, e.g:

  "gen": (optional, zero or more items) [
    "key": (required) jinja-str,
    "url": (required) jinja-str,
    "offset": (optional) jinja-str, // if provided, must also provide length
    "length": (optional) jinja-str,
    "dimensions": (required, one or more arbitrary keys) {
      "variable_name": (required) 
        {"start": (optional) int, "stop": (required) int, "step": (optional) int}
        OR
        [int, ...]
    }
  ],

Motivation

As an example, the DZI tile format is a pyramidal image format where each tile is an separate JPEG object with key {level}/{x}_{y}.jpeg. I have written a napari plugin to translate this format to Zarr on-demand, but I experimented today with exporting this translation in some cases as an fsspec reference.

Using the gen field as described above let me express this translation very compactly:

{
  "version": "1.0",
  "templates": {
    "u": "https://raw.githubusercontent.com/visgl/deck.gl-data/master/website/image-tiles/moon.image/moon.image_files"
  },
  "gen": [
    {
      "key": "15/{{j}}.{{i}}",
      "url": "{{u}}/15/{{i}}_{{j}}.jpeg",
      "dimensions": {
        "i": { "stop": 46 },
        "j": { "stop": 46 }
      }
   }
   ...

but ultimately I had to expand the keys when writing the reference because this isn't spec compliant with V1.

{
 "version": 1,
 "templates": {
  "u": "https://raw.githubusercontent.com/visgl/deck.gl-data/master/website/image-tiles/moon.image/moon.image_files"
 },
 "refs": {
  "15/0.0": [
   "{{u}}/15/0_0.jpeg"
  ],
  "15/1.0": [
   "{{u}}/15/0_1.jpeg"
  ],
  "15/2.0": [
   "{{u}}/15/0_2.jpeg"
  ...

The former was ~3.2kb and the latter ~125kb.

Example

I generated a reference for this multiscale image, converted to DZI as a deck.gl Example.

import dask.array as da
import fsspec  # 0.9.0
import napari
import zarr
from imagecodecs.numcodecs import register_codecs

from napari_dzi_zarr import DZIStore

# add image codecs to zarr registry
register_codecs()

# write the reference
# url = 'https://raw.githubusercontent.com/visgl/deck.gl-data/master/website/image-tiles/moon.image/moon.image.dzi'
# DZIStore(url).write_fsspec(filename='moon-dzi.json')

# open references
store = fsspec.get_mapper(
    "reference://",
    fo='https://gist.githubusercontent.com/manzt/68fed819ebfbcc006e919f2c6c296972/raw/d27af0abf0d89e91652f24a64312807146772f3b/moon-dzi.json',
    target_protocol="http",
)

grp = zarr.open(store)
pyramid = [
    da.from_zarr(store, component=d["path"])
    for d in grp.attrs["multiscales"][0]["datasets"]
]

viewer = napari.Viewer()
viewer.add_image(pyramid)
napari.run()

Note: edge tiles in DZI aren't consistently sized, so I only translate the inner tiles in the spec (meaning zarr client will fill-zero edge tiles when fetching these data)

This isn't central to my day to day work, but I assume this feature could help address the use case in zarr-developers/zarr-python#631, where separate objects can be mapped to Zarr in a predictable way.

NWM2.1 reanalysis question

Hi, I'm new to kerchunk. I followed the example (see below) to convert the NWM S3Files to Zarr.
It worked well for files before 2007.
After 2007, I got an error saying file signature not found,

File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 96, in h5py.h5f.open
OSError: Unable to open file (file signature not found)

It turns out after 2007 NWM2.1 reanalysis data are not stored as h5 files (they are netCDF files). I wonder if you know of a way of getting around that.

Thanks,
Alex

#my code starts here =============================================
from kerchunk.hdf import SingleHdf5ToZarr 
from kerchunk.combine import MultiZarrToZarr
import fsspec

def testme():
    #u='s3://noaa-nwm-retrospective-2-1-pds/forcing/1996/199602182000.LDASIN_DOMAIN1' #this works
    u='s3://noaa-nwm-retrospective-2-1-pds/forcing/2007/2007010100.LDASIN_DOMAIN1'   # but this does not work
    so = dict(
        mode="rb", anon=True, default_fill_cache=False, default_cache_type="none"
    )
    with fsspec.open(u, **so) as inf:
        print (u)
        h5chunks = SingleHdf5ToZarr(inf, u, inline_threshold=300)

if __name__ == '__main__':
    testme()

decouple hardwired source as an example?

How best to handle the hardwired source example here:
https://github.com/intake/fsspec-reference-maker/blob/main/fsspec_reference_maker/hdf.py#L262-L267

Add JS example

https://observablehq.com/@manzt/ome-tiff-as-filesystemreference

@manzt , perhaps you want to comment on the status of your zarr storage driver.

Could we handle geotiff?

Was chatting with @jkellndorfer and currently a lot of folks who work with geotiff use VRT to create virtual datasets (mosaics, time stacks, etc).

Could we do the same with fsspec-reference-maker, thereby allowing folks to read from large virtual collections and avoiding GDAL?

Seems that would be very cool.

Creating fsspec mapper from ReferenceFileSystem using templates is slow

Hello,

I recently added the ability to create ReferenceFileSystem JSON files for TIFF files to the tifffile library.

I noticed that using fsspec.get_mapper with ReferenceFileSystem version 1 and templates is much slower than version 0 without templates. For the moderately sized file in below example, the difference is a few seconds. For files containing millions of chunks the time to read version 1 files with templates is minutes. Am I missing something?

Related to the efficiency of ReferenceFileSystem: the JSON files can get very large, more than 2 GB in some cases. Maybe a binary/database format (sqlite?) or a git repro/bundle could be more efficient than JSON?

Example:

import fsspec
import zarr
import tifffile
from imagecodecs.numcodecs import register_codecs

# http://openslide.cs.cmu.edu/download/openslide-testdata/Aperio/CMU-1.svs
# a multiscales dataset containing 24813 JPEG compressed chunks in TIFF
filename = 'CMU-1.svs'
url = 'http://localhost:8080/scratch/'

register_codecs()  # required for decoding JPEG chunks

# create fsspec ReferenceFileSystem, version 0 and 1
with tifffile.imread(filename, aszarr=True) as store:

    with tifffile.Timer('store.write_fsspec v0'):
        store.write_fsspec(f'{filename}.v0.json', url, version=0)

    with tifffile.Timer('store.write_fsspec v1'):
        store.write_fsspec(f'{filename}.v1.json', url, version=1)

# create fsspec mappers from ReferenceFileSystem
with tifffile.Timer('fsspec.get_mapper v0'):
    mapper_v0 = fsspec.get_mapper(
        'reference://', fo=f'{url}/{filename}.v0.json', target_protocol='http'
    )

with tifffile.Timer('fsspec.get_mapper v1'):
    # this is slow
    mapper_v1 = fsspec.get_mapper(
        'reference://', fo=f'{url}/{filename}.v1.json', target_protocol='http'
    )

# create zarr Group from mappers
with tifffile.Timer('zarr.open v0'):
    data_v0 = zarr.open(mapper_v0, mode='r')

with tifffile.Timer('zarr.open v1'):
    data_v1 = zarr.open(mapper_v1, mode='r')

# visualize zarr Group in napari
data = data_v1

if isinstance(data, zarr.Group):
    data = [
        data[int(dataset['path'])]
        for dataset in data.attrs['multiscales'][0]['datasets']
    ]

import napari

with napari.gui_qt():
    viewer = napari.Viewer()
    viewer.add_image(data, rgb=True, contrast_limits=[0, 255])

Output:

store.write_fsspec v0 0.358685 s
store.write_fsspec v1 0.344329 s
fsspec.get_mapper v0 2.670161 s
fsspec.get_mapper v1 9.135032 s
zarr.open v0 0.002924 s
zarr.open v1 0.001974 s

CMU-1.svs.v0.json:

{
 ".zgroup": "{\n \"zarr_format\": 2\n}",
 ".zattrs": "{\n \"multiscales\": [\n  {\n   \"datasets\": [\n    {\n     \"path\": \"0\"\n    },\n    {\n     \"path\": \"1\"\n    },\n    {\n     \"path\": \"2\"\n    }\n   ],\n   \"metadata\": {},\n   \"name\": \"Baseline\",\n   \"version\": \"0.1\"\n  }\n ]\n}",
 "0/.zarray": "{\n \"chunks\": [\n  256,\n  256,\n  3\n ],\n \"compressor\": {\n  \"bitspersample\": 8,\n  \"colorspace_data\": 2,\n  \"colorspace_jpeg\": 2,\n  \"header\": null,\n  \"id\": \"imagecodecs_jpeg\",\n  \"tables\": \"/9j/2wBDABsSFBcUERsXFhceHBsgKEIrKCUlKFE6PTBCYFVlZF9VXVtqeJmBanGQc1tdhbWGkJ6jq62rZ4C8ybqmx5moq6T/xAAfAAABBQEBAQEBAQAAAAAAAAAAAQIDBAUGBwgJCgv/xAC1EAACAQMDAgQDBQUEBAAAAX0BAgMABBEFEiExQQYTUWEHInEUMoGRoQgjQrHBFVLR8CQzYnKCCQoWFxgZGiUmJygpKjQ1Njc4OTpDREVGR0hJSlNUVVZXWFlaY2RlZmdoaWpzdHV2d3h5eoOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4eLj5OXm5+jp6vHy8/T19vf4+fr/2Q==\"\n },\n \"dtype\": \"|u1\",\n \"fill_value\": 0,\n \"filters\": null,\n \"order\": \"C\",\n \"shape\": [\n  32914,\n  46000,\n  3\n ],\n \"zarr_format\": 2\n}",
 "1/.zarray": "{\n \"chunks\": [\n  256,\n  256,\n  3\n ],\n \"compressor\": {\n  \"bitspersample\": 8,\n  \"colorspace_data\": 2,\n  \"colorspace_jpeg\": 2,\n  \"header\": null,\n  \"id\": \"imagecodecs_jpeg\",\n  \"tables\": \"/9j/2wBDAAsICAoIBwsKCQoNDAsNERwSEQ8PESIZGhQcKSQrKigkJyctMkA3LTA9MCcnOEw5PUNFSElIKzZPVU5GVEBHSEX/xAAfAAABBQEBAQEBAQAAAAAAAAAAAQIDBAUGBwgJCgv/xAC1EAACAQMDAgQDBQUEBAAAAX0BAgMABBEFEiExQQYTUWEHInEUMoGRoQgjQrHBFVLR8CQzYnKCCQoWFxgZGiUmJygpKjQ1Njc4OTpDREVGR0hJSlNUVVZXWFlaY2RlZmdoaWpzdHV2d3h5eoOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4eLj5OXm5+jp6vHy8/T19vf4+fr/2Q==\"\n },\n \"dtype\": \"|u1\",\n \"fill_value\": 0,\n \"filters\": null,\n \"order\": \"C\",\n \"shape\": [\n  8228,\n  11500,\n  3\n ],\n \"zarr_format\": 2\n}",
 "2/.zarray": "{\n \"chunks\": [\n  256,\n  256,\n  3\n ],\n \"compressor\": {\n  \"bitspersample\": 8,\n  \"colorspace_data\": 2,\n  \"colorspace_jpeg\": 2,\n  \"header\": null,\n  \"id\": \"imagecodecs_jpeg\",\n  \"tables\": \"/9j/2wBDAAYEBAUEBAYFBQUGBgYHCQ4JCQgICRINDQoOFRIWFhUSFBQXGiEcFxgfGRQUHScdHyIjJSUlFhwpLCgkKyEkJST/xAAfAAABBQEBAQEBAQAAAAAAAAAAAQIDBAUGBwgJCgv/xAC1EAACAQMDAgQDBQUEBAAAAX0BAgMABBEFEiExQQYTUWEHInEUMoGRoQgjQrHBFVLR8CQzYnKCCQoWFxgZGiUmJygpKjQ1Njc4OTpDREVGR0hJSlNUVVZXWFlaY2RlZmdoaWpzdHV2d3h5eoOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4eLj5OXm5+jp6vHy8/T19vf4+fr/2Q==\"\n },\n \"dtype\": \"|u1\",\n \"fill_value\": 0,\n \"filters\": null,\n \"order\": \"C\",\n \"shape\": [\n  2057,\n  2875,\n  3\n ],\n \"zarr_format\": 2\n}",
 "0/0.0.0": ["http://localhost:8080/scratch/CMU-1.svs", 16, 2383],
 "0/0.1.0": ["http://localhost:8080/scratch/CMU-1.svs", 347589, 2432],
 "0/0.2.0": ["http://localhost:8080/scratch/CMU-1.svs", 700119, 2687],
<snip>
 "2/8.10.0": ["http://localhost:8080/scratch/CMU-1.svs", 177079453, 4246],
 "2/8.11.0": ["http://localhost:8080/scratch/CMU-1.svs", 177083699, 4193]
}

CMU-1.svs.v1.json:

{
 "version": 1,
 "templates": {
  "u": "http://localhost:8080/scratch/CMU-1.svs"
 },
 "gen": [],
 "refs": {
  ".zgroup": "{\n \"zarr_format\": 2\n}",
  ".zattrs": "{\n \"multiscales\": [\n  {\n   \"datasets\": [\n    {\n     \"path\": \"0\"\n    },\n    {\n     \"path\": \"1\"\n    },\n    {\n     \"path\": \"2\"\n    }\n   ],\n   \"metadata\": {},\n   \"name\": \"Baseline\",\n   \"version\": \"0.1\"\n  }\n ]\n}",
  "0/.zarray": "{\n \"chunks\": [\n  256,\n  256,\n  3\n ],\n \"compressor\": {\n  \"bitspersample\": 8,\n  \"colorspace_data\": 2,\n  \"colorspace_jpeg\": 2,\n  \"header\": null,\n  \"id\": \"imagecodecs_jpeg\",\n  \"tables\": \"/9j/2wBDABsSFBcUERsXFhceHBsgKEIrKCUlKFE6PTBCYFVlZF9VXVtqeJmBanGQc1tdhbWGkJ6jq62rZ4C8ybqmx5moq6T/xAAfAAABBQEBAQEBAQAAAAAAAAAAAQIDBAUGBwgJCgv/xAC1EAACAQMDAgQDBQUEBAAAAX0BAgMABBEFEiExQQYTUWEHInEUMoGRoQgjQrHBFVLR8CQzYnKCCQoWFxgZGiUmJygpKjQ1Njc4OTpDREVGR0hJSlNUVVZXWFlaY2RlZmdoaWpzdHV2d3h5eoOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4eLj5OXm5+jp6vHy8/T19vf4+fr/2Q==\"\n },\n \"dtype\": \"|u1\",\n \"fill_value\": 0,\n \"filters\": null,\n \"order\": \"C\",\n \"shape\": [\n  32914,\n  46000,\n  3\n ],\n \"zarr_format\": 2\n}",
  "1/.zarray": "{\n \"chunks\": [\n  256,\n  256,\n  3\n ],\n \"compressor\": {\n  \"bitspersample\": 8,\n  \"colorspace_data\": 2,\n  \"colorspace_jpeg\": 2,\n  \"header\": null,\n  \"id\": \"imagecodecs_jpeg\",\n  \"tables\": \"/9j/2wBDAAsICAoIBwsKCQoNDAsNERwSEQ8PESIZGhQcKSQrKigkJyctMkA3LTA9MCcnOEw5PUNFSElIKzZPVU5GVEBHSEX/xAAfAAABBQEBAQEBAQAAAAAAAAAAAQIDBAUGBwgJCgv/xAC1EAACAQMDAgQDBQUEBAAAAX0BAgMABBEFEiExQQYTUWEHInEUMoGRoQgjQrHBFVLR8CQzYnKCCQoWFxgZGiUmJygpKjQ1Njc4OTpDREVGR0hJSlNUVVZXWFlaY2RlZmdoaWpzdHV2d3h5eoOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4eLj5OXm5+jp6vHy8/T19vf4+fr/2Q==\"\n },\n \"dtype\": \"|u1\",\n \"fill_value\": 0,\n \"filters\": null,\n \"order\": \"C\",\n \"shape\": [\n  8228,\n  11500,\n  3\n ],\n \"zarr_format\": 2\n}",
  "2/.zarray": "{\n \"chunks\": [\n  256,\n  256,\n  3\n ],\n \"compressor\": {\n  \"bitspersample\": 8,\n  \"colorspace_data\": 2,\n  \"colorspace_jpeg\": 2,\n  \"header\": null,\n  \"id\": \"imagecodecs_jpeg\",\n  \"tables\": \"/9j/2wBDAAYEBAUEBAYFBQUGBgYHCQ4JCQgICRINDQoOFRIWFhUSFBQXGiEcFxgfGRQUHScdHyIjJSUlFhwpLCgkKyEkJST/xAAfAAABBQEBAQEBAQAAAAAAAAAAAQIDBAUGBwgJCgv/xAC1EAACAQMDAgQDBQUEBAAAAX0BAgMABBEFEiExQQYTUWEHInEUMoGRoQgjQrHBFVLR8CQzYnKCCQoWFxgZGiUmJygpKjQ1Njc4OTpDREVGR0hJSlNUVVZXWFlaY2RlZmdoaWpzdHV2d3h5eoOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4eLj5OXm5+jp6vHy8/T19vf4+fr/2Q==\"\n },\n \"dtype\": \"|u1\",\n \"fill_value\": 0,\n \"filters\": null,\n \"order\": \"C\",\n \"shape\": [\n  2057,\n  2875,\n  3\n ],\n \"zarr_format\": 2\n}",
  "0/0.0.0": ["{{u}}", 16, 2383],
  "0/0.1.0": ["{{u}}", 347589, 2432],
  "0/0.2.0": ["{{u}}", 700119, 2687],
<snip>
  "2/8.10.0": ["{{u}}", 177079453, 4246],
  "2/8.11.0": ["{{u}}", 177083699, 4193]
 }
}

"ValueError: missing object_codec for object array" on this https netcdf

I'm getting a "ValueError: missing object_codec for object array" error when trying to run kerchunk on this single NetCDF file available via HTTPS:
url = 'https://geoport.usgs.esipfed.org/erddap/files/MKK18C1802vec-b/MKK18C1802vec-b.nc'

This is confusing to me since the NetCDF file doesn't seem to use any strange codec.

See this notebook:
https://nbviewer.org/gist/rsignell-usgs/e90c9ca09f97aecab7dc9b969bc40fee

SpatioTemporal Asset Catalogs STAC

How does this compare to what STAC is trying to achieve?

Or maybe this is a tool which can work together with STAC?

A quick search in your documentation and issues and there is no mentioning of STAC. Maybe you haven't heard of it?
https://stacspec.org
github of STAC specs

Formats to consider

The following formats can currently be processed:

HDF5
tiff
grib2

Please comment about other file formats that might be interesting to this project. Have they been translated to large zarr datasets in the past? Do they have internal byte-ranges containing encoded C-array-buffers?

Possible candidates I happen to have worked with:

dicom and nifti (medical imaging). The former commonly uses jpeg encoding, should be fine with imagecodecs
FITS (astro) image/array extensions. Would only work where the whole file is not gzip-compressed.

Add support for H5Z_FILTER_SHUFFLE for GZIP/ZLIB compressed data

I have started testing out this approach on some of the Australian Integrated Marine Observing System datasets that are stored in NetCDF format on S3 (eg. s3://imos-data/IMOS/SRS/OC/gridded/aqua/P1D/2014/08/A.P1D.20140801T053000Z.aust.K_490.nc )

This dataset utilises gzip compression with the H5Z_FILTER_SHUFFLE filter applied. I have looked over the upstream code dependencies and would appreciate some advice on the pathway forward.

Currently the numcodecs gzip and zlib Codecs do not support a shuffle option. The code for the shuffle is pretty straight forward, however will likely be very slow in python if not compiled: https://github.com/HDFGroup/hsds/blob/03890edfa735cc77da3bc06f6cf5de5bd40d1e23/hsds/util/storUtil.py#L43
numcodecs uses cython for compiled code rather than numba.

I am keen to help get this sorted and suggest one possible way forward could be:

Raise an issue and PR on numcodecs to implement the shuffle/unshuffle as a Codec in C with a cython binding to bloscs shuffle.h
Alter the call to create_dataset at https://github.com/intake/fsspec-reference-maker/blob/main/fsspec_reference_maker/hdf.py#L118 to include the filters argument = sequence of numcodec filters

Appreciate advice from some of the more experienced devs here (@martindurant, @ajelenak and @rabernat) if you think this is a reasonable way forward?

Thanks for everyone's effort on this!

missing object_codec for object array

I am trying to create references for this netCDF file which is a collection of ~30 xArray dataSets. Each dataSet has 8 HDF5 datasets. This file was written with xArray.

I make it through the root group and the first dataset then I get this error on the second dataset in the group:
Exception has occurred: ValueError
missing object_codec for object array
File "/Users/metadatagamechanger/GitRepositories/fsspec-reference-maker/fsspec_reference_maker/hdf.py", line 177, in _translator
za = self._zroot.create_dataset(h5obj.name, shape=h5obj.shape,
File "/Users/metadatagamechanger/GitRepositories/fsspec-reference-maker/fsspec_reference_maker/hdf.py", line 71, in translate
self._h5f.visititems(self._translator)
File "/Users/metadatagamechanger/Library/Mobile Documents/com~~apple~~CloudDocs/Documents/MetadataGameChanger/ProjectsAndPlans/UNAVCO/CommonContainer/referenceFileSystem/makeReferences.py", line 10, in
h5t = h5chunks.translate()

On line: za = self._zroot.create_dataset(h5obj.name, shape=h5obj.shape,
dtype=h5obj.dtype,
chunks=h5obj.chunks or False,
fill_value=h5obj.fillvalue,
compression=compression,
filters=filters,
overwrite=True)

The json so far ends with the reference to the data in the first dataset. The h5ls for the dataset the error comes from is:
/AMC2/Solution Dataset {7025/7025}
Attribute: DIMENSION_LIST {1}
Type: variable length of
object reference
Data: (DATASET-1:529616)
Attribute: _Netcdf4Dimid scalar
Type: native int
Data: 2
Location: 1:932878
Links: 1
Storage: 56200 logical bytes, 112400 allocated bytes, 50.00% utilization
Type: variable-length null-terminated UTF-8 string

Creating reference JSONS from netCDF file on Azure Blob hangs

I'm trying to use ReferenceMaker on files stored on Azure Blog storage after successfully doing the same for the same files on AWS S3. I'm using the following code:

import fsspec
from fsspec_reference_maker.hdf import SingleHdf5ToZarr
import json

url = 'az://noaa-goes16/ABI-L2-MCMIPF/2020/002/00/OR_ABI-L2-MCMIPF-M6_G16_s20200020000216_e20200020009524_c20200020010031.nc'

so = dict(
         mode="rb", anon=True, default_fill_cache=False, default_cache_type="none",
)

with fsspec.open(url, **so, account_name='goes') as inf:
        h5chunks = SingleHdf5ToZarr(inf, url)
        with open("test.json", 'w') as outf: 
            outf.write(json.dumps(h5chunks.translate()).encode())

But this seems to hang forever (or at very least take longer than 20 minutes for a single file, after which I cancelled). This might be an issue with fsspec itself, as just opening a file with fsspec and xarray takes 10x as long on Azure as on AWS S3. See my related issue opened here fsspec/filesystem_spec#681

release?

It would be great to have a release soon so we can more easily install into pangeo forge containers.

xarray dimensions error when opening hdf5 reference with groups

Here is a zip of the data file and a reference json to the same file in azure

Opening the attached file works locally with xarray, provided the group is specified:

xr.open_dataset(
    "./VNP14A1.A2020001.h08v04.001.2020003132203.h5", 
    group="HDFEOS/GRIDS/VNP14A1_Grid/Data Fields”
)

But when opening a reference to this file on Azure (test.json)

fs1 = fsspec.filesystem('reference', fo='test.json', 
                        remote_protocol='az', remote_options={
                            'account_name' : 'modissa'
                        })

ds = xr.open_dataset(fs1.get_mapper("HDFEOS/GRIDS/VNP14A1_Grid/Data Fields"), engine='zarr')

Yields the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-37-a8b2cd409cf2> in <module>
      4                         })
      5 
----> 6 ds = xr.open_dataset(fs1.get_mapper("HDFEOS/GRIDS/VNP14A1_Grid/Data Fields"), engine='zarr')

/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, backend_kwargs, *args, **kwargs)
    494 
    495     overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 496     backend_ds = backend.open_dataset(
    497         filename_or_obj,
    498         drop_variables=drop_variables,

/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/backends/zarr.py in open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, synchronizer, consolidated, consolidate_on_close, chunk_store, storage_options, lock)
    745         store_entrypoint = StoreBackendEntrypoint()
    746         with close_on_error(store):
--> 747             ds = store_entrypoint.open_dataset(
    748                 store,
    749                 mask_and_scale=mask_and_scale,

/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/backends/store.py in open_dataset(self, store, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta)
     20         decode_timedelta=None,
     21     ):
---> 22         vars, attrs = store.load()
     23         encoding = store.get_encoding()
     24 

/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/backends/common.py in load(self)
    122         """
    123         variables = FrozenDict(
--> 124             (_decode_variable_name(k), v) for k, v in self.get_variables().items()
    125         )
    126         attributes = FrozenDict(self.get_attrs())

/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/backends/zarr.py in get_variables(self)
    376 
    377     def get_variables(self):
--> 378         return FrozenDict(
    379             (k, self.open_store_variable(k, v)) for k, v in self.ds.arrays()
    380         )

/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/core/utils.py in FrozenDict(*args, **kwargs)
    444 
    445 def FrozenDict(*args, **kwargs) -> Frozen:
--> 446     return Frozen(dict(*args, **kwargs))
    447 
    448 

/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/backends/zarr.py in <genexpr>(.0)
    377     def get_variables(self):
    378         return FrozenDict(
--> 379             (k, self.open_store_variable(k, v)) for k, v in self.ds.arrays()
    380         )
    381 

/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/backends/zarr.py in open_store_variable(self, name, zarr_array)
    373             attributes["_FillValue"] = zarr_array.fill_value
    374 
--> 375         return Variable(dimensions, data, attributes, encoding)
    376 
    377     def get_variables(self):

/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/core/variable.py in __init__(self, dims, data, attrs, encoding, fastpath)
    313         """
    314         self._data = as_compatible_data(data, fastpath=fastpath)
--> 315         self._dims = self._parse_dimensions(dims)
    316         self._attrs = None
    317         self._encoding = None

/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/core/variable.py in _parse_dimensions(self, dims)
    572         dims = tuple(dims)
    573         if len(dims) != self.ndim:
--> 574             raise ValueError(
    575                 f"dimensions {dims} must have the same length as the "
    576                 f"number of data dimensions, ndim={self.ndim}"

ValueError: dimensions () must have the same length as the number of data dimensions, ndim=2

Variable encoding lost in MultiZarrToZarr

I have been playing around with fsspec-reference-maker in developing a tutorial example for pangeo-forge/pangeo-forge-recipes#174.

I have discovered a problem related to how encoding is handled. Basically, when two variables in different files have different encoding, the MultiZarrToZarr reference filesystem assumes that all variables have the same encoding. However, the raw data have not been changed, so the incorrect encoding is applied.

This may seem like a niche case, but it is actually very common for timeseries netCDF files to have their time dimension encoded as an int with units days since X, where X is different in each file.

The following example is a simple standalone reproducer. The key point is that the two original datasets have the same raw data in the time coord, but different attributes.

import os
import json

import xarray as xr
import fsspec
from fsspec_reference_maker.combine import MultiZarrToZarr
from fsspec_reference_maker.hdf import SingleHdf5ToZarr

ds1 = xr.DataArray(
    [0, 0, 0],
    name='foo',
    dims=['time'],
    coords={
        'time': (
            ['time'],
            [1, 2, 3], 
            {'units': 'days since 1900-01-01', 'calendar': 'noleap'}
        )
    }
).to_dataset()
ds1.to_netcdf('ds1.nc', mode='w')

ds2 = xr.DataArray(
    [1, 1, 1],
    name='foo',
    dims=['time'],
    coords={
        'time': (
            ['time'],
            [1, 2, 3], 
            {'units': 'days since 1900-01-04', 'calendar': 'noleap'}
        )
    }
).to_dataset()
ds2.to_netcdf('ds2.nc', mode='w')

fnames = ['ds1.nc', 'ds2.nc']
ds_concat =xr.open_mfdataset(fnames).load()


for fname in fnames:
    json_fname = os.path.basename(fname + ".json")
    url = 'file://' + os.path.abspath(fname)
    with fsspec.open(fname) as f:
        h5chunks = SingleHdf5ToZarr(f, url, inline_threshold=300)
        chunks = h5chunks.translate()
    with open(json_fname, mode='wt') as f_out:
        json.dump(chunks, f_out)

dsets = []
for fname in fnames:
    ref_url = 'file://' + os.path.abspath(fname + ".json")
    m = fsspec.get_mapper(
        "reference://",
        fo=ref_url,
        target_protocol="file",
        remote_protocol="s3",
        skip_instance_cache=True,
    )
    dsets.append(xr.open_dataset(m, engine='zarr', backend_kwargs={'consolidated': False}))

ds_concat_ref = xr.concat(dsets, dim='time')

json_files = ['file://' + os.path.abspath(fname + ".json") for fname in fnames]
mzz = MultiZarrToZarr(
    json_files,
    xarray_concat_args={'dim': 'time'},
    remote_protocol='file'
)
out = mzz.translate(None)
with open('combined.json', 'wt') as fp:
    json.dump(out, fp)

m_combined = fsspec.get_mapper(
    "reference://",
    fo='combined.json',
    target_protocol="file",
    remote_protocol="file",
    skip_instance_cache=True,
)
ds_combined = xr.open_dataset(m_combined, engine='zarr', backend_kwargs={'consolidated': False})

print(ds_concat.time)
print(ds_concat_ref.time)
print(ds_combined.time)

xr.testing.assert_equal(ds_concat.time, ds_concat_ref.time)
xr.testing.assert_equal(ds_concat.time, ds_combined.time)

AssertionError: Left and right DataArray objects are not equal

Differing values:
L
    array([cftime.DatetimeNoLeap(1900, 1, 2, 0, 0, 0, 0, has_year_zero=True),
           cftime.DatetimeNoLeap(1900, 1, 3, 0, 0, 0, 0, has_year_zero=True),
           cftime.DatetimeNoLeap(1900, 1, 4, 0, 0, 0, 0, has_year_zero=True),
           cftime.DatetimeNoLeap(1900, 1, 5, 0, 0, 0, 0, has_year_zero=True),
           cftime.DatetimeNoLeap(1900, 1, 6, 0, 0, 0, 0, has_year_zero=True),
           cftime.DatetimeNoLeap(1900, 1, 7, 0, 0, 0, 0, has_year_zero=True)],
          dtype=object)
R
    array([cftime.DatetimeNoLeap(1900, 1, 3, 0, 0, 0, 0, has_year_zero=True),
           cftime.DatetimeNoLeap(1900, 1, 4, 0, 0, 0, 0, has_year_zero=True),
           cftime.DatetimeNoLeap(1900, 1, 5, 0, 0, 0, 0, has_year_zero=True),
           cftime.DatetimeNoLeap(1900, 1, 3, 0, 0, 0, 0, has_year_zero=True),
           cftime.DatetimeNoLeap(1900, 1, 4, 0, 0, 0, 0, has_year_zero=True),
           cftime.DatetimeNoLeap(1900, 1, 5, 0, 0, 0, 0, has_year_zero=True)],
          dtype=object)
Differing coordinates:
L * time     (time) object 1900-01-02 00:00:00 ... 1900-01-07 00:00:00
R * time     (time) object 1900-01-03 00:00:00 ... 1900-01-05 00:00:00

Digging deeper, we can see that each individual SingleHdf5ToZarr reference dataset preserves the correct encoding. But the combined dataset has a new encoding (determined by xarray) which doesn't match the actual encoding values on disk.

print(dsets[0].time.encoding)
print(dsets[1].time.encoding)
print(ds_combined.time.encoding)

{'chunks': (3,), 'preferred_chunks': {'time': 3}, 'compressor': None, 'filters': None, '_FillValue': -9223372036854775806, 'units': 'days since 1900-01-01', 'calendar': 'noleap', 'dtype': dtype('int64')}
{'chunks': (3,), 'preferred_chunks': {'time': 3}, 'compressor': None, 'filters': None, '_FillValue': -9223372036854775806, 'units': 'days since 1900-01-04', 'calendar': 'noleap', 'dtype': dtype('int64')}
{'chunks': (6,), 'preferred_chunks': {'time': 6}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': 0, 'units': 'days since 1900-01-02 00:00:00.000000', 'calendar': 'noleap', 'dtype': dtype('int64')}

This is a pretty tricky situation. The whole point of the reference filesystem is to avoid copying the data. But there is no way for an xarray dataset to use different encoding for different parts of the same array. So the best we could do in this case would be to raise an error when encoding is inconsistent. Alternatively, for small data (< inline_threshold), we could consider reencoding data with uniform encoding, like xarray does when it writes data to disk.

MultiZarrToZarr assumes equal file sizes

At https://github.com/intake/fsspec-reference-maker/blob/main/fsspec_reference_maker/combine.py#L275-L278 MultiZarrToZarr finds which axis has been concatenated along by finding dimensions inds (two files concatenated) which are twice the length of ds.

This fails when the concatenated dims aren't equal, e.g. daily data, one year per file might have 365 values in one file and 366 in the next if that is for a leap year.

Possible alternatives could be:

Find dimensions where ds.dims[k] > ds0.dims[k]
Get the concat dims from self.concat_kwargs['dim']

Performance issue with combined zarr sources

I am trying to use kerchunk to access a distributed collection of zarr arrays, parts of a large dataset split by the 'time' dimension.
Each zarr array is accessible through its own NFS-mounted directory on the machine where the code is running.
Here is the entire test code with some output in the comments:

from kerchunk.zarr import single_zarr
from kerchunk.combine import MultiZarrToZarr
import fsspec
import zarr
import xarray

prefix = '/home/idies/workspace/poseidon/'
suffix = '/llc4320_tests/10dayhourly/velocities'
combined_json = 'file:///home/idies/workspace/combined.json'

refs = []
for node in range(1, 11):
    for dir in range(1, 4):
        path = prefix + f'data{node:02}_{dir:02}' + suffix
        refs.append(single_zarr(path))
        print(path)

# /home/idies/workspace/poseidon/data01_01/llc4320_tests/10dayhourly/velocities
# /home/idies/workspace/poseidon/data01_02/llc4320_tests/10dayhourly/velocities
# /home/idies/workspace/poseidon/data01_03/llc4320_tests/10dayhourly/velocities
# /home/idies/workspace/poseidon/data02_01/llc4320_tests/10dayhourly/velocities
# ...

mz2z = MultiZarrToZarr(refs, remote_protocol='file', xarray_concat_args={'dim': 'time'})
mz2z.translate(combined_json)

mapper = fsspec.get_mapper('reference://', fo=combined_json)

# Testing...
ds1 = zarr.open(mapper, mode='r')
%timeit ds1['U'][0,0,0,:,:]
# 401 ms ± 12.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

ds2 = xarray.open_dataset(mapper, engine='zarr')
%timeit ds2['U'][0,0,0,:,:].values
# 402 ms ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

ds3 = zarr.open('/home/idies/workspace/poseidon/data01_01/llc4320_tests/10dayhourly/velocities', mode='r')
%timeit ds3['U'][0,0,0,:,:]
# 126 ms ± 2.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The three read functions at the bottom are getting a slice that fits in a single zarr chunk of one source array, so in principle it can be read directly using zarr library functions, as shown with ds3. But the equivalent read operation on the kerchunk combined array takes much more time (see ds1 and ds2). It also takes more time than using a combined dataset produced by xarray.open_mfdataset (not shown here), which adds some performance overhead for using Dask, and is generally quite slow compared to pure zarr.open or xarray.open_dataset.

Is kerchunk expected to add that much overhead even when reading a single chunk from only one of the sources, or am I doing something wrong here?

Possible future use of offset references in indexing ROOT files

Reposting from my email to @martindurant:

This fsspec-reference-maker, which indexes byte positions in HDF5 files and serves the HDF5 files as Zarr data—such a thing could be useful for serving particle physics data. The ROOT files contain arrays at byte positions that can be pre-scanned and put in some sort of database. I did a test of this once with SkyHook, a UCSC research project (part of Ceph). Their preferred format for these byte positions was FlatBuffers.

https://github.com/diana-hep/uproot-skyhook

I'd be interested in revisiting this project (and possibly involve people who are more closely related to physics data management here). The compilations are that (a) the data are compressed with zlib, lzma, lz4, or zstandard and (b) most data have nested structures, so it would be a transform-to-Awkward like the one you recently developed. The second point might force this to wait for interpretation-aware Zarr extensions. Issue (c) might be that some ROOT files can have astonishingly small chunks, like 10's of kilobytes sometimes, which requires a very compact handling of the byte position metadata. (They're not predictably distributed, either: it has to be a list of numbers.)

Martin's response:

Yes, we may well get to supporting binary storage for a set of offset chunks - I wasn't initially thinking of the millions of keys case. JSON is nice, though, because everyone will support it. So it may well be that, as well as a bunch of scanners for various file formats, there will also be a range of storage options, possibly including zarr itself.

a) the set of compressors doesn't sound like a problem, they are all supported by zarr (https://numcodecs.readthedocs.io/en/latest/index.html#contents ) and should work on any non-python implementations too
b) right: if zarr is your machine, then any structure not natively understood by zarr would need to be in an extension (v3) or informal convention (v2)
c) we can face that when we get to it. For POC, JSON should work. The ROOT stack is sufficiently deep and complex that if you can expose some examples via zarr/fsspec alone, that might be attractive, and then we can face the likely bottlenecks later. I quite like the idea of using zarr to store the offsets, if it's the intended loader anyway. The fsspec implementation takes either a JSON file or a dict/mapping, so you can already provide something dict-like which generates offsets or stores them in a compact way.

PS: this conversation could be copied to the fsspec-reference-maker-repo too

Context: Awkward Array can be expressed in Zarr, but a conversion is needed. @martindurant did this here: https://github.com/martindurant/awkward_extras/tree/main/awkward_zarr

Trying out additional example files

This is really neat, and I'm excited to try things out with some additional HDF files!

I realize the goal is to flesh out the specification and this is not a general conversion tool yet, but it seems like working with more HDF files out in the wild might bring things to light.

Some initial questions/suggestions:

How to write out the .zchunkstore? seems things are setup currently to just output logging info
Add chunk info output to logger (maybe dtype and MB too?) lggr.debug(f'_ARRAY_CHUNKS = {h5obj.chunks}')
Could be useful to first check input file is valid H5. this is easy for a local file, not sure about remote:

        if not h5py.is_hdf5(f):
            raise ValueError('Not an hdf5 file')

What isn't supported?
https://github.com/intake/fsspec-reference-maker/blob/bf41138add53b0201e583aa40840cd4fa5fb907b/fsspec_reference_maker/hdf.py#L103-L106

The first file I tried to generate .zchunkstore with ran into the above, code and traceback below:

def ATL06_remote():
    return hdf2zarr.run(
        's3://its-live-data.jpl.nasa.gov/icesat2/alt06/rel003/ATL06_20181230162257_00340206_003_01.h5',
        mode='rb', anon=False, requester_pays=True,
        default_fill_cache=False, default_cache_type='none'
    )

DEBUG:h5-to-zarr:translator:Group: /gt1l/land_ice_segments
DEBUG:h5-to-zarr:translator:Dataset: /gt1l/land_ice_segments/atl06_quality_summary
Traceback (most recent call last):
  File "h5py/h5o.pyx", line 302, in h5py.h5o.cb_obj_simple
  File "/Users/scott/miniconda3/envs/fsspec-ref/lib/python3.8/site-packages/h5py/_hl/group.py", line 591, in proxy
    return func(name, self[name])
  File "/Users/scott/GitHub/fsspec-reference-maker/fsspec_reference_maker/hdf.py", line 105, in translator
    raise RuntimeError(
RuntimeError: /gt1l/land_ice_segments/atl06_quality_summary uses unsupported HDF5 filters

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "./test.py", line 45, in <module>
    ATL06_remote()
  File "./test.py", line 15, in ATL06_remote
    return hdf2zarr.run(
  File "/Users/scott/GitHub/fsspec-reference-maker/fsspec_reference_maker/hdf.py", line 273, in run
    return h5chunks.translate()
  File "/Users/scott/GitHub/fsspec-reference-maker/fsspec_reference_maker/hdf.py", line 54, in translate
    self._h5f.visititems(self.translator)
  File "/Users/scott/miniconda3/envs/fsspec-ref/lib/python3.8/site-packages/h5py/_hl/group.py", line 592, in visititems
    return h5o.visit(self.id, proxy)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
SystemError: <built-in function visit> returned a result with an error set

Successful example of using MultiZarrToZarr

Not so much an issue, but just some examples of applying this to a large 30 year Sea Surface Temperature (SST) dataset (10576 x 4500 x 6000), chunked in (1, 2250, 3000).

I tested combining the single time reference files into months and years, I found that due to the template parsing monthly aggregates worked the best as compromise between granularity and template parsing overhead. I also tested using zarr or json for the combined references and the difference is negligible. Storage space vs having a single container for the references. Possibly zipped json is the way to go?

Ultimately, using many small workers, using the ReferenceFileSystem, extractions across time took ~15s - this is compared to many hours using THREDDS or via netCDF.

This notebook demonstrates creating aggregated references at monthly and yearly timescales:
https://github.com/pbranson/pixeldrill/blob/main/notebooks/create_agg_index.ipynb
The chunking layout changes in 2016 sporadically, so there is some bespoke code to detect these and stack them together when found. When the in full timeseries is recomposed, a simple sort by time re-orders the data.

This notebook demonstrates using 180 single core workers with 1GB memory per worker to load the data:
https://github.com/pbranson/pixeldrill/blob/main/notebooks/access_via_aggindex.ipynb

Thanks for everyone's work on this library!

Support for concat of files with unequal length in concat dim

@jbusecke and I are wondering if
https://github.com/intake/fsspec-reference-maker/blob/89d32322fdd3ba631a4fe846475a8857f10ad872/fsspec_reference_maker/combine.py#L313-L316

should instead check that v / ds0.dims[k] > 1.

We were working with the Pangeo Forge HDFReference recipe class today, and it seemed that this change was needed to concatenate a list of files with varying lengths in the time dimension.

Thoughts, @martindurant ?

cc @rabernat

inconsistent fill value on hdf5 files

For some reason I have files (model output) where the hdf5 fillvalue property is different from the _FillValue attribute:

f = h5py.File("<file>", mode="r")
f["latitude"].fillvalue != f["latitude"].attrs["_FillValue"]  # significantly different

While xarray's netcdf4 engine will ignore fillvalue and use only _FillValue (?), kerchunk will copy both over, translating fillvalue to zarr's fill_value:

kerchunk/kerchunk/hdf.py

Lines 186 to 192 in 1c6add1

 za = self._zroot.create_dataset(h5obj.name, shape=h5obj.shape, 

 dtype=h5obj.dtype, 

 chunks=h5obj.chunks or False, 

 fill_value=h5obj.fillvalue, 

 compression=compression, 

 filters=filters, 

 overwrite=True)

However, if I open the resulting file the zarr engine will either complain about multiple fill values or the opened dataset won't have missing values (i.e. xr.testing.assert_identical(ds_nc, ds_zarr) fails but xr.testing.assert_identical(ds_nc, ds_zarr.where(ds_nc.notnull()) doesn't). Not sure why it doesn't always raise, though.

Am I missing something? If not, is there a way to tell kerchunk to use only _FillValue?

planned improvements

combine to support direct writing to remote
support multiple filesystems in ReferenceFileSystem (will be great for multi-resolution, where we can put downsampled images somewhere other than originals) fsspec/filesystem_spec#867
more hooks for custom functions to manage zarr objects and/or post-process output JSON (#122 )
combine2 and multi-axis concat? ( ~~parts of #83~~ #122)
consolidate reads in ReferenceFS, where possible, using code made for fsspec.parquet

allow non-uniform chunks in zarr (just like dask.array) (cc @joshmoore , I may have mentioned this idea)
entrypoints for numcodecs ( zarr-developers/numcodecs#290 )

two offsets or offset + size?

In most of original zarr issue discussion (zarr-developers/zarr-python#556), offsets were specified with offset, size. In this spec, you are doing start, stop. It seems like offset, size will probably be slightly smaller as a text representation. I'm curious how you made that choice?

invalid jinja templates for a bigger number of files

When trying to open a combined JSON file for more than 1908 netcdf files with more than 5 variables (or when setting templates_count to 0 in MultiZarrToZarr.translate), fsspec raises a jinja error. For a script to reproduce this and the actual traceback:

script to generate, convert and combine netcdf files

import pathlib

import dask
import dask.diagnostics
import fsspec
import numpy as np
import ujson
import xarray as xr

from fsspec_reference_maker.combine import MultiZarrToZarr
from fsspec_reference_maker.hdf import SingleHdf5ToZarr


def gen_json(u, root):
    so = {"mode": "rb"}  # add storage options for other filesystem implementations
    with fsspec.open(u, **so) as inf:
        h5chunks = SingleHdf5ToZarr(inf, u, inline_threshold=300)

        with open(root / f"{pathlib.Path(u).name}.json", "wb") as outf:
            outf.write(ujson.dumps(h5chunks.translate()).encode())


nc_root = pathlib.Path("nc")
nc_root.mkdir(exist_ok=True)


def create_file(n):
    ds = xr.Dataset(
        data_vars={
            "a": (("x", "y"), np.ones((100, 1))),
            "b": (("y", "z"), np.zeros((1, 200))),
        },
        coords={
            "y": [n],
        },
    ).chunk({"x": 10})
    ds.to_netcdf(nc_root / f"{n:04d}.nc")


with dask.diagnostics.ProgressBar():
    _ = dask.compute(*[dask.delayed(create_file)(i) for i in range(1909)])

paths = sorted(nc_root.iterdir())

# create a directory to store the datas
separate_root = pathlib.Path("separate")
separate_root.mkdir(exist_ok=True)

# create a directory to store the individual datas
with dask.diagnostics.ProgressBar():
    _ = dask.compute(*[dask.delayed(gen_json)(str(p), separate_root) for p in paths])


def combine_json(paths, outpath):
    mzz = MultiZarrToZarr(
        paths,
        remote_protocol="file",
        xarray_open_kwargs={
            "decode_cf": False,
            "mask_and_scale": False,
            "decode_times": False,
            "decode_timedelta": False,
            "use_cftime": False,
            "decode_coords": False,
        },
        xarray_concat_args={
            "combine_attrs": "override",
            "data_vars": "minimal",
            "coords": "minimal",
            "compat": "override",
            # "concat_dim": "time",
            "dim": "time",
        },
    )
    mzz.translate(outpath, template_count=1)


json_paths = sorted(separate_root.iterdir())

combined_root = pathlib.Path("combined")
combined_root.mkdir(exist_ok=True)

path_1908 = combined_root / "1908.json"
path_1909 = combined_root / "1909.json"

print("combining ... ", end='')
combine_json(json_paths[:1908], path_1908)
combine_json(json_paths, path_1909)
print("done")

mapper_1908 = fsspec.get_mapper(
    "reference://", fo=f"file://{path_1908.absolute()}", remote_protocol="file"
)
print("1908 files:", mapper_1908)
mapper_1909 = fsspec.get_mapper(
    "reference://", fo=f"file://{path_1909.absolute()}", remote_protocol="file"
)
print("1909 files:", mapper_1909)

traceback

Traceback (most recent call last):
  File ".../test.py", line 86, in <module>
    mapper_1909 = fsspec.get_mapper("reference://", fo=f"file://{path_1909.absolute()}", remote_protocol="file")
  File ".../fsspec/fsspec/mapping.py", line 228, in get_mapper
    fs, urlpath = url_to_fs(url, **kwargs)
  File ".../fsspec/fsspec/core.py", line 399, in url_to_fs
    fs = cls(**options)
  File ".../fsspec/fsspec/spec.py", line 76, in __call__
    obj = super().__call__(*args, **kwargs)
  File ".../fsspec/fsspec/implementations/reference.py", line 142, in __init__
    self._process_references(text, template_overrides)
  File ".../fsspec/fsspec/implementations/reference.py", line 259, in _process_references
    self._process_references1(references, template_overrides=template_overrides)
  File ".../fsspec/fsspec/implementations/reference.py", line 300, in _process_references1
    u = _render_jinja(u)
  File ".../fsspec/fsspec/implementations/reference.py", line 283, in _render_jinja
    return jinja2.Template(u).render(**self.templates)
  File "~/.conda/envs/fsspec/lib/python3.9/site-packages/jinja2/environment.py", line 1208, in __new__
    return env.from_string(source, template_class=cls)
  File "~/.conda/envs/fsspec/lib/python3.9/site-packages/jinja2/environment.py", line 1092, in from_string
    return cls.from_code(self, self.compile(source), gs, None)
  File "~/.conda/envs/fsspec/lib/python3.9/site-packages/jinja2/environment.py", line 757, in compile
    self.handle_exception(source=source_hint)
  File "~/.conda/envs/fsspec/lib/python3.9/site-packages/jinja2/environment.py", line 925, in handle_exception
    raise rewrite_traceback_stack(source=source)
  File "<unknown>", line 1, in template
jinja2.exceptions.TemplateSyntaxError: expected token 'end of print statement', got 'integer'

A colleague traced this to the generation of the template ids in https://github.com/intake/fsspec-reference-maker/blob/fda8a5318e307c23ef656257c0bc78262bcb9a6f/fsspec_reference_maker/combine.py#L131-L138 My conclusion is that jinja requires those ids to be valid python identifiers, but the 1909th template gets a id of 01 (there are also single digit ids earlier, but those are fine, apparently).

Side note: I don't understand why it uses itertools.combinations(..., i) and not itertools.product(..., repeat=i), which would allow more values with fewer characters

To fix this here I'd either filter out invalid identifiers:

for tup in itertools.combinations(string.ascii_letters + string.digits, i): 
    if tup[0].isdigit():
        continue
    yield "".join(tup)

or always prefix the id with a character (this results in a bigger file size, though):

for tup in itertools.combinations(string.ascii_letters + string.digits, i): 
    yield "i" + "".join(tup)

but I may be missing a better way. In any case, both fix my issue.

If any of these sound good we can send in a PR.

cc @tinaok @HectorBaril

ideas from H5coro

Two key things we may wish to implement:

building references on-the-fly, such that the first access to a dataset is slow, but then subsequent accesses use a cached references index; but the user call doesn't change
merging fetches when contiguous: since we know the full set of start-end offsets we are about to launch for each given URL, we could reduce the number of calls to the remote API and potentially save time.

Templating and deterministic maps

In yesterday's meeting, we discussed the widespread desire for this spec to handle deterministic, evenly spaced-out chunks within a file, rather than explicitly enumerating each chunk and its offsets. We would need some sort of support for templates and symbolic expressions. For a 1D file, it could be something like

"{i}", ["s3://bucket/path/file.nc", "3200 * {i}", 3200]

"{i}", ["s3://bucket/path/file.nc", "{itemsize} * {i}", 3200]

where {itemsize} would get filled with 3200

For ND it would be harder because you need to know the array shape:

"{j}.{i}", ["s3://bucket/path/file.nc", "{itemsize} * ({j} * Nx + {i})", 3200]

I can't think of a clever way around that.

Ideas? @manzt? @joshmoore?

"Failed to open Zarr store with consolidated metadata" xarray error with MultiZarrToZarr

@rsignell-usgs and I ran into this when working on a notebook together yesterday, xarray seems to have a new warning when it can't open a zarr store with consolidated metadata.

When running MultiZarrToZarr.translate(), I get

/home/conda/store/578732b5a39dadc3cadc71a29a33e58ded03e8a9d5c888edac6151d80eda2868-pangeo/lib/python3.8/site-packages/fsspec_reference_maker/combine.py:170: 
RuntimeWarning: Failed to open Zarr store with consolidated metadata, falling back to try reading
non-consolidated metadata. This is typically much slower for opening a dataset. 
To silence this warning, consider:
1. Consolidating metadata in this existing store with zarr.consolidate_metadata().
2. Explicitly setting consolidated=False, to avoid trying to read consolidate metadata, or
3. Explicitly setting consolidated=True, to raise an error in this case instead of falling back to try reading non-consolidated metadata.
  dss = [xr.open_dataset(m, engine="zarr", chunks={}, **self.xr_kwargs)

I'm not sure if we're trying to consolidate metadata or not. If we are, then something isn't working. If we're not, then we might consider adding consolidated=False to https://github.com/intake/fsspec-reference-maker/blob/e96969d938572fb883b2acf761506dc11de45a4c/fsspec_reference_maker/combine.py#L170 to suppress the warning for now.

Minimum working code sample and json files:

import xarray as xr
import fsspec
from glob import glob
from fsspec_reference_maker.combine import MultiZarrToZarr

jsons = sorted(glob("./jsons/*"))

mzz = MultiZarrToZarr(
    jsons,
    remote_protocol='s3',
    remote_options={'anon' : 'True'},   #JSON files  
    xarray_open_kwargs={
        'decode_cf' : False,
        'mask_and_scale' : False,
        'decode_times' : False,
        'use_cftime' : False,
        'drop_variables': ['reference_time', 'crs'],
        'decode_coords' : False
    },
    xarray_concat_args={
#          "data_vars": "minimal",
#          "coords": "minimal",
#          "compat": "override",
        "join": "override",
        "combine_attrs": "override",
        "dim": "time"
    }
)

mzz.translate("test.json")

Explain target_protocol and remote_protocol

I'm trying to understand fsspec-reference-maker better.

Consider the following code from the pangeo-forge hdf-reference tutorial:

m = fsspec.get_mapper(
    "reference://",
    fo=ref_url,
    target_protocol="file",
    remote_protocol="s3",
    skip_instance_cache=True,
)

Please help me understand the need for target_protocol and remote_protocol

target_protocol

AFAICT, this is the name of the protocol needed to open ref_url. Is this always needed? What if the protocol is already in ref_url, e.g. `ref_url = "https://..."? What if the two are inconsistent?

Why don't we just infer the target_protocol from ref_url?

remote_protocl

AFAICT, this is the protocol used for opening the underlying reference files. But that is already encoded in the reference file!!. Here, the beginning of ref_url is

{"version": 1, "templates": {"a": "s3://esgf-world/CMIP6/OMIP/NOAA-GFDL/GFDL-CM4/omip1/r1i1p1f1/Omon/thetao/gr/v20180701/thetao_Omon_GFDL-CM4_omip1_r1i1p1f1_gr_170801-172712.nc", "b": "s3://esgf-world/CMIP6/OMIP/NOAA-GFDL/GFDL-CM4/omip1/r1i1p1f1/Omon/thetao/gr/v20180701/thetao_Omon_GFDL-CM4_omip1_r1i1p1f1_gr_172801-174712.nc", "c": "s3://esgf-world/CMIP6/OMIP/NOAA-GFDL/GFDL-CM4/omip1/r1i1p1f1/Omon/thetao/gr/v20180701/thetao_Omon_GFDL-CM4_omip1_r1i1p1f1_gr_174801-176712.nc", "d": "s3://esgf-world/CMIP6/OMIP/NOAA-GFDL/GFDL-CM4/om ...

All of those s3://s mean that the data should be read with s3 protocol. So why do we also need to specify remote_protocol? What if I put remote_protocol='gcs' but the actual references are s3? Wouldn't this cause problems?

In summary, it feels to me like both these specifiers are redundant and therefore a source of potential bugs. But I'm sure I'm missing something.

In any case, the documentation on these options could be improved.

Create tests and automate for new PRs

As @martindurant mentioned in #58 we should probably get some tests created with the example_* functions and run them automatically with new PRs. I think this can be relatively easy to do with GitHub actions.

versioning the spec

Once we have a complete spec of some sort, we should give it a version, which also is saved into the prescriptions, so that we can do backward-incompatible changes later

Contributing a CSV module [RE: dask dataframe `read_csv`]

Following our discussion in dask #8045 I’m seeking to contributing a CSV module to fsspec-reference-maker, and wanted to kick off a feature proposal issue here to clarify some aspects.

Hope this is a decent start, but WIP and comments appreciated (I need to review the role of fsspec in dask more closely).

Firstly, as described in #7, this library is geared towards evenly spaced chunks. The README of this repo gives an example of this with a single variable, or “dimension”, i (the number of chunks), which is used in the example spec along with a fixed length (1000) to produce range(0, i * 1000, 1000).

This of course matches how dask reads CSVs as range(0, size, blocksize), but the whole point of my intervention here is to make them no longer evenly spaced deterministic chunks.

The word ‘deterministic’ here seems to mean both:

“based on the byte positions rather than byte content at those positions”, as well as
“uniquely identifiable, not changing upon recalculation”.

The alternative I am proposing for CSVs to dask only fits the 2nd of these criteria (as it will involve checking bytes, which may result in changes to the offsets, thus chunks may not stay evenly spaced).

Also in #7 there is mention of “callbacks”, and I’m uncertain whether this is the route I should take to achieving this adjustment to the offsets (or something entirely different I should ignore).

I am inclined to copy the conclusions of that issue, that perhaps it is easiest to begin by aiming to produce the explicit ‘version 0’ spec rather than the templated ‘version 1’ until how this works is clearer to me.

As a less significant problem for me, the docs here refer to jinja2 rendered strings, but I don’t see that library as a requirement anywhere here, so I’m not sure how that works (perhaps it is a future feature, I’m noting this library is a recent/future-facing effort).

Here’s my first attempt at an idea of how this would look (filenames vaguely copying the format of the WIT dataset as “base_0_ordinal_count of base_1_cardinal_total”):

The values in the 2nd and 3rd parts of these values are supposed to indicate where a previous routine has calculated the offsets (as 10, 5, 20, 0, 50, 25), which are added to the evenly spaced offsets (1000+10, 2000+5 etc.) and subtracted from the lengths between consecutive offsets (1010-0 = 1010, 2005-1010=995, etc.)
I’m fairly sure that the 3rd item in the gen keys should indicate the length of the chunk that ends at that offset but please correct me if I’m wrong.

This then gives the spec of a ‘filesystem for partitions within a file’, addressable by filepath plus ‘virtual’ partition index:

{
  "key0": "data",
  "gen_key0": ["/path/to/csv_0_of_1.csv.gz/partition_0", 1010, 1010],
  "gen_key1": ["/path/to/csv_0_of_1.csv.gz/partition_1", 2005, 995],
  "gen_key2": ["/path/to/csv_0_of_1.csv.gz/partition_2", 3020, 1015],
  "gen_key3": ["/path/to/csv_0_of_1.csv.gz/partition_3", 4000, 980],
  "gen_key4": ["/path/to/csv_0_of_1.csv.gz/partition_4", 5050, 1050],
  "gen_key5": ["/path/to/csv_0_of_1.csv.gz/partition_5", 6025, 975]
}

I’m not sure what to put in the ‘key’ entries so have removed the ones from the example spec (please let me know if this is unadvisable, and if you have an idea of what should go there instead)
I presume the one that is currently bearing the bytes b”data” should be storing something important to identify the CSV, but I can’t determine what that is on my first attempt
- My understanding is that this will be fed into the OpenFile object as the fs argument, so it should store things relevant to that. Perhaps path? I’m very unsure how this should look though, and suspect if I guess I’d only end up putting irrelevant info in that’ll already be passed in.
For simplicity I’m considering 2 files here, each with 3 offsets (i.e. 4 chunks: the offset starting at 0 is always going to be assumed to be valid: if it’s not then that’s a corrupt CSV, not the problem I’m seeking to solve here)

As for the matter of identifying the offset adjustments (10, 5, 20, 0, 50, 25) I expect the fastest way to do so is

initialise separator_skip_count = separator_skip_offset = 0 at each offset mark (1000, 2000, etc.)
try pandas.read_csv(nrows=1)
- catch failure; increment separator_skip_count += 1 if it fails (repeat)
finally [upon success]
- break out of the loop
- use the tell minus the offset to give the ‘offset adjustment’ (assign separator_skip_offset)
  - left as 0 for no adjustment (if separator_skip_count == 0), or a positive integer

The separator_skip_count indicating the number of newlines that were skipped after the offset+length to find the ‘genuine’ row-delimiting offset seems redundant to store, but useful while writing/debugging this algorithm.

I say that, but I don’t know: perhaps it’d be inexpensive to recalculate the actual byte offsets from the number of newlines to skip after the offset, rather than store that offset? (Not clear to me yet)

Only the separator_skip_offset needs to be stored: summed with the offset, in the 2nd item of the values (1010, 2005, etc.)

I think at the point that the separator_skip_offset is calculated, the ‘version 1’ spec could be computed, to reduce to the above ‘version 0’ spec, as something like:

{
    "version": 1,
    "templates": {
        "u": "/path/to/csv_0_of_1.csv.gz",
        "f": "partition_{{c}}"
    },
    "gen": [
        {
            "key": "gen_key{{i}}",
            "url": "{{u}}/{{f(c=i)}}",
            "offset": "{{(i + 1) * 1000}}",
            "length": "1000",
            "dimensions": 
              {
                "i": {"stop":  5}
              }
        }   
    ],
    "refs": {
      "key0": "data",
    }
}

I may be misunderstanding something by putting the filename within the template rather than as a variable
should gen_key0 be partition0 (etc) ? or should this stay as gen_key0 to make clear that it’s generated?
I can’t figure out where the array specifying the separator_skip_offset should go (if I put it in the gen.dimensions key it’ll become a Cartesian product, whereas I want to ‘zip’ it against the i range…)
Should I change the gen.url key from “url” to something else, since it’s expected to refer to a file path not a web resource?

Without understanding how to incorporate the offset adjustments into this template, I don’t think I can write the ‘version 1’ spec at this time, but I hope we might be able to figure it out here.

C++ / pybind11 mapping for AbstractBufferedFile?

I maintain a library that exposes a C++ codebase as a python module.
I'm fighting with trying to figure out how to pass the python AbstractBufferedFile into the C++ code with pybind11.
Do you know of any projects where someone might have wrestled with this problem?
I have an ugly solution that partially works but I'm looking for something better.
Thanks!

MultiZarrToZarr speed improvements

Opening an issue so we can discuss processing time for MultiZarrToZarr. I'm running with the following code:

mzz = MultiZarrToZarr(
    json_list,
    remote_protocol='az',
    remote_options={
       'account_name' : 'goeseuwest'
    },    
    xarray_open_kwargs={
        'decode_cf' : False,
        'mask_and_scale' : False,
        'decode_times' : False,
        'use_cftime' : False,
        'decode_coords' : False,
    },
    xarray_concat_args={
        "data_vars": "minimal",
        "coords": "minimal",
        "compat": "override",
        "join": "override",
        "combine_attrs": "override",
        "dim": "t"

    }
)

mzz.translate('combined.json')

Running mzz.translate() takes upwards of 40 minutes on a list of 144 reference jsons, which were generated with inline_threshold=300. It looks like a good chunk of the time is divided between to_zarr() and split().

I've attached the profile file here as well, which can be visualzed with snakeviz ./multizarr_profile.

multizarr_profile.zip

	za = self._zroot.create_dataset(h5obj.name, shape=h5obj.shape,
	dtype=h5obj.dtype,
	chunks=h5obj.chunks or False,
	fill_value=h5obj.fillvalue,
	compression=compression,
	filters=filters,
	overwrite=True)