google-research / arco-era5 Goto Github PK

View Code? Open in Web Editor NEW

176.0 13.0 12.0 4.71 MB

Recipes for reproducing Analysis-Ready & Cloud Optimized (ARCO) ERA5 datasets.

Home Page: https://cloud.google.com/storage/docs/public-datasets/era5

License: Apache License 2.0

Python 98.13% Dockerfile 1.76% Shell 0.11%

cloud-optimized ecmwf era5 zarr open-science

arco-era5's Introduction

Google Research

This repository contains code released by Google Research.

All datasets in this repository are released under the CC BY 4.0 International license, which can be found here: https://creativecommons.org/licenses/by/4.0/legalcode. All source files in this repository are released under the Apache 2.0 license, the text of which can be found in the LICENSE file.

Because the repo is large, we recommend you download only the subdirectory of interest:

SUBDIR=foo
svn export https://github.com/google-research/google-research/trunk/$SUBDIR

If you'd like to submit a pull request, you'll need to clone the repository; we recommend making a shallow clone (without history).

git clone [email protected]:google-research/google-research.git --depth=1

Disclaimer: This is not an official Google product.

Updated in 2023.

arco-era5's People

Contributors

Stargazers

Watchers

Forkers

clickratio agile-lee yuting-chen-0604 mlshapiro isabella232 dabhicusp pythoner-learner eeholmes sadeghtabas-noaa rajveer43 zhouslab rhorbe

arco-era5's Issues

NaNs in 6-hourly analysis-ready dataset for 2m temperature`

Hi there! I've come across NaNs in the 2m_temperature variable in the 6-hourly analysis-ready dataset -- MWE below -- does this reproduce for you?

Three strange observations:

I've tried the 1-hourly dataset and couldn't see any NaNs in any of the 24 hours for this date (2016-06-25).
No NaNs in neighbouring dates.
No NaNs in some other variables (u- and v-wind) in the same 6-h dataset.

import xarray as xr

source = "gs://gcp-public-data-arco-era5/ar/1959-2022-full_37-6h-0p25deg-chunk-1.zarr-v2"
era5_zarr = xr.open_zarr(source, consolidated=True, chunks={"time": 48})
era5_zarr["2m_temperature"].sel(time="2015-06-28").load()

Returns

<xarray.DataArray '2m_temperature' (time: 4, latitude: 721, longitude: 1440)>
array([[[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],
       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],
       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],
       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]]], dtype=float32)
Coordinates:
  * latitude   (latitude) float32 90.0 89.75 89.5 89.25 ... -89.5 -89.75 -90.0
  * longitude  (longitude) float32 0.0 0.25 0.5 0.75 ... 359.0 359.2 359.5 359.8
  * time       (time) datetime64[ns] 2015-06-28 ... 2015-06-28T18:00:00
Attributes:
    long_name:   2 metre temperature

Daily-averaged analysis-ready ERA5 data

Thanks for this fantastic initiative, it's great to be able to access ERA5 data straight into RAM with just a few lines of code.

Currently only an hourly, instantaneous time version of ERA5 is provided. However, it's common in environmental ML research to use daily averages (especially when compute is restricted or for PoC purposes). Are there any plans to provide a daily-averaged version of the analysis-ready data? This would reduce network traffic and presumably decrease download time by 24x, which would be useful. Cheers!

Explanation for chunking

Hey, not a cloud expert but wondering the rational for using the chunking you have. I see that the zarr files have rather large chunk sizes. For example, the model level variables have chunk size dask.array<chunksize=(48, 137, 410240). This works out to be about 10 gigs. My understanding was that a good chunk size for object storage is on the order of MBs. Wouldn't it make sense to have chunking (1, 1, 410240) for example?

Add example notebooks for using ERA5 for climate science

For example, @cgentemann could add lecture 16 from her course: https://github.com/DS-100/fa22/blob/main/lec/lec16/lec16.ipynb

ERA5-Land support

It seems this dataset is limited to ERA5. Is there any plan to also make ERA5-Land data available?

https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-land?tab=overview

Fix broken link to demo notebook in the README.

In the prompt to view the walkthrough, we point to a broken link. The actual link should be: https://github.com/google-research/arco-era5/blob/main/docs/0-Surface-Reanalysis-Walkthrough.ipynb

Why use this dataset?

What are the uses for this data?

Add TISR 212 and LSM 172

It would be great to see if this dataset could be used to train machine learning models such as graphcast. There are a couple of variable missing though,

(taken from here, https://arxiv.org/pdf/2212.12794.pdf) I think the two missing fields are Incident solar radiation and land sea mask. Any interest in including these or even trying to get all the era5 variables available.

Rechunk Pressure level data in lat lon dataset

I have been using the latlon dataset here gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3. It has been extremely helpful for setting up different projects. I am wondering if it would be possible to rechunk the pressure level data. Currently all pressure levels are in a single chunk. If we want to sub sample we will end up getting the entire chunk which can significantly slow down the bandwidth. Ideally given this is in object storage we could use much smaller chunk sizes and just have the chunks be the lat long grid. What do you thinks?

How to increase the speed of saving ERA5 chunks data？

Hi, everyone. It is really convenient to access ERA5 data from cloud storage. However, it's very slow to save the processed data as netcdf format. It has taken 40 minutes so far and still has not been saved successfully. How can I solve this problem and increase the speed of saving ERA5 chunks data？This is my code.

import xarray as xr
reanalysis = xr.open_zarr(
    'gs://gcp-public-data-arco-era5/ar/1959-2022-full_37-1h-0p25deg-chunk-1.zarr-v2', 
    chunks={'time': 48},
    consolidated=True)

## data_CN Size: 119G
data_CN = reanalysis["2m_temperature"].loc["1961-01-01":'2020-12-31',60:0,70:130]  

## data_CN_daily Size: 159M
data_CN_daily = data_CN.resample(time="M").mean()

data_CN_daily = data_CN_daily.compute()    ## This process will cost long time (at least 50min+)

data_CN_daily.to_netcdf("data_ch.cn")

Add support to produce output locally for `model-levels-to-zarr.py` and `single-levels-to-zarr.py`

I am facing challenges while attempting to generate output locally of the model-levels-to-zarr.py and single-levels-to-zarr.py scripts.
I have reviewed the provided documentation and source code, but I am still uncertain about the necessary steps to execute these scripts and produce the desired output files locally.

Inconsistent variable naming for 'z' and 'zs' in data repository and storage bucket.

As per ECMWF's official website documentation, the variable name should be referenced as z and not zs. This is indicated in the reference link: (https://codes.ecmwf.int/grib/param-db/?id=129). However, it's observed that both in the repository (https://github.com/google-research/arco-era5/blob/main/raw/era5_ml_zs.cfg) and within the storage bucket(gs://gcp-public-data-arco-era5/raw/ERA5GRIB/HRES/Month/**) where the data is stored, the variable name is currently set as zs. To ensure consistency, an update is required in this regard.

Store data in uint16

Hey,

Totally amazing project. Wanted to ask about compression/uint16 storage. When I get lat/lon ERA5 data from the CDS api I see the data is compressed and stored in uint16 along with scale factors to convert back to float32. When I look at the dataset here it seems to be in the full float32. Any reason not to store in the original uint16? Saves tremendous amounts of bandwidth and when using ERA5 in ML workflows this can make a big difference.

Snow depth latitude flipped for select dates

It seems that for a select few dates the snow_depth variable is flipped (e.g., the latitudes are reversed) from this dataset gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3/. I only checked values in 6 hour chunks from 1979 to 2021, but I found the fields are flipped for these dates:

Flipped data at 1981-03-16T00:00:00.000000000
Flipped data at 1981-03-16T06:00:00.000000000
Flipped data at 1981-03-16T12:00:00.000000000
Flipped data at 1981-03-16T18:00:00.000000000
Flipped data at 1982-04-06T00:00:00.000000000
Flipped data at 1982-04-06T06:00:00.000000000
Flipped data at 1982-04-06T12:00:00.000000000
Flipped data at 1982-04-06T18:00:00.000000000
Flipped data at 1985-12-11T00:00:00.000000000
Flipped data at 1985-12-11T06:00:00.000000000
Flipped data at 1985-12-11T12:00:00.000000000
Flipped data at 1985-12-11T18:00:00.000000000
Flipped data at 1987-11-30T00:00:00.000000000
Flipped data at 1987-11-30T06:00:00.000000000
Flipped data at 1987-11-30T12:00:00.000000000
Flipped data at 1987-11-30T18:00:00.000000000
Flipped data at 1990-03-05T00:00:00.000000000
Flipped data at 1990-03-05T06:00:00.000000000
Flipped data at 1990-03-05T12:00:00.000000000
Flipped data at 1990-03-05T18:00:00.000000000
Flipped data at 1990-04-02T00:00:00.000000000
Flipped data at 1990-04-02T06:00:00.000000000
Flipped data at 1990-04-02T12:00:00.000000000
Flipped data at 1990-04-02T18:00:00.000000000
Flipped data at 1990-08-12T00:00:00.000000000
Flipped data at 1990-08-12T06:00:00.000000000
Flipped data at 1990-08-12T12:00:00.000000000
Flipped data at 1990-08-12T18:00:00.000000000
Flipped data at 1997-05-15T00:00:00.000000000
Flipped data at 1997-05-15T06:00:00.000000000
Flipped data at 1997-05-15T12:00:00.000000000
Flipped data at 1997-05-15T18:00:00.000000000
Flipped data at 2002-03-17T00:00:00.000000000
Flipped data at 2002-03-17T06:00:00.000000000
Flipped data at 2002-03-17T12:00:00.000000000
Flipped data at 2002-03-17T18:00:00.000000000
Flipped data at 2003-11-26T00:00:00.000000000
Flipped data at 2003-11-26T06:00:00.000000000
Flipped data at 2003-11-26T12:00:00.000000000
Flipped data at 2003-11-26T18:00:00.000000000
Flipped data at 2004-02-10T00:00:00.000000000
Flipped data at 2004-02-10T06:00:00.000000000
Flipped data at 2004-02-10T12:00:00.000000000
Flipped data at 2004-02-10T18:00:00.000000000
Flipped data at 2006-04-12T00:00:00.000000000
Flipped data at 2006-04-12T06:00:00.000000000
Flipped data at 2006-04-12T12:00:00.000000000
Flipped data at 2006-04-12T18:00:00.000000000
Flipped data at 2007-06-19T00:00:00.000000000
Flipped data at 2007-06-19T06:00:00.000000000
Flipped data at 2007-06-19T12:00:00.000000000
Flipped data at 2007-06-19T18:00:00.000000000
Flipped data at 2009-03-05T00:00:00.000000000
Flipped data at 2009-03-05T06:00:00.000000000
Flipped data at 2009-03-05T12:00:00.000000000
Flipped data at 2009-03-05T18:00:00.000000000
Flipped data at 2013-11-11T00:00:00.000000000
Flipped data at 2013-11-11T06:00:00.000000000
Flipped data at 2013-11-11T12:00:00.000000000
Flipped data at 2013-11-11T18:00:00.000000000
Flipped data at 2014-05-11T00:00:00.000000000
Flipped data at 2014-05-11T06:00:00.000000000
Flipped data at 2014-05-11T12:00:00.000000000
Flipped data at 2014-05-11T18:00:00.000000000
Flipped data at 2017-03-17T00:00:00.000000000
Flipped data at 2017-03-17T06:00:00.000000000
Flipped data at 2017-03-17T12:00:00.000000000
Flipped data at 2017-03-17T18:00:00.000000000
Flipped data at 2020-05-19T00:00:00.000000000
Flipped data at 2020-05-19T06:00:00.000000000
Flipped data at 2020-05-19T12:00:00.000000000
Flipped data at 2020-05-19T18:00:00.000000000

Recompute `model-level-wind.zarr` after acquiring the missing data shard (2008-08-27 for `dve`).

It seems like there was a small error in what data was received for the 2008-08-27 + dve shard of data. It may be possible that a tape access error occurred.

We will re-ingest the data as soon as the MARS archive is back online.

Can I use the data for {research,commercial} purposes?

Research:

Please cite this work
Cite ECMWF

Commercial:

(need to investigate): Must acknowledge where the data came from (CDS / ECMWF)
Must include ECMWF.

Could surface solar radiation be added?

Hi,
Really great work! Its nice being able to stream in all this ERA5 data. I was wondering if some more surface solar radiation variables could be added to this? The reason for this would be to help with renewable energy generation forecasting, specifically PV/Solar generation, and training models to forecast those.

Add surface level raw download configs.

          > I couldn't find the config files in /raw -- perhaps these need to be added too?

Will follow up in a future issue.

Originally posted by @alxmrs in #27 (comment)

Add hooks to open dataset up with Google Colab

For the existing notebooks, add an "open with colab" button on the top. The trick here is to set up the dependencies on the notebook ahead of time so everything "just works".
Add a series of open with colab buttons to the main README to help users use the data immediately.

Automatically ingest raw ERA5 data as soon as it's available from ECMWF.

Ideally, we'd like to make sure that this repository has new raw data from ECMWF as soon as it's available. For at least a few dataset, Copernicus makes new data available on a daily cadence.

Implementation Notes

Let's modify our existing weather-dl scripts to try to ingest new data from CDS on on a cron via Github Actions. Here, we should modify the config on every run to extend it to the current date. This may require we modify the config parser of weather-dl a bit first google/weather-tools#267.

Blocked from accessing data

Hey,

I was using ARCO ERA5 to generate a training dataset for neural networks. I was pulling the data last night and after pulling ~200 GB I started getting the bellow error. Not a super familiar with google cloud storage but is there limits put on how much data we can pull from ARCO ERA5?

    raise HttpError({"code": status, "message": msg})  # text-like
gcsfs.retry.HttpError: <html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"/><title>Sorry...</title><style> body { font-family: verdana, arial, sans-serif; background-color: #fff; color: #000; }</style></head><body><div><table><tr><td><b><font face=sans-serif size=10><font color=#4285f4>G</font><font color=#ea4335>o</font><font color=#fbbc05>o</font><font color=#4285f4>g</font><font color=#34a853>l</font><font color=#ea4335>e</font></font></b></td><td style="text-align: left; vertical-align: bottom; padding-bottom: 15px; width: 50%"><div style="border-bottom: 1px solid #dfdfdf;">Sorry...</div></td></tr></table></div><div style="margin-left: 4em;"><h1>We're sorry...</h1><p>... but your computer or network may be sending automated queries. To protect our users, we can't process your request right now.</p></div><div style="margin-left: 4em;">See <a href="https://support.google.com/websearch/answer/86640">Google Help</a> for more information.<br/><br/></div><div style="text-align: center; border-top: 1px solid #dfdfdf;"><a href="https://www.google.com">Google Home</a></div></body></html>, 429
[2023-12-15 10:08:15,139][gcsfs][ERROR] - _request out of retries on exception: <html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"/><title>Sorry...</title><style> body { font-family: verdana, arial, sans-serif; background-color: #fff; color: #000; }</style></head><body><div><table><tr><td><b><font face=sans-serif size=10><font color=#4285f4>G</font><font color=#ea4335>o</font><font color=#fbbc05>o</font><font color=#4285f4>g</font><font color=#34a853>l</font><font color=#ea4335>e</font></font></b></td><td style="text-align: left; vertical-align: bottom; padding-bottom: 15px; width: 50%"><div style="border-bottom: 1px solid #dfdfdf;">Sorry...</div></td></tr></table></div><div style="margin-left: 4em;"><h1>We're sorry...</h1><p>... but your computer or network may be sending automated queries. To protect our users, we can't process your request right now.</p></div><div style="margin-left: 4em;">See <a href="https://support.google.com/websearch/answer/86640">Google Help</a> for more information.<br/><br/></div><div style="text-align: center; border-top: 1px solid #dfdfdf;"><a href="https://www.google.com">Google Home</a></div></body></html>, 429

Minimal example that gives the error for me,

import fsspec
import xarray as xr

# Load arco xarray
fs = fsspec.filesystem('gs')
arco_filename = 'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3'
arco_mapper = fs.get_mapper(arco_filename)
arco_era5 = xr.open_zarr(arco_mapper, consolidated=True)

save_dir = './'
var_name = "10m_u_component_of_wind"
zarr_path = save_dir / Path(f"{var_name}.zarr")

# Save
delayed_obj = arco_era5[var_name].to_zarr(zarr_path, consolidated=True, compute=False)

# Wait for save to finish
delayed_obj.compute()

*Fixed type in example

Ingest all variables for single-level reanalysis.

Add a new single-level reanalysis config to ingest the remaining variables from ECMWF within the same time range. We can then create a new XArray-Beam pipeline to append variables to the existing CO Zarr dataset.

Variables are listed here: https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-single-levels?tab=overview

We can create more DL configs like this one: https://github.com/google-research/arco-era5/blob/main/raw/era5_sfc_rad.cfg

See https://weather-tools.readthedocs.io/ or https://github.com/google-research/arco-era5/tree/main/raw for instructions on how to use weather-dl.

Design analysis ready Zarr to allow updating with preliminary ERA5 data.

For phase 2, we'd like to produce surface and atmospheric Zarrs that can be updated with preliminary data. Specifically, we intend to backfill the raw data covering 1959 to 1978. It's possible that in the future, ECMWF will produce an even earlier backfill. As I understand it, the standard structure of Zarr only allow appending to the end.

The aim for this issue would be to devise a means of avoiding recomputing our Zarr datasets whenever we want to include new data at earlier times.

For overall pipeline structure, I have the following sketch in mind:

For each epoch of preliminary data we ingest from ECMWF, we manually produce a new cloud optimized dataset. This can use the scripts in this project that we've already developed.
For recent data, we use Pangeo components update the cloud-optimized datasets (e.g. pangeo-forge/pangeo-forge-recipes#447). These download and append steps can be automated to a regular cadence (monthly or quarterly).
For the analysis ready version, we create XArray-Beam pipelines to transform the Zarr datasets.
After a new cloud-optimized preliminary dataset is produced, we invoke the same XArray-Beam pipelines to update the analysis ready versions.

@shoyer @rabernat: Do either of you have any thoughts on how we could structure our Zarr to these ends?

Document "single-level-surface.zarr" in README

The documentation is really helpful, thank you!

I was looking for surface pressure (to compute pressure on the model levels) and found there is a separate store single-level-surface.zarr in 1-Model-Levels-Walkthrough.ipynb

It would be great if this was included in the README below the Model Level Wind section.

Lat/lon gridded data does not have monotonically increasing latitudes

First of all THANK YOU so much for this effort! Having ERA5 data available in an ARCO format is truly a game changer!

I noticed a small issue: The latitudes of the lat/lon gridded data 1959-2022-full_37-1h-0p25deg-chunk-1.zarr-v2/ seems to have decreasing latitude values

import xarray as xr

ar_full_37_1h = xr.open_zarr(
    'gs://gcp-public-data-arco-era5/ar/1959-2022-full_37-1h-0p25deg-chunk-1.zarr-v2/',
).isel(time=0)
ar_full_37_1h

which makes selecting a region in xarray slightly counterintuitive:

ar_full_37_1h.sel(latitude=slice(-50, 50))

returns no latitude indicies

while

ar_full_37_1h.sel(latitude=slice(50, -50))

gives (the desired)

If you end up reprocessing the data at some point, I wonder if something like xarrays ds.sortby('latitude') or equivalent could be added to the pipeline.

chunking schema for analysis-ready?

Hey guys! Love to see this effort.

Quick question about the chunking schema you're planning for an analysis-ready corpus. Will you keep the native ERA5 chunking, i.e. {'time':1}? Are you chunking variables together?

With h2ox we've chunked hourly ERA5-land in blocks of 4-years with some moderate spatial aggregation, e.g. 5 degrees. Our main use case is for quick/easy retrieval of timeseries. Would be good to know what you guys are thinking for the chunking schema here!

Why are there two model-level datasets and not one?

Create a logo for this project

It would be nice to have an image banner at the top of the README that symbolized the dataset & this work.

(Optional) Describe what we have done to make this dataset possible.

How to cite this work

Investigate if we can use a lower memory machine type if we increase the `subset_inputs` for surface data.

Where should I be cautious? What are the limitations of the dataset?

Answer the question in the FAQ.

Notes:

Surface variables at the coastline!
Land-sea mask. See: the Bay Area. Coastal China (Shanghai). Tokyo. Costal India or Jakarta.
Areas of rapid topography, e.g. Tibetan plateau Precipitation.
- Link out to papers floating around about the known issue.
We include it to be complete, but we don’t recommend using it blindly.

docstring addition!

Add CI + Unit test coverage for existing pipelines.

Jupyter Notebook Improvements

Here is a list of feedback on our walkthrough notebooks from our team, in no particular order:

time unit error when opening single-level-forecast.zarr-v2

Hi,
If I try to open the single-level-forecast.zarr-v2 with :

sl_forecasts = xr.open_zarr(
    'gs://gcp-public-data-arco-era5/co/single-level-forecast.zarr-v2/', 
    chunks={'time': 48},
    consolidated=True,
)

I get the error :
ValueError: Failed to decode variable 'valid_time': unable to decode time units 'hours since 1900-01-01 06:00:00' with "calendar 'proleptic_gregorian'". Try opening your dataset with decode_times=False or installing cftime if it is not installed.
But cftime is already installed.

Also the time unit seems strange : hours since 1900-01-01T06:00:00.000000, it is probably in days from a more recent date.

So adding decode_times=False and doing

units, reference_date = sl_forecasts.time.attrs['units'].split('since')
sl_forecasts['time'] = pd.date_range(start=reference_date, periods=sl_forecasts.sizes['time'], freq='H')

would be wrong.

Any idea how we can open the dataset with the correct time ?

Thanks.

Universal Thermal Comfort Index

A new variable of interest would be: Universal Thermal Comfort Index

https://cds.climate.copernicus.eu/cdsapp#!/dataset/derived-utci-historical?tab=overview

As it is, one must download it MANY stages because it is daily data and it exceeds the ECMWF API requests.
It is also an index of interest because it goes beyond the traditional physical variables (temperature, radiation, etc) and can enable interdisciplinary applications of ERA5 data.