Giter Club home page Giter Club logo

arco-era5's Introduction

Google Research

This repository contains code released by Google Research.

All datasets in this repository are released under the CC BY 4.0 International license, which can be found here: https://creativecommons.org/licenses/by/4.0/legalcode. All source files in this repository are released under the Apache 2.0 license, the text of which can be found in the LICENSE file.


Because the repo is large, we recommend you download only the subdirectory of interest:

SUBDIR=foo
svn export https://github.com/google-research/google-research/trunk/$SUBDIR

If you'd like to submit a pull request, you'll need to clone the repository; we recommend making a shallow clone (without history).

git clone [email protected]:google-research/google-research.git --depth=1

Disclaimer: This is not an official Google product.

Updated in 2023.

arco-era5's People

Contributors

alxmrs avatar dabhicusp avatar darshansp19 avatar mlshapiro avatar rajveer43 avatar wundersooner avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arco-era5's Issues

NaNs in 6-hourly analysis-ready dataset for 2m temperature`

Hi there! I've come across NaNs in the 2m_temperature variable in the 6-hourly analysis-ready dataset -- MWE below -- does this reproduce for you?

Three strange observations:

  • I've tried the 1-hourly dataset and couldn't see any NaNs in any of the 24 hours for this date (2016-06-25).
  • No NaNs in neighbouring dates.
  • No NaNs in some other variables (u- and v-wind) in the same 6-h dataset.
import xarray as xr

source = "gs://gcp-public-data-arco-era5/ar/1959-2022-full_37-6h-0p25deg-chunk-1.zarr-v2"
era5_zarr = xr.open_zarr(source, consolidated=True, chunks={"time": 48})
era5_zarr["2m_temperature"].sel(time="2015-06-28").load()

Returns

<xarray.DataArray '2m_temperature' (time: 4, latitude: 721, longitude: 1440)>
array([[[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],
       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],
       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],
       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]]], dtype=float32)
Coordinates:
  * latitude   (latitude) float32 90.0 89.75 89.5 89.25 ... -89.5 -89.75 -90.0
  * longitude  (longitude) float32 0.0 0.25 0.5 0.75 ... 359.0 359.2 359.5 359.8
  * time       (time) datetime64[ns] 2015-06-28 ... 2015-06-28T18:00:00
Attributes:
    long_name:   2 metre temperature

Daily-averaged analysis-ready ERA5 data

Thanks for this fantastic initiative, it's great to be able to access ERA5 data straight into RAM with just a few lines of code.

Currently only an hourly, instantaneous time version of ERA5 is provided. However, it's common in environmental ML research to use daily averages (especially when compute is restricted or for PoC purposes). Are there any plans to provide a daily-averaged version of the analysis-ready data? This would reduce network traffic and presumably decrease download time by 24x, which would be useful. Cheers!

Explanation for chunking

Hey, not a cloud expert but wondering the rational for using the chunking you have. I see that the zarr files have rather large chunk sizes. For example, the model level variables have chunk size dask.array<chunksize=(48, 137, 410240). This works out to be about 10 gigs. My understanding was that a good chunk size for object storage is on the order of MBs. Wouldn't it make sense to have chunking (1, 1, 410240) for example?

Add TISR 212 and LSM 172

It would be great to see if this dataset could be used to train machine learning models such as graphcast. There are a couple of variable missing though,
Screenshot from 2023-03-07 15-00-09
(taken from here, https://arxiv.org/pdf/2212.12794.pdf) I think the two missing fields are Incident solar radiation and land sea mask. Any interest in including these or even trying to get all the era5 variables available.

Rechunk Pressure level data in lat lon dataset

I have been using the latlon dataset here gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3. It has been extremely helpful for setting up different projects. I am wondering if it would be possible to rechunk the pressure level data. Currently all pressure levels are in a single chunk. If we want to sub sample we will end up getting the entire chunk which can significantly slow down the bandwidth. Ideally given this is in object storage we could use much smaller chunk sizes and just have the chunks be the lat long grid. What do you thinks?

How to increase the speed of saving ERA5 chunks data?

Hi, everyone. It is really convenient to access ERA5 data from cloud storage. However, it's very slow to save the processed data as netcdf format. It has taken 40 minutes so far and still has not been saved successfully. How can I solve this problem and increase the speed of saving ERA5 chunks data?This is my code.

import xarray as xr
reanalysis = xr.open_zarr(
    'gs://gcp-public-data-arco-era5/ar/1959-2022-full_37-1h-0p25deg-chunk-1.zarr-v2', 
    chunks={'time': 48},
    consolidated=True)

## data_CN Size: 119G
data_CN = reanalysis["2m_temperature"].loc["1961-01-01":'2020-12-31',60:0,70:130]  

## data_CN_daily Size: 159M
data_CN_daily = data_CN.resample(time="M").mean()

data_CN_daily = data_CN_daily.compute()    ## This process will cost long time (at least 50min+)

data_CN_daily.to_netcdf("data_ch.cn")

Inconsistent variable naming for 'z' and 'zs' in data repository and storage bucket.

As per ECMWF's official website documentation, the variable name should be referenced as z and not zs. This is indicated in the reference link: (https://codes.ecmwf.int/grib/param-db/?id=129). However, it's observed that both in the repository (https://github.com/google-research/arco-era5/blob/main/raw/era5_ml_zs.cfg) and within the storage bucket(gs://gcp-public-data-arco-era5/raw/ERA5GRIB/HRES/Month/**) where the data is stored, the variable name is currently set as zs. To ensure consistency, an update is required in this regard.

Store data in uint16

Hey,

Totally amazing project. Wanted to ask about compression/uint16 storage. When I get lat/lon ERA5 data from the CDS api I see the data is compressed and stored in uint16 along with scale factors to convert back to float32. When I look at the dataset here it seems to be in the full float32. Any reason not to store in the original uint16? Saves tremendous amounts of bandwidth and when using ERA5 in ML workflows this can make a big difference.

Snow depth latitude flipped for select dates

It seems that for a select few dates the snow_depth variable is flipped (e.g., the latitudes are reversed) from this dataset gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3/. I only checked values in 6 hour chunks from 1979 to 2021, but I found the fields are flipped for these dates:

Flipped data at 1981-03-16T00:00:00.000000000
Flipped data at 1981-03-16T06:00:00.000000000
Flipped data at 1981-03-16T12:00:00.000000000
Flipped data at 1981-03-16T18:00:00.000000000
Flipped data at 1982-04-06T00:00:00.000000000
Flipped data at 1982-04-06T06:00:00.000000000
Flipped data at 1982-04-06T12:00:00.000000000
Flipped data at 1982-04-06T18:00:00.000000000
Flipped data at 1985-12-11T00:00:00.000000000
Flipped data at 1985-12-11T06:00:00.000000000
Flipped data at 1985-12-11T12:00:00.000000000
Flipped data at 1985-12-11T18:00:00.000000000
Flipped data at 1987-11-30T00:00:00.000000000
Flipped data at 1987-11-30T06:00:00.000000000
Flipped data at 1987-11-30T12:00:00.000000000
Flipped data at 1987-11-30T18:00:00.000000000
Flipped data at 1990-03-05T00:00:00.000000000
Flipped data at 1990-03-05T06:00:00.000000000
Flipped data at 1990-03-05T12:00:00.000000000
Flipped data at 1990-03-05T18:00:00.000000000
Flipped data at 1990-04-02T00:00:00.000000000
Flipped data at 1990-04-02T06:00:00.000000000
Flipped data at 1990-04-02T12:00:00.000000000
Flipped data at 1990-04-02T18:00:00.000000000
Flipped data at 1990-08-12T00:00:00.000000000
Flipped data at 1990-08-12T06:00:00.000000000
Flipped data at 1990-08-12T12:00:00.000000000
Flipped data at 1990-08-12T18:00:00.000000000
Flipped data at 1997-05-15T00:00:00.000000000
Flipped data at 1997-05-15T06:00:00.000000000
Flipped data at 1997-05-15T12:00:00.000000000
Flipped data at 1997-05-15T18:00:00.000000000
Flipped data at 2002-03-17T00:00:00.000000000
Flipped data at 2002-03-17T06:00:00.000000000
Flipped data at 2002-03-17T12:00:00.000000000
Flipped data at 2002-03-17T18:00:00.000000000
Flipped data at 2003-11-26T00:00:00.000000000
Flipped data at 2003-11-26T06:00:00.000000000
Flipped data at 2003-11-26T12:00:00.000000000
Flipped data at 2003-11-26T18:00:00.000000000
Flipped data at 2004-02-10T00:00:00.000000000
Flipped data at 2004-02-10T06:00:00.000000000
Flipped data at 2004-02-10T12:00:00.000000000
Flipped data at 2004-02-10T18:00:00.000000000
Flipped data at 2006-04-12T00:00:00.000000000
Flipped data at 2006-04-12T06:00:00.000000000
Flipped data at 2006-04-12T12:00:00.000000000
Flipped data at 2006-04-12T18:00:00.000000000
Flipped data at 2007-06-19T00:00:00.000000000
Flipped data at 2007-06-19T06:00:00.000000000
Flipped data at 2007-06-19T12:00:00.000000000
Flipped data at 2007-06-19T18:00:00.000000000
Flipped data at 2009-03-05T00:00:00.000000000
Flipped data at 2009-03-05T06:00:00.000000000
Flipped data at 2009-03-05T12:00:00.000000000
Flipped data at 2009-03-05T18:00:00.000000000
Flipped data at 2013-11-11T00:00:00.000000000
Flipped data at 2013-11-11T06:00:00.000000000
Flipped data at 2013-11-11T12:00:00.000000000
Flipped data at 2013-11-11T18:00:00.000000000
Flipped data at 2014-05-11T00:00:00.000000000
Flipped data at 2014-05-11T06:00:00.000000000
Flipped data at 2014-05-11T12:00:00.000000000
Flipped data at 2014-05-11T18:00:00.000000000
Flipped data at 2017-03-17T00:00:00.000000000
Flipped data at 2017-03-17T06:00:00.000000000
Flipped data at 2017-03-17T12:00:00.000000000
Flipped data at 2017-03-17T18:00:00.000000000
Flipped data at 2020-05-19T00:00:00.000000000
Flipped data at 2020-05-19T06:00:00.000000000
Flipped data at 2020-05-19T12:00:00.000000000
Flipped data at 2020-05-19T18:00:00.000000000

Could surface solar radiation be added?

Hi,
Really great work! Its nice being able to stream in all this ERA5 data. I was wondering if some more surface solar radiation variables could be added to this? The reason for this would be to help with renewable energy generation forecasting, specifically PV/Solar generation, and training models to forecast those.

Add hooks to open dataset up with Google Colab

  • For the existing notebooks, add an "open with colab" button on the top. The trick here is to set up the dependencies on the notebook ahead of time so everything "just works".
  • Add a series of open with colab buttons to the main README to help users use the data immediately.

Automatically ingest raw ERA5 data as soon as it's available from ECMWF.

Ideally, we'd like to make sure that this repository has new raw data from ECMWF as soon as it's available. For at least a few dataset, Copernicus makes new data available on a daily cadence.

Implementation Notes

Let's modify our existing weather-dl scripts to try to ingest new data from CDS on on a cron via Github Actions. Here, we should modify the config on every run to extend it to the current date. This may require we modify the config parser of weather-dl a bit first google/weather-tools#267.

Blocked from accessing data

Hey,

I was using ARCO ERA5 to generate a training dataset for neural networks. I was pulling the data last night and after pulling ~200 GB I started getting the bellow error. Not a super familiar with google cloud storage but is there limits put on how much data we can pull from ARCO ERA5?

    raise HttpError({"code": status, "message": msg})  # text-like
gcsfs.retry.HttpError: <html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"/><title>Sorry...</title><style> body { font-family: verdana, arial, sans-serif; background-color: #fff; color: #000; }</style></head><body><div><table><tr><td><b><font face=sans-serif size=10><font color=#4285f4>G</font><font color=#ea4335>o</font><font color=#fbbc05>o</font><font color=#4285f4>g</font><font color=#34a853>l</font><font color=#ea4335>e</font></font></b></td><td style="text-align: left; vertical-align: bottom; padding-bottom: 15px; width: 50%"><div style="border-bottom: 1px solid #dfdfdf;">Sorry...</div></td></tr></table></div><div style="margin-left: 4em;"><h1>We're sorry...</h1><p>... but your computer or network may be sending automated queries. To protect our users, we can't process your request right now.</p></div><div style="margin-left: 4em;">See <a href="https://support.google.com/websearch/answer/86640">Google Help</a> for more information.<br/><br/></div><div style="text-align: center; border-top: 1px solid #dfdfdf;"><a href="https://www.google.com">Google Home</a></div></body></html>, 429
[2023-12-15 10:08:15,139][gcsfs][ERROR] - _request out of retries on exception: <html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"/><title>Sorry...</title><style> body { font-family: verdana, arial, sans-serif; background-color: #fff; color: #000; }</style></head><body><div><table><tr><td><b><font face=sans-serif size=10><font color=#4285f4>G</font><font color=#ea4335>o</font><font color=#fbbc05>o</font><font color=#4285f4>g</font><font color=#34a853>l</font><font color=#ea4335>e</font></font></b></td><td style="text-align: left; vertical-align: bottom; padding-bottom: 15px; width: 50%"><div style="border-bottom: 1px solid #dfdfdf;">Sorry...</div></td></tr></table></div><div style="margin-left: 4em;"><h1>We're sorry...</h1><p>... but your computer or network may be sending automated queries. To protect our users, we can't process your request right now.</p></div><div style="margin-left: 4em;">See <a href="https://support.google.com/websearch/answer/86640">Google Help</a> for more information.<br/><br/></div><div style="text-align: center; border-top: 1px solid #dfdfdf;"><a href="https://www.google.com">Google Home</a></div></body></html>, 429

Minimal example that gives the error for me,

import fsspec
import xarray as xr

# Load arco xarray
fs = fsspec.filesystem('gs')
arco_filename = 'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3'
arco_mapper = fs.get_mapper(arco_filename)
arco_era5 = xr.open_zarr(arco_mapper, consolidated=True)

save_dir = './'
var_name = "10m_u_component_of_wind"
zarr_path = save_dir / Path(f"{var_name}.zarr")

# Save
delayed_obj = arco_era5[var_name].to_zarr(zarr_path, consolidated=True, compute=False)

# Wait for save to finish
delayed_obj.compute()

*Fixed type in example

Ingest all variables for single-level reanalysis.

Add a new single-level reanalysis config to ingest the remaining variables from ECMWF within the same time range. We can then create a new XArray-Beam pipeline to append variables to the existing CO Zarr dataset.

Variables are listed here: https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-single-levels?tab=overview

We can create more DL configs like this one: https://github.com/google-research/arco-era5/blob/main/raw/era5_sfc_rad.cfg

See https://weather-tools.readthedocs.io/ or https://github.com/google-research/arco-era5/tree/main/raw for instructions on how to use weather-dl.

Design analysis ready Zarr to allow updating with preliminary ERA5 data.

For phase 2, we'd like to produce surface and atmospheric Zarrs that can be updated with preliminary data. Specifically, we intend to backfill the raw data covering 1959 to 1978. It's possible that in the future, ECMWF will produce an even earlier backfill. As I understand it, the standard structure of Zarr only allow appending to the end.

The aim for this issue would be to devise a means of avoiding recomputing our Zarr datasets whenever we want to include new data at earlier times.

For overall pipeline structure, I have the following sketch in mind:

  • For each epoch of preliminary data we ingest from ECMWF, we manually produce a new cloud optimized dataset. This can use the scripts in this project that we've already developed.
  • For recent data, we use Pangeo components update the cloud-optimized datasets (e.g. pangeo-forge/pangeo-forge-recipes#447). These download and append steps can be automated to a regular cadence (monthly or quarterly).
  • For the analysis ready version, we create XArray-Beam pipelines to transform the Zarr datasets.
  • After a new cloud-optimized preliminary dataset is produced, we invoke the same XArray-Beam pipelines to update the analysis ready versions.

@shoyer @rabernat: Do either of you have any thoughts on how we could structure our Zarr to these ends?

Lat/lon gridded data does not have monotonically increasing latitudes

First of all THANK YOU so much for this effort! Having ERA5 data available in an ARCO format is truly a game changer!

I noticed a small issue: The latitudes of the lat/lon gridded data 1959-2022-full_37-1h-0p25deg-chunk-1.zarr-v2/ seems to have decreasing latitude values

import xarray as xr

ar_full_37_1h = xr.open_zarr(
    'gs://gcp-public-data-arco-era5/ar/1959-2022-full_37-1h-0p25deg-chunk-1.zarr-v2/',
).isel(time=0)
ar_full_37_1h
image

which makes selecting a region in xarray slightly counterintuitive:

ar_full_37_1h.sel(latitude=slice(-50, 50))

returns no latitude indicies
image

while

ar_full_37_1h.sel(latitude=slice(50, -50))

gives (the desired)

image

If you end up reprocessing the data at some point, I wonder if something like xarrays ds.sortby('latitude') or equivalent could be added to the pipeline.

chunking schema for analysis-ready?

Hey guys! Love to see this effort.

Quick question about the chunking schema you're planning for an analysis-ready corpus. Will you keep the native ERA5 chunking, i.e. {'time':1}? Are you chunking variables together?

With h2ox we've chunked hourly ERA5-land in blocks of 4-years with some moderate spatial aggregation, e.g. 5 degrees. Our main use case is for quick/easy retrieval of timeseries. Would be good to know what you guys are thinking for the chunking schema here!

Where should I be cautious? What are the limitations of the dataset?

Answer the question in the FAQ.

Notes:

  • Surface variables at the coastline!
  • Land-sea mask. See: the Bay Area. Coastal China (Shanghai). Tokyo. Costal India or Jakarta.
  • Areas of rapid topography, e.g. Tibetan plateau Precipitation.
    • Link out to papers floating around about the known issue.
  • We include it to be complete, but we don’t recommend using it blindly.

Jupyter Notebook Improvements

Here is a list of feedback on our walkthrough notebooks from our team, in no particular order:

  • Add a warning in the Surface-Walkthrough that regridding takes a long time
  • In the model-level notebook, add a brief description of weather events happening on each datestring
  • Add extra instructions on how to authenticate in the details view, in case users hit permissions issues (i.e. application-default login).
  • In each notebook, document what a chunk means, and where the 48 number comes from.
  • Add labels & comments to figures
  • Improve plotting & visualization, especially better plotting projections
  • Delete extra cells in the notebooks
  • ML Notebook copy: have all the variables in the same order as how XArray prints them. E.g.: Have the var shorname be a prefix for the description
  • Add a note to the ML notebook on what the dimensions mean, esp. vertical levels
  • Add a "looking ahead" or conclusion section to the end of the ML notebook
  • Make sure the checked-in version of the ML notebook includes plots.

time unit error when opening single-level-forecast.zarr-v2

Hi,
If I try to open the single-level-forecast.zarr-v2 with :

sl_forecasts = xr.open_zarr(
    'gs://gcp-public-data-arco-era5/co/single-level-forecast.zarr-v2/', 
    chunks={'time': 48},
    consolidated=True,
)

I get the error :
ValueError: Failed to decode variable 'valid_time': unable to decode time units 'hours since 1900-01-01 06:00:00' with "calendar 'proleptic_gregorian'". Try opening your dataset with decode_times=False or installing cftime if it is not installed.
But cftime is already installed.

Also the time unit seems strange : hours since 1900-01-01T06:00:00.000000, it is probably in days from a more recent date.

So adding decode_times=False and doing

units, reference_date = sl_forecasts.time.attrs['units'].split('since')
sl_forecasts['time'] = pd.date_range(start=reference_date, periods=sl_forecasts.sizes['time'], freq='H')

would be wrong.

Any idea how we can open the dataset with the correct time ?

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.