Pangeo Forge

The idea of Pangeo Forge is to copy the very successful pattern of Conda Forge for crowdsourcing the curation of an analysis-ready data library.

In Conda Forge, a maintainer contributes a recipe which is used to generate a conda package from a source code tarball. Behind the scenes, CI downloads the source code, builds the package, and uploads it to a repository.

In Pangeo Forge, a maintainer contributes a recipe which is used to generate an analysis-ready cloud-based copy of a dataset in a cloud-optimized format like Zarr. Behind the scenes, CI downloads the original files from their source (e.g. FTP, HTTP, or OpenDAP), combines them using xarray, writes out the Zarr file, and uploads to cloud storage.

There are many details to work out. It is useful to think about what ingredients might go into a recipe.

Sample meta.yaml recipe.

dataset:
  # should be a unique name and valid python identifier
  name: noaa_oisst_avhrr_only
  # how should we version datasets?
  version: 0.1.0

# where do we find the files
source:
  format: netcdf
  # example: https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/access/avhrr-only/198109/avhrr-only-v2.19810901.nc
  url_pattern: https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/access/avhrr-only/{{ yyyymm }}/avhrr-only-v2.{{ yyyymmdd }}.nc
  # mirror pandas syntax
  # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html
  date_range:
    start: 1981-09-01
    # the end date can
    end: 2019-08-11
    freq: D

# tell xarray how to open and combine the files
combine:
  # options passed to xarray.open_dataset
  open_options:
    decode_cf: False
  combine_method: combine_by_coords
  # options passed to combine method
  combine_options:
    compat: no_conflicts

# where / how to save the data
output:
  format: zarr
  chunks:
    time: 5
  consolidated: True
  target:
    urlpath: gcs://pangeo-noaa/
    # how to deal with credentials?
    credentials: ???

about:
  home: https://www.ncdc.noaa.gov/oisst
  # how to deal with licensing?
  license: ???
  summary: 'NOAA Optimally Interpolated Sea Surface Temperature (AVHRR Only)'
  description: |
    The NOAA 1/4° daily Optimum Interpolation Sea Surface Temperature (or daily OISST) is an analysis constructed by combining observations from different platforms (satellites, ships, buoys) on a regular global grid. A spatially complete SST map is produced by interpolating to fill in gaps.
    The methodology includes bias adjustment of satellite and ship observations (referenced to buoys) to compensate for platform differences and sensor biases. This proved critical during the Mt. Pinatubo eruption in 1991, when the widespread presence of volcanic aerosols resulted in infrared satellite temperatures that were much cooler than actual ocean temperatures (Reynolds 1993).
  support_email: [email protected]

pangeo-forge / pangeo-smithy Goto Github PK

pangeo-smithy's Introduction

Pangeo Forge

pangeo-smithy's People

Contributors

Stargazers

Watchers

Forkers

pangeo-smithy's Issues

Archive or delete this repo?

Tracking flow status after execution

Useful article on deploying a Prefect Agent in AWS

Runtime for Pipelines

First draft of a parser using the recipe example

Intake catalog?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent