Giter Club home page Giter Club logo

pangeo-smithy's Introduction

Pangeo Forge

The idea of Pangeo Forge is to copy the very successful pattern of Conda Forge for crowdsourcing the curation of an analysis-ready data library.

In Conda Forge, a maintainer contributes a recipe which is used to generate a conda package from a source code tarball. Behind the scenes, CI downloads the source code, builds the package, and uploads it to a repository.

In Pangeo Forge, a maintainer contributes a recipe which is used to generate an analysis-ready cloud-based copy of a dataset in a cloud-optimized format like Zarr. Behind the scenes, CI downloads the original files from their source (e.g. FTP, HTTP, or OpenDAP), combines them using xarray, writes out the Zarr file, and uploads to cloud storage.

There are many details to work out. It is useful to think about what ingredients might go into a recipe.

Sample meta.yaml recipe.

dataset:
  # should be a unique name and valid python identifier
  name: noaa_oisst_avhrr_only
  # how should we version datasets?
  version: 0.1.0

# where do we find the files
source:
  format: netcdf
  # example: https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/access/avhrr-only/198109/avhrr-only-v2.19810901.nc
  url_pattern: https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/access/avhrr-only/{{ yyyymm }}/avhrr-only-v2.{{ yyyymmdd }}.nc
  # mirror pandas syntax
  # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html
  date_range:
    start: 1981-09-01
    # the end date can
    end: 2019-08-11
    freq: D

# tell xarray how to open and combine the files
combine:
  # options passed to xarray.open_dataset
  open_options:
    decode_cf: False
  combine_method: combine_by_coords
  # options passed to combine method
  combine_options:
    compat: no_conflicts

# where / how to save the data
output:
  format: zarr
  chunks:
    time: 5
  consolidated: True
  target:
    urlpath: gcs://pangeo-noaa/
    # how to deal with credentials?
    credentials: ???

about:
  home: https://www.ncdc.noaa.gov/oisst
  # how to deal with licensing?
  license: ???
  summary: 'NOAA Optimally Interpolated Sea Surface Temperature (AVHRR Only)'
  description: |
    The NOAA 1/4° daily Optimum Interpolation Sea Surface Temperature (or daily OISST) is an analysis constructed by combining observations from different platforms (satellites, ships, buoys) on a regular global grid. A spatially complete SST map is produced by interpolating to fill in gaps.
    The methodology includes bias adjustment of satellite and ship observations (referenced to buoys) to compensate for platform differences and sensor biases. This proved critical during the Mt. Pinatubo eruption in 1991, when the widespread presence of volcanic aerosols resulted in infrared satellite temperatures that were much cooler than actual ocean temperatures (Reynolds 1993).
  support_email: [email protected]

pangeo-smithy's People

Contributors

jhamman avatar rabernat avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pangeo-smithy's Issues

Tracking flow status after execution

Our current setup in our demo-feedstock (pangeo-forge/terraclimate-feedstock@a9229cb) is using GitHub actions to run a recipes Prefect flow. The flow run step (prefect run flow ...) is working but exits once the flow is delivered to the prefect agent. This leads to a situation where the CI reports a successful deployment before the flow is actually executed.

So we need some way to track the flow's deployment status, in addition to the CI execution. Some ideas discussed on today's call were:

Runtime for Pipelines

Right now, the pipelines are very coupled to their "runtime," by which I mean environment context that varies between location.

Example: https://github.com/pangeo-forge/terraclimate-feedstock/blob/master/recipe/pipeline.py

We want to refactor this so that the runtime is specified separately from the pipeline.

Things attached to the runtime:

  • Prefect stuff:
    • execution environment (e.g. DaskKubernetesEnvironment, recipe/job.yaml, recipe/worker_pod.yaml)
    • flow storage (e.g. Docker)
    • flow registration (everything in the __main__)
  • Data storage stuff:
    • cache location
  • Credentials
    • docker
    • cloud storage
    • prefect token

What am I missing?

Intake catalog?

It seems to me that it would be useful to automatically include produced artefacts into a catalog as part of the pipeline. Thoughts?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.