Giter Club home page Giter Club logo

score-hv's Introduction

score-hv/README.md

Summary

Python package used to harvest metrics from reanalysis data. This repositority is a standalone Python package that can be used as a part of a larger workflow (e.g., with the score-db and score-monitoring repositories).

Setup and installation

The repository can be downloaded using git:

git clone https://github.com/NOAA-PSL/score-hv.git

For testing and development, we recommend creating a new python environment (e.g., using mamba as shown below or other options such as conda). To install the required dependencies into a new environment using the micromamba command-line interface, run the following after installing mamba/micromamba:

micromamba create -f environment.yml; micromamba activate score-hv-default-env

Depending on your use case, you can install score-hv using one of three methods using pip,

pip install . # default installation into active environment

pip install -e . # editable installation into active enviroment, useful for development

pip install -t [TARGET_DIR] --upgrade . # target installation into TARGET_DIR, useful for deploying for cylc workflows (see https://cylc.github.io/cylc-doc/stable/html/tutorial/runtime/introduction.html#id3)

Verify the installation by running the unit test suite. There are no expected test failures.

pytest tests

Harvesting metric data with score-hv

score-hv takes in either a yaml or dictionary via harvester_base.py which specifies the harvester to call, input data files and other inputs to the harvester (such as which variables and statistics to harvest). Example input dictionaries for each harvester are provided in the Available Harvesters section below. Calls can be made directly in the command line or by importing the score-hv module and calling harvester_base.harvest([havester_config filename / config dictionary]).

For example, the following dictionary could be used to request the global, gridcell area weighted statistics for the temporally (in this case daily) weighted netcdf gridcell area data for tmp2m returning the mean, variance, minimum, and maximum.

Example input dictionary:

VALID_CONFIG_DICT = {'harvester_name': 'daily_bfg',
                        'filenames' : [
                            '/filepath/tmp2m_bfg_2023032100_fhr09_control.nc',
                            '/filepath/tmp2m_bfg_2023032106_fhr06_control.nc',
                            '/filepath/tmp2m_bfg_2023032106_fhr09_control.nc',
                            '/filepath/tmp2m_bfg_2023032112_fhr06_control.nc',
                            '/filepath/tmp2m_bfg_2023032112_fhr09_control.nc',
                            '/filepath/tmp2m_bfg_2023032118_fhr06_control.nc',
                            '/filepath/tmp2m_bfg_2023032118_fhr09_control.nc',
                            '/filepath/tmp2m_bfg_2023032200_fhr06_control.nc'],
                     'statistic': ['mean', 'variance', 'minimum', 'maximum'],
                     'variable': ['tmp2m']}

A request dictionary must provide the harvester_name and filenames. Supported harvester_name(s) are provided below, and each harvester may have additional input options or requirements.

Available Harvesters

inc_logs

increment descriptive statistics from log files

Expected file format: log output file

Returns the following named tuple:

HarvestedData = namedtuple('HarvestedData', ['logfile',
                                             'cycletime',
                                             'statistic',
                                             'variable',
                                             'value', 
                                             'units'])

Available statistics

VALID_STATISTICS = ('mean', 'RMS')

Available variables

VALID_VARIABLES = ['pt_inc', 's_inc', 'u_inc', 'v_inc', 'SSH', 'Salinity',
                   'Temperature', 'Speed of Currents', 'o3mr_inc', 'sphum_inc',
                   'T_inc', 'delp_inc', 'delz_inc']

Example dictionary input

VALID_CONFIG_DICT = {'harvester_name': 'inc_logs',
                     'filename' : '/filepath/calc_atm_inc.out',
                     'cycletime': datetime(2019,3,21,0),
                     'statistic': ['mean', 'RMS'],
                     'variable': ['o3mr_inc', 'sphum_inc', 'T_inc', 'u_inc', 'v_inc',
                                  'delp_inc', 'delz_inc']}

daily_bfg

The daily_bfg harvester pulls data from the background forecast data files. It calculates daily statistics from the provided files.

A netcdf file containing area grid cell weights is required. This file should be included in the github repository with the download found in the data folder.

The daily_bfg harvester returns the following named tuple:

HarvestedData = namedtuple('HarvestedData', ['filenames',
                                             'statistics',
                                             'variable',
                                             'value',
                                             'units',
                                             'mediantime',
                                             'longname',
                                             'surface_mask', 
                                             'region'])

The example filenames are listed in the dictionary example below.

Available Harvester statistics:

A list of statistics. Valid statistics are ['mean', 'variance', 'minimum', 'maximum']

Available Harvester variables:

VALID_VARIABLES  = (
                    'lhtfl_ave',# surface latent heat flux (W/m**2)
                    'shtfl_ave', # surface sensible heat flux (W/m**2)
                    'dlwrf_ave', # surface downward longwave flux (W/m**2)
                    'dswrf_ave', # averaged surface downward shortwave flux (W/m**2)
                    'ulwrf_ave', # surface upward longwave flux (W/m**2)
                    'uswrf_ave', # averaged surface upward shortwave flux (W/m**2)
                    'netrf_avetoa', #top of atmosphere net radiative flux (SW and LW) (W/m**2)
                    'netef_ave', #surface energy balance (W/m**2)
                    'prateb_ave', # surface precip rate (mm weq. s^-1)
                    'soil4', # liquid soil moisture at layer-4 (?)
                    'soilm', # total column soil moisture content (mm weq.)
                    'soilt4', # soil temperature unknown layer 4 (K)
                    'tg3', # deep soil temperature (K)
                    'tmp2m', # 2m (surface air) temperature (K)
                    'ulwrf_avetoa', # top of atmos upward longwave flux (W m^-2)
                    )

The variable netrf_avetoa is calculated from:

   dswrf_avetoa:averaged surface - downward shortwave flux
   uswrf_avetoa:averaged surface upward shortwave flux
   ulwrf_avetoa:surface upward longwave flux

   Theses variables are found in the bfg control files.
   netrf_avetoa = dswrf_avetoa - uswrf_avetoa - ulwrf_avetoa

The variable netef_ave is calculated from:

   dswrf_ave : averaged surface downward shortwave flux
   dlwrf_ave : surface downward longwave flux
   ulwrf_ave : surface upward longwave flux
   uswrf_ave : averaged surface upward shortwave flux
   shtfl_ave : surface sensible heat flux
   lhtfl_ave : surface latent heat flux

   These variables are found in the bfg control files.
   netef_ave = dswrf_ave + dlwrf_ave - ulwrf_ave - uswrf_ave - shtfl_ave - lhtfl_ave

Returned Results

Value: The value entry of the harvested tuple contains the calculated value of valid statistic that was requested by the user.

Units: The units entry of the harvested tuple contains the untis associated with the requested variable from the BFG Netcdf file. If no units were given on the file then a value of None is returned.

Mediantime: The mediantime of the harvested tuple is calculated from the endpoints of the variable time stamps on the BFG Netcdf file.

Longname: The long name entry of the harvested tuple is taken from the variable long name on the BFG Netcdf file.

Region: This entry of the harvested tuple is a nested dictionary. The region dictionary contains the following information. regon name: this is a name given to the region by the user. It is a required key word. latitude_range(min_latitude,max_latitude) longitude range(east_lon,west_lon). The following nested dictionaries for region are accepted:

          user name of region': {'latitude_range' : (min_lat,max_lat)}
          The user has not specified a longitude range.  The default will be applied. 
          default longitude is (360, 0)
          NOTE:  The longitude values on the bfg files are grid_xt : 0 to 359.7656 by 0.234375 degrees_E  circular

          'user name of region': {'longitude_range' : (min_lon,max_lon}
           The user has not specified a latitude_range.  The default will be applied.
           default latitude is (-90,90)
           NOTE:  The latitude values on the bfg files are grid_yt : 89.82071 to -89.82071 degrees_N
                
           examples: 'region' : {
                                 'conus': {'latitude_range': (24.0, 49.0), 'longitude_range': (294.0, 235.0)},
                                 'western_hemis': 'longitude_range': (200,360)},
                                 'southern hemis': 'latitude_range': (20,-90)}
                                }

The daily_bfg.py file returns the following for each variable and statistic requested.

(HarvestedData(
                                      self.config.harvest_filenames,
                                       statistic, 
                                       variable,
                                       np.float32(value),
                                       units,
                                       dt.fromisoformat(median_cftime.isoformat()),
                                       longname,
                                       user_regions))

Example Input Dictionary

Example input dictionary for calling the daily_bfg harvester:

VALID_CONFIG_DICT = {'harvester_name': 'daily_bfg',
                        'filenames' : [
                            '/filepath/tmp2m_bfg_2023032100_fhr09_control.nc',
                            '/filepath/tmp2m_bfg_2023032106_fhr06_control.nc',
                            '/filepath/tmp2m_bfg_2023032106_fhr09_control.nc',
                            '/filepath/tmp2m_bfg_2023032112_fhr06_control.nc',
                            '/filepath/tmp2m_bfg_2023032112_fhr09_control.nc',
                            '/filepath/tmp2m_bfg_2023032118_fhr06_control.nc',
                            '/filepath/tmp2m_bfg_2023032118_fhr09_control.nc',
                            '/filepath/tmp2m_bfg_2023032200_fhr06_control.nc'],
                     'statistic': ['mean', 'variance', 'minimum', 'maximum'],
                     'variable': ['tmp2m']}

obs_info_log

observation information for pressure, specific humidity, temperature, height, wind components, precipitable h2o, and relative humidity

Expected file format: text file

File format generated from NCEPlibs cmpbqm command output

Required dictionary inputs: 'variable'

Available variables

Valid 'variable' options: 'Temperature', 'Pressure', 'Specific Humidity', 'Relative Humidity', 'Height', 'Wind Components', 'Precipitable H20'

Example input dictionary

VALID_CONFIG_DICT_TEMP = {
    'harvester_name': 'obs_info_log',
    'filename': '/filepath/log_cmpbqm.txt',
    'variable': 'TEMPERATURE'
}

score-hv's People

Contributors

amschne avatar slawrence87544 avatar sherrief avatar kevincounts avatar jrknezha avatar frolovsa avatar

Watchers

 avatar Peter Vaillancourt avatar Jeffrey Whitaker avatar Chesley McColl avatar Niraj Agarwal avatar  avatar  avatar

score-hv's Issues

utilities file

need a utilities file for common functions (e.g., get_gridcell_area_data_path())

Shift test data hosting to AWS

The test data files in /score-hv/tests/data should instead be hosted on the AWS server along with the files listed in this script: get_unit_test_data.sh. The test data and the cached versions of it will need to be removed once that is set up. This should make cloning the repo smaller and faster.

manage stack on monitoring cluster

need better method to maintain and update packages on monitoring cluster

current warning output from score-hv unit tests:

`=============================================== warnings summary ================================================
tests/test_harvester_daily_bfg_prateb.py::test_variable_names
/contrib/home/builder/UFS-RNR-stack/anaconda3/lib/python3.9/site-packages/scipy/init.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.26.1
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html`

README needs work

Need to add setup and usage information (perhaps from a sample unit test) to the README for better usability. It would also be good to document the harvester registry and its template for adding new harvesters.

replay analysis increments

add feature to harvest vertically resolved analysis increments for replay data (from the fv3_increment6.nc files)

Dependency handling

Right now the following dependencies are required to build and test this repo from a clean conda environment:
pip
xarray
dask
pytest
The following are installed via the setup.cfg script:
netCDF4
numpy
pandas
pyyaml
scipy

To make things easier for users, setup.cfg should be updated to include the first section of packages, or a new environment.yml file should be created to contain all of the necessary packages.

consistency in string formatting

update use of old style of string formatting, e.g.,

msg = ("'%s' is not a supported "
             "variable to harvest from the background forecast data. "
             "Please reconfigure the input dictionary using only the "
             "following variables: %r" % (var, VALID_VARIABLES))

to use "f strings", e.g.,

msg = (f"{var} is not a supported "
             "variable to harvest from the replay analysis " 
             "increments (fv3_increment6.nc). "
             "Please reconfigure the input dictionary using only the "
             f"following variables: {VALID_VARIABLES}")

update setup.cfg

several fields in the setup.cfg field are outdated, providing bad information for new users

Pip install

When using pip to build and install the package, you have to use editable mode. This installs the package in /home//score-hv/src. If a standard pip install is used, the package is installed in the conda site-packages directory (..../envs//lib/python3.12/site-packages. This file path difference means that the get_gridcell_area_data_path() function in daily_bfg.py retrieves the wrong data path.

Potential solutions:

  1. Update get_gridcell_area_data_path() to pull in the correct filepath for either install method
  2. Recommend users install with pip editable mode until package is more stable. Before package is released, update get_gridcell_area_data_path() to work with standard mode only.

Create original harvester for innovation statistics.

The harvester app will take the following form.

  1. harvester will be kicked off by either dict or yaml (just like eva)
  2. the config file needs to at least contain the name of the registered harvester that the user desires
  3. each new harvester will have 2 basic sections (a) a config handler, and (b) a file reader and parser
  4. the config will tell the parser exactly where to find the file to be harvested and what data it should parse from that file.
    The config will also tell the parser portion of the harvester exactly what format to output the results (i.e. list of tuples or pandas dataframe)

coding exercise: parsing of the prepbufr log files

Context

One of the things we need to do is to inventory our data collections. One type of tool that we develop are harvesters that parse files in to a subset of tuples that can be passed upstream for insertion into a database. Attached to this issue is an example of the log file that inventories number of observations in one of the files. the numbers are aggregated by

  • variable type (e.g. each of the 7 sections in the log files with headers like: {'PRESSURE', 'SPECIFIC HUMIDTY', 'TEMPERATURE', 'HEIGHT', 'WIND COMPONENTS', 'PRECIPITABLE H2O', 'RELATIVE HUMIDTY'}. Notice that there are 7 sections but only 6 tables since 'RELATIVE HUMIDTY' has no data entries.
  • observation platform: integer index in the first column of each table (heading 'typ')
  • number of observations in columns 2-11. Number of observations are binned in total number and number of observations that passes certain quality control procedures. We are only interested in the total number (second column with heading 'tot') and high quality obs (third column with heading '0-3').

Problem description

Add new harvester in this directory that will parse the attached log file into a list of tuples. Harvester should be controlled by a yaml file with the following information:

  • Location of the log file on disk.
  • Which of the variables should be parsed {'PRESSURE', 'SPECIFIC HUMIDTY', 'TEMPERATURE', 'HEIGHT', 'WIND COMPONENTS', 'PRECIPITABLE H2O', 'RELATIVE HUMIDTY'}

Harvester needs to inherent from the harvester base class. An example of the harvester implementation for a different problem is provided here

The named tuples for the output should have the following fields

HarvestedData = namedtuple(
    'HarvestedData',
    [
        'file_name', # name of the log file
        'cycletime', # parsed from 'DATA  VALID AT' on the first line of the log file
        'variable',    # name of the variable passed in the yaml file
        'insturment type', # first column of the table in the log file
        'number_obs', # second column of the table in the log file
        'number_obs_qc_0to3', # third column of the table in the log file
    ],
)

Delivery of results

Preferred way is for you to deliver the results is to (1) clone the repo to your computer and (2) sending us back a tarball of your local checkout. While not elegant, this will ensure that your contribution will remain private. You are welcome to ask questions through email.

Simplifications to the problem

You might notice that the full repo has :

Ideally you will deliver your results as another pytest in the test directory. However, we appreciate that we are requesting you to do this work on your own time. You might choose to deliver the results in some other way that demonstrates how you think about this parsing problem and how you structure your work in a sequence of commits.

test_log_cmpbqm.txt

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.