noaa-psl / score-hv Goto Github PK

Repo to house any code necessary to pull diagnostics files or rmse scores related to ufs-rnr experiments and push the output to desired location and in desired format

License: GNU General Public License v3.0

Python 99.38% Shell 0.62%

score-hv's Introduction

score-hv/README.md

Summary

Python package used to harvest metrics from reanalysis data. This repositority is a standalone Python package that can be used as a part of a larger workflow (e.g., with the score-db and score-monitoring repositories).

Setup and installation

The repository can be downloaded using git:

git clone https://github.com/NOAA-PSL/score-hv.git

For testing and development, we recommend creating a new python environment (e.g., using mamba as shown below or other options such as conda). To install the required dependencies into a new environment using the micromamba command-line interface, run the following after installing mamba/micromamba:

micromamba create -f environment.yml; micromamba activate score-hv-default-env

Depending on your use case, you can install score-hv using one of three methods using pip,

pip install . # default installation into active environment

pip install -e . # editable installation into active enviroment, useful for development

pip install -t [TARGET_DIR] --upgrade . # target installation into TARGET_DIR, useful for deploying for cylc workflows (see https://cylc.github.io/cylc-doc/stable/html/tutorial/runtime/introduction.html#id3)

Verify the installation by running the unit test suite. There are no expected test failures.

pytest tests

Harvesting metric data with score-hv

score-hv takes in either a yaml or dictionary via harvester_base.py which specifies the harvester to call, input data files and other inputs to the harvester (such as which variables and statistics to harvest). Example input dictionaries for each harvester are provided in the Available Harvesters section below. Calls can be made directly in the command line or by importing the score-hv module and calling harvester_base.harvest([havester_config filename / config dictionary]).

For example, the following dictionary could be used to request the global, gridcell area weighted statistics for the temporally (in this case daily) weighted netcdf gridcell area data for tmp2m returning the mean, variance, minimum, and maximum.

Example input dictionary:

VALID_CONFIG_DICT = {'harvester_name': 'daily_bfg',
                        'filenames' : [
                            '/filepath/tmp2m_bfg_2023032100_fhr09_control.nc',
                            '/filepath/tmp2m_bfg_2023032106_fhr06_control.nc',
                            '/filepath/tmp2m_bfg_2023032106_fhr09_control.nc',
                            '/filepath/tmp2m_bfg_2023032112_fhr06_control.nc',
                            '/filepath/tmp2m_bfg_2023032112_fhr09_control.nc',
                            '/filepath/tmp2m_bfg_2023032118_fhr06_control.nc',
                            '/filepath/tmp2m_bfg_2023032118_fhr09_control.nc',
                            '/filepath/tmp2m_bfg_2023032200_fhr06_control.nc'],
                     'statistic': ['mean', 'variance', 'minimum', 'maximum'],
                     'variable': ['tmp2m']}

A request dictionary must provide the harvester_name and filenames. Supported harvester_name(s) are provided below, and each harvester may have additional input options or requirements.

Available Harvesters

inc_logs

increment descriptive statistics from log files

Expected file format: log output file

Returns the following named tuple:

HarvestedData = namedtuple('HarvestedData', ['logfile',
                                             'cycletime',
                                             'statistic',
                                             'variable',
                                             'value', 
                                             'units'])

Available statistics

VALID_STATISTICS = ('mean', 'RMS')

Available variables

VALID_VARIABLES = ['pt_inc', 's_inc', 'u_inc', 'v_inc', 'SSH', 'Salinity',
                   'Temperature', 'Speed of Currents', 'o3mr_inc', 'sphum_inc',
                   'T_inc', 'delp_inc', 'delz_inc']

Example dictionary input

VALID_CONFIG_DICT = {'harvester_name': 'inc_logs',
                     'filename' : '/filepath/calc_atm_inc.out',
                     'cycletime': datetime(2019,3,21,0),
                     'statistic': ['mean', 'RMS'],
                     'variable': ['o3mr_inc', 'sphum_inc', 'T_inc', 'u_inc', 'v_inc',
                                  'delp_inc', 'delz_inc']}

daily_bfg

The daily_bfg harvester pulls data from the background forecast data files. It calculates daily statistics from the provided files.

A netcdf file containing area grid cell weights is required. This file should be included in the github repository with the download found in the data folder.

The daily_bfg harvester returns the following named tuple:

HarvestedData = namedtuple('HarvestedData', ['filenames',
                                             'statistics',
                                             'variable',
                                             'value',
                                             'units',
                                             'mediantime',
                                             'longname',
                                             'surface_mask', 
                                             'region'])

The example filenames are listed in the dictionary example below.

Available Harvester statistics:

A list of statistics. Valid statistics are ['mean', 'variance', 'minimum', 'maximum']

Available Harvester variables:

VALID_VARIABLES  = (
                    'lhtfl_ave',# surface latent heat flux (W/m**2)
                    'shtfl_ave', # surface sensible heat flux (W/m**2)
                    'dlwrf_ave', # surface downward longwave flux (W/m**2)
                    'dswrf_ave', # averaged surface downward shortwave flux (W/m**2)
                    'ulwrf_ave', # surface upward longwave flux (W/m**2)
                    'uswrf_ave', # averaged surface upward shortwave flux (W/m**2)
                    'netrf_avetoa', #top of atmosphere net radiative flux (SW and LW) (W/m**2)
                    'netef_ave', #surface energy balance (W/m**2)
                    'prateb_ave', # surface precip rate (mm weq. s^-1)
                    'soil4', # liquid soil moisture at layer-4 (?)
                    'soilm', # total column soil moisture content (mm weq.)
                    'soilt4', # soil temperature unknown layer 4 (K)
                    'tg3', # deep soil temperature (K)
                    'tmp2m', # 2m (surface air) temperature (K)
                    'ulwrf_avetoa', # top of atmos upward longwave flux (W m^-2)
                    )

The variable netrf_avetoa is calculated from:

   dswrf_avetoa:averaged surface - downward shortwave flux
   uswrf_avetoa:averaged surface upward shortwave flux
   ulwrf_avetoa:surface upward longwave flux

   Theses variables are found in the bfg control files.
   netrf_avetoa = dswrf_avetoa - uswrf_avetoa - ulwrf_avetoa

The variable netef_ave is calculated from:

   dswrf_ave : averaged surface downward shortwave flux
   dlwrf_ave : surface downward longwave flux
   ulwrf_ave : surface upward longwave flux
   uswrf_ave : averaged surface upward shortwave flux
   shtfl_ave : surface sensible heat flux
   lhtfl_ave : surface latent heat flux

   These variables are found in the bfg control files.
   netef_ave = dswrf_ave + dlwrf_ave - ulwrf_ave - uswrf_ave - shtfl_ave - lhtfl_ave

Returned Results

Value: The value entry of the harvested tuple contains the calculated value of valid statistic that was requested by the user.

Units: The units entry of the harvested tuple contains the untis associated with the requested variable from the BFG Netcdf file. If no units were given on the file then a value of None is returned.

Mediantime: The mediantime of the harvested tuple is calculated from the endpoints of the variable time stamps on the BFG Netcdf file.

Longname: The long name entry of the harvested tuple is taken from the variable long name on the BFG Netcdf file.

Region: This entry of the harvested tuple is a nested dictionary. The region dictionary contains the following information. regon name: this is a name given to the region by the user. It is a required key word. latitude_range(min_latitude,max_latitude) longitude range(east_lon,west_lon). The following nested dictionaries for region are accepted:

          user name of region': {'latitude_range' : (min_lat,max_lat)}
          The user has not specified a longitude range.  The default will be applied. 
          default longitude is (360, 0)
          NOTE:  The longitude values on the bfg files are grid_xt : 0 to 359.7656 by 0.234375 degrees_E  circular

          'user name of region': {'longitude_range' : (min_lon,max_lon}
           The user has not specified a latitude_range.  The default will be applied.
           default latitude is (-90,90)
           NOTE:  The latitude values on the bfg files are grid_yt : 89.82071 to -89.82071 degrees_N
                
           examples: 'region' : {
                                 'conus': {'latitude_range': (24.0, 49.0), 'longitude_range': (294.0, 235.0)},
                                 'western_hemis': 'longitude_range': (200,360)},
                                 'southern hemis': 'latitude_range': (20,-90)}
                                }

The daily_bfg.py file returns the following for each variable and statistic requested.

(HarvestedData(
                                      self.config.harvest_filenames,
                                       statistic, 
                                       variable,
                                       np.float32(value),
                                       units,
                                       dt.fromisoformat(median_cftime.isoformat()),
                                       longname,
                                       user_regions))

Example Input Dictionary

Example input dictionary for calling the daily_bfg harvester:

VALID_CONFIG_DICT = {'harvester_name': 'daily_bfg',
                        'filenames' : [
                            '/filepath/tmp2m_bfg_2023032100_fhr09_control.nc',
                            '/filepath/tmp2m_bfg_2023032106_fhr06_control.nc',
                            '/filepath/tmp2m_bfg_2023032106_fhr09_control.nc',
                            '/filepath/tmp2m_bfg_2023032112_fhr06_control.nc',
                            '/filepath/tmp2m_bfg_2023032112_fhr09_control.nc',
                            '/filepath/tmp2m_bfg_2023032118_fhr06_control.nc',
                            '/filepath/tmp2m_bfg_2023032118_fhr09_control.nc',
                            '/filepath/tmp2m_bfg_2023032200_fhr06_control.nc'],
                     'statistic': ['mean', 'variance', 'minimum', 'maximum'],
                     'variable': ['tmp2m']}

obs_info_log

observation information for pressure, specific humidity, temperature, height, wind components, precipitable h2o, and relative humidity

Expected file format: text file

File format generated from NCEPlibs cmpbqm command output

Required dictionary inputs: 'variable'

Available variables

Valid 'variable' options: 'Temperature', 'Pressure', 'Specific Humidity', 'Relative Humidity', 'Height', 'Wind Components', 'Precipitable H20'

Example input dictionary

VALID_CONFIG_DICT_TEMP = {
    'harvester_name': 'obs_info_log',
    'filename': '/filepath/log_cmpbqm.txt',
    'variable': 'TEMPERATURE'
}

score-hv's People

Contributors

Watchers

score-hv's Issues

TOA (radiative) energy flux unit tests are incomplete

Need to add back and correct unit tests for the following stats:

global mean
min/max
variance

utilities file

need a utilities file for common functions (e.g., get_gridcell_area_data_path())

make harvester for land metrics (from sfc_data.tileX.nc)

Harvest atmospheric correction (aka increment) descriptive statistics from log file

Monitoring of increment max, min, mean, and RMS/MSE from log file. Could also include S1 scores, other skill scores and scalar attributes of performance.

precip harvester exports cftime datetime instead of datetime datetime

The daily mean precipitation harvester calculates from a list of input files a median time to export to the SQL database. The SQL database expects a datetime object, but the harvester currently exports a cftime datetime object, which is not compatible with the database.

make a harvester for global mean sst

make a harvester for csv files

Shift test data hosting to AWS

The test data files in /score-hv/tests/data should instead be hosted on the AWS server along with the files listed in this script: get_unit_test_data.sh. The test data and the cached versions of it will need to be removed once that is set up. This should make cloning the repo smaller and faster.

daily bfg harvester needs subroutines

harvester gets unwieldy after adding derived variables

manage stack on monitoring cluster

need better method to maintain and update packages on monitoring cluster

current warning output from score-hv unit tests:

`=============================================== warnings summary ================================================
tests/test_harvester_daily_bfg_prateb.py::test_variable_names
/contrib/home/builder/UFS-RNR-stack/anaconda3/lib/python3.9/site-packages/scipy/init.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.26.1
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html`

gsistats file harvester for satellite radiance monitoring

README needs work

Need to add setup and usage information (perhaps from a sample unit test) to the README for better usability. It would also be good to document the harvester registry and its template for adding new harvesters.

replay analysis increments

add feature to harvest vertically resolved analysis increments for replay data (from the fv3_increment6.nc files)

Grid cell area weighted global means

Harvesters that calculate global mean values need to be weighted by grid cell areas.

Dependency handling

Right now the following dependencies are required to build and test this repo from a clean conda environment:
pip
xarray
dask
pytest
The following are installed via the setup.cfg script:
netCDF4
numpy
pandas
pyyaml
scipy

To make things easier for users, setup.cfg should be updated to include the first section of packages, or a new environment.yml file should be created to contain all of the necessary packages.

Create new harvester for one of the current analyses files output by the ufs-rnr workflow: Part II (data parser)

consistency in string formatting

update use of old style of string formatting, e.g.,

msg = ("'%s' is not a supported "
             "variable to harvest from the background forecast data. "
             "Please reconfigure the input dictionary using only the "
             "following variables: %r" % (var, VALID_VARIABLES))

to use "f strings", e.g.,

msg = (f"{var} is not a supported "
             "variable to harvest from the replay analysis " 
             "increments (fv3_increment6.nc). "
             "Please reconfigure the input dictionary using only the "
             f"following variables: {VALID_VARIABLES}")

regional mean calculations

Need to integrate support for regions into score-hv calculations

unit tests and full support for sea ice metrics

fix score-hv innov stats tests

update setup.cfg

several fields in the setup.cfg field are outdated, providing bad information for new users

Pip install

When using pip to build and install the package, you have to use editable mode. This installs the package in /home//score-hv/src. If a standard pip install is used, the package is installed in the conda site-packages directory (..../envs//lib/python3.12/site-packages. This file path difference means that the get_gridcell_area_data_path() function in daily_bfg.py retrieves the wrong data path.

Potential solutions:

Update get_gridcell_area_data_path() to pull in the correct filepath for either install method
Recommend users install with pip editable mode until package is more stable. Before package is released, update get_gridcell_area_data_path() to work with standard mode only.

Create original harvester for innovation statistics.

The harvester app will take the following form.

harvester will be kicked off by either dict or yaml (just like eva)
the config file needs to at least contain the name of the registered harvester that the user desires
each new harvester will have 2 basic sections (a) a config handler, and (b) a file reader and parser
the config will tell the parser exactly where to find the file to be harvested and what data it should parse from that file.
The config will also tell the parser portion of the harvester exactly what format to output the results (i.e. list of tuples or pandas dataframe)

Create new harvester for one of the current analyses files output by the ufs-rnr workflow: Part I (config handler)

Create a new harvester for diag_conv_uv_ges.2015120318_cntrl.nc4 type files (or one of the next most relevant analysis file types).
This would include both any required dict format as well as the reading, parsing, and packaging of the relevant data. The data should be packaged into a list of tuples or a pandas dataframe.

need global constants file

global constants specified in multiple files allow for ambiguous definitions

add git ignore file

Create at least one unit test for the newly created harvesters.

make harvester for climate metrics

coding exercise: parsing of the prepbufr log files

Context

One of the things we need to do is to inventory our data collections. One type of tool that we develop are harvesters that parse files in to a subset of tuples that can be passed upstream for insertion into a database. Attached to this issue is an example of the log file that inventories number of observations in one of the files. the numbers are aggregated by

variable type (e.g. each of the 7 sections in the log files with headers like: {'PRESSURE', 'SPECIFIC HUMIDTY', 'TEMPERATURE', 'HEIGHT', 'WIND COMPONENTS', 'PRECIPITABLE H2O', 'RELATIVE HUMIDTY'}. Notice that there are 7 sections but only 6 tables since 'RELATIVE HUMIDTY' has no data entries.
observation platform: integer index in the first column of each table (heading 'typ')
number of observations in columns 2-11. Number of observations are binned in total number and number of observations that passes certain quality control procedures. We are only interested in the total number (second column with heading 'tot') and high quality obs (third column with heading '0-3').

Problem description

Add new harvester in this directory that will parse the attached log file into a list of tuples. Harvester should be controlled by a yaml file with the following information:

Location of the log file on disk.
Which of the variables should be parsed {'PRESSURE', 'SPECIFIC HUMIDTY', 'TEMPERATURE', 'HEIGHT', 'WIND COMPONENTS', 'PRECIPITABLE H2O', 'RELATIVE HUMIDTY'}

Harvester needs to inherent from the harvester base class. An example of the harvester implementation for a different problem is provided here

The named tuples for the output should have the following fields

HarvestedData = namedtuple(
    'HarvestedData',
    [
        'file_name', # name of the log file
        'cycletime', # parsed from 'DATA  VALID AT' on the first line of the log file
        'variable',    # name of the variable passed in the yaml file
        'insturment type', # first column of the table in the log file
        'number_obs', # second column of the table in the log file
        'number_obs_qc_0to3', # third column of the table in the log file
    ],
)

Delivery of results

Preferred way is for you to deliver the results is to (1) clone the repo to your computer and (2) sending us back a tarball of your local checkout. While not elegant, this will ensure that your contribution will remain private. You are welcome to ask questions through email.

Simplifications to the problem

You might notice that the full repo has :

A registry to register the harvester classes.
And the pytest directory.

Ideally you will deliver your results as another pytest in the test directory. However, we appreciate that we are requesting you to do this work on your own time. You might choose to deliver the results in some other way that demonstrates how you think about this parsing problem and how you structure your work in a sequence of commits.

test_log_cmpbqm.txt

noaa-psl / score-hv Goto Github PK

score-hv's Introduction

score-hv/README.md

Summary

Setup and installation

Harvesting metric data with score-hv

Available Harvesters

inc_logs

Available statistics

Available variables

Example dictionary input

daily_bfg

Available Harvester statistics:

Available Harvester variables:

Returned Results

Example Input Dictionary

obs_info_log

Available variables

Example input dictionary

score-hv's People

Contributors

Watchers

score-hv's Issues

Context

Problem description

Delivery of results

Simplifications to the problem

Recommend Projects

Recommend Topics

Recommend Org