Giter Club home page Giter Club logo

pywatershed's Introduction

pywatershed

ci-badge codecov-badge Documentation Status asv Formatted with Ruff

Available on pypi PyPI Status PyPI Versions

Anaconda-Server Badge Anaconda-Server Badge

WholeTale

Table of Contents

About

Welcome to the pywatershed repository!

Pywatershed is Python package for simulating hydrologic processes motivated by the need to modernize important, legacy hydrologic models at the USGS, particularly the Precipitation-Runoff Modeling System (PRMS, Markstrom et al., 2015) and its role in GSFLOW (Markstrom et al., 2008). The goal of modernization is to make these legacy models more flexible as process representations, to support testing of alternative hydrologic process conceptualizations, and to facilitate the incorporation of cutting edge modeling techniques and data sources. Pywatershed is a place for experimentation with software design, process representation, and data fusion in the context of well-established hydrologic process modeling.

For more information on the goals and status of pywatershed, please see the pywatershed docs.

Installation

pywatershed uses Python 3.9 or 3.10.

The pywatershed package is available on PyPI but installation of all dependencies sets (lint, test, optional, doc, and all) may not be reliable on all platforms.

The pywatershed package is available on conda-forge. The installation is the quickest way to get up and running by provides only the minimal set of dependencies (not including Jupyter nor all packages needed for running the example notebooks, also not suitable for development purposes).

We recommend the following installation procedures to get fully-functional environments for running pywatershed and its example notebooks. We strongly recommend using Mambato first instal dependencies from the environment_y_jupyter.yml file in the repository before installing pywatershed itself. Mamba will be much faster than Ananconda (but the conda command could also be used).

If you wish to use the stable release, you will use main in place of <branch> in the following commands. If you want to follow development, you'll use develop instead.

Without using git (directly), you may:

curl -L -O https://raw.githubusercontent.com/EC-USGS/pywatershed/<branch>/environment_w_jupyter.yml
mamba env create -f environment_w_jupyter.yml
conda activate pws
pip install git+https://github.com/EC-USGS/pywatershed.git@<branch>

Or to use git and to be able to develop:

git clone https://github.com/EC-USGS/pywatershed.git
cd pywatershed
mamba env create -f environment_w_jupyter.yml
activate pws
pip install -e .

(If you want to name the environment other than the default pws, use the command mamba env update --name your_env_name --file environment_w_jupyter.yml --prune you will also need to activate this environment by name.)

We install the environment_w_jupyter.yml to provide all known dependencies including those for running the example notebooks. (The environment.yml does not contain Jupyter or JupyterLab because this interferes with installation on WholeTale, see Getting Started section below.)

Getting started / Example notebooks

Please note that you can browse the API reference, developer info, and index in the pywatershed docs. But the best way to get started with pywatershed is to dive into the example notebooks.

For introductory example notebooks, look in the examples/ directory in the repository. Numbered starting at 00, these are meant to be completed in order. Numbered starting at 00, these are meant to be completed in order. Notebook outputs are not saved in Github. But you can run these notebooks locally or using WholeTale (an NSF funded project supporting logins from many institutions, free but sign-up or log-in required) where the pywatershed environment is all ready to go:

WholeTale

WholeTale will give you a JupyterLab running in the root of this repository. You can navigate to examples/ and then open and run the notebooks of your choice. The develop container may require the user to update the repository (git pull origin) to stay current with development.

Non-numbered notebooks in examples/ cover additional topics. These notebooks are not yet covered by testing and you may encounter some issues. In examples/developer/ there are notebooks of interest to developers who may want to learn about running the software tests.

Community engagement

We value your feedback! Please use discussions or issues on Github. For more in-depth contributions, please start by reading over the pywatershed DEVELOPER.md and CONTRIBUTING.md guidelines.

Thank you for your interest.

Disclaimer

This information is preliminary or provisional and is subject to revision. It is being provided to meet the need for timely best science. The information has not received final approval by the U.S. Geological Survey (USGS) and is provided on the condition that neither the USGS nor the U.S. Government shall be held liable for any damages resulting from the authorized or unauthorized use of the information.

From: https://www2.usgs.gov/fsp/fsp_disclaimers.asp#5

This software is in the public domain because it contains materials that originally came from the U.S. Geological Survey, an agency of the United States Department of Interior. For more information, see the official USGS copyright policy

Although this software program has been used by the USGS, no warranty, expressed or implied, is made by the USGS or the U.S. Government as to the accuracy and functioning of the program and related program material nor shall the fact of distribution constitute any such warranty, and no responsibility is assumed by the USGS in connection therewith. This software is provided "AS IS."

pywatershed's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

pywatershed's Issues

Limit variables written to NetCDF output.

Since there is an advantage to using PRMS control and parameter files for existing PRMS users. The output variables specified in the control file should be used to define the variables output to NetCDFs unless overridden using args in .initialize_netcdf(*args, output_vars=["rain",]).

Run NHM processes with separate parameters

As part of introducing new processes from non-PRMS models, we need to first make PRMS/NHM processes work as we'd want the others to work. The following steps will (more or less) take us there

  • Run processes separately, with their separated parameters
  • Processs come with data in when Model(): Run NHM passing parameters for each process

Reorganize environment.yml files

Since pyproject.toml has been setup to install required, lint, test, and optional dependencies a single development environment.yml should be created.

  • The revised environment.yml should have all of the dependencies in pyproject.toml
  • The revised environment.yml should be in the root repo directory.
  • All of the other environment.yml files should be eliminated.
  • README.md should be updated to reflect changes.
  • Should evaluate if *.txt equivalents of environment.yml are needed. Delete if not needed.

Prevent parameters from being edited

Currently reused parameter instances cause errors in the channel network.

My ethos is that parameters should not be edited in the dark, hidden corners of the code. They should be edited as preprocessing so that they can be provenanced with the results.

My prefered approach to this problem is currently to freeze/imutablize parameter data. Two approaches
https://adamj.eu/tech/2022/01/05/how-to-make-immutable-dict-in-python/
https://pypi.org/project/frozendict/

Control metadata

@jmccreight After meeting last week I wanted to document what we discussed and some of my thoughts. Hopefully we can use this to discuss and document possible changes to the control metadata.

Control

Legacy xml tags and attributes

Metadata was initially created from the published control variable tables and scraping the PRMS5 code.

Example of xml structure

<control>
    <control_param name="basinOutON_OFF" version="5.0" deprecated="6.0">
        <default>0</default>
				<force_default>0</force_default>
        <type>1</type>
        <numvals>1</numvals>
        <desc>Switch to specify whether or not basin summary output files are generated</desc>
        <values type="flag">
            <value name="0">off</value>
            <value name="1">on</value>
        </values>
        <related_variables>
            <variable name="basinOutVars"/>
            <variable name="basinOut_freq"/>
        </related_variables>
    </control_param>
	  ...
</control>

Each control variable has:

Key Description XML Element XML Attribute Allowed values
name Name of the control variable control_param name
version Version of PRMS where variable was introduced control_param version
deprecated Optional version of PRMS when variable no longer supported control_param deprecated
default Default value to use when one has not been specified default
force_default Optional boolean indicating the default value should always be used force_default 0, 1
type Data type of the control variable values type 1, 2, 3, 4
numvals Number of values allowed for variable numvals
desc Description of variable desc
values Optional block for denoting accepted values of variable values
values->type Type of valid values values type flag, module, interval, method, parameter
value Description of valid value in values block value
value->name A valid value value name
value->version Version when a value option was added value version
related_variables A group of variables that are associated with current variable related_values
variable->name Name of control variable associated with current variable variable name

type description

1: int32
2: float32
3: float64
4: string

values->type description

flag - Either boolean compatible values (1/0 = true/false, on/off, yes/no) or multiple values indicating behavior(s) to trigger
interval - values indicating the time-series interval to use (e.g. 0 = daily)
method - values indicating an action to perform or module to act on
module - values representing modules to use in simulation code
parameter - values denoting one or more parameters to effect

Challenges

  • no way to programmatically figure out if a control variable value is a scalar or list
    • could create context key which would have values of scalar, array, etc
      • remove numvals?
  • version information is currently incomplete in the XML file
  • how to handle start_time and end_time which are really datetime values but stored in control file as an array of integers

pynhm

Legacy XML structures are converted to a dictionary and written to YAML.

YAML example

basinOutON_OFF:
  default: 0
  desc: Switch to specify whether or not basin summary output files are generated
    (0=no; 1=yes)
  numvals: 1
  related_variables:
  - basinOutVars
  - basinOut_freq
  type: 1
  values:
    0: 'off'
    1: 'on'
  values_type: flag

Suggested changes

  • missing keys
    • version, deprecated, force_default
  • version usage could be changed to also indicate a range of versions. If we did this then the deprecated key would not be necessary.
    • e.g. version=">=5.0.0 & <5.2.0"
  • add context key with possible values scalar, array, etc
  • suggested name changes
    • type -> datatype
    • values -> valid_values
    • values_type -> valid_values_type
  • change datatype (aka type) from integer values to strings:
     int32: 1
     float32: 2
     float64: 3
     string: 4
     datetime: 
    

Can't get gis files in PRMS legacy notebook example

Should these files be in the develop distribution? Can't find them.
ImportError: cannot import name 'gis_files' from 'helpers' (C:\Users\jdickins\AppData\Local\mambaforge\envs\pywatershed\lib\site-packages\helpers_init_.py)

implement GSFLOW/PRMS cascades

-[X] Add sagehen unstructured HRUs #288
-[ ] Add sagehen gridded HRUs
-[ ] Adopt GSFLOW 2.3 as basis for regression tests. Use git submodules?
-[ ] Use recent flow accumulation work in pygsflow for cascades

examples/01_multi-process_models.ipynb

Excited to see this work, ultimately to use with MF6 api. Trying to get oriented but I'm running into an issue that I do not comprehend. Any suggestions?

image

Overhaul tests against PRMS output

For "unit"/process tests in pywatershed against PRMS output, we used a variety of different approaches.

  • test either/both "in memory" and "output files"
  • define common utils that replicate and enforce np.testing.assert_allclose(). Replicate because it is desirable to have access to the failures in debug for diagnostics.
  • Apply common testing approach to all PRMS processes: solar, atmosphere, canopy, snow, soil, runoff, groundwater, and channel.
  • Maximize the public variables tested for each pws Process tested

examples/02_prms_legacy_models.ipynb

I've been running the example 02_prms_legacy_models.ipynb notebooks from a pywatershed installation on my windows 10 machine (not Whole Tale). This isn't necessarily a showstopper for me, but I want to report this error:


AssertionError Traceback (most recent call last)
Cell In[16], line 7
1 submodel = pws.Model(
2 submodel_processes,
3 control=control,
4 parameters=params,
5 )
----> 7 pws.analysis.ModelGraph(
8 submodel,
9 hide_variables=not show_params,
10 show_params=show_params,
11 process_colors=palette,
12 ).SVG(verbose=True, dpi=48)

File D:\PYWATERSHED\pywatershed\pywatershed\analysis\model_graph.py:120, in ModelGraph.SVG(self, verbose, dpi)
118 if self.graph is None:
119 self.build_graph()
--> 120 self.graph.write_svg(tmp_file, prog=["dot", f"-Gdpi={dpi}"])
121 if verbose:
122 print(f"Displaying SVG written to temp file: {tmp_file}")

File ~\AppData\Local\mambaforge\envs\pws\lib\site-packages\pydot.py:1743, in Dot.init..new_method(path, f, prog, encoding)
1739 def new_method(
1740 path, f=frmt, prog=self.prog,
1741 encoding=None):
1742 """Refer to docstring of method write."""
-> 1743 self.write(
1744 path, format=f, prog=prog,
1745 encoding=encoding)

File ~\AppData\Local\mambaforge\envs\pws\lib\site-packages\pydot.py:1828, in Dot.write(self, path, prog, format, encoding)
1826 f.write(s)
1827 else:
-> 1828 s = self.create(prog, format, encoding=encoding)
1829 with io.open(path, mode='wb') as f:
1830 f.write(s)

File ~\AppData\Local\mambaforge\envs\pws\lib\site-packages\pydot.py:1956, in Dot.create(self, prog, format, encoding)
1944 message = (
1945 '"{prog}" with args {arguments} returned code: {code}\n\n'
1946 'stdout, stderr:\n {out}\n{err}\n'
(...)
1952 err=stderr_data,
1953 )
1954 print(message)
-> 1956 assert process.returncode == 0, (
1957 '"{prog}" with args {arguments} returned code: {code}'.format(
1958 prog=prog,
1959 arguments=arguments,
1960 code=process.returncode,
1961 )
1962 )
1964 return stdout_data

AssertionError: "dot" with args ['-Tsvg', '-Gdpi=48', 'C:\Users\markstro\AppData\Local\Temp\1\tmprshir33o'] returned code: 3221225477

Improve NetCDF performance

Currently NetCDF output can add significant wall clock time, especially when separate_files=False. Performance needs to be improved.

Selected output variable netcdf files not writing to output folder

When writing output variable netcdf files, I am trying to specify the output vars explicitly (using control.options), instead of writing all output var files.

control.options = control.options | {
    "input_dir" : work_dir,
    "budget_type" : None,
    "verbose" : False,
    "calc_method" : 'numba',
    "netcdf_output_var_names": ['recharge']
}

But, it seems to not write "recharge" in this example.

I can still get all vars to write using:

multi_proc_model.initialize_netcdf(
    output_dir=out_dir,
    separate_files=True,
)

But, this takes a very very very long time. I'd rather fix the first block (runs faster). I'll paste the whole run script below.

Thanks
Eddie

import pathlib as pl
import time
import pywatershed
 
work_dir = pl.Path("./")
out_dir = work_dir / "output"
out_dir.mkdir(parents=True, exist_ok=True)
 
params = pywatershed.parameters.PrmsParameters.load_from_json(work_dir / "parameters.json")  
control = pywatershed.Control.load(work_dir / "control.test")

control.options = control.options | {
    "input_dir" : work_dir,
    "budget_type" : None,
    "verbose" : False,
    "calc_method" : 'numba',
    "netcdf_output_var_names": ['recharge']
}
multi_proc_model = pywatershed.Model(
    [pywatershed.PRMSSolarGeometry,
    pywatershed.PRMSAtmosphere,
    pywatershed.PRMSCanopy,
    pywatershed.PRMSSnow,
    pywatershed.PRMSRunoff,
    pywatershed.PRMSSoilzone,
    pywatershed.PRMSGroundwater,
    pywatershed.PRMSChannel],
    control=control,
    parameters=params  
)
multi_proc_model.initialize_netcdf(
    output_dir=out_dir,
    separate_files=True,
)
sttime = time.time()
multi_proc_model.run(finalize=True)
print(f'That took {time.time()-sttime:.3f} looong seconds')

Reorganize and thin notebooks

Notebooks exist in multiple places in the repo (evaluation, examples). Also it is unclear if all of these are necessary. In a recent PR (#164) several notebooks were out of date with recent commits.

Notebooks should also be run as part of CI to ensure that can be run with the current version of pywatershed.

multiple forcing variables in a single ncf file

I'm trying to run something based on examples/02_prms_legacy_models.ipynb. I already have the variables prcp, tmax, and tmin in a single netcdf file that was written by Parker with his Bandit extraction code. I don't need to convert ascii CBH files into NCF files. I've been looking in the pywatershed code trying to figure out how to specify multiple variables that reside in one .nc but I can't figure it out. Is it possible to do this, or do I need to split them into three files with one variable in each?

improve test data generation

  • Fix stale description
  • Provide a way for autotest to tell if the test data exist or are stale
  • Streamline the test data creation, preferably from autotest rather than from test_data/generate

Project purpose in README?

Coming to this repository with a curious eye as an outsider, it's not clear what's going on or why. Some kind of description of what this is in the README would be useful.

improve nc4 support in data_model

currently many data_model and DatasetDict tests are not carried out using netcdf4, but only xarray. this is because certain variable types (particularly time) needs extra code. I explored the code for how xarray solves this (using netcdf4 as a backend), but it is not obvious.

along these lines, the test_prms_param_separate.py does not separate parameters using netcdf4. Would be good to have that included.

release 1.0.0

This release focuses on pywatershed reproducing PRMS5.2.1
This is an extended description repeating the milestone and allowing or additional discussion.

Issues: Milestone 1.0.0 includes

USGS internal review:

https://www.usgs.gov/products/software/software-management/types-software-review

  • domain review
  • code review
  • admin review

Release documentation

  • What are we comparing?
    • NHM physics
  • PRMS 5.2.1 in the pywatershed repo
    • Both mixed and double precision compilations
    • Enhanced output for comparing and to driving pywatershed components
    • Snow mass balance fix
    • Anything else?
  • Pywatershed testing procedures description
    • Driven with output of double precision PRMS 5.2.1
    • Tolerances of individual processes
    • Domains: hru_1, drb_2yr, ucb_2yr, ?CONUS?
  • CONUS domain full NHM config
    • Release notebook script for running the comparisons
    • Summary statistics
      • Violin plots: bias, rmse, rrmse, KGE, R^2?, NSE
  • Reproduce legacy results of PRMS?
    • How to?
  • Caveats: snow
  • Future work
    • GSFlow functionality: cascades, couping to mf6 via bmi

Bug with soil_moist_prev

in PRMSSoilzone, the variable soil_moist is diagnostic, therefore it should not have a prior state that is tracked soil_moist_prev because it is not computed from it. Confusingly, I'd moved soil_moist_prev out of the .advance() but this caused an issue with PRMSRunoff not getting the correct soil_moist_prev. The fix is to remove soil_moist_prev from Soilzone and make PRMSRunoff get the prognostic inputs soil_lower_prev and soil_rechr_prev.

bools as ints

These variables are really booleans but we are treating them as ints to be consistent with PRMS. This is happening in a translation step in meta.py that I'm removing, so I'm changing the types to int in the metadata/variables.yaml. These should eventually be changed to boolean in the metadata and handled as such. this list may not be comprehensive.

diff --git a/pynhm/static/metadata/variables.yaml b/pynhm/static/metadata/variables.yaml
index 17e075c..fdd19b3 100644
--- a/pynhm/static/metadata/variables.yaml
+++ b/pynhm/static/metadata/variables.yaml
@@ -1636,7 +1636,7 @@ iasw:
     on curve and maximum (1) or is on the defined curve (0)
   dimensions:
     0: nhru
-  type: bool
+  type: int32
   units: none
 imperv_evap:
   desc: Evaporation from impervious area for each HRU
@@ -1896,7 +1896,7 @@ lst:
     the albedo curve (1) (albset_snm or albset_sna) otherwise (0)
   dimensions:
     0: nhru
-  type: bool
+  type: int32
   units: none
 lwrad_net:
   desc: Net long-wave radiation for each HRU
@@ -2109,7 +2109,7 @@ pptmix_nopack:
     present on an HRU (1), otherwise (0)
   dimensions:
     0: nhru
-  type: bool
+  type: int32
   units: none
 precip:
   desc: Precipitation at each measurement station

'PRMSSoilzone' object has no attribute '_adjust_parameters'

I'm running my NHM CONUS input files through the latest pywatershed develop version. I get the exception "AttributeError: 'PRMSSoilzone' object has no attribute '_adjust_parameters'" at pywatershed\hydrology\prms_soilzone.py:319 in _set_initial_conditions

I forgot to say that when I ran the much smaller Willamette model I didn't have this problem.

Control object features

There are a variety of features that it would be nice for the Control class to implement:

  • .load() should be reanamed or even subclassed as it is PRMS specific. Moreover, the vast majority of PRMS control variables are ignored and these should be warned when they are present. A flag could be used to silence these warnings (e.g. ignore_legacy_vars). These should not be retained in
  • what are required fields? input_dir? throw error when required fields are not present. attempt to not use defaults
  • a to_yaml() method
  • should find a way to write-protect control either in Model or in Process.
  • both str and repr (this would be used by .to_yaml())
  • determine how config should be edited in memory (prior to write protect), just edit config directly? or establish methods?

Fortran component build

Currently, from numpy.distutils.misc_util.Configuration is used to build Fortran components. This approach is currently deprecated. When running

python -m build

The following is reported:

`numpy.distutils` is deprecated since NumPy 1.23.0, as a result
  of the deprecation of `distutils` itself. It will be removed for
  Python >= 3.12. For older Python versions it will remain present.
  It is recommended to use `setuptools < 60.0` for those Python versions.
  For more details, see:
https://numpy.org/devdocs/reference/distutils_status_migration.html 

TODO:

  • Find a solution that relies on setuptools.

Also, since it seems we are getting ~equivalent performance using numba maybe we should

  1. only worry about fortran for development installation from the repo or
  2. consider dropping fortran options to simplify pywatershed

asv benchmarking to use mamba

This will save some significant time.
However

  1. current release of asv dosent include mamba support
  2. how to use mamba with asv is also very opaque, I could not get it to work for their master branch

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.