euroargodev / argodmqc_owc Goto Github PK

View Code? Open in Web Editor NEW

9.0 11.0 3.0 154.36 MB

Argo float salinity calibration software

License: Apache License 2.0

HTML 55.77% Python 44.23%

argo python oceanography quality-control argo-floats dmqc

argodmqc_owc's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger gmaze

argodmqc_owc's Issues

cal2dec not robust to leap years !

The cal2dec function reads:

def cal2dec(pa_month, pa_day, pa_hour=0, pa_minute=0):
    """ Converts a calendar date (month, day, hour, minute) to a decimal date (float)

        Parameters
        ----------
        pa_month: Month in the year (where 0 is Janurary and 11 is Decemeber)
        pa_day: Day in the month
        pa_hour: Hour in the day
        pa_minute: Minute in the hour

        Returns
        -------
        decimalised version of the date
    """

    ln_cumulative_months = np.cumsum([0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31])

    try:
        dec_date = float(
            ln_cumulative_months[pa_month] + pa_day - 1 + pa_hour / 24 + pa_minute / 60 / 24
        )
        if dec_date > 366:
            raise ValueError("Day is out of scope of the year")
        return dec_date

    except IndexError:
        raise ValueError('Month is out of scope')

and the decimal year is then computed in change_dates with:

day = year + (cal2dec(month - 1, day, hour, minute) / 365)

But I'm not sure this function is robust to leap years.
For instance here we can find this implementation:

from datetime import datetime as dt
import time

def toYearFraction(date):
    def sinceEpoch(date): # returns seconds since epoch
        return time.mktime(date.timetuple())
    s = sinceEpoch

    year = date.year
    startOfThisYear = dt(year=year, month=1, day=1)
    startOfNextYear = dt(year=year+1, month=1, day=1)

    yearElapsed = s(date) - s(startOfThisYear)
    yearDuration = s(startOfNextYear) - s(startOfThisYear)
    fraction = yearElapsed/yearDuration

    return date.year + fraction

which takes the year into account, contrary to cal2dec.

So the difference will read:

date = pd.to_datetime('20200301')
print(toYearFraction(date)) 
# 2020.1639344262296

dec = cal2dec(int(date.strftime("%m"))-1, int(date.strftime("%d")), int(date.strftime("%H")), int(date.strftime("%M")))
int(date.strftime("%Y")) + dec / 365 
# 2020.1616438356164

This may not be very significant for the calibration, but this is simply not right and we should fix this.

Setting warning in configurations to prevent entry of swapped values

I have noticed that currently in Matlab and Python version of OWC there is no warning/throw an error to prevent DMQC operator to enter swapped values in small and large: Lon, Lat, Age parts of the Objective Mapping Parameters section. I run Python and Matlab OWC software with swapped values and in both cases, codes are running and producing map_.mat file.

Should we introduce in the OWC code some kind of e.g warning/throw an error, to prevent users to accidentally enter swapped values?

Python crash when no good climatology data has been found

I run the analysis for very long float including 269 profiles. At profile 265, I received an error message that there has been no good climatology data. That is kind of error that we expect. However, Python code is not generating map_.mat file for the profiles that were sucesfully analysed before crash.
It would be worth to set in the code for lack of coog climtology data python should skip that profile and continue analysis for another one. If there is no other good data code should stop analysis and produce the map_.mat file for those profile that climatology is sufficient.

Error message:

UPDATE_SALINITY_MAPPING: Working on profile 265 Warning: found 1 all NaNs in your dataset. These water columns have been removed Traceback (most recent call last): File "L:/users/argo/kamwal/OWC_Python_Test_1/owc-software-python/owc_calibration.py", line 73, in <module> update_salinity_mapping("/", FLOAT_NAME, USER_CONFIG) File "L:\users\argo\kamwal\OWC_Python_Test_1\owc-software-python\ow_calibration\update_salinity_mapping\update_salinity_mapping.py", line 351, in update_salinity_mapping hist_interp_sal, hist_interp_pres = interp_climatology(best_hist_sal, File "L:\users\argo\kamwal\OWC_Python_Test_1\owc-software-python\ow_calibration\interp_climatology\interp_climatology.py", line 78, in interp_climatology raise ValueError("No good climatological data has been found") ValueError: No good climatological data has been found

WMO: 4901464
Reference data: CTD_for_DMQC_2019v01.m
WMO boxes: wmo_boxes_ctd.mat
Objective mapping parameters:

`% Objective Mapping Parameters
%
% max number of historical casts used in objective mapping
CONFIG_MAX_CASTS=300
% 1=use PV constraint, 0=don't use PV constraint, in objective mapping
MAP_USE_PV=0
% 1=use SAF separation criteria, 0=don't use SAF separation criteria, in objective mapping
MAP_USE_SAF=0

% spatial decorrelation scales, in degrees
MAPSCALE_LONGITUDE_LARGE=8
MAPSCALE_LONGITUDE_SMALL=4
MAPSCALE_LATITUDE_LARGE=4
MAPSCALE_LATITUDE_SMALL=2
% cross-isobath scales, dimensionless, see BS(2005)
MAPSCALE_PHI_LARGE=0.1
MAPSCALE_PHI_SMALL=0.02
% temporal decorrelation scale, in years
MAPSCALE_AGE=5
MAPSCALE_AGE_LARGE=20
% exclude the top xxx dbar of the water column
MAP_P_EXCLUDE=100
% only use historical data that are within +/- yyy dbar from float data
MAP_P_DELTA=200`

Reduce amount done per function

It goes without saying that this is low priority, for now.

Some functions, such as update_salinity mapping, are doing far too much leg work for a single function. Because of this, they require numerous local variable definitions, and require a huge amount branches (nested for and if statements).

Not only does this make them somewhat difficult to test (as the testing suite to hit all possible branches would be huge), it makes them difficult to maintain for anyone approaching the code without a huge amount of background knowledge.

I propose a reasonably simple, but potentially time consuming refactor whereby we split these routines into even smaller chunks, separating out data manipulation areas (ie converting from 0 < x < 360 latitude to -180 < x < 180) and sections after "if" and "for" statements into their own easily testable functions, if they aren't already.

This will

Allow testing to be even more thorough and effective
Make code reading and navigating far simpler
Make changes easier to implement in the long term
Allow for more accurate and precise documentation

Refactoring

Let's talk about refactoring ideas

Configuration:

This is the most important feature to properly control what's being done, so it requires a nice and flexible UI

Better distinguish Analysis parameters (eg: use_pv) from Local configuration settings (eg: path and folder/file names)
Getter, setter for configuration items
Loader, saver from/to files (using txt files compatible with the Matlab software)
Usable in context ("with") or for sessions
Consistency/validity checks (e.g. #28 )

Data Mapping:

This is the most time consuming step, so it requires optimization to improve perf

Reduce the size of the update_salinity_mapping function (500 hundreds line !) by identifying recurrent patterns and affecting inner loop work to specific functions
Consider using xarray to make management of group of variables much easier. Together with a more readable code, this should help in recurrent task like modification of variable batches.
update_salinity_mapping has 2 main loop levels: on profile and on vertical levels. If inner loop work is delegated to functions and if data structure is clarified (with dictionaries or even better: xarray.DataSet), make these loops work in parallel will be much easier and a game changer in terms of performances.

Data fetching:

This is a key component of the software, fetching float but more importantly reference data.

Refactor all functions dedicated to data fetching to a single module (already started in #26 but surely not finished)
Consider using a cache filesystem to avoid repeating previous fetch
When this will be possible, consider fetching data from web-API: always up to date !

Code design:

Refactor code so that functions are really self contained: eg in update_salinity_mapping the longitude values wrapping between -180/180 for get_topo_grid is done outside of get_topo_grid, adding 4 lines and 1 variable to the code. It it this function inner responsibility to check for longitude values, must not be done outside.
Start a documentation, getting started with sphinx autodoc. Will make refactoring easier to be able to navigate the code.
This is linked to documentation: gives clear explanation of configuration parameters and where they are used in the code. This will also help users to set up the software properly.

SAF and PV code

Most of the code for the Potential Vorticity has actually been merged in (though I do need to double check this). However, we still need to merge in the code that allows an operator to filter out data around the South Antarctic Front.

Refactor function options management

I noticed that some of the pyowc functions do not use the same options order policy !
For instance:

# Calculate mapped values needed for the analysis
owc.calibration.update_salinity_mapping("", USER_CONFIG, FLOAT_NAME)

while we have:

# Set the calseries parameters for analysis and line fitting
owc.configuration.set_calseries("", FLOAT_NAME, USER_CONFIG)

see how USER_CONFIG comes first and then second ...

when refactoring the code, this must be addressed.

Initial alpha test release

Hi,
I've rushed a little bit yesterday when migrating the repo
I think we should tag a release 0.1.0 with the initial state, the version that is being tested by expert users.
So I propose to strip the code from my commits yesterday, and tag a release with this initial state
It should be easier to work with for debugging the testers
Are you ok @edsmall-bodc @matdon17 ?

Change the author_email in setup.py

Currently, in the setup.py code including the general information about the software the author email is [email protected]
Ed Small is not working in BODC and it would be good to change this email to more general such as [email protected] or any other

Migrating repo from NOC to EUROARGODEV

In order to migrate the repository from NOC to github servers at EUROARGODEV:

Requirements:

Check if the Github import tool is effective. @gmaze tested it with success at https://github.com/gmaze/owc-software-python. All repo history and contributions are imported properly (see contributions here and commits here).
Check if the CI workflow for NOC can be managed with github actions (or other service like Travis). @gmaze tested with success an implementation of NOC workflow in a Github actions. Online testing is run with success (See here for instance) under Linux and Mac OS and with python versions 3.6, 3.7 and 3.8 (which seems to be more than what was done at NOC). However, it is unsuccessful under Windows OS, which should be fixed later.

Migration procedure:

Run the Github import tool
Check history of commits, branch merging, etc ...
Move "merge_request_templates" to .github folder
Add a Github action workflow for CI, following @gmaze template here.
Link repo to https://codecov.io/gh/euroargodev and set-up code coverage (which is excellent here !) reporting
Link repo to https://requires.io/ and set-up requirements checks
Add readme and badges
Set-up Access rights

@edsmall-bodc, @matdon17 : I'll be happy to do this if you want me to.

Python vs Matlab speed differences

I found relatively large differences (~40 seconds) in speed per profile in the update salinity mapping calulations. This potentially could be related with a capability of my computer or there is something that could be improved in Python code.
Did anyone have a simmilar Python code behaviour?

WMO num: 1901847
WMO boxes: wmo_boxes_ctd.mat
Refence data: CTD_for_DMQC_2019V01
Code evaliation times for Matlab and Python:
Matlab 1901847.docx
Python 1901847.docx

Config_file:

`% ===============================
%
% Objective Mapping Parameters
%

% max number of historical casts used in objective mapping
CONFIG_MAX_CASTS=300

% 1=use PV constraint, 0=don't use PV constraint, in objective mapping
MAP_USE_PV=0

% 1=use SAF separation criteria, 0=don't use SAF separation criteria, in objective mapping
MAP_USE_SAF=0

% spatial decorrelation scales, in degrees
MAPSCALE_LONGITUDE_LARGE=8
MAPSCALE_LONGITUDE_SMALL=4
MAPSCALE_LATITUDE_LARGE=4
MAPSCALE_LATITUDE_SMALL=2

% cross-isobath scales, dimensionless, see BS(2005)
MAPSCALE_PHI_LARGE=0.1
MAPSCALE_PHI_SMALL=0.02

% temporal decorrelation scale, in years
MAPSCALE_AGE=5
MAPSCALE_AGE_LARGE=20

% exclude the top xxx dbar of the water column
MAP_P_EXCLUDE=100

% only use historical data that are within +/- yyy dbar from float data
MAP_P_DELTA=200`

Set the amount of theta levels plotted in configuration.py

Currently, the amount of displayed plots (from 1 to 10) can be set in dashboard.py function. The default option is 2.

def plot_diagnostics(float_dir, float_name, config, levels=2)

Maybe instead of going inside this function and looking for the line where to set the number of displayed theta levels, it would be worth to have that setup in the configuration.py ?

Data not appending correctly

Data used as an example in master branchare different than the source data in my source file.
Below I am including the source file matrix:

3901960.zip

Missing coastline in trajectory plot

I have found that for many floats Python code is not displying coastline contours of continents in trajectory plot.

Left plot -Python verion; Right plot- Matlab version

WMO: 3900560; 3901531; 3901985; 4901464; 6901121
Reference data: CTD_for_DMQC_2019V01
WMO boxes: wmo_boxes_ctd.mat

Cal series:
breaks = [] max_breaks = 4 # 0 for linear trend, -1 for offset calseries = np.ones((1, no_profiles)).flatten() # example for splitting time series at profile 33 # calseries = np.concatenate((np.ones((1, 33)), 2 * np.ones(1,no_profiles - 33))) calib_profile_no = profile_no use_percent_gt = 0.5 use_theta_lt = 0 use_theta_gt = 0 use_pres_lt = 0 use_pres_gt = 0

Software LICENSE

Which license should we use ?

This can help: https://choosealicense.com/

#25

New function to create: float_source/<WMO>.mat file

At this point, the python software still relies on Matlab or user own function to create the crucial float_source/<WMO>.mat file file.
For users who want to try or use the Python software, we need to provide a function that read the netcdf file and produce that OWC specific file !

This should not be too complicated, I see that the file the test "3901960.mat" file has the following arrays:

__header__ : b'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Mon Feb 25 13:18:03 2019'
__version__ : 1.0
__globals__ : []
PROFILE_NO : [[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
  25 26 27 35 36 37 38 39 40 41 42]]
LAT : (1, 35)
LONG : (1, 35)
DATES : (1, 35)
PRES : (205, 35)
TEMP : (205, 35)
PTMP : (205, 35)
SAL : (205, 35)

Library to read Matlab files

I see that pyowc is using 2 libraries to read Matlab mat files:

pymatreader read_mat
scipy.io loadmat

The former is used only once, the latter basically everywhere else.

I see from #52 that pymatreader was added to fix #30 , so in fact we can't get rid of this dependency

So why using 2 dependencies for the same task ?

Should pyowc use only 1 reader ? pymatreader only

ps: This is not considering the fact that all external data should be accessible in other formats than Matlab (eg: #53)

Managing calibration and other software options and parameters

As pointed out by @edsmall-bodc in #23 :

Currently the user has to define many things for the OWC toolbox to run:
For the update salinity mapping, they need to define:

Which data set they want to use from each WMO box (argo/CTD/bottle data) [though we have discussed how the WMO boxes could be phased out!]
The float name they want to analyse
The maximum number of historical casts they want to use during analysis
Whether or not they would like to use potential vorticity
Whether or not they would like to use special constraints around the subantarctic front
The latitude and longitude scales used for decorrelation (4 values)
The cross isobath scales (2 values)
The age scales (2 values)
The exclusion depth
The age exclusion parameter
Many of these parameters have a default value, but I know operators change them regularly

That brings variables for salinity mapping to 15, which is somewhat tricky to manage in a usable way. Either we will need to group some of these variables together (into classes or dictionaries?), OR we can split the update salinity mapping routine up into separate routines that can be called?

We may want to consider a set of options (as a dictionary) used throughout the package that users can set or change within context and in their script.

Most of the time, users will only need to modify a subset of the parameters.
This is a mechanism used in matplotlib (they also use files) and xarray that I implemented in http://www.github.com/euroargodev/argopy as well

It is very easy to manage and I think is rather elegant and intuitive in terms of user experience:

Optional parameters can be set 3 ways:

at the beginning of a script, notebook:

argopy.set_options(src='erddap')

in a temporary context:

with argopy.set_options(src='erddap'):
    loader = ArgoDataFetcher().profile(6902746, 34)

I love this one, because it will make very easy for users to test different parameters in a single script

or with options when calling methods:

loader = ArgoDataFetcher(src='erddap').profile(6902746, 34)

Allow Python code to display all plots in PyCharm

As an additional improvement that would be nice to have it so allow the Python code to automatic display all generated plots in the Pycharm. At the moment, to display and generate another plot in PyCharm I need to manually close the previously displayed plot.

Refactor find_besthist

Currently, the find_besthist function requires 18 arguments to run. This is far too many, looks a bit messy, and can make using the function rather difficult.

I suggest taking some of the arguments out and placing them in a class, especially where the same class can be used for differing data.

Create a class for containing float data. This class can contain the latitudes, longitudes, z vales, and dates of different data points. We can either create two objects (one being the float, and the second being all the historical points), or we can create an array of objects, where each object is a single data point. This class could also contain some functions that are exclusively used on the data (i.e. calculating potential vorticity).

We could also create a class for calculating the ellipse that inherits from the other class to remove even more arguments.

I also think we should just pass in the struct containing all the configuration variables into the function, rather than dropping them in one by one.

Doing all the above will reduce the argument count from 18 to 3. I would be much more manageable.

Problem in selecting Argo referenced data

Durring running the OWC analysis using Argo referenced data only, I encountered the below error:

` UPDATE_SALINITY_MAPPING: Working on profile 3
Traceback (most recent call last):
File "L:\users\argo\kamwal\OWC_Python_SSH\owc-software-python\ow_calibration\get_region\get_region_data.py", line 172, in get_region_data
grid_long = wrap_longitude(grid_long)
File "L:\users\argo\kamwal\OWC_Python_SSH\owc-software-python\ow_calibration\get_region\data_functions\wrap_longitude.py", line 27, in wrap_longitude
neg_long = np.argwhere(grid_long < 0)
TypeError: '<' not supported between instances of 'list' and 'int'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "L:/users/argo/kamwal/OWC_Python_SSH/owc-software-python/owc_calibration.py", line 73, in
update_salinity_mapping("/", FLOAT_NAME, USER_CONFIG)
File "L:\users\argo\kamwal\OWC_Python_SSH\owc-software-python\ow_calibration\update_salinity_mapping\update_salinity_mapping.py", line 330, in update_salinity_mapping
best_hist_lat, best_hist_long, best_hist_dates] = get_region_data(wmo_numbers,
File "L:\users\argo\kamwal\OWC_Python_SSH\owc-software-python\ow_calibration\get_region\get_region_data.py", line 198, in get_region_data
raise Exception("NO DATA FOUND")
Exception: NO DATA FOUND

Process finished with exit code 1

`
In running the same analysis for (1) ctd+ argo and (2) CTD only, there was not such error.

FLOAT SOURCE NUMBER: 3901960
REFERENCE DATA: ARGO_for_DMQC_2020V01.mat
WMO boxes: wmo_boxes_argo.mat
OBJECTIVE MAPPING PARAMETERS:

` # ===============================
#
# Objective Mapping Parameters
#

    # max number of historical casts used in objective mapping
    'CONFIG_MAX_CASTS': 300,

    # 1=use PV constraint, 0=don't use PV constraint, in objective mapping
    'MAP_USE_PV': 0,

    # 1=use SAF separation criteria, 0=don't use SAF separation criteria, in objective mapping
    'MAP_USE_SAF': 0,

    # spatial decorrelation scales, in degrees
    'MAPSCALE_LONGITUDE_LARGE': 8,
    'MAPSCALE_LONGITUDE_SMALL': 2,
    'MAPSCALE_LATITUDE_LARGE': 4,
    'MAPSCALE_LATITUDE_SMALL': 1,

    # cross-isobath scales, dimensionless, see BS(2005)
    'MAPSCALE_PHI_LARGE': 0.1,
    'MAPSCALE_PHI_SMALL': 0.02,

    # temporal decorrelation scale, in years
    'MAPSCALE_AGE_LARGE': 20,
    'MAPSCALE_AGE_SMALL': 10,

    # exclude the top xxx dbar of the water column
    'MAP_P_EXCLUDE': 100,

    # only use historical data that are within +/- yyy dbar from float data
    'MAP_P_DELTA': 200`

Add calibrated sal_anom_plot

In the plots-diagnostics function of python code there is available only uncalibrated version of sal_anom_plot. The task is to include the code in plot_diagnostics and dashboard with calibrated version of this plot.

Fetching data from the Argo reference dataset

It's probably still too early, but argopy is able to fetch and return data from the Argo reference dataset !

from argopy import DataFetcher as ArgoDataFetcher
loader = ArgoDataFetcher(src='erddap', ds='ref')
ds1 = loader.region([-85, -45, 10, 20, 0, 1000, ‘2012-01’, ‘2012-02’]).to_xarray()

we could develop an access point that would work with owc, ie using distances instead of a rectangular space/time domain

Originally posted by @gmaze in #30 (comment)

float_source matlab files incorrect

While working on euroargodev/argopy/issues/141 and euroargodev/argopy/pull/142 I found that the following pyowc float_source matlab files:

are incorrect !

They appear to have been generated with an old version of the Matlab routine create_float_source.m.
The last version uses another method to sub-select vertical pressure levels (floor instead of round),
It can be seen here: euroargodev/dm_floats@7497874#diff-a7f153565065c547918a043959d1efd54b042a8d75edbf7fcef1c8bd6f76ec72

So I guess these files must be updated
The most recent version of the create_float_source.m routine is here:
https://github.com/euroargodev/dm_floats/blob/master/src/ow_source/create_float_source.m

Publishing on Pypi

For standard users of the software, they should not need to clone the repository to be able to run it, but rather simply install it using pip or conda.
This implies 2 things:

a proper setup.py and packaging
a serious refactoring of the code to manage options and configuration parameters outside of the software code (to be honest, I'd like this to be my 1st contribution to the code)

This is also to let you known that when this will be ready to publish euroargodev python softwares on pypi, I created a euroargodev account at https://test.pypi.org/user/euroargodev/ and https://www.pypi.org/user/euroargodev/
(don't be fulled by the picture, this is retrieved automatically from my email but will change soon, so that it is the EA logo).

Ellipse should sub-sample data for random selection

When selecting historical data for comparison, 1/3 of this data is selected randomly. The rest are selected by how well the remaining historical data matches the current float profile (space and time).

The randomly selected data are specifically selected to avoid choosing only data from one area because of strong correlations. We want to ensure we use strongly correlated data AND enough data to cover the area generally.

However, if we select 1/3 of our data points randomly, we are sill at risk of selecting a poor spatial/temporal distribution (though this probability is very small). We can increase our chances of getting a good distribution by splitting up the ellipse into N parts and randomly selecting data in each of these parts.

We need to discuss how we should go about splitting up the ellipse into chunks. We should try and do this dynamically, so that DMQC operators can decide themselves how many areas the ellipse should be split into.

Also need to decide how to allocate the amount of selection for each segment. Eg, what if we want 20 random data points from each section, but one section only contains 5 data points? How do we allocate the work to pick up the slack?

I've added some diagrams to visualise what we are trying to achieve.

Yellow dot is the current profile.
Green dots are selected data
Red dots are data that fall outside our spatial/temporal parameters.