Giter Club home page Giter Club logo

l2ss-py's Introduction

l2ss-py

Coverage
develop: Develop Build
main: Main Build

Harmony service for subsetting L2 data. l2ss-py supports:

  • Spatial subsetting
    • Bounding box
    • Shapefile subsetting
    • GeoJSON subsetting
  • Temporal subsetting
  • Variable subsetting

If you would like to contribute to l2ss-py, refer to the contribution document.

Initial setup, with poetry

  1. Follow the instructions for installing poetry here.
  2. Install l2ss-py, with its dependencies, by running the following from the repository directory:
poetry install

Note: l2ss-py can be installed as above and run without any dependency on harmony. However, to additionally test the harmony adapter layer, extra dependencies can be installed with poetry install -E harmony.

How to test l2ss-py locally

Unit tests

There are comprehensive unit tests for l2ss-py. The tests can be run as follows:

poetry run pytest -m "not aws and not integration" tests/

You can generate coverage reports as follows:

poetry run pytest --junitxml=build/reports/pytest.xml --cov=podaac/ --cov-report=html -m "not aws and not integration" tests/

Note: The majority of the tests execute core functionality of l2ss-py without ever interacting with the harmony python modules. The test_subset_harmony tests, however, are explicitly for testing the harmony adapter layer and do require the harmony optional dependencies be installed, as described above with the -E harmony argument.

l2ss-py script

You can run l2ss-py on a single granule without using Harmony. In order to run this, the l2ss-py package must be installed in your current Python interpreter

$ l2ss-py --help                                                                                                                    
usage: run_subsetter.py [-h] [--bbox BBOX BBOX BBOX BBOX]
                        [--variables VARIABLES [VARIABLES ...]]
                        [--min-time MIN_TIME] [--max-time MAX_TIME] [--cut]
                        input_file output_file

Run l2ss-py

positional arguments:
  input_file            File to subset
  output_file           Output file

optional arguments:
  -h, --help            show this help message and exit
  --bbox BBOX BBOX BBOX BBOX
                        Bounding box in the form min_lon min_lat max_lon
                        max_lat
  --variables VARIABLES [VARIABLES ...]
                        Variables, only include if variable subset is desired.
                        Should be a space separated list of variable names
                        e.g. sst wind_dir sst_error ...
  --min-time MIN_TIME   Min time. Should be ISO-8601 format. Only include if
                        temporal subset is desired.
  --max-time MAX_TIME   Max time. Should be ISO-8601 format. Only include if
                        temporal subset is desired.
  --cut                 If provided, scanline will be cut
  --shapefile SHAPEFILE
                        Path to either shapefile or geojson file used to subset the provided input granule

For example:

l2ss-py /path/to/input.nc /path/to/output.nc --bbox -50 -10 50 10 --variables wind_speed wind_dir ice_age time --min-time '2015-07-02T09:00:00' --max-time '2015-07-02T10:00:00' --cut

An addition to providing a bounding box, spatial subsetting can be achieved by passing in a shapefile or a geojson file.

poetry run l2ss-py /path/to/input.nc /path/to/output.nc --shapefile /path/to/test.shp

or

poetry run l2ss-py /path/to/input.nc /path/to/output.nc --shapefile /path/to/test.geojson

Running Harmony locally

In order to fully test l2ss-py with Harmony, you can run Harmony locally. This requires the data exists in UAT Earthdata Cloud.

  1. Set up local Harmony instance. Instructions here
  2. Add concept ID for your data to services.yml
  3. Execute a local Harmony l2ss-py request. For example:
    localhost:3000/YOUR_COLLECTION_ID/ogc-api-coverages/1.0.0/collections/all/coverage/rangeset?format=application%2Fx-netcdf4&subset=lat(-10%3A10)&subset=lon(-10%3A10)&maxResults=2
    

l2ss-py's People

Contributors

danielfromearth avatar dkauf42 avatar fgreg avatar frankinspace avatar jamesfwood avatar jonathansmolenski avatar nlenssen2013 avatar phoeneix avatar skorper avatar sliu008 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

l2ss-py's Issues

Empty subset result should maintain dimensions

When a subset is requested that is outside of the bounds of the input granule, an empty granule is returned. This empty granule is a shell of the input granule and contains all of the same variable attributes, global attributes, etc., but has empty data.

One issue we are seeing however is that the empty granule doesn't maintain its dimensions. Those dimensions should be maintained which will allow the granule to be used with Harmony concatation.

For example:

➜ ncdump -h ascat_20211220_231800_metopb_48038_eps_o_250_3202_ovw.l2.nc4
netcdf ascat_20211220_231800_metopb_48038_eps_o_250_3202_ovw.l2 {
dimensions:
	time = UNLIMITED ; // (0 currently)
	lat = UNLIMITED ; // (0 currently)
	lon = UNLIMITED ; // (0 currently)
	wind_speed = UNLIMITED ; // (0 currently)
	wind_dir = UNLIMITED ; // (0 currently)
variables:
	double time(time) ;
		time:_FillValue = 1.00000001504747e+30 ;
		time:missing_value = 1.e+30 ;
		time:valid_min = 0 ;
		time:valid_max = 2147483647 ;
		time:standard_name = "time" ;
		time:long_name = "time" ;
		time:coordinates = "lat lon" ;
	double lat(lat) ;
		lat:_FillValue = 1.00000001504747e+30 ;
		lat:missing_value = 1.e+30 ;
		lat:valid_min = -9000000 ;
		lat:valid_max = 9000000 ;
		lat:standard_name = "latitude" ;
		lat:long_name = "latitude" ;
		lat:units = "degrees_north" ;
	double lon(lon) ;
		lon:_FillValue = 1.00000001504747e+30 ;
		lon:missing_value = 1.e+30 ;
		lon:valid_min = 0 ;
		lon:valid_max = 36000000 ;
		lon:standard_name = "longitude" ;
		lon:long_name = "longitude" ;
		lon:units = "degrees_east" ;
	double wind_speed(wind_speed) ;
		wind_speed:_FillValue = 1.00000001504747e+30 ;
		wind_speed:missing_value = 1.e+30 ;
		wind_speed:valid_min = 0s ;
		wind_speed:valid_max = 5000s ;
		wind_speed:standard_name = "wind_speed" ;
		wind_speed:long_name = "wind speed at 10 m" ;
		wind_speed:units = "m s-1" ;
		wind_speed:coordinates = "lat lon" ;
...

Installation instructions

We should add local installation instructions to the README. Lots of people are new to Poetry and may not know how to properly install this on their local system to test the CLI.

Subset outside provided bounds

The following request:

https://harmony.earthdata.nasa.gov/C1996881636-POCLOUD/ogc-api-coverages/1.0.0/collections/l2p_flags/coverage/rangeset?granuleid=G2274421530-POCLOUD&subset=lat(57.338435000000004:69.9247)&subset=lon(97.24319999999999:156.88401)

results in latitude values outside the provided bounds. After subsetting, I would expect the latitude values to be within the bounds 57.338435000000004/69.9247, but instead they are in bounds 56.7/70.6, which is the same as the original granule.

We should investigate why this is happening -- it may be expected behavior or it may be a bug. If expected behavior, we should update our regression tests to not fail on this case.

OMI non-variable subsetting

Lat and Lon are in a different group than the data fields variable. If no variables are given for a subset, the data_vars not in the geofields group should still be included in the subset. Extra else statement needed.

SMAP_JPL_L2B_NRT_SSS_CAP_V5 time calculation issues

The subsetter is struggling with the time variable in SMAP_JPL_L2B_NRT_SSS_CAP_V5. The following time variable exists:

	float row_time(phony_dim_1) ;
		row_time:long_name = "Approximate observation time for each row" ;
		row_time:units = "UTC seconds of day" ;
		row_time:valid_max = 86400.f ;
		row_time:valid_min = 0.f ;

However, this represents the seconds within a day. The day/month/year metadata doesn't appear to be present in any variable in this dataset. This information might need to be pulled from global metadata, should contact data engineer about this issue to clarify this question before implementation.

Variable subsetting allows '/'

Current variable subsetting occurs after group flatting and requires a '__' for subsetting. Using harmony variable subsetting requires a '/' in between groups. Fix will be a simple character replace.

Error while determining coordinate variables

Two new PODAAC datasets AHI_H08-STAR-L2P-v2.70 and ABI_G17-STAR-L2P-v2.71 fail when trying to determine coordinate variables. This is because the coordinate variables for both of these datasets are nj/ni instead of lon/lat.

The way in which coordinate variables are determined should be improved. Perhaps the CF-compliant coordinates field could be utilized. Alternatively, the xarray auto-coord detection feature could be utilized.

VIIRS L2P SST latitude outside user specified bounds

Request:
https://harmony.earthdata.nasa.gov/C1996881636-POCLOUD/ogc-api-coverages/1.0.0/collections/l2p_flags/coverage/rangeset?granuleid=G2325969674-POCLOUD&subset=lat(51.2426775:62.351549999999996)&subset=lon(55.92425:103.7432125)

The latitude max is 63, which is outside the user specified bounds.
Cut = True in the global metadata in the output .nc of this harmony request.

This may not be an issue, but should understand what it fails sometimes.

Jenkins job error:

Failed to run Nightly Build for L2SS Notebook on the following collections
OPS: C1996881636-POCLOUD <-- FAILIING

l2ss-py TASKTOP TEST 1

Created by James in GH. This should sync to a new bug in Jira with the l2ss-py "component".

Add support for granules where time variable represents lines

An example of this would be SWOT_SIMULATED_L2_KARIN_SSH_ECCO_LLC4320_CALVAL_V1, (C2158344213-POCLOUD), which has the following structure

>>> ds.time
<xarray.DataArray 'time' (num_lines: 9866)>
array(['2011-11-13T07:43:01.639593984', '2011-11-13T07:43:01.954040000',
       '2011-11-13T07:43:02.268488000', ..., '2011-11-13T08:34:27.435267968',
       '2011-11-13T08:34:27.751033984', '2011-11-13T08:34:28.066801024'],
      dtype='datetime64[ns]')
Coordinates:
    latitude_nadir   (num_lines) float64 ...
    longitude_nadir  (num_lines) float64 ...
Dimensions without coordinates: num_lines
Attributes:
    long_name:           time in UTC
    standard_name:       time
    leap_second:         YYYY-MM-DDThh:mm:ssZ
    comment:             Time of measurement in seconds in the UTC time scale...
    tai_utc_difference:  34.0
>>> ds.coords
Coordinates:
    latitude         (num_lines, num_pixels) float64 ...
    longitude        (num_lines, num_pixels) float64 ...
    latitude_nadir   (num_lines) float64 ...
    longitude_nadir  (num_lines) float64 ...

Expanded Time

TROPOMI products that have "time" and "delta_time" will have delta time to be the same shape as lat/lon. In the current get_time_variable method, if variable contains "time" that variable will get the squeeze() method and the lat/lon variable will not get .squeeze()

Dimension assumptions

In the get_indexers_from_nd(cond, cut) method the indexers assumes that the cond.dims for a bounding box are 0 and 1. Tropomi products have time as the 0 spot. It also assumes that the cond array input has 2 dimensions, tropomi products have 3.

Collection associations can be overwritten if made while release branch is opened

This happened recently. The timeline was:

  1. 5/25 Release branch 1.5.0 created
  2. 5/27 UMM-S collection association for CYGNSS_NOAA_L2_SWSP_25KM_V1.2 was made, and committed to develop branch (via automated workflow)
  3. 5/31 1.5.0 released. Upon merge to main branch, collection associations in CMR are overwritten with values in ops_collection.txt

This means the association made on 5/27 was overwritten after release. A solution might be:

a. Pull collection association changes into release branch before release OR
b. Update CMR umm updater to not overwrite existing UMM-S collection associations

The preferred solution would probably be (b), but (a) works as a workaround for now.

Variable dimensions

Variable dimensions are assumed to be the same across all variables in a group in recombine_group_dataset method

Root level variable names

The root level of the grouped files have variables sometimes. Need to access these variables and add the '__' group delimiter

As a Harmony service, I want to use coordinate variable information to identify coordinates when available

If a collection contains UMM-V associations where the UMM-V is identified as VariableType: Coordinate, that information will be passed to l2ss-py as part of the dataOperation message.

l2ss-py should be improved to use this (optional) information if it exists in order to more accurately identify which variables in the collection should be used as coordinates.

Relates to: https://bugs.earthdata.nasa.gov/browse/TRT-159

Acceptance Criteria:

  • L2SS updates subsetting processing logic to utilize UMM-Var information.
  • Subsetted outputs from L2SS remain unchanged compared to outputs from hardcoded coordinate metadata handling.

Empty subset for grouped dataset results in error

An empty subset, meaning a subset that contains no data, of a grouped dataset results in an error. This is related to issue #36 but only for grouped datasets.

Example query that will lead to an error:

https://harmony.earthdata.nasa.gov/C2158348264-POCLOUD/ogc-api-coverages/1.0.0/collections/all/coverage/rangeset?subset=lon(-10:10)&subset=lat(-10:10)&maxResults=2

This leads to the following error:

WorkItem [29846] failed with error: ghcr.io/podaac/l2ss-py:1.3.0: 'NoneType' object is not subscriptable

The stack trace reveals the failure is on the following line:

return np.array([[
min(lon[0][0][0] for lon in zip(spatial_bounds)),
max(lon[0][0][1] for lon in zip(spatial_bounds))
], [
min(lat[0][1][0] for lat in zip(spatial_bounds)),
max(lat[0][1][1] for lat in zip(spatial_bounds))
]])

This is happening for two reasons:

  1. Harmony might include a granule in a subset that doesn't actual contain data in the bbox because the outer extend does contain the bbox.
  2. get_spatial_bounds returns None if there is no data available to populate new spatial extent.

This can be fixed by having the None result of get_spatial_bounds get processed correctly and not parsed as-if it was an actual bound.

Service request failed with an unknown error

From @jjmcnelis

I’m getting an error back from Harmony for my MODIS-A L2 SST subset requests: https://harmony.earthdata.nasa.gov/jobs/0ed1a0b0-6f2d-47cb-b286-c340c52297e0 I submitted identical requests several times last year between March and December, so I know this dataset and request to be valid/compatible with harmony. Also, I think this error/response message is the same as the one that Celia was receiving for the SWOT L2 SSH requests that we looked at together last week.

{
  "username": "jmcnelis",
  "status": "failed",
  "message": "WorkItem [44006] failed with error: ghcr.io/podaac/l2ss-py:1.3.0: Service request failed with an unknown error",
  "progress": 0,
  "createdAt": "2022-02-24T16:17:47.903Z",
  "updatedAt": "2022-02-24T16:18:04.155Z",
  "links": [
    {
      "href": "https://harmony.earthdata.nasa.gov/service-results/harmony-prod-staging/public/podaac/l2-subsetter/1da09702-4bdc-4475-8d6d-d68d6606fd2d/20190101031001-JPL-L2P_GHRSST-SSTskin-MODIS_A-N-v02.0-fv01.0_subsetted.nc4",
      "title": "20190101031001-JPL-L2P_GHRSST-SSTskin-MODIS_A-N-v02.0-fv01.0_subsetted.nc4",
      "type": "application/x-netcdf4",
      "rel": "data",
      "bbox": [
        -13.3,
        37.7,
        -12.8,
        38.5
      ],
      "temporal": {
        "start": "2019-01-01T03:10:01.000Z",
        "end": "2019-01-01T03:15:00.000Z"
      }
    },
    {
      "href": "https://harmony.earthdata.nasa.gov/jobs/0ed1a0b0-6f2d-47cb-b286-c340c52297e0?page=1&limit=2000",
      "title": "The current page",
      "type": "application/json",
      "rel": "self"
    }
  ],
  "request": "https://harmony.earthdata.nasa.gov/C1940473819-POCLOUD/ogc-api-coverages/1.0.0/collections/all/coverage/rangeset?subset=lat(37.707%3A38.484)&subset=lon(-13.265%3A-12.812)&subset=time(%222019-01-01%22%3A%222019-01-31%22)",
  "numInputGranules": 103,
  "jobID": "0ed1a0b0-6f2d-47cb-b286-c340c52297e0"
}

SMAP subsets are stuck at 0%

Improve shapefile subsetting performance

Investigate possible performance improvements to the shapefile subsetting capability. This capability is currently being brute-forced, but there might be a more elegant solution. Because this is L2 data, many of the existing capabilities for doing this may not work, such as using rioxarray.

The following library may be useful in achieving this: https://github.com/xarray-contrib/xoak
Also, see the the following discussion on this exact topic: corteva/rioxarray#202

Validate service works with Harmony turbo

In support of the Harmony Turbo (soon to just be called “Harmony”) switch, we’re planning to have all current services testable in UAT using the “turbo=true” feature toggle by the start of Sprint 3 and are going to want all teams to either confirm we’re okay to switch requests to use Turbo by default or to send us a strongly-worded cease and desist before Sprint 5.

Time variable has inconsistent types for collection ASCATB-L2-25km

See https://harmony.earthdata.nasa.gov/C2075141559-POCLOUD/ogc-api-coverages/1.0.0/collections/wind_speed,wind_dir/coverage/rangeset?forceAsync=true&subset=lat(-90%3A0)&subset=lon(-180%3A0)&subset=time(%222021-12-20T00%3A00%3A00%22%3A%222021-12-20T23%3A59%3A59.999999%22)&concatenate=true

The time variable seems to change from the original float32 to float64 when subsetted. This causes issues when concatenating granules that are/aren't subsetted because the types need to match but don't.

We should update l2ss-py to ensure the original type is always retained when doing a subset.

name:time dim_order:('NUMROWS', 'NUMCELLS') fill_value:1.0000000150474662e+30 datatype:float64 group_path:/
/time
name:time dim_order:('NUMROWS', 'NUMCELLS') fill_value:1.0000000150474662e+30 datatype:float32 group_path:/

OMI Temporal Subsetting

xarray decode times not decoding the OMI time variable. OMI time variable is tai93. Fix is to change timestamp to a tai93 timeformat and then compare.

OCO3 Subsetting

OCO3 fails to subset having both Latitude and latitude variables

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.