cf-convention / cf-2 Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 1.0 3 KB

Group Repo

cf-2's People

Contributors

Stargazers

Watchers

Forkers

feihugis

cf-2's Issues

Support for Groups in CF-2.0

Use case

While developing the file format guidelines for the upcoming Sentinel 5-precursor ESA earth observation mission, I ran into some limitations of the CF-1.6 conventions.

The number of output fields in our data products is large. To help our users distinguish the main output fields from the support data, we want to use groups. The main data contains for instance a total ozone column, its precision and the main geolocation. A simple quality indicator is included as well. This should suffice for basic usage. For us (retrieval algorithm developers) and other advanced users more details are needed, such as detailed processing flags, intermediate results, column values of trace gases that are fitted in addition to the main parameter, model parameters to translate a slant column to a vertical column, the slant columns themselves, pixel corners, etc. We don't want to bother most users with these details, and have therefore put these variables in another group.

The problem

The current CF-1.6 does not support this. References from a variable in one group to one in another are not supported. I will give a few solutions here as a starting point, and we will see where we end up. I've selected to use one of these options as a stop-gap measure, but we are (within reason) flexible enough to support either of these options.

Basic requirement for the solution

Variables that are linked to main variables, for instance via the 'ancillary_variables' attribute, but also in the 'bounds', 'coordinates' and probably other attributes as well, must use the same dimensions.

Reference structure

+ /PRODUCT
| /PRODUCT/scanline(scanline) (DIM)
| /PRODUCT/ground_pixel(ground_pixel) (DIM)
| /PRODUCT/corners(corners) (DIM)
| /PRODUCT/latitude(scanline, ground_pixel)
| /PRODUCT/longitude(scanline, ground_pixel)
| /PRODUCT/ozone_column(scanline, ground_pixel)
+ /PRODUCT/SUPPORT_DATA/processing_flags(scanline, ground_pixel)
| /PRODUCT/SUPPORT_DATA/latitude_bounds(scanline, ground_pixel, corners)
| /PRODUCT/SUPPORT_DATA/longitude_bounds(scanline, ground_pixel, corners)

Possible solution 1: Follow the scoping rules for dimensions

Follow the scoping rules for dimensions, and search all of the scope where the dimensions of the main variable can be used. The netCDF-4 C++ interface provides nice options for this, although more convenient support may be added to that interface later on.

In the example, the /PRODUCT/latitude variable has an attribute bounds with value latitude_bounds, while the /PRODUCT/SUPPORT_DATA/processing_flags variable has an attribute coordinates with value latitude longitude.

To find the actual variables, first the application find the dimensions (using std::set<NcDim> netCDF::NcGroup::getDims(), with netCDF::NcGroup::ParentsAndCurrent as the search scope), then starting from group where the dimension is defined (NcGroup NcDim::getParentGroup()), and finally find the named variable within scope of the dimension (using NcVar netCDF::NcGroup::getVar() with NcGroup::Location::ChildrenAndCurrent as the search scope). Other interfaces may make it harder to implement this pattern, but that is only a temporary limitation I think.

Note that this places other restrictions on the file, such as the inability of using the same name for a variable in two different groups within the same dimension search scope. I'm not sure this is a restriction at all, but it is something to keep in mind.

Possible solution 2: Use HDF-5 paths to point to linked variables.

This solution is more explicit, and uses HDF-5 paths to explicitly point to the location of a linked variable.

In the example, the /PRODUCT/latitude variable has an attribute bounds with value SUPPORT_DATA/latitude_bounds, while the /PRODUCT/SUPPORT_DATA/processing_flags variable has an attribute coordinates with value /PRODUCT/latitude /PRODUCT/longitude.

To find the actual variables, some string manipulations are needed to find the group names, and then finding the variables is probably fairly straightforward.

Note that this solution uses the fact that the / character is used as a path separator in HDF-5 (and can therefore not occur in a variable- or group-name). This method puts a restriction on group names in that these should not contain spaces, as the lists of variables are space separated. A similar restriction is already in place (implicitly) on variable names in CF-1.6.

General note on variable names.

Within the S5P project we have put a restriction on 'element' names (groups, variables, attributes). NetCDF-4 allows an element name like "χ²" (\u03C7\u00B2). This is probably very good for human readability, but accessing the field from a program or script (non-interactively) is probably pretty hard. To get the string into this text file I went into an interactive python3 shell, and asked it to print("\u03C7\u00B2"), and those numbers were obtained from a website. Other computer systems may offer more convenient access.

We use the following restrictions:

The names of NetCDF-4 elements must match the regular expression: [a-zA-Z_][a-zA-Z0-9_]*. This means that the name of a NetCDF-4 element can be used as a variable name in most programming languages.
The names NetCDF-4 elements use underscores to separate parts within a name. An exception to this rule is formed by attributes whose name is specified by an external standard or recommendation, such as the CF metadata conventions
The names of variables are all lower case, with the exception of chemical species and abbreviations.
The names of groups are all upper case.
It is recommended to limit the names of elements to 40 characters or less.
Elements names that only differ in capitalization are not allowed.
It is strongly recommended to ensure that names of variables are unique within a file.

The first limitation is for instance nice when using the (HDF-5) pytables interface for python, as it allows simple dot-notation to access variables in a file, but requires that all elements are valid python variable names. Adding a similar interface to the python netCDF4 package is on my (far too long) todo list.

Notes

See summary below. The variable name restriction now have their own issue #5.

Adding Team members

@rsignell-usgs
@ngalbraith
@rockdoc
@BobSimons
@graybeal
@daf
@SiggyF

Invitations are sent via email and can be accepted at https://github.com/cf-convention

Use VLEN datatypes for Discrete Sampling Geometries

The section on discrete sampling geometries uses various methods to (efficiently) store datasets of varying length in linear arrays. NetCDF-4 introduces a VLEN data type, which seems to be quite elegant for this type of data. I suggest that we look into this.

Support for Simple Features in Discrete Sampling Geometries

As a watershed modeler working on a model discritized to watershed polygons, I want to store the polygons and model state variables for a model run or collection of model runs, so my model archive can be shared with others in a fully described single file.

This is envisioned as a minor extension to the discrete sampling geometry station data type. In the station data type, the station is at a lat/lon. In this, the 'station' would be a polyline or a linear ring defining a polygon.

Suggest we rename the repo to CF-2

I suggest we rename the repo to CF-2, to avoid the connotation of only a specific version.

Should this repo be archived?

If this notification makes it to anyone that objects -- please object. Otherwise, I will archive next time I decide to clean up github stuff.

String, char, unsigned integers, and character encodings.

From: John Caron

Background:

In the classic model, data using the "byte" data type are interpreted as signed when converting. However, the byte data type is sometimes used for unsigned data. Unidata introduced the "_Unsigned" attribute to allow the user to specify this. Not all libraries look for this attribute.

Sometimes the "char" data type is intended to mean unsigned byte data. More typically it is used for encoding text data, but the character encoding is undefined. Probably "printable ASCII" is a reasonable assumption. Char data are fixed length arrays only, and one must specify the length using a global, shared dimension, which is unneeded and clutters the dimension namespace.

The NetCDF-4 enhanced model adds Strings and unsigned integer types, so we have the opportunity to clarify. Lots of work on character encodings have been done in the last 20 years with Unicode, and we should leverage that. UTF8 is a variable length encoding of Unicode that has ASCII as a subset, allows any language to be encoded, and has become the dominant encoding on the web. NetCDF libraries assume Strings are UTF8 encoded. If your text is ASCII, you are using UTF8 already.

Also see:

CDL Data Types

Developing Conventions for NetCDF-4 : Use of Strings

Proposal:

Use the unsigned or signed integer data types when your data is unsigned or signed, respectively.
Do not use _Unsigned attribute.
Use the String data type for text data, encoded in UTF-8. Any language (aka character set) is allowable.
The char data type is deprecated. If you must use it, use it only for ASCII text data.

Limitation on object names

Note: This remark was originally part of issue #4. Split of to separate the discussions.

Use case

In issue #4 references between variables are discussed. These occur in space separated lists, and as such they put a limitation on the allowable characters in object names.

The problem

NetCDF-4 allows an object name like "χ²" (\u03C7\u00B2). This is probably very good for human readability, but accessing the field from a program or script (non-interactively) is probably pretty hard.

To get the string into this text file I went into an interactive python3 shell, and asked it to print("\u03C7\u00B2"), and those numbers were obtained from a website. Other computer systems may offer more convenient access.

Terminology

For some of the terminology refer to “CDM Object Names” and the discussion summary of issue #4.

In particular, an “object name” refers to the short name of either a group, variable, dimension, attribute, type definition or field in a compound datatype.

cf-convention / cf-2 Goto Github PK

cf-2's People

Contributors

Stargazers

Watchers

Forkers

cf-2's Issues

Use case

The problem

Basic requirement for the solution

Reference structure

Possible solution 1: Follow the scoping rules for dimensions

Possible solution 2: Use HDF-5 paths to point to linked variables.

General note on variable names.

Notes

Use case

The problem

Terminology

Suggested solution

Recommend Projects

Recommend Topics

Recommend Org