cf-convention / cf-2 Goto Github PK
View Code? Open in Web Editor NEWGroup Repo
Group Repo
While developing the file format guidelines for the upcoming Sentinel 5-precursor ESA earth observation mission, I ran into some limitations of the CF-1.6 conventions.
The number of output fields in our data products is large. To help our users distinguish the main output fields from the support data, we want to use groups. The main data contains for instance a total ozone column, its precision and the main geolocation. A simple quality indicator is included as well. This should suffice for basic usage. For us (retrieval algorithm developers) and other advanced users more details are needed, such as detailed processing flags, intermediate results, column values of trace gases that are fitted in addition to the main parameter, model parameters to translate a slant column to a vertical column, the slant columns themselves, pixel corners, etc. We don't want to bother most users with these details, and have therefore put these variables in another group.
The current CF-1.6 does not support this. References from a variable in one group to one in another are not supported. I will give a few solutions here as a starting point, and we will see where we end up. I've selected to use one of these options as a stop-gap measure, but we are (within reason) flexible enough to support either of these options.
Variables that are linked to main variables, for instance via the 'ancillary_variables' attribute, but also in the 'bounds', 'coordinates' and probably other attributes as well, must use the same dimensions.
+ /PRODUCT
| /PRODUCT/scanline(scanline) (DIM)
| /PRODUCT/ground_pixel(ground_pixel) (DIM)
| /PRODUCT/corners(corners) (DIM)
| /PRODUCT/latitude(scanline, ground_pixel)
| /PRODUCT/longitude(scanline, ground_pixel)
| /PRODUCT/ozone_column(scanline, ground_pixel)
+ /PRODUCT/SUPPORT_DATA/processing_flags(scanline, ground_pixel)
| /PRODUCT/SUPPORT_DATA/latitude_bounds(scanline, ground_pixel, corners)
| /PRODUCT/SUPPORT_DATA/longitude_bounds(scanline, ground_pixel, corners)
Follow the scoping rules for dimensions, and search all of the scope where the dimensions of the main variable can be used. The netCDF-4 C++ interface provides nice options for this, although more convenient support may be added to that interface later on.
In the example, the /PRODUCT/latitude
variable has an attribute bounds
with value latitude_bounds
, while the /PRODUCT/SUPPORT_DATA/processing_flags
variable has an attribute coordinates
with value latitude longitude
.
To find the actual variables, first the application find the dimensions (using std::set<NcDim> netCDF::NcGroup::getDims()
, with netCDF::NcGroup::ParentsAndCurrent
as the search scope), then starting from group where the dimension is defined (NcGroup NcDim::getParentGroup()
), and finally find the named variable within scope of the dimension (using NcVar netCDF::NcGroup::getVar()
with NcGroup::Location::ChildrenAndCurrent
as the search scope). Other interfaces may make it harder to implement this pattern, but that is only a temporary limitation I think.
Note that this places other restrictions on the file, such as the inability of using the same name for a variable in two different groups within the same dimension search scope. I'm not sure this is a restriction at all, but it is something to keep in mind.
This solution is more explicit, and uses HDF-5 paths to explicitly point to the location of a linked variable.
In the example, the /PRODUCT/latitude
variable has an attribute bounds
with value SUPPORT_DATA/latitude_bounds
, while the /PRODUCT/SUPPORT_DATA/processing_flags
variable has an attribute coordinates
with value /PRODUCT/latitude /PRODUCT/longitude
.
To find the actual variables, some string manipulations are needed to find the group names, and then finding the variables is probably fairly straightforward.
Note that this solution uses the fact that the /
character is used as a path separator in HDF-5 (and can therefore not occur in a variable- or group-name). This method puts a restriction on group names in that these should not contain spaces, as the lists of variables are space separated. A similar restriction is already in place (implicitly) on variable names in CF-1.6.
Within the S5P project we have put a restriction on 'element' names (groups, variables, attributes). NetCDF-4 allows an element name like "χ²" (\u03C7\u00B2). This is probably very good for human readability, but accessing the field from a program or script (non-interactively) is probably pretty hard. To get the string into this text file I went into an interactive python3 shell, and asked it to print("\u03C7\u00B2"), and those numbers were obtained from a website. Other computer systems may offer more convenient access.
We use the following restrictions:
[a-zA-Z_][a-zA-Z0-9_]*
. This means that the name of a NetCDF-4 element can be used as a variable name in most programming languages.The first limitation is for instance nice when using the (HDF-5) pytables interface for python, as it allows simple dot-notation to access variables in a file, but requires that all elements are valid python variable names. Adding a similar interface to the python netCDF4 package is on my (far too long) todo list.
See summary below. The variable name restriction now have their own issue #5.
@rsignell-usgs
@ngalbraith
@rockdoc
@BobSimons
@graybeal
@daf
@SiggyF
Invitations are sent via email and can be accepted at https://github.com/cf-convention
The section on discrete sampling geometries uses various methods to (efficiently) store datasets of varying length in linear arrays. NetCDF-4 introduces a VLEN data type, which seems to be quite elegant for this type of data. I suggest that we look into this.
As a watershed modeler working on a model discritized to watershed polygons, I want to store the polygons and model state variables for a model run or collection of model runs, so my model archive can be shared with others in a fully described single file.
This is envisioned as a minor extension to the discrete sampling geometry station data type. In the station data type, the station is at a lat/lon. In this, the 'station' would be a polyline or a linear ring defining a polygon.
I suggest we rename the repo to CF-2, to avoid the connotation of only a specific version.
If this notification makes it to anyone that objects -- please object. Otherwise, I will archive next time I decide to clean up github stuff.
From: John Caron
Background:
In the classic model, data using the "byte" data type are interpreted as signed when converting. However, the byte data type is sometimes used for unsigned data. Unidata introduced the "_Unsigned" attribute to allow the user to specify this. Not all libraries look for this attribute.
Sometimes the "char" data type is intended to mean unsigned byte data. More typically it is used for encoding text data, but the character encoding is undefined. Probably "printable ASCII" is a reasonable assumption. Char data are fixed length arrays only, and one must specify the length using a global, shared dimension, which is unneeded and clutters the dimension namespace.
The NetCDF-4 enhanced model adds Strings and unsigned integer types, so we have the opportunity to clarify. Lots of work on character encodings have been done in the last 20 years with Unicode, and we should leverage that. UTF8 is a variable length encoding of Unicode that has ASCII as a subset, allows any language to be encoded, and has become the dominant encoding on the web. NetCDF libraries assume Strings are UTF8 encoded. If your text is ASCII, you are using UTF8 already.
Also see:
Developing Conventions for NetCDF-4 : Use of Strings
Proposal:
Note: This remark was originally part of issue #4. Split of to separate the discussions.
In issue #4 references between variables are discussed. These occur in space separated lists, and as such they put a limitation on the allowable characters in object names.
NetCDF-4 allows an object name like "χ²" (\u03C7\u00B2). This is probably very good for human readability, but accessing the field from a program or script (non-interactively) is probably pretty hard.
To get the string into this text file I went into an interactive python3 shell, and asked it to print("\u03C7\u00B2"), and those numbers were obtained from a website. Other computer systems may offer more convenient access.
For some of the terminology refer to “CDM Object Names” and the discussion summary of issue #4.
In particular, an “object name” refers to the short name of either a group, variable, dimension, attribute, type definition or field in a compound datatype.
The short names of NetCDF-4 objects must match the regular expression: [a-zA-Z_][a-zA-Z0-9_]*
. This means that the name of a NetCDF-4 object can be used as a variable name in most programming languages.
If there is a need to attach a name to a variable that requires the use of other characters, then an attributes can be used. A possible solution for this would be an attribute with the name “local_name_zh_CN
” for a Chinese localisation. The end of the attribute name follows ISO 639-1/ISO 639-2.
Now all we need is a code for “mathematical localisation”.
CF-1.6 says: "The dimensions of a variable must all have different names."
Why is dimension name (or named dimension) necessary?
Isn't coordinates attribute on dataset sufficient?
I hope CF-2.0 drops such requirement given that backward compatibility (COARDS) is not an issue.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.