rubisco-sfa / ilamb-data Goto Github PK

A collection of scripts used to format ILAMB data and community portal to make contributions

Python 100.00%

ilamb-data's Introduction

ILAMB-Data

This repository stores the scripts used to download observational data from various sources and format it into a CF-compliant netCDF4 file which can be used for model benchmarking via ILAMB.

Please note that the repository contains no data. If you need to download our observational data, see the ilamb-fetch tutorial. This collection of scripts is to:

archive how we have produced the model comparison datasets with ILAMB
expose the details of our formatting choices for transparency
provide the community a path to contributing new datasets as well as pointing out errors in the current collection

Contributing

If you have a suggestion or issue with the observational data ILAMB uses, we encourage you to use the issue tracker associated with this repository rather than that of the ILAMB codebase. This is because the ILAMB codebase is meant to be a general framework for model-data intercomparison and is ignorant of observational data sources. Here are a few ways you can contribute to this work:

Debugging

If you notice an irregularity/bug/error with a dataset in our collection:

Raise an issue with the dataset name included in the title (e.g., "Netcdf read error in wang2021.nc") for record keeping and discussion
Tag the issue with bug
(Optional) Fork the ILAMB-Data repo and fix the erroneous convert.py
(Optional) Submit a pull request for our review

Suggesting Datasets

If you know of a dataset that would be a great addition to ILAMB:

Raise an issue with the proposed dataset name included in the title (e.g., New Global Forest Watch cSoil dataset).
Tag the issue with new dataset.
Provide us with details of the dataset as well as some reasoning for the recommendation; consider including hyperlinks to papers, websites, etc.
(Optional) Fork the ILAMB-Data repo, create a new directory named after the dataset (e.g., GFW), and create a convert file to preprocesses and formats the data for ILAMB.
(Optional) Submit a pull request with the new directory and convert script for our review.

See below for specific guidelines on adding new datasets

Dataset Formatting Guidelines

We appreciate the community interest in improving ILAMB. We believe that more quality observational constraints will lead to a better Earth system model ecosystem, so we are always interested in new observational data. We ask that you follow this procedure for adding new datasets:

Before encoding the dataset, search the open and closed issues in the issue tracker. We may already have someone assigned to work on this and do not want to waste your effort. Or, we have considered adding the dataset and reasoned against it after discussion.
If no open or closed issue is found, raise a new issue with the new dataset name in the title, and be sure to add the new dataset tag.
Create a new directory to work in. We generally name it after the folks/project who made the dataset; name it whatever you like.
Write the conversion (e.g., convert.py) file inside the folder you created, which (optionally) downloads the dataset; loads the dataset; formats it into a netcdf that follows updated CF Conventions; and, if a gridded dataset, it's helpful to resample to 0.5 degrees (EPSG:4326). Lastly, try to format variable names and units accoding to the accepted MIP variables for easier model comparison.
Submit a pull request for us to review the script and outputted dataset.

You may use any language you wish to encode the dataset, but we strongly encourage the use of python3. You can find examples in this repository to use as a guide. The GFW convert.py is a recent gridded data example, and this Ameriflux convert.py is a recent point dataset example. See this tutorial for help, and feel free to ask questions in the issue you've created for the dataset.

Once you have formatted the dataset, we recommend running it against a collection of models, along with other relevant observational datasets using ILAMB. There are tutorials to help you do this. This will allow the community to evaluate the new addition and decide if or how it should be included in the curated collection.
After you have these results, consider attending one of our conference calls. Here, you can present the results of the intercomparison, and the group can discuss.

ilamb-data's People

Contributors

Stargazers

Watchers

Forkers

nocollier mmu2019 wujianzhao

ilamb-data's Issues

Carbon accumulation potential from regrowth

We have a lack of land user and land cover change data sets in ILAMB. We should consider various data sets available and the kind of different metrics required to use them. One data set that looks interesting is this global 1km x 1km 30-year forest regrowth potential carbon accumulation data set available at https://data.globalforestwatch.org/documents/f950ea7878e143258a495daddea90cc0/about

The related paper is at https://www.nature.com/articles/s41586-020-2686-x

This data set was identified in an NGEE Tropics All Hands Meeting breakout session focused on land use on November 15, 2023.

IPCC Regions

There are new regions being used in AR6 that we should add to ILAMB.
Inne Vanderkelen (@Ivanderkelen) provided a script which encodes them in an ILAMB-ready fashion (thanks!). They have a paper as well describing their definition:

https://essd.copernicus.org/preprints/essd-2019-258/

The regions are part of a work supported at this github repository:

https://github.com/SantanderMetGroup/ATLAS/tree/master/reference-regions

Flavio Runoff Sensitivities

confirm biomass units as carbon versus biomass?

Hi All -- I was just staring at the various biomass datasets and am struck by the large difference in slope between the ESA Biomass_cci product and others when plotted against each other. One tricky thing is that sometime biomass is reported in carbon units and sometimes in total biomass, with a factor of two difference. It strikes me that the offset would possibly disappear if the ESA product was actually in units of total biomass rather than carbon. Looking at the variable description at https://catalogue.ceda.ac.uk/uuid/5f331c418e9f4935b8eb1b836f8a91b8 strikes me that that is possibly the case. Is that how ILAMB is currently interpreting the data, or is it converting from biomass to carbon already?

Conserving Land-Atmosphere Synthesis Suite (CLASS) v1.1

Gab Abramowitz's group has created a globally gridded dataset which simultaneously balances water and energy while also providing estimates of uncertainty based on agreement with site measurements. We are actively working on adding this dataset to ILAMB and also adapting the methodology to make use of the uncertainty measurements.

This issue is meant to represent current progress and provide a location for further comment. We have:

Code which automatically downloads and formats the variables.
A branch in the main ILAMB repository which uses the uncertainty estimates and compares to the old methodology. This is currently in development.
A webpage where we have the current methodology across the collection of CMIP6 models.

There are a number of open questions to address:

There is a dataset for the water storage variable that currently is not included. We need a mechanism for skipping the bias score in this new codebase as the storage variable is an anomaly which has a mean of zero.
There is also a ground heat flux variable provided. I am not sure what variable this maps to in the CMIP table or if models provide it to ESGF.
We are currently including notions of uncertainty in the scoring of bias and RMSE. Could we develop other areas of the score (phase, spatial distribution, interannual variability) to also make use of the uncertainty?

New harmonized global above and belowground biomass carbon density data

A new synthesized dataset provides separate maps of global above and belowground biomass carbon density estimates across a wide range of vegetation types in 2010 at 300m resolution with quantified uncertainty. The relevant paper, published April 6, 2020, is:

Spawn, Seth A., Clare C. Sullivan, Tyler J. Lark, and Holly K. Gibbs (2020), Harmonized global maps of above and belowground biomass carbon density in the year 2010, Sci Data, 7, 112, doi:10.1038/s41597-020-0444-4

The authors compare these new estimates with the IPCC Tier-1 live C reported by Ruesh and Gibbs (2008) for year 2000, but they caution against using these differences as an indication of biomass change across that period since the products were produced by different methods.

Discussion Points for 2023/03/08 Meeting

The new methodology using regional quantiles to normal is now fully implemented for bias and RMSE. Results are up here.
- This is not yet the default methodology, but can be run by adding commandline options to ILAMB. Will make it default in an upcoming 2.7 release once stable
- CTSM was interested in looking at the new methodology on new runs, waiting for feedback
- Currently seeing some stability problems, but OLCF has had issues lately
- Many of the new biomass datasets are not in the CMIP5v6 because they start beyond 2005
Running tests on the CMIP6 collection with the added biomass datasets (scale factor to account for total/carbon and total/AGB included).
Kuang's methane is now in ILAMB-Data, the CMIP variable is wetlandCH4. However, it sounds like the model variable is the CH4 out of the welands, averaged over the grid cell, not the wetland area of the grid cell. Kuang's product is the mean methane from wetland areas. A few CMIP6 models do have the variable. #22
Gretchen and I have a plan to get her student's CO2 metric into ILAMB, should be a simple addition, talk forthcoming on a science friday
Preparing capability to participate in the upcoming ESS-CI ELM / ATS / ILAMB hackathon. Adding a model result that is a collection of observation points (hydrogram) from a high resolution regional model.
With Elias, we have been working on evaluating/developing AI stomatal conductance models and seeing how they perform inside the photosynthesis solve as implemented in MAAT. Still a work in progress but we are trying to evaluate science improvements as well as computational effects. Current progress.

Add ACCESS-ESM to the CMIP6 comparison results

Hi,

I would like to add the ACCESS-ESM model to the CMIP6 results shown here.
Where can I find the ilamb configuration files used to run the comparison?

Thanks for your help,

Romain

Wrong Jung et al. reference for FLUXCOM datasets

It appears that the preferred references for FLUXCOM datasets are:

Jung, M., S. Koirala, U. Weber, K. Ichii, F. Gans, Gustau-Camps-Valls, D. Papale, C. Schwalm, G. Tramontana, and M. Reichstein (2019), The FLUXCOM ensemble of global land-atmosphere energy fluxes, Scientific Data, 6(1), 74, doi:10.1038/s41597-019-0076-8.

Tramontana, G., M. Jung, C.R. Schwalm, K. Ichii, G. Camps-Valls, B. Raduly, M. Reichstein, M.A. Arain, A. Cescatti, G. Kiely, L. Merbold, P. Serrano-Ortiz, S. Sickert, S. Wolf, and D. Papale (2016), Predicting carbon dioxide and energy fluxes across global FLUXNET sites with regression algorithms, Biogeosciences, 13, 4291-4313, doi:10.5194/bg-13-4291-2016.

Our data information includes the correct Tramontana et al. (2016) reference, but points to a less-useful Jung et al. (2017) reference instead of the Jung et al. (2019) reference above.

Our FLUXCOM datasets should be regenerated with the correct Jung et al. (2019) reference, replacing Jung et al. (2017).

ESA Biomass Climate Change Initiative (Biomass_cci)

Global datasets of forest above-ground biomass for the years 2010, 2017 and 2018, v2

From description:

The maps are derived from a combination of data, depending on the year, from the Copernicus Sentinel-1 mission, Envisat’s ASAR instrument and JAXA’s Advanced Land Observing Satellite (ALOS-1 and ALOS-2), along with additional information from Earth observation sources.

more info here: https://climate.esa.int/en/news-events/maps-improve-forest-biomass-estimates/

data available at: http://dx.doi.org/10.5285/84403d09cef3485883158f4df2989b0c

GCP nbp

The current version of nbp that we use from GCP is 2015 and has no uncertainties encoded. We should update this dataset and also encode the uncertainties.

GEOCARBON biomass

Global.Carbon as far as I know has no reference beyond Saatchi's paper for the Tropical dataset. We should drop it and re-encode GEOCARBON:

https://www.wur.nl/en/Research-Results/Chair-groups/Environmental-Sciences/Laboratory-of-Geo-information-Science-and-Remote-Sensing/Research/Integrated-land-monitoring/Forest_Biomass.htm

This product is partially based on Saatchi's pan-tropical dataset and also on a Boreal biomass map from MPI.

Add Terrestrial Coupling Index

Computed as:

CI(x) = covar( SM(t,x), SHFLX(t,x) ) / sigma(SM(t,x)

Daily averaging period matters in this case. I am curious why not normalize completely by also dividing through by sigma_SHFLX? Then you have Pearson's correlation coefficient.

We have dabbled with daily data, but nothing that has made it into something we compute regularly. It is more of an organizational problem and nothing we cannot remedy. Seasonal averaging is just not added (we haven't used it) and nothing we cannot do quickly.

I would recommend that we make a Confrontation for this. I have been long working on a new version of ILAMB based on xarray. It has just taken a long time to re-implement everything. One approach I am now using is to add new things to ILAMB in the new xarray way when possible and then create a backward compatibility wrapper to the old version. I can do this pretty quickly.

Fluxnet data is a pain to work with. The following snippet combines a few columns from the monthly data into a single dataframe and then xarray dataset with arrays that are of dimension (time, ndata). It doesn't have all the metadata yet, but you may find it useful.

DFS = []
include = ["TIMESTAMP", "SWC_F_MDS_1", "TA_F_MDS", "GPP_NT_VUT_MEAN"]
for fname in glob.glob("*FULLSET_MM*"):
    read = pd.read_excel if fname.endswith(".xlsx") else pd.read_csv
    cols = read(fname, nrows=0).columns
    if set(include).difference(cols):
        continue
    site = fname.split("_")[1]
    df = read(fname, na_values=-9999)[include]
    df = df.dropna()
    df["TIMESTAMP"] = pd.to_datetime(df["TIMESTAMP"], format="%Y%d")
    df["SITE"] = site
    df = df.set_index(["TIMESTAMP", "SITE"])
    DFS.append(df)

df = pd.concat(DFS)
ds = df.to_xarray()
ds.to_netcdf("Fluxnet2015.nc")

Snow off date

Received this possible addition to ILAMB from @wwieder.

https://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=1711

I am going to start logging dataset suggestions here as 'enhancement' issues so we can discuss and also not lose them. This particular one would require special case analysis code which I am fine to write if we can agree on what makes sense in the comparison.

Saatchi Global Biomass

https://www.science.org/doi/full/10.1126/sciadv.abe9829

Scoring Changes to NSIDC Permafrost Map

Our current methodology scores the intersecting area, normalized by the reference and model extent separately. To score the permafrost extent in the reference, but not in the model (so-called 'missed' area), we write:

score_missed = intersection / reference
             = intersection / ( intersection + missed )

As missed -> 0 then score_missed -> 1 and as intersection -> 0 then score_missed -> 0 (missed itself would be large in this case avoiding division by 0). We score the model extent that is not in the reference (so-called 'excess') in the same manner, but normalizing by the model area instead.

One problem @dlawrenncar noticed, is that a lot of a model's missed area could be discontinuous permafrost (see the CMIP5v6 comparison). While it is in one sense reasonable to expect a model to predict permafrost when 50-90% of the grid cell is covered (a tricky concept in our models), in another sense a model missing continuous permafrost is a much more serious problem. It would be helpful if we could adapt our scores to reflect this.

Permafrost Extent Venn Diagram

Note: The sliver marked 'not land in the reference' is removed from the extent calculations. This is area that usually is along coastlines where due to the coarser resolution of models ends up in the ocean of the reference data.

Add WECANN GPP as an alternate dataset

The WECANN data provides an alternate global dataset for GPP and turbulent fluxes, which would be nice to have in ILAMB.

Alemohammad, S. H., B. Fang, A. G. Konings, F. Aires, J. K. Green, J. Kolassa, D. Miralles, C. Prigent, and P. Gentine (2017), Water, Energy, and Carbon with Artificial Neural Networks (WECANN): A statistically based estimate of global surface turbulent fluxes and gross primary productivity using solar-induced fluorescence, Biogeosciences, 14(18), 4101–4124, doi:10.5194/bg-14-4101-2017.

Data available at https://avdc.gsfc.nasa.gov/pub/data/project/WECANN/

GIMMS lai

https://essd.copernicus.org/preprints/essd-2023-68/essd-2023-68.pdf

Discussion Points for Group Meeting 06/01/2022

Enumerating points of discussion for a meeting on 06/01/2022:

New datasets are added (CLASS, Davies-Barnard, ESACCI, WECANN, WangMao) see ILAMB results.
Should we compare the biomass datasets to 0.8 * cVeg * treeFrac instead of just cVeg. See more discussion in #19. The 0.8 factor comes from an estimation that the below to above ground mass is roughly 20%. Is there a proper reference for that number? The other thing that becomes difficult is that not all models give treeFrac. I found that the BCC, NorESM, MIROC for CMIP6 do not. So then, how do we communicate that for some models you are seeing a comparison to cVeg and some this other factor? Maybe we can come up with a method.
Should the CLASS versions of hfls and mrro deprecate DOLCE and LORA, respectively? As I understand, they are built on the same methodology with the exception that the CLASS versions are modified with the additional constraint that the budget is closed.
Note that we have found that the CLASS net surface radiation is almost universally low relative to other datasets and the CMIP6 models we run against.
When Yaoping and Jiafu presented their soil moisture ILAMB addition a while back, we concluded that we wanted to see the analysis with respect to more models. We have the current results for ELM uploaded here because we also wanted to think about some of the new scoring aspects that have been introduced. Any update on progress? @ypwong22

Add Critical Soil Moisture

As part of a push to develop ILAMBv3, I had implemented the critical soil moisture (CSM) metric. See code and results. We would like to port it to current ILAMB. The current version just implemented the calculations and thus is more of a model characteristic that is compared to others. Ideally we would like to compare this to reference data. Paul's cheatsheet has a good discussion of what kinds of observations could be used.

To do:

MODIS lai

Reported at AGU 2019 by Christian Seiler, the MODIS lai we have encoded is a single year replicated to extend from 2000 thru 2005. I am given to understand that our original intent was to have this be more of a representative year.

If this is still the desired intent, the dataset should be formatted as a climatology. ILAMB can understand annual cycles and compute an annual cycle across a span of years indicated in the file.

However, is there a good reason we shouldn't just compare MODIS lai at the years encoded in the data?

Discussion Points for 2022/08/10 Group Meeting

Progress on soil moisture in high latitudes? @jiafumao
I have reviewed @ypwong22 pull request which includes some changes to the ILAMB library. We need to step through this carefully as there is some logic which assumes things like noleap calendars which won't work on other models.
Are our biomass datasets consistently reporting carbon and not total biomass? See #35 which @ckoven brings up. It seems the new ESACCI product is total which would explain the factor of ~2 difference in the comparison plots. We should check our other products, volunteers?
I have re-encoded the NCSCD data to produce two products cSoil and cSoilAbove1m. There is a difference in this and the NCSCDV22 product that we encoded long ago, seems to be from interpolation. Is the spatial variability so high that we shouldn't regrid?
There are other datasets available as per this ESSD paper. Should we chase these down and encode?

Biomass dynamics by tree size

It would be good to get some benchmarks into ILAMB that assess how well vegetation demographic models are capturing tree demography. As a starting point, Piponiot et al. 2022 have data on aboveground biomass, aboveground woody productivity, and aboveground woody mortality by DBH (diameter at breast height) class for 25 plots across tropical and temperate forests. Although it a relatively small set of sites, it could act as a template for additional datasets.

Paper - https://nph.onlinelibrary.wiley.com/doi/10.1111/nph.17995

Supporting Information Dataset 2 - https://nph.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1111%2Fnph.17995&file=nph17995-sup-0002-DatasetS2.xlsx

Global Forest Age Dataset (GFAD)

Would be cool to add the Global Forest Age Dataset (GFAD) to ILAMB, to include for disturbance-resolving models. Data available at: https://doi.pangaea.de/10.1594/PANGAEA.897392

Umakant's soil carbon

https://datadryad.org/stash/dataset/doi:10.7941/D1GD1H

New/Replacement Harmonized World Soils Database version 2.0

A new version of HWSD v.2 was released in 2023, with a minor update to 2.01 in September 2023. It is described at https://gaez.fao.org/pages/hwsd

The report for the data is available at https://www.fao.org/3/cc3823en/cc3823en.pdf

LandTrendr Landsat-based 30m annual AGB dataset

Moving this from Slack to here

Dataset location
ftp://islay.ceoas.oregonstate.edu/cms

[sserbin@modex LandTrendr_AGB_data]$ ncdump -h biomassfiaald_1984_median.nc
netcdf biomassfiaald_1984_median {
dimensions:
    x = 153809 ;
    y = 96684 ;
variables:
    char albers_conical_equal_area ;
        albers_conical_equal_area:grid_mapping_name = "albers_conical_equal_area" ;
        albers_conical_equal_area:false_easting = 0. ;
        albers_conical_equal_area:false_northing = 0. ;
        albers_conical_equal_area:latitude_of_projection_origin = 23. ;
        albers_conical_equal_area:longitude_of_central_meridian = -96. ;
        albers_conical_equal_area:standard_parallel = 29.5, 45.5 ;
        albers_conical_equal_area:long_name = "CRS definition" ;
        albers_conical_equal_area:longitude_of_prime_meridian = 0. ;
        albers_conical_equal_area:semi_major_axis = 6378137. ;
        albers_conical_equal_area:inverse_flattening = 298.257222101004 ;
        albers_conical_equal_area:spatial_ref = "PROJCS[\"NAD83_Conus_Albers\",GEOGCS[\"NAD83\",DATUM[\"North_American_Datum_1983\",SPHEROID[\"GRS 1980\",6378137,298.2572221010042,AUTHORITY[\"EPSG\",\"7019\"]],AUTHORITY[\"EPSG\",\"6269\"]],PRIMEM[\"Greenwich\",0],UNIT[\"degree\",0.0174532925199433],AUTHORITY[\"EPSG\",\"4269\"]],PROJECTION[\"Albers_Conic_Equal_Area\"],PARAMETER[\"standard_parallel_1\",29.5],PARAMETER[\"standard_parallel_2\",45.5],PARAMETER[\"latitude_of_center\",23],PARAMETER[\"longitude_of_center\",-96],PARAMETER[\"false_easting\",0],PARAMETER[\"false_northing\",0],UNIT[\"metre\",1,AUTHORITY[\"EPSG\",\"9001\"]]]" ;
        albers_conical_equal_area:GeoTransform = "-2356065 30 0 3172575 0 -30 " ;
    double x(x) ;
        x:standard_name = "projection_x_coordinate" ;
        x:long_name = "x coordinate of projection" ;
        x:units = "m" ;
    double y(y) ;
        y:standard_name = "projection_y_coordinate" ;
        y:long_name = "y coordinate of projection" ;
        y:units = "m" ;
    short Band1(y, x) ;
        Band1:long_name = "GDAL Band Number 1" ;
        Band1:_FillValue = -9999s ;
        Band1:grid_mapping = "albers_conical_equal_area" ;

// global attributes:
        :GDAL_AREA_OR_POINT = "Area" ;
        :Conventions = "CF-1.5" ;
        :GDAL = "GDAL 2.3.1, released 2018/06/22" ;
        :history = "Thu Jan 24 15:49:02 2019: GDAL CreateCopy( biomassfiaald_1984_median.nc, ... )" ;
}

Converting from GeoTiff to netCDF
gdal_translate -of netcdf biomassfiaald_1984_median.tif biomassfiaald_1984_median.nc

These are large files, and we may want to instead start by extracting individual point time-series

My comment from Slack

"At the CMS meeting I brought up that their data would be good for helping to benchmark models but that they 1) tend not to provide the data in a manner easy to use as benchmarks, 2) the data isn't available, 3) they need uncertainties. I was thinking this would be a good example case because it has all three. I was thinking 1) we could do what you say, I can extract data points from grid cells at the AMF sites to allow us to evaluate the fluxes AND test against aggregate annual biomass patters between model and data, and 2) I can aggregate up to a coarser scale to use as a gridded benchmark. We can use this to illustrate how CMS can use ILAMB more; I am thinking this could actually be a good proposal. Happy to discuss another time but I may tinker with creating single-point .nc timeseries that ILAMB could use. But one key thing between this and the site benchmarking is adding in the handling of model/data uncertainties in the analysis, something we have discussed before and you have been working on"

ESA Active Layer Thickness

As mentioned on our last dev call, here is the data reference. @dlawrenncar @ckoven

https://catalogue.ceda.ac.uk/uuid/67a3f8c8dc914ef99f7f08eb0d997e23

Soil Carbon from SoilGrids

It was suggested that we look at SoilGrids to see how it could be made into a cSoil product for ILAMB. It is part of this comparison paper, the data is linked to therein (250m resolution).

Discussion Points for 2022/07/27 Group Meeting

Any response from Gab about CLASS low net radiation? Should we drop DOLCE and LORA? @dlawrenncar (see email response below)
For the higher dimensional soil moisture product, @ypwong22 is working on adding models. She has some more CMIP6 models, we are working on getting the results uploaded where we can discuss them.
Had an email about what we call the Global.Carbon biomass product. The reference we provide discusses tropical biomass, but this is a global dataset. If you see here though, Global.Carbon and Tropical are different. I thought we had dropped this dataset? We need to restore GEOCARBON and Saatchi's Global dataset now has a reference.
I have an initial work on an adaptation of Umakant's soil carbon data. He has several layers of data, but two I think are particularly useful to us: cSoilAbove1m and cSoil which is the soil carbon above 3m. Do we compare separately to both? Umakant's estimates are high relative to the other products we have and even correlate badly with respect to the NCSCDV22. How could we check that my coarsening strategy is reasonable?
See Issue #30 about the GBAF datasets. A while ago we dropped them for the FLUXCOM product. I assumed that this was a name change but am I wrong? I grabbed the neural net product, but there are others generated with other techniques. I am not inclined to encode each of their products.
I have pushed on our new scoring methodology, score = | 1 - error / bad_biome_error |, clipped to be on [0,1]. I have a comparison of the new vs old scores if the quantile used to define the bad_biome_error is changed. I also produced a CMIP5v6 comparison. The 98th quantile is a defensible choice, but leads to scores which use very little of the [0,1] range. For that reason it seems better to use a lower quantile, but which one and what principle do we use to justify the choice?
Any updates to our project board? Progress made on any dataset? New suggestions? Anyone want to take one on?
Last meeting, we had some unresolved discussion about the surface soil moisture from WangMao. In particular, CESM2 and NorESM are quite wet in high latitudes relative to the data product and we wondered if this effect is real, especially given the caution provided in Dirmeyer2016 to combine soil moisture data carefully. Here is the dataset publication Wang2021. @jiafumao

Restore the pr vs mrro ratio over basins comparison

Mingquan's original work is cached here. Perhaps in place of scatter diagrams of model/obs ratios, we could color basins in a map like we do with GRACE?

Wang2021 Surface Soil Moisture

We could extract the top surface of the WangMao mrro dataset and compare it to mrsos in the current ILAMB methodology. See their publication.

New eddy covariance (EC) data collection for high latitudes

A fairly recent collection of EC site data, called the ABCflux database, has been synthesized and published:

Virkkala A-M, SM Natali, BM Rogers, JD Watt, K Savage et al. 2022. The ABCflux database: Arctic-boreal CO2 flux observations and ancillary information aggregated to monthly time steps across terrestrial ecosystems. Earth System Science Data 14: 179-208. https://doi.org/10.5194/essd-14-179-2022.

This collection is of particular interest for site-level evaluation in projects like DOE's NGEE Arctic and other projects with a high latitudes focus. We should incorporate this database into ILAMB.

CERES Albedo has wrong unit

Shouldn't this line:

https://github.com/mmu2019/Datasets/blob/master/read-albedo-ceres.py#L164

be 1?

Data issue of `DATA/albedo/CERESed4.1/albedo.nc`

Hi, Rubisco team:
we get ILAMB-DATA by ilamb-fetch and found dataset DATA/albedo/CERESed4.1/albedo.nc has corrupted. More specifically, the image we plot is like this:

Try to convert the naming standard of ERA5 to CMIP

Recently we try to use EAR5 data as our observational dataset to do some confrontation with our own model results. But I found that the way ERA5 used to name there variables is different than what ilamb using now(this is the ERA5 intro: https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation). I just what to ask if anyone have the experience to use ERA5 data or know how to convert the naming standard of ERA%5 to what we using now. Thanks

Gridded SOC maps for CONUS and Mexico

Soil Organic Carbon Across Mexico and the Conterminous United States (1991–2010)

https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2019GB006219

Migrate ILAMB v3 reformatted datasets into this repository

Mingquan Mu (@mmu2019) has worked to reformat and update many of the datasets we have used in ILAMB for a long time. These should be slowly migrated to this repository and included in our curated collection. I recommend creating a pull request with one code added such that we can discuss changes to make the rest of the imports smoother.

FLUXNET-CH4 eddy covariance dataset

Chang, K.-Y., Riley, W., Collier, N., McNicol, G., Fluet-Chouinard, E., Knox, S., Jackson, R., Poulter, B., Saunois, M., Zhu, Q.: Multi-model ensemble does not fill the gaps in sparse wetland methane observations, In review, Global Biogeochemical Cycles.

Test

Davies-Barnard Biological Nitrogen Fixation

It has been suggested that we implement T's Biological Nitrogen Fixation dataset.

https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2019GB006387

NASA Gridded SWE and snow depth

Suggestion from Will Weider @wwieder

Data citation:
P. Broxton, X. Zeng, N. Dawson, Daily 4 km Gridded SWE and Snow Depth from Assimilated In-Situ and Modeled Data over the Conterminous US, Version 1. Boulder, Colorado USA. NASA National Snow and Ice Data Center Distributed Active Archive Center. https://doi.org/10.5067/0GGPB220EX6A. NASA National Snow and Ice Data Center Distributed Active Archive Center

X. Zeng, P. Broxton, N. Dawson, Snowpack Change From 1982 to 2016 Over Conterminous United States. Geophysical Research Letters 45, 12,940-912,947 (2018).

The data only covers CONUS, but is relatively high resolution (4km) and at a daily interval. For direct use in the current ILAMB methodology, we should coarsen to monthly. Alternatively we could write custom analysis to make use of the higher temporal information. If the appropriate variable for snow thickness is sisnthick, then there are several models in ESGF with daily historical data. I think the same variable on a monthly scale is known as snd and there are models with this data.

Lai

Check that we fixed MODIS, I think not, also, check if we can encode their std.

CLASS lost its reference information

In my haste to fix some formatting, I neglected to port the reference information again.

Soil Moisture Datasets

Global

SMAP Level 4 (Reichle et al. 2022. https://nsidc.org/data/spl4smgp/versions/7 and https://nsidc.org/sites/default/files/documents/technical-reference/smap_l4v7_sm_assessmtreport.pdf)
ERA5 (https://cds.climate.copernicus.eu/cdsapp#/dataset/reanalysis-era5-single-levels-monthly-means?tab=overview, June 2021, Copernicus Climate Change Service, 2017)
MERRA2 (https://disc.gsfc.nasa.gov/datasets?project=MERRA-2, GMAO 2015a, b, c)
ESA-CCI (https://esa-soilmoisture-cci.org/, Dorigo et al., 2017; Gruber et al., 2019; Preimesberger et al., 2021)
Wang2021 (Wang, Y., Mao, J., et al. 2021. https://doi.org/10.5194/essd-13-4385-2021. https://doi.org/10.5194/essd-13-4385-2021 )
Fang et al 2021 (A global 1-km downscaled SMAP soil moisture product based on thermal inertia theory. DOI: 10.1002/vzj2.20182

Continental

US dataset (Ning, Fisher) https://doi.org/10.1016/j.jhydrol.2023.130010

New Northern Hemisphere Permafrost Map

I found a new permafrost map in a June 2019 paper, titled “Northern Hemisphere permafrost map based on TTOP modelling for 2000–2016 at 1 km² scale”, at https://doi.org/10.1016/j.earscirev.2019.04.023

The data are from AWI at https://doi.pangaea.de/10.1594/PANGAEA.888600?format=html#download

We should evaluate the utility of using this product, and start by comparing it with the existing map we have in ILAMB.

Contiguous United States forest biomass from Oregon State

On behalf of @mmu2019:

I have contiguous United States forest biomass from Oregon State. This data
is annual data with time series of 34 years from 1984 until 2017, 30-meter
spatial resolution. I think someone knows more about this data, but I can't
be sure if it is worth adding this data in our ILAMB system.

missing GBAF datasets

The ILAMB data repo does not have GBAF latent and sensible heat flux datasets.

Licensing information in datasets metadata

Hi All,

First of all, thanks for putting the ILAMB-data together, that's a great ressource.

I am working on some use cases for the ACCESS / CABLE community here in Australia.
One of my TO-Dos is to make ILAMB and the ILAMB datasets available at NCI.

I looked at the datasets and could not find information on licensing in the metadata.
Is it recorded somewhere?

Decision on datasets to include

We have formatted several new datasets but have not included them into our 'canon'. For those datasets which have alternatives already in our collection, I have a set of pictorial comparisons. For many of these, I also have a test ILAMB run against CESM2.

Another question: what does it mean to be in our 'canon'? I support a few ILAMB configuration files, but the main one we use is cmip.cfg, the configuration of the CMIP5v6 comparison. However, for that comparison we dropped the burned area dataset as so few models include this and CanSISE for reasons that I do not totally recall. So there is a danger that we 'lose' datasets. I propose that we support 2 configurations:

ilamb.cfg where we put all datasets that we have prepared that we deem are useful for confrontation land models.
cmip.cfg which is a subset that includes datasets we wish to present in a CMIP5/6 comparison.

This would also give us a criteria for which we decide to include datasets. Another issue is that we need to agree on weights to assign each dataset based on our rubric.

See Project Board.