Giter Club home page Giter Club logo

xagg's Introduction

XAGG

XML Aggregator for CMIPx data

Dependencies:

python3, scipy, joblib, scandir, cdscan (part of cdms2)

Setup:

  • Create anaconda environment with dependencies (creates a Py3 environment by default)
conda create -y -n xagg -c conda-forge -c cdat/label/v8.2.1 "libnetcdf=*=mpi_openmpi_*" "mesalib=18.3.1" "python=3.7" cdat cdms2 joblib scandir scipy
  • Download local tables database
cd tools
./updateTables.sh
cd ..

Run software:

  • Execute test cases
./xagg.py -f mon -v tas --outputDirectory ~/tmp > log.txt

Or

./xagg.py -f mon -v tas -e hist-CO2 --outputDirectory ~/tmp --updatePaths False > log.txt
  • Execute complete scan
./xagg.py --outputDirectory ~/tmp > log.txt

xagg's People

Contributors

durack1 avatar pochedls avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Forkers

bonfils2

xagg's Issues

Update ocean vars scanned

Adding some new variables:

mlotstmax {Omon}: Maximum Ocean Mixed Layer Thickness Defined by Sigma T [mon: Temporal Maximum, Global field (single level) [XY-na] [tmax]] (2)
mlotstmin {Omon}: Minimum Ocean Mixed Layer Thickness Defined by Sigma T [mon: Temporal mean, Global field (single level) [XY-na] [tmin]] (2)
mlotstsq {Omon}: Square of Ocean Mixed Layer Thickness Defined by Sigma T [mon: Temporal mean, Global field (single level) [XY-na] [amse-tmn]] (5)
omldamax {Oday}: Mean Daily Maximum Ocean Mixed Layer Thickness Defined by Mixing Scheme [day: Temporal Maximum, Global field (single level) [XY-na] [tmax]] (1)
sosga {Odec}: Global Average Sea Surface Salinity [dec: Temporal mean, Global mean/constant [na-na] [amse-tmn]] (1)

http://clipc-services.ceda.ac.uk/dreq/index/CMORvar.html

HighResMIP

Hi @durack1 @pochedls @gleckler1,

I see HighResMIP data is available in the PCMDI scratch at /p/css03/esgf_publish/CMIP6/HighResMIP, wondering if xml aggregation for the 'HighResMIP' can be added if it is easy enough. @gleckler1 and I are going help Cheng analyzing diurnal cycle of pr for HighResMIP, and I think it would be very helpful if we can have xmls for those data.

Thanks for your consideration.

Missing ocean fixed fields (CNRM-CM6-1 piControl e.g.)

In data travelings, a wrinkle with fixed fields has been uncovered.

Data that is locally available:

-bash-4.2$ ls -al ~/css03/esgf_publish/CMIP6/CMIP/CNRM-CERFACS/CNRM-CM6-1/
piControl/r1i1p1f2/Ofx/areacello/gn/v20180814/
total 8448
drwxrwsr-x 2 1682 climatew    4096 Aug 31  2018 .
drwxrwsr-x 3 1682 climatew    4096 Aug 22  2018 ..
-rw-rwSr-- 1 1682 climatew 8620506 Aug 14  2018 areacello_Ofx_CNRM-CM6-1_piControl_r1i1p1f2_gn.nc

Is not indexed by the xmls:

-bash-4.2$ ls ~/xclim/CMIP6/fx/areacello/*CNRM-CM6-1*
ls: cannot access ~/xclim/CMIP6/fx/areacello/*CNRM-CM6-1*: No such file or directory

I note there is some unusual permissions -rw-rwSr-- on this file and directory, but I can ncdump the file, so wondering what is going wrong?

There is a parallel issue with deptho data that is available (e.g. GFDL-ESM4.historical.r1i1p1f1.Ofx.deptho) but this will need to be resolved elsewhere.

Ping @durack1 @pochedls @painter1 @jetesdal @gleckler1

Request adding new variables to scan

I'd like to ask for adding new monthly variables to the scan list if it is possible: od550aer, od550so4. Please let me know if you need further information about this request. Thank you!

Xue Zheng

str type error

(xagg) duro@ocean:[xagg]:[master]:[11621]> ./xagg.py --outputDirectory ~tmp > log.txt
Traceback (most recent call last):
  File "./xagg.py", line 137, in <module>
    fx.createLookupDictionary(diskPaths, outfile=cmipMetaFile)
  File "~git/xagg/fx.py", line 621, in createLookupDictionary
    key = mip_era + '.' + cmipTable + '.' + variable
TypeError: can only concatenate list (not "str") to list

One XML points to limited time period

Just one of the many XMLs in /p/user_pub/xclim/CMIP6/CMIP/historical/atmos/mon/zg/ points to data from the limited time period 1910-1979: CMIP6.CMIP.historical.NCC.NorESM2-LM.r1i1p1f1.mon.zg.atmos.glb-p19-gn.v20190815.0000000.0.xml. All other XMLs in this directory point to output that extends into this century. See highlighted output from my script in the attached MS Word document.
coveyGrimTranscript.docx

CESM CFMIP files missing

Some CESM data that is included in runSettings and for which we have data is not producing xml files.

It appears this is related to an issue with cdscan (re)-documented here for which there is a fix.

MRI amip 3hr datasets with missing files

MRI models have missing files in the below folders:

/p/css03/esgf_publish/cmip5/output1/MRI/MRI-AGCM3-2H/amip/3hr/atmos/3hr/r1i1p1/v20110914/pr

/p/css03/esgf_publish/cmip5/output1/MRI/MRI-AGCM3-2S/amip/3hr/atmos/3hr/r1i1p1/v20111121/pr

/p/css03/esgf_publish/cmip5/output1/MRI/MRI-CGCM3/amip/3hr/atmos/3hr/r1i1p1/v20120119/pr

MRI is a model that have files chunked into small pieces, maybe that causes abnormality when replicate the data? @painter1

Bad latitude coordinates in (out of 18) CMIP6 historical runs ?

Fourier analysis of 500hPa geopotential heights for two of the CMIP6 models gives results that differ so much from the others that one suspects an error in reporting latitude coordinate values. See nearby graphic. Data in question are from the following PCMDI XML files:
*
geopotentialWaveAmplitudesCMIP6.pdf
/p/user_pub/xclim/CMIP6/CMIP/historical/atmos/mon/zg/CMIP6.CMIP.historical.BCC.BCC-CSM2-MR.r1i1p1f1.mon.zg.atmos.glb-p19-gn.v20181126.0000000.0.xml
*/p/user_pub/xclim/CMIP6/CMIP/historical/atmos/mon/zg/CMIP6.CMIP.historical.CAMS.CAMS-CSM1-0.r1i1p1f1.mon.zg.atmos.glb-p19-gn.v20190708.0000000.0.xml

According to the WCRP CMIP6 website, the institutions providing this data are:

Curt Covey <@yahoo.com>

DCPP Runs

I disabled DCPP scanning since it was taking >4 hours to stat these directories. I think we need to parallelize these to a finer scale.

Note that we were already parallelizing across institute directories (i.e., CMIP6/DCPP/* was split into 9 scans: BCC, CCCma, CNRM-CERFACS, EC-Earth-Consortium, IPSL, MIROC, MPI-M, NCAR, NCC). I think the NCAR directory itself was taking ~4 hours, which means we need to parallelize on a finer level.

Job not completing

The nightly cron job has not completed since 8/17. The run lock was in place [now removed]. The log shows that it finished scanning:

Start scans
Sat Aug 17 00:48:29 2019

Sat Aug 17 00:48:29 2019: 831/831 (100.0%) /p/css03/esgf_publish/CMIP6/DCPP/

But then doesn't show the subsequent steps (or an error). Typically the log shows the following (once scans are completed):

Write statistics to database
Fri Aug 16 01:56:25 2019

Finished run
Fri Aug 16 01:56:42 2019

The next missing step is to write the scan results to the database. It is unlikely this happened (the last write time on the .db file is 00:43).

3hr prw for CMIP5/6 amip and historical

@pochedls Hi Steve, would you please consider add 3hr prw 3hr? I'm not sure how many models have this variable as output. But it will be useful for convection onset analysis. Thank you!

variable: prw
frequency: 3hr
experiments: amip and historical
mip_era: CMIP5 and CMIP6

Monthly Database Snapshots

xagg is currently set to create a zipped backup of the sqlite3 database on a nightly basis. This is starting to take up a lot of space (62GB).

I will plan on storing the most recent 30 days and monthly backups before that (e.g., on the first of the month or the closest backup to that).

Deal with sidepot data

The "sidepot" data is an accumulating resource that needs to deal with both CMIP5 and CMIP5 (and CMIP3) data, some tweaks enable mip_era to be identified

specifying correct naming of netcdf files in search

Some of the files published by NCC are in the format
variable_Amon_experiment_modelname_etc and some are variable_Amon_modelname_experiment_etc. The latter is the recommended format.

In response to this issue, I am wondering if it would be a good idea for cdscan to specifically search for filenames written correctly, i.e., specifying

filename = '+var+'Amon'+model+''+exp+''+ripf+'gn*.nc'

in the search.

Permissions wash

There is a need to ensure that file and directory permissions are correctly set after each run. These are 775/r-x for directories and 774/r-- for files.

-bash-4.2$ ls -al ../CMIP6/CMIP/historical/ocean/mon/thetao/
CMIP6.CMIP.historical.AS-RCEC.TaiESM1.r1i1p1f1.mon.thetao.
ocean.glb-l-gn.v20200320.0000000.0.xml 
-rwxrwxr-- 1 poche xclimw 55846 May  1 05:43 ../CMIP6/CMIP/
historical/ocean/mon/thetao/CMIP6.CMIP.historical.AS-RCEC.TaiESM1.
r1i1p1f1.mon.thetao.ocean.glb-l-gn.v20200320.0000000.0.xml
...
-bash-4.2$ ls -al ../CMIP6/CMIP/historical/ocean/mon
total 3264
drwxrwxr-x 31 poche xclimw  4096 Jan 16 02:57 .
drwxrwxr-x  4 poche xclimw  4096 Jun  5  2019 ..
drwxrwxr-x  2 poche xclimw 32768 May 25 04:02 agessc
drwxrwxr-x  2 poche xclimw 32768 May  6 04:07 cfc11
drwxrwxr-x  2 poche xclimw 65536 Jun  1 04:05 evs
drwxrwxr-x  2 poche xclimw 32768 Jun  1 04:05 ficeberg

https://superuser.com/questions/91935/how-to-recursively-chmod-all-directories-except-files

Issue scanning files with time units as days since 0000-00-00

This issue results from this issue: CDAT/cdms#334

One way to avoid this issue is to retrieve the file time units and if they are relative to the year 0 we could invoke cdscan with the flag: -e "time.units='days since 1900-01-01'"

We probably do not want to check the units on every file, but maybe could check units on files with particular error messages and rerun cdscan with the -e flag if they are tied to year zero.

Update to include CMIP3 mapping

The CMIP3 data needs to be mapped to the CMIP6 specs so that CMIP3, 5 and 6 can all be addressed using the same directory and filename logic. To achieve this, the CMIP3 tables that were used for CMOR1 will need to be uncovered so a mapping can occur.

I'll obtain a copy of these tables from @taylor13 (which are summarized at https://pcmdi.llnl.gov/ipcc/standard_output.html), update them into an appropriately named github repo and use these as input for the mapping

Duplicate ScenarioMIP data?

@pochedls I was just starting to take a look at the various MIP datasets, and stumbled upon

(base) bash-4.2$ ls -al ../xclim/CMIP6/CMIP/
total 0
drwxrwxr-x 15 pochedls xclimw 4096 Oct  4  2019 .
drwxrwxr-x 12 pochedls xclimw 4096 Jun  1 17:48 ..
drwxrwxr-x  9 pochedls xclimw 4096 Sep 13  2019 1pctCO2
drwxrwxr-x  9 pochedls xclimw 4096 Sep 13  2019 abrupt-4xCO2
drwxrwxr-x  8 pochedls xclimw 4096 Sep 13  2019 amip
drwxrwxr-x  9 pochedls xclimw 4096 Sep 13  2019 esm-hist
drwxrwxr-x  9 pochedls xclimw 4096 Sep 13  2019 esm-piControl
drwxrwxr-x  9 pochedls xclimw 4096 Jul 13 04:04 esm-piControl-spinup
drwxrwxr-x  9 pochedls xclimw 4096 Sep 13  2019 historical
drwxrwxr-x  9 pochedls xclimw 4096 Sep 13  2019 piControl
drwxrwxr-x  9 pochedls xclimw 4096 Jul 13 04:04 piControl-spinup
drwxrwxr-x  5 pochedls xclimw 4096 Oct  4  2019 ssp126
drwxrwxr-x  5 pochedls xclimw 4096 Oct  4  2019 ssp245
drwxrwxr-x  5 pochedls xclimw 4096 Jul 30 05:35 ssp370
drwxrwxr-x  4 pochedls xclimw 4096 Oct  4  2019 ssp585
(base) bash-4.2$ ls -al ../xclim/CMIP6/ScenarioMIP/
total 0
drwxrwxr-x 10 pochedls xclimw 4096 May 14  2019 .
drwxrwxr-x 12 pochedls xclimw 4096 Jun  1 17:48 ..
drwxrwxr-x  9 pochedls xclimw 4096 Sep 13  2019 ssp119
drwxrwxr-x  9 pochedls xclimw 4096 Sep 13  2019 ssp126
drwxrwxr-x  9 pochedls xclimw 4096 Sep 13  2019 ssp245
drwxrwxr-x  9 pochedls xclimw 4096 Sep 13  2019 ssp370
drwxrwxr-x  9 pochedls xclimw 4096 Sep 13  2019 ssp434
drwxrwxr-x  9 pochedls xclimw 4096 Sep 13  2019 ssp460
drwxrwxr-x  9 pochedls xclimw 4096 Sep 13  2019 ssp534-over
drwxrwxr-x  9 pochedls xclimw 4096 Sep 13  2019 ssp585

Is the duplication of ssp126 - > ssp585 intentional?

Deal with /badx directories

The function that finds directories eligible for scanning ignores directories that have a sub-directory. Unfortunately some CMIP data is kept in directories that also include folders of the form /badx/ (where x is a number). These directories are skipped over for scanning (because they have the badx sub-directory).

For example, the directory /p/css03/cmip5_css01/data/cmip5/output1/LASG-IAP/FGOALS-s2/abrupt4xCO2/mon/atmos/Amon/r1i1p1/v1/hur/ has the contents:
bad0/
bad1/
hur_Amon_FGOALS-s2_abrupt4xCO2_r1i1p1_185001-199912.nc

The path is passed over for scanning because of the two sub-directories. We should modify the scantree code to detect these situations and return the eligible path.

Case study of deprecation (E3SM-1-0 ocean data)

There is a known issue with the E3SM-1-x ocean data, such that a grid mask issue propagated (for more info see here).

The status change of the data, from the (now) currently available, but erroneous data, to new data that will be published soon and will deem the errored data deprecated will be useful to watch. I note some of the file tracking_id entries may be useful test cases if we wish to query the "latest" status of existing data available locally in the LLNL archives.

Some examples (from CMIP6.CMIP.historical.*):

Filename tracking_id collection PID
tos_Omon_E3SM-1-0_historical_r3i1p1f1_gr_185001-185412.nc hdl:21.14100/7483b96d-a173-4155-9590-e4f7c4c5fccc hdl:21.14100/ba92826d-cb6b-315c-ab20-104f2f20258c
so_Omon_E3SM-1-0_historical_r5i1p1f1_gr_201001-201412.nc hdl:21.14100/1f217b12-0541-48d1-9c2e-c272af18e4a3 hdl:21.14100/ec772211-6e75-30af-b573-4554e625cb69
thetao_Omon_E3SM-1-0_historical_r2i1p1f1_gr_189501-189912.nc hdl:21.14100/6807c1a3-d001-4fb7-8e88-df8c5dde68b2 hdl:21.14100/2effbc41-fbd0-3bf0-883c-b4bc12fa89e8

Each of these files currently shows no known issues on the ES-DOC PID lookup page, see https://errata.es-doc.org/static/pid.html

Each of these datasets will likely be deprecated as the issue linked above is addressed, new data replaces the files above and the issue is closed. Using the ES-DOC API, it would be possible to query the "status" of files using their tracking_id/PID so that you know you are always using the ESGF federated latest rather than the latest local version of the CMIPx data.

For reference, all current errata is listed at https://errata.es-doc.org/static/index.html?project=CMIP6

@mzelinka @taylor13 @gleckler1 @lee1043 just FYI

Flag / delete retracted data

The xml files we use currently link to every possible dataset we can find, including retracted data. We should flag or delete these files in some way so that the user can avoid them.

A web-based api would be incredibly slow unless we can create some kind of bulk request for retracted datasets.

@painter1 / @durack1 - do you have ideas about how we can identify and flag / delete retracted data? Are there local databases that contain this information?

Add monthly aqua experiments to process list

@pochedls Hi Steve, could you please consider adding monthly aqua planet simulations? It will be useful for a hierarcy comparison with other AMIP and fully-coupled simulations. Thank you!

frequency: Amon, CFmon
mip_era: CMIP5 and CMIP6
experiments [CMIP5/CMIP6]: aquaControl/aqua-control, aqua4K/aqua-p4K, aqua4xCO2/aqua-4xCO2

NPROC limits being hit

This error is not a bug in software, but rather in the env and config of a user and machine, so dropping it here in case it happens again.

(xagg) duro@ocean:[xagg]:[master]:[11288]> ./xagg.py --outputDirectory ~/tmp > log.txt
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable                                
OpenBLAS blas_thread_init: RLIMIT_NPROC 1024 current, 1032773 max                                          
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable                                
OpenBLAS blas_thread_init: RLIMIT_NPROC 1024 current, 1032773 max
...

This requires an edit to the (RHEL6) /etc/security/limits.conf file, increasing the nproc limit above the default of 1024

#<domain>      <type>  <item>         <value>
...
#*               soft    core            0
#*               hard    rss             10000
#@student        hard    nproc           20
...
#@student        -       maxlogins       4
duro             soft    nproc           4096 ; # <-- CHANGED THIS

The above requires an env refresh, so logout/login to refresh what the shell limits are - if no new session is created new terms inherit from existing terms

Default perms check

The default group is currently climate, we should wash so that xclim is the new group for all subdirs and files.

All machines will need to inherit the groups and user/uid tables

omitted GISS-E2-1-G f2 data?

I was surprised to see several directories missing XML files, all the f2 directories below. I note that these were listed the last time I ran my code.

source directories:

(base) bash-4.2$ ls -1 ~/CMIP6/ScenarioMIP/NASA-GISS/GISS-E2-1-G/ssp245/
r101i1p1f1
r10i1p1f2
r10i1p5f1
r10i1p5f2
r1i1p1f2
r1i1p3f1
r1i1p5f1
r1i1p5f2
r2i1p1f2
r2i1p3f1
r2i1p5f1
r2i1p5f2
r3i1p1f2
r3i1p3f1
r3i1p5f1
r3i1p5f2
r4i1p1f2
r4i1p3f1
r4i1p5f1
r4i1p5f2
r5i1p1f2
r5i1p3f1
r5i1p5f1
r5i1p5f2
r6i1p1f2
r6i1p5f1
r6i1p5f2
r7i1p1f2
r7i1p5f1
r7i1p5f2
r8i1p1f2
r8i1p5f1
r8i1p5f2
r9i1p1f2
r9i1p5f1
r9i1p5f2

And XML matches:

(base) bash-4.2$ ls -1 *GISS-E2-1-G*
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G-CC.r102i1p1f1.mon.mrro.land.glb-2d-gn.v20220115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r101i1p1f1.mon.mrro.land.glb-2d-gn.v20220115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r10i1p5f1.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r1i1p3f1.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r1i1p5f1.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r2i1p3f1.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r2i1p5f1.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r3i1p3f1.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r3i1p5f1.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r4i1p3f1.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r4i1p5f1.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r5i1p3f1.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r5i1p5f1.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r6i1p5f1.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r7i1p5f1.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r8i1p5f1.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r9i1p5f1.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml

Fall cleaning

We have various versions of xml libraries lying around.

It's time to wipe them out to avoid confusing people:

rm -rf ...

  • /work/cmip-dyn/
  • /work/cmip5-dyn/
  • /work/cmip5-test/
  • /work/cmip5/

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.