geoschem / gchp Goto Github PK
View Code? Open in Web Editor NEWThe "superproject" wrapper repository for GCHP, the high-performance instance of the GEOS-Chem chemical-transport model.
Home Page: https://gchp.readthedocs.io
License: Other
The "superproject" wrapper repository for GCHP, the high-performance instance of the GEOS-Chem chemical-transport model.
Home Page: https://gchp.readthedocs.io
License: Other
Hi everyone,
I mentioned this on Slack last week, but plotting GCHP full global data with pcolormesh()
can be difficult because you'll get horizontal streaks for grid-boxes that cross the antimeridian. Below are examples.
TLDR: Use cartopy version 0.19 or greater if you want to plot GCHP data for the entire globe.
Set up the figure
ax = plt.axes(projection=ccrs.PlateCarree())
ax.set_global()
ax.coastlines()
Plot a face that doesn't cross the antimeridian (looks good)
ds = xr.open_dataset('GCHP.SpeciesConc.nc4')
plt.pcolormesh(
ds.lons.isel(nf=4).values,
ds.lats.isel(nf=4).values,
ds.SpeciesConc_NO2.isel(nf=4, lev=0, time=0).values,
vmax=8e-9
)
Now, plot a face that does cross the antimeridian (results in horizontal streaking)
plt.pcolormesh(
ds.lons.isel(nf=3).values,
ds.lats.isel(nf=3).values,
ds.SpeciesConc_NO2.isel(nf=3, lev=0, time=0).values,
vmax=8e-9
)
This is illustrated a bit better with stretched-grids.
Again, set up the figure
ax = plt.axes(projection=ccrs.PlateCarree())
ax.set_global()
ax.coastlines()
Plot a face that doesn't cross the antimeridian
ds = xr.open_dataset('GCHP.SpeciesConc.nc4')
plt.pcolormesh(
ds.lons.isel(nf=0).values,
ds.lats.isel(nf=0).values,
ds.SpeciesConc_NO2.isel(nf=0, lev=0, time=0).values,
vmax=8e-9
)
Next, plot a face that does cross the AM
plt.pcolormesh(
ds.lons.isel(nf=1).values,
ds.lats.isel(nf=1).values,
ds.SpeciesConc_NO2.isel(nf=1, lev=0, time=0).values,
vmax=8e-9
)
The PlateCarree projection is a 2D space unaware of wrapping at the antimeridian. Work arounds include:
Fixed in SciTools/cartopy#1622 (thanks @htonchia and @greglucas). This merge was on August 19, 2020. IIUC this fix will be released in cartopy version 0.19.
Is ExtData/SPC_RESTARTS or ExtData/GEOSCHEM_RESTARTS the proper place for our restarts? Currently, ./createRunDir.sh
links to restarts in SPC_RESTARTS, but those don't exist on ComputeCanada. I assume ./createRunDir.sh
needs to be updated--is that right?
Re: #36
GCHPctm should not be built with OpenMP. However, MAPL is built with OpenMP and possibly other components as well. As a result, I ran into a compile issue with ifort19 for 13.0.0-alpha.8 due to an improper openMP directive (OMP serial). The fix for that issue went into 13.0.0-alpha.9, but the larger issue that GCHPctm is being compiled with OpenMP remains. I believe the issue has to do with the settings in the CMake files within MAPL and this needs to be further looked at.
On Friday July 24, the master
branch of GCHPctm will be deleted and the new branch main
will take its place. At the same time, main
will be moved up to the latest alpha pre-release version. If you have not already, take a look out the GCHPctm Releases page to view GCHPctm pre-release versions available and what they contain.
If you are currently using the master branch you do not need to change anything. If you decide to update versions simply do the following:
git fetch
git checkout main
If you have a fork of GCHPctm and have a second remote connected to the upstream (geoschem/gchpctm), you can do the following:
git fetch upstream (or whatever your upstream remote is called)
The main branch will then be available to checkout or merge:
git checkout upstream/main
To delete the stale branches from any of your remotes, do the following:
git fetch {remotename} --prune
It has been a known issue for a long time that GCHP does not give exactly the same final result between a long single run and the identical run split up into shorter durations. This has been true for both the transport tracer and the full chemistry simulations.
This is especially problematic for GCHP because currently the only way to output monthly mean diagnostics is to break up a run into 1-month run segments. A monthly mean capability was supposed to be included within MAPL for the 13.0.0 release but that update is not yet ready in a MAPL release. Since we output monthly means in GCHP 1-year benchmarks I have been looking more closely at this issue to find fixes before we do the 13.0.0 benchmark.
Recent updates that are going into GEOS-Chem 13.0.0 correct this problem for transport tracers. Bug fixes in the GEOS-Chem and HEMCO submodules resolved the issue and the simulation now gives zero diffs regardless of how the run is split up. See the following posts on GitHub for more information on these updates:
Differences persist in the full chemistry simulation and I am actively looking into them.
Kevin Bowman suggests that CO2-only mode can be a good use case for GCHP-on-cloud, as it requires much less I/O which is the major bottleneck on AWS cloud.
I haven't used the CO2 mode before and would like to learn more about its current status:
Particularly @sdeastham who should have more experience with the CO2 mode.
The difficulty to build GCHP (despite the large improvement over early versions) is preventing user adoption and eating a lot of engineer time (e.g. on debugging makefiles). In particular, it is time-consuming to diagnose compiler/MPI-specific problems, as there are so many combinations of them.
So far this problem is being treated passively -- we stick to very few combinations we know that is working (notably ifort + OpenMPI3). Other combinations (gfortran, other MPIs) are solved case-by-case, typically when a user hit bugs on a specific system.
The sustainable way (which will save a lot of engineer time in the long-term) is to deal with this problem actively -- we should explore all common combinations using build matrix that is offered by most Continuous Integration (CI) services.
The components of the build matrix include:
By having a continuous build at every commit / every minor releases, we will be able to:
This also helps user to find the "shortest path" to solve their specific error. An example question is "my build is failing on Ubuntu + gfortran + mpich; which component should I change to fix the problem"? By looking at the matrix, you can see that (for example) changing the MPI can lead to a correct built.
A simple CI (on Travis) for GC-classic is geoschem/geos-chem#11 However, the memory & compute limit on Travis probably won't allow building GCHP. Other potentially better options are:
Tutorial-like pages:
Existing models for reference:
HEMCO will be a separate sub-project from GEOS-Chem in 13.0.0-alpha.4. Ideally it would also be used as an ESMF gridded component. I am aiming to implement this for 13.0.0-alpha.5.
Ask your question here
Hi all,
I want to calculate lifetime of NOx and I need the chemical production and loss rates.
I want to add Prod_NO and Loss_NO in History.rc, but it didn't work.
As wiki said, some quantities in ProdLoss collection are not applicable to certain simulations.
So I wonder whether it can output Prod_NO, Loss_NO, Prod_NO2 and Loss_NO2?
Thanks and waiting for your replying!
Hongjian
All HEMCO diagnostics are output with level 1 corresponding to top of atmospheric. This is opposite all GEOS-Chem diagnostics. It would be ideal to have consistency in level order across diagnostics to avoid confusion. This update would be for use in GCHP only. Levels should remain as they are if using GEOS.
ESMF 8.0.1 is now available. The official release page is here. This version is supposed to be backwards compatible so no changes should be necessary for running with GCHPctm. It includes updates that improve performance, but whether GCHPctm performance is improved is yet to be determined. Notably, with this release ESMF is now on GitHub.
GCHPctm needs to be tested with the new version of ESMF, and the documentation for ESMF download needs to be updated, both on the wiki and in the GitHub README.
Hi, I'm trying to get the GCHPctm wrapper working in the default standard simulation with MERRA2, using GEOS-Chem 12.8.2. I was able to build the geos executable following the "getting started" portion of the GCHPctm GitHub page, I am running into some problems with running the default (6-core, 1-node, 1-hour) test simulation. I am running interactively using the gchp.local.run script available in the runScriptSamples directory.
More information:
-I am running on the NCAR Cheyenne system, which uses a PBS scheduling system
-My interactive session uses 1 node with 36 cores/node
-I am using openmpi 4.0.3 and gfortran 8.3.0 compiler
-I built ESMF with ESMF_COMM=openmpi
I receive this output when running the default:
WARNING: NX and NY are set such that NX x NY/6 has side ratio >= 2.5. Consider adjusting resources in runConfig.sh to be more square. This will avoid negative effects due to excessive communication between cores.
Compute resources:
NX : 1 GCHP.rc
NY : 30 GCHP.rc
CoresPerNode : 30 HISTORY.rc
Cubed-sphere resolution:
GCHP.IM_WORLD : 24 GCHP.rc
GCHP.IM : 24 GCHP.rc
GCHP.JM : 144 GCHP.rc
IM : 24 GCHP.rc
JM : 144 GCHP.rc
npx : 24 fvcore_layout.rc
npy : 24 fvcore_layout.rc
GCHP.GRIDNAME : PE24x144-CF GCHP.rc
Initial reestart file:
GIGCchem_INTERNAL_RESTART_FILE : +initial_GEOSChem_rst.c24_standard.nc GCHP.rc
Simulation start, end, duration:
BEG_DATE : 20160701 000000 CAP.rc
END_DATE : 20160701 010000 CAP.rc
JOB_SGMT : 00000000 010000 CAP.rc
Checkpoint (restart) frequency:
RECORD_FREQUENCY : 100000000 GCHP.rc
RECORD_REF_DATE : 20160701 GCHP.rc
RECORD_REF_TIME : 000000 GCHP.rc
The run eventually crashes after reading the HEMCO_Config.rc file and gives this error:
FATAL from PE 1: mpp_domains_define.inc: not all the pe_end are in the pelist
FATAL from PE 3: mpp_domains_define.inc: not all the pe_end are in the pelist
FATAL from PE 5: mpp_domains_define.inc: not all the pe_end are in the pelist
FATAL from PE 0: mpp_domains_define.inc: not all the pe_end are in the pelist
FATAL from PE 2: mpp_domains_define.inc: not all the pe_end are in the pelist
FATAL from PE 4: mpp_domains_define.inc: not all the pe_end are in the pelist
FATAL from PE 0: mpp_domains_define.inc: not all the pe_end are in the pelist
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[r1i2n17:64901] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193
[r1i2n17:64901] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193
[r1i2n17:64901] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193
[r1i2n17:64901] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193
[r1i2n17:64901] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193
[r1i2n17:64901] 5 more processes have sent help message help-mpi-api.txt / mpi-abort
[r1i2n17:64901] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
I am quite new to this, so I am not sure what is going on and would appreciate any advice. Further, are there potential issues I might run into by attempting to use GEOS-Chem 12.8.2 with GCHPctm? I am happy to provide more info if needed, as well. The full output log is attached.
gchp.pdf
Default values in GCHPctm 13.0.0-alpha.2 (and all previous GCHP versions) write a MAPL internal state checkpoint file during the first timestep. At high resolutions this adds significantly to the run-time. We should have the ability to turn off writing this first checkpoint file. It shouldn't be needed since we already have a restart file and the output checkpoint file is written separately at the end of the run.
Multi-word compiler arguments should be wrapped in \"SHELL:word1 [word2 [...]]\"
to prevent CMake from splitting the words up when it removes duplicate command line arguments. I did this for Release
flags a while back but I didn't do this for Debug
flags. This needs to be done in
This will break the Debug
build type until this is fixed.
Hi everyone,
In the GCSC meeting the other day, it was suggested I test mass conservation in stretched-grid simulations. Could someone help me understand how I could do this? I see there's a geosfp_2x25_masscons
run directory for GC-Classic—is there a way to mirror this in GCHPctm?
Thanks in advance!
There is currently no way to calculate radiative forcing with GCHP. This is in spite of the presence of a version of RRTMG in the GEOS-Chem code.
GEOS-Chem Classic has diagnostic collection AdvFluxVert to save vertical fluxes in tpcore. There is no equivalent diagnostic in GCHP. Since tpcore is in FV3 we should be able to add in an equivalent diagnostic by creating a new MAPL export in the DYNAMICS grid comp in GCHP.
The MAPL debug print option gives lots of information during the information collection stage of MAPL ExtData, such as parsing Extdata.rc and finding files with the right times. However, the regridding part of ExtData is very murky and GCHP seemlingly stalls for a while without any printing at all during this phase. If the run times out due to an issue not caught with error handling then it is hard to know where it went wrong.
This feature request is really for GEOS-ESM/MAPL but we can put it in with GCHP in mind and then submit it as a PR to go to the upstream MAPL.
I noticed some GCHPctm simulations are crashing on some nodes on compute1 with the following error
...
At line 548 of file /my-projects/sgv/line-3/GCHPctm/src/MAPL/MAPL_Base/MAPL_MemUtils.F90
Fortran runtime error: Integer overflow while reading item 1
...
The line where the overflow occurs is https://github.com/geoschem/MAPL/blob/10e7a0bc8d0d79eb90a3742980fa6f7f073a87e3/MAPL_Base/MAPL_MemUtils.F90#L548 because memtot
is a 32-bit signed integer, and this is happening on nodes that report >3TB of memory in /proc/meminfo
.
Changing memtot
to a 64-bit integer should fix this. I'll do this when I get a chance
Changing a line in MAPL_ExtDataGridComp.F90 results in recompiling all of MAPL_Base and FVdycore when rebuilding with ifort. Only MAPL_ExtDataGridComp.F90 is rebuilt, however, when building gfortran. This makes developing and debugging with gfortran superior to ifort. It would be great if we could eventually have the same functionality with ifort as well.
Hi everyone,
I'm trying to run a 30-core 1-day trial simulation with the 13.0.0-alpha.9 version, but the run ended after ~1 simulation hour and escaped with forrtl: error (73): floating divide by zero
. The full log files are attached below.
163214_print_out.log
163214_error.log
More information:
ESMF_COMM=intelmpi
I'm not sure how to troubleshoot this issue. I tried to cmake the source code with -DCMAKE_BUILD_TYPE=Debug
(with the fix in #35) and rerun the simulation, but it gives a really large error log file so I'm not attaching it here. The first few lines of the error log are:
forrtl: error (63): output conversion error, unit -5, file Internal Formatted Write
Image PC Routine Line Source
geos 00000000094A364E Unknown Unknown Unknown
geos 00000000094F8D62 Unknown Unknown Unknown
geos 00000000094F6232 Unknown Unknown Unknown
geos 000000000226CC73 advcore_gridcompm 261 AdvCore_GridCompMod.F90
geos 0000000007F00A0D Unknown Unknown Unknown
geos 0000000007F0470B Unknown Unknown Unknown
geos 00000000083BF095 Unknown Unknown Unknown
geos 0000000007F0219A Unknown Unknown Unknown
geos 0000000007F01D4E Unknown Unknown Unknown
geos 0000000007F01A85 Unknown Unknown Unknown
geos 0000000007EE1304 Unknown Unknown Unknown
geos 0000000006827DDA mapl_genericmod_m 4545 MAPL_Generic.F90
geos 0000000006829035 mapl_genericmod_m 4580 MAPL_Generic.F90
geos 0000000000425200 gchp_gridcompmod_ 138 GCHP_GridCompMod.F90
geos 0000000007F00A0D Unknown Unknown Unknown
geos 0000000007F0470B Unknown Unknown Unknown
geos 00000000083BF095 Unknown Unknown Unknown
geos 0000000007F0219A Unknown Unknown Unknown
geos 0000000007F01D4E Unknown Unknown Unknown
geos 0000000007F01A85 Unknown Unknown Unknown
geos 0000000007EE1304 Unknown Unknown Unknown
geos 0000000006827DDA mapl_genericmod_m 4545 MAPL_Generic.F90
geos 0000000006A52D6C mapl_capgridcompm 482 MAPL_CapGridComp.F90
geos 0000000007F00B39 Unknown Unknown Unknown
geos 0000000007F0470B Unknown Unknown Unknown
geos 00000000083BF095 Unknown Unknown Unknown
geos 0000000007F0219A Unknown Unknown Unknown
geos 000000000844804D Unknown Unknown Unknown
geos 0000000007EE2A0F Unknown Unknown Unknown
geos 0000000006A67F42 mapl_capgridcompm 848 MAPL_CapGridComp.F90
geos 0000000006A39B5E mapl_capmod_mp_ru 321 MAPL_Cap.F90
geos 0000000006A370A7 mapl_capmod_mp_ru 198 MAPL_Cap.F90
geos 0000000006A344ED mapl_capmod_mp_ru 157 MAPL_Cap.F90
geos 0000000006A32B5F mapl_capmod_mp_ru 131 MAPL_Cap.F90
geos 00000000004242FF MAIN__ 29 GCHPctm.F90
geos 000000000042125E Unknown Unknown Unknown
geos 000000000042125E Unknown Unknown Unknown
libc-2.17.so 00002AFBC9F34505 __libc_start_main Unknown Unknown
geos 0000000000421169 Unknown Unknown Unknown
I also noticed something weird towards the start of the run:
MAPL: No configure file specified for logging layer. Using defaults.
SHMEM: NumCores per Node = 6
SHMEM: NumNodes in use = 1
SHMEM: Total PEs = 6
SHMEM: NumNodes in use = 1
Previous versions (12.8.2) usually shows this instead:
In MAPL_Shmem:
NumCores per Node = 6
NumNodes in use = 1
Total PEs = 6
In MAPL_InitializeShmem (NodeRootsComm):
NumNodes in use = 1
but I'm not sure if that matters.
It would be useful if there was a way to generate gridspec files for the model grid. I see some references to gridspec files in ExtData output which makes me think MAPL might some support for generating gridspec files, and ESMF supports gridspec inputs. For stretched-grids (and normal cubed-spheres) the grid-box corner coordinates are useful for plotting model output, and a way to generate gridspec files for the model grid seem like the proper and cleanest way to provide these coordinates. This would also facilitate the use of ESMF's offline regridders since ESMF_RegridWeightGen
and ESMF_Regrid
can take grid definitions for cubed-sphere and stretched-grid grids in the gridspec file format. I can inquire about this in the next MAPL call.
It would be nice if there was an option for geos
like
./geos --generate_gridspec
that generated the gridspec file.
Note that right now I have a custom script for generating a NetCDF file with corner coordinates, but I think it would be best to avoid solutions like this if possible.
It has been puzzling me that ~40% of GCHP simulation time is spent on MPI_Barrier
, as shown by the IPM profiler (https://github.com/nerscadmin/IPM).
For example, here is the IPM profiling result of a 7-day c180 benchmark on 288 cores (version 12.3.2, runs on AWS):
##IPMv2.0.6########################################################
#
# command : ./geos
# start : Thu Sep 26 03:23:50 2019 host : ip-172-31-0-86
# stop : Thu Sep 26 08:47:40 2019 wallclock : 19430.50
# mpi_tasks : 288 on 8 nodes %comm : 48.44
# mem [GB] : 905.40 gflop/sec : 0.00
#
# : [total] <avg> min max
# wallclock : 5595949.52 19430.38 19430.11 19430.50
# MPI : 2710587.40 9411.76 7384.92 11767.68
# %wall :
# MPI : 48.44 38.01 60.56
# #calls :
# MPI : 6023502175 20914938 16331252 21806384
# mem [GB] : 905.40 3.14 3.04 11.42
#
# [time] [count] <%wall>
# MPI_Barrier 2280341.77 11672064 40.75
# MPI_Bcast 242708.94 48370752 4.34
# MPI_Allreduce 89036.54 49428288 1.59
# MPI_Wait 73996.00 2953775418 1.32
# MPI_Scatterv 21071.88 185184 0.38
# MPI_Isend 2117.89 1476895338 0.04
# MPI_Gatherv 969.00 5689728 0.02
# MPI_Irecv 266.05 1476880080 0.00
# MPI_Comm_create 30.24 576 0.00
# MPI_Recv 30.11 15258 0.00
# MPI_Comm_split 17.11 8064 0.00
# MPI_Allgather 1.17 864 0.00
# MPI_Reduce 0.45 1728 0.00
# MPI_Comm_rank 0.24 503073 0.00
# MPI_Comm_size 0.02 74600 0.00
# MPI_Comm_free 0.01 296 0.00
# MPI_Comm_group 0.00 576 0.00
# MPI_Init 0.00 288 0.00
#
###################################################################
Here the total wall time is 19430.50 seconds; the MPI_Barrier
time is 2280341.77 / 288 = 7918 seconds (IPM prints the total time across all ranks), accounting for ~40% of total time. The fraction of MPI_Barrier
is reduced to ~30% at 576 cores and ~20% at 1152 cores, but this is still much larger than a normal value (being blocked at 20%~40% of time seems a bit ridiculous).
Full log:
I wrote a Python script to parse IPM results (https://github.com/JiaweiZhuang/ipm_util) so I can easily analyze & visualize MPI time.
Averaging over all ranks, MPI_Barrier
takes much longer than any other MPI calls.
Break into per-rank time:
(similar to the "Communication balance by task" plot in IPM's default HTLM report)
Same data but on individual panels:
Full notebook: https://gist.github.com/JiaweiZhuang/587a17fbb2b757182c5e49dcd3d1f8a9
The notebook reads these IPM XML log files: gchp_ipm_logs.zip
I originally expected that MPI_Barrier
comes from old MAPL's serial I/O, where other ranks are waiting for the master rank to read data from disk. However, I/O can't explain the problem because:
MPI_Barrier
time (7918 seconds). There must be other components that have load-imbalance and cause this long blocking, maybe advection or gas-phase chemistry (say different spatial regions require different numbers of inner solver steps?)MPI_Barrier
(7900 ± 790 seconds) -- this cannot come from a blocking I/O, where the master process should have near-zero barrier time.A typical serial & blocking I/O should look like this toy MPI_Barrier example, with the core body:
call MPI_Barrier( MPI_COMM_WORLD, ierror)
if (rank .eq. 0) then
call SLEEP(3) ! delaying everyone else
end if
call MPI_Barrier( MPI_COMM_WORLD, ierror)
in which case rank 0
will have zero MPI_Barrier time, while other ranks will have 3 seconds of MPI_Barrier time. IPM shows very accurate results on this toy program.
As a comparison, WRF doesn't have such a long MPI_Barrier, shown by this WRF profiling result
An intriguing observation is that MPI_Barrier
time decreases with the number of cores, while other MPI calls (especially MPI_Bcast
) generally take longer time with more cores (due to increased communication, obviously).
One hypothesis is that, MPI_Barrier
comes from the load-imbalance of photochemistry, as the KPP solver time varies a lot across day and night. Since the chemistry component scales almost perfectly with core counts, its time drops quickly with more cores.
We can locate those time-consuming MPI_Barrier
calls by either
MPI_Pcontrol(...)
, as shown in IPM User Guide.If such load imbalance does come the chemical solver, then people should pay extra attention to the slowest spatial regions when trying to speed-up chemistry solvers. This might be a GCHP-only problem; GC-classic would be OK if OpenMP dynamic thread scheduling is used.
cc. @yantosca @lizziel something worth investigating if you want to profile the code.
OpenMPI (and other MPI implementations) can make use of OpenUCX, which provides some low-level functionality. @mathomp4 discovered that MPI built with OpenUCX v1.6 will give incomplete traceback information when throwing an error due to floating point exceptions, at least when using some Intel compilers (openucx/ucx#5611). This can be resolved by instead using OpenUCX v1.8.1. GCHPctm successfully compiles with OpenUCX v1.8.1 and OpenMPI v4.0.4, so this should be the recommended (open source) software stack for GCHPctm.
This discussion will pick up where issue #43 (formerly of GCHP repository) left off. GCHP 13.0.0 includes a continuous integration pipeline via Azure but currently only builds the model. It also only builds with a single configuration of compiler flags.
Having test runs on Azure is challenging due to the size of the GCHP input data but there may be work-arounds to get some simple form of testing implemented. @LiamBindle has also suggested outsourcing automated tests to his local cluster where the input data is available and memory/storage constraints are not an issue. @msulprizio is developing integration testing for GEOS-Chem which could also fulfill some of the GCHPctm automated testing needs.
This discussion is intended to be a forum for people to weigh in on GCHPctm testing needs and help develop a feasible plan to implement over the course of the GCHP 13 series.
Hi everyone,
I'm just submitting this for the archive of issues on GitHub.
ESMF_COMM=mpiuni
Yesterday I tried running the default 6-core 1-node 1-hour GCHP simulation and it crashed almost immediately. This happned with GHCP_CTM 13.0.0-alpha.1, but this could happen with any version that uses MAPL 2.0+. Below is the full output. The important parts to pick out are:
Failed run output:
In MAPL_Shmem:
NumCores per Node = 6
NumNodes in use = 1
Total PEs = 6
In MAPL_InitializeShmem (NodeRootsComm):
NumNodes in use = 1
Integer*4 Resource Parameter: HEARTBEAT_DT:600
Integer*4 Resource Parameter: HEARTBEAT_DT:600
Integer*4 Resource Parameter: HEARTBEAT_DT:600
Integer*4 Resource Parameter: HEARTBEAT_DT:600
Integer*4 Resource Parameter: HEARTBEAT_DT:600
Integer*4 Resource Parameter: HEARTBEAT_DT:600
NOT using buffer I/O for file: cap_restart
NOT using buffer I/O for file: cap_restart
NOT using buffer I/O for file: cap_restart
NOT using buffer I/O for file: cap_restart
NOT using buffer I/O for file: cap_restart
NOT using buffer I/O for file: cap_restart
pe=00001 FAIL at line=00250 MAPL_CapGridComp.F90 <something impossible happened>
pe=00001 FAIL at line=00826 MAPL_CapGridComp.F90 <status=1>
pe=00001 FAIL at line=00427 MAPL_Cap.F90 <status=1>
pe=00001 FAIL at line=00303 MAPL_Cap.F90 <status=1>
pe=00001 FAIL at line=00151 MAPL_Cap.F90 <status=1>
pe=00001 FAIL at line=00129 MAPL_Cap.F90 <status=1>
pe=00001 FAIL at line=00029 GEOSChem.F90 <status=1>
Abort(262146) on node 1 (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 262146) - process 1
pe=00002 FAIL at line=00250 MAPL_CapGridComp.F90 <something impossible happened>
pe=00002 FAIL at line=00826 MAPL_CapGridComp.F90 <status=1>
pe=00002 FAIL at line=00427 MAPL_Cap.F90 <status=1>
pe=00002 FAIL at line=00303 MAPL_Cap.F90 <status=1>
pe=00002 FAIL at line=00151 MAPL_Cap.F90 <status=1>
pe=00002 FAIL at line=00129 MAPL_Cap.F90 <status=1>
pe=00002 FAIL at line=00029 GEOSChem.F90 <status=1>
pe=00003 FAIL at line=00250 MAPL_CapGridComp.F90 <something impossible happened>
pe=00003 FAIL at line=00826 MAPL_CapGridComp.F90 <status=1>
pe=00003 FAIL at line=00427 MAPL_Cap.F90 <status=1>
pe=00003 FAIL at line=00303 MAPL_Cap.F90 <status=1>
pe=00003 FAIL at line=00151 MAPL_Cap.F90 <status=1>
pe=00003 FAIL at line=00129 MAPL_Cap.F90 <status=1>
pe=00003 FAIL at line=00029 GEOSChem.F90 <status=1>
pe=00000 FAIL at line=00250 MAPL_CapGridComp.F90 <something impossible happened>
pe=00000 FAIL at line=00826 MAPL_CapGridComp.F90 <status=1>
pe=00000 FAIL at line=00427 MAPL_Cap.F90 <status=1>
pe=00000 FAIL at line=00303 MAPL_Cap.F90 <status=1>
pe=00000 FAIL at line=00151 MAPL_Cap.F90 <status=1>
pe=00000 FAIL at line=00129 MAPL_Cap.F90 <status=1>
pe=00000 FAIL at line=00029 GEOSChem.F90 <status=1>
Abort(262146) on node 2 (rank 2 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 262146) - process 2
Abort(262146) on node 3 (rank 3 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 262146) - process 3
pe=00004 FAIL at line=00250 MAPL_CapGridComp.F90 <something impossible happened>
pe=00004 FAIL at line=00826 MAPL_CapGridComp.F90 <status=1>
pe=00004 FAIL at line=00427 MAPL_Cap.F90 <status=1>
pe=00004 FAIL at line=00303 MAPL_Cap.F90 <status=1>
pe=00004 FAIL at line=00151 MAPL_Cap.F90 <status=1>
pe=00004 FAIL at line=00129 MAPL_Cap.F90 <status=1>
pe=00004 FAIL at line=00029 GEOSChem.F90 <status=1>
Abort(262146) on node 4 (rank 4 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 262146) - process 4
pe=00005 FAIL at line=00250 MAPL_CapGridComp.F90 <something impossible happened>
pe=00005 FAIL at line=00826 MAPL_CapGridComp.F90 <status=1>
pe=00005 FAIL at line=00427 MAPL_Cap.F90 <status=1>
pe=00005 FAIL at line=00303 MAPL_Cap.F90 <status=1>
pe=00005 FAIL at line=00151 MAPL_Cap.F90 <status=1>
pe=00005 FAIL at line=00129 MAPL_Cap.F90 <status=1>
pe=00005 FAIL at line=00029 GEOSChem.F90 <status=1>
Abort(262146) on node 5 (rank 5 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 262146) - process 5
Abort(262146) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 262146) - process 0
The issue was ESMF was built with ESMF_COMM=mpiuni
. This appears to have happended because the spack install spec wasn't quite right, but I didn't build ESMF myself so I can't be sure.
The build-time value of ESMF_COMM is written to esmf.mk
beside your ESMF libraries. You can see it with the following command
grep 'ESMF_COMM' $(spack location -i esmf)/lib/esmf.mk
or
grep 'ESMF_COMM' /path/to/ESMF/libraries/esmf.mk
Rebuild ESMF and make sure ESMF_COMM
is set to the appropriate MPI flavor.
Hello,
To get more familiar with the multi-run option, I am trying to split a 3-hour simulation into 3 jobs (GCHP version 12.9.3), following the respective wiki instructions. However, after the first run, the simulation crashes with an error: cap_restart did not update to different date. Checking the cap_restart file I can see that the file is empty.
Maybe the problem is that the date of the first job is not written in the cap_restart file and thus the second job cannot start running? Which part of the code is writting the cap_restart file?
Compilation commands
Run commands
./gchp.multirun.sh
There are some errors in the slurm file of the first job (attached), and an error in the multirun.log file:
Error: cap_restart did not update to different date
input.geos: input.geos.txt
HEMCO_Config.rc: HEMCO_Config.rc.txt
HEMCO.log: HEMCO.log.txt
ExtData.rc: ExtData.rc.txt
HISTORY.rc: HISTORY.rc.txt
GCHP.rc: GCHP.rc.txt
GCHP compile log file: compile.log.txt
CAP.rc: CAP.rc.txt
runConfig.sh: runConfig.sh.txt
GCHP run log file: multirun.log.txt
slurm.out or any other error messages from your scheduler: slurm-512145.err.txt
slurm.error: slurm-512145.out.txt
gchp.multirun.run: gchp.multirun.run.txt
gchp.multitun.sh: gchp.multirun.sh.txt
I would appreciate it if you could provide some help to solve my problem. Thank you in advance.
Regards,
Maria Tsivlidou
When I do cmake with the latest main branch, I get a compile error which I traced back to src/MAPL/MAPL_cfio_r4/CMakeFiles/MAPL_cfio_r4.dir/flags.make
. The Fortran_FLAGS variable includes -check bounds uninit
but for ifort these options should be comma separated, i.e. -check bounds,uninit
Hello,
I am Maria Tsivlidou, a PhD student in Laboratoire d'Aerologie in Toulouse, France supervised by Bastien Sauvage and Brice Barret. I am trying to restart a GCHP simulation using an existing restart file, and I would like to kindly ask your assistance for an error I have.
I used GCHP version 12.9.3 to produce a successful simulation with start date 20080501 and end date 20080601. At the end of the run the checkpoint file was created (gcchem_internal_checkpoint.restart.20080601_000000.nc4.txt). Now I am trying to restart the run since 20080601, but it is not working.
Compilation commands
1. make clean_all
2. make build_all
Run commands
The error message in the gchp.log file is: Mem/Swap Used (MB) at GIGCenvMAPL_GenericInitialize= 6.8260E+03 0.0
000E+00
ERROR: Timer TOTAL needs to be set first
ERROR: Timer INITIALIZE needs to be set first
ERROR: Timer TOTAL needs to be set first
ERROR: Timer INITIALIZE needs to be set first
ERROR: Timer TOTAL needs to be set first
ERROR: Timer GenInitTot needs to be set first
ERROR: Timer --GenInitMine needs to be set first
Also, there are several errors in the slurm.out.txt file attached below.
- GEOS-Chem version: GCHP 12.9.3
- Compiler version: ifort 18.0.2 20180210
- netCDF version: netcdf/4.7.3-openmpi
- netCDF-Fortran version (if applicable): __
- Did you run on a computational cluster, on the AWS cloud: No
- If you ran on the AWS cloud, please specify the Amazon Machine Image (AMI) ID: __
- Are you using GEOS-Chem "out of the box" (i.e. unmodified): No
- If you have modified GEOS-Chem, please list what was changed: __
- lastbuild: __
- input.geos: input.geos.txt
- HEMCO_Config.rc: HEMCO_Config.rc.txt
- GEOS-Chem "Classic" log file: gchp.log.txt
- HEMCO.log: HEMCO.log.txt
- slurm.out or any other error messages from your scheduler: slurm-510683.out.txt
- runConfig: runConfig.sh.txt
sbatch script: gchp_nuwa.run.txt
checkpoint_restart file: gcchem_internal_checkpoint.restart.20080601_000000.nc4.txt
I had the same error even when I tried to restart a 6-month simulation that was interrupted after 3 months, using the last restart file that was created.
This is a quick note. For 13.0.0 release, we should try to have some documentation explaining how to modifying GCHPctm (updating submodule for your own fork, checking out branches on your own fork, etc) now that it has submodules.
To plot data on a curvlinear grid with routines like matplotlib's pcolormesh()
, the coordinates of grid-box edges are necessary. Currently the diagnostics don't include edge coordinates, and there's no easy way to get them.
GCPy calculates edge coordinates itself (privately), and uses those to plot cubed-sphere data. (See here).
else:
#Cubed-sphere single level
ax.coastlines()
try:
if masked_data == None:
masked_data = np.ma.masked_where(np.abs(grid["lon"] - 180) < 2, plot_vals.data.reshape(6, res, res))
except ValueError:
#Comparison of numpy arrays throws errors
pass
[minlon,maxlon,minlat,maxlat] = extent
#Catch issue with plots extending into both the western and eastern hemisphere
if np.max(grid["lon_b"] > 180):
grid["lon_b"] = (((grid["lon_b"]+180)%360)-180)
for j in range(6):
plot = ax.pcolormesh(
grid["lon_b"][j, :, :],
grid["lat_b"][j, :, :],
masked_data[j, :, :],
transform=proj,
cmap=comap,
norm=norm
)
In this snippet, the grid
dict contains arrays with edge coordinates. The grid dict is generated by call_make_grid()
. But, these coordinates are not available to users.
Below are some ideas that come to mind:
I look forward to the thoughts of others. Is there a solution to this that I'm not aware of?
Writing the initial restart file can be very slow with IntelMPI for big simulations/high core counts.
Last week I was running a C360 simulation on 900 cores, and it got stuck writing the first gcchem_internal_checkpoint
. It was writing it so slow that it would have taken >1 day.
Related: GEOS-ESM/MAPL#548
Set the following environment variable fixed it for me:
export I_MPI_ADJUST_GATHERV=3
Set this environment variable to select the desired algorithm(s) for the collective operation under particular conditions. Each collective operation has its own environment variable and algorithms.
Environment Variables, Collective Operations, and Algorithms
Environment Variable Collective Operation Algorithms I_MPI_ADJUST_GATHERV MPI_Gatherv 1. Linear
2. Topology aware linear
3. Knomial
Hello, I have been running GCHPctm at c24 with no issue using GEOS-Chem 12.8.2, but I recently wanted to change my simulation resolution to c180 and am running into problems.
Some info:
-Running on NCAR Casper environment
-I am running on 8 nodes, 288 cores total (but this error persists if I request different numbers of nodes/cores)
-I am using openmpi 4.0.3 and gfortran 8.3.0 compiler
Regardless of the resources I request from my cluster, the simulation crashes at this point in the log file:
MAPL ExtData initialization complete
Mem/Swap Used (MB) at MAPL_Cap:TimeLoop
= 1.919E+05 3.103E+03
Calling MAPL ExtData Run_
ExtData Run_: READ_LOOP
ExtData Run_: ---PopulateBundle
ExtData Run_: ---CreateCFIO
ExtData Run_: ---prefetch
I don't receive an error message from the code, but I see this message in my SLURM file and find core files in my run directory:
[casper15:68225:0:68225] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xffff
fff2d2f75588)
==== backtrace (tid: 68225) ====
What might be causing this?
For the transport tracers simulation GCHP still needs to go through the GEOS-Chem classic code to get information about the tracers used in a given simulation. Ideally GCHP would instead use the NASA GMAO tracer gridded component called TR_GridComp used in the GEOS system.
Currently it is not its own repository but is part of https://github.com/GEOS-ESM/GEOSchem_GridComp. For now it is best to simply copy it into GCHPctm as a start. See discussion with GMAO on this here: GEOS-ESM/GEOSchem_GridComp#51.
todo
Right now BaseTarget
get compiler flags transitively from MAPL. We should revisit this before release because right now the list of compiler flags is big list of oftentimes repeated flags.
Currently GCHP does not obey MAPL conventions regarding which way is "up" in exported data. It appears that - when writing out - data acquired through the GEOS-Chem diagnostic arrays are produced "inverted" (level 1 = surface, even though HISTORY is meant to output with level 1 = TOA) but all other data are produced "right way up" (level 1 = TOA). I've opened a request in the GEOS-ESM/MAPL repo to make the "positive" attribute of vertical data be something that is communicated explicitly in imports and exports (GEOS-ESM/MAPL#284), which would resolve this issue. However, this will require that we modify GCHP to comply with MAPL's standards.
GCHP crashes and says there is an error reading in lightning NOx when I try to restart the multirun set of simulations. GCHP also crashes on a leap day with a MAPL error. I don't know if the two errors are related.
GCHP reading in lightning NOx and proceeds with the simulation and doesn't crash in the first place when getting to the leap day.
GCHP crashes and sas there is an error reading in lightning NOx when I try to restart a multirun set of simulations and crashes on a leap day.
Start a single run simulaiton on 20160229 000000 (or 20160207 000000 or seemingly any time in Febuary 2016)
Or attempt to re-start a multirun simulation set that previously crashed by using an existing cap_reststart and the last restart file from the multirun (restarting on Feburary 1).
For the leap day simulations I've now had multile simulations crash for the month of February when getting to 00:00 on Feb 29. See the log file below for an example.
Compilation commands
I used cmake and ifort 18. The standard environment used by Lizzie Lundgren. With RRTMG on.
Run commands
used the gchp.run script.
For the lightning NOx crash the .out file says:
ExtData could not find bracketing data from file template
./HcoDir/OFFLINE_LIGHTNING/v2020-03/GEOSFP/%y4/FLASH_CTH_GEOSFP_0.25x0.3125_%y4
_%m2.nc4 for side L
The .err files for both types of crashes have lots of MAPL errors and MPI abort errors. See the relevant log files listed below.
HEMCO.log didn't have anything specific for either of the two errors.
see here on Cannon for all the above files: /n/holyscratch01/jacob_lab/jmoch/geoE_rdirs/GCHP_13.0.0_geoE_off_vtest3
the log file relevant is: slurm-7116496.out and slurm-7116496.err for the initial crash. And slurm-7180457.out and slurm-7180457.out for the crash when I try to restart it and get a lightning NOx error.
I believe the GCHP adjoint code is close to a state where I can submit a pull request. However, I have had to fork the geos-chem, MAPL, FVdycore, and HEMCO submodule repositories and make changes to them in addition to the GCHP code. Is that multiple pull requests then? How can they be coordinated? More generally, is there a guide about coding and testing requirements before submitting the request?
The build with ESMF 8.0.0 official release is failing because use ESMF
doesn't define the ESMF_CS_Arguments
type which is used in MAPL_Base/MAPL_CubedSphereGridFactory.F90
.
I'm looking into this...
As of Intel MPI 2019 Update 8 and libfabric 1.10.0, there is a bug related to registering memory that causes a crash in GCHP when using certain fabric providers. This was originally identified as an issue when using the EFA provider on AWS EC2, but has also been encountered on systems that use the Verbs provider. This issue may be fixed in libfabric 1.11.0. For users who cannot update the libfabric version on their system, a temporary solution is to put the line export MPIR_CVAR_CH4_OFI_ENABLE_RMA=0
in gchp.env
. This bug is not relevant to users of other MPI providers such as OpenMPI.
An update to restrict the mass balance operation to the troposphere, submitted by Sebastian Eastham (@sdeastham ) in 12.1.1, has not yet migrated from old GCHP to new GCHPctm. The old GCHP commit is 7a4589c. This update needs to be manually applied to the new FV submodule in GCHPctm, ideally before the 13.0.0 release.
Hi everyone, I was thinking about how to add the ability to vertically flip metfields today (for running from native metfields), and I came to two ideas:
Any thoughts? Does anyone have a different idea in mind?
GCHP crashes when I try to use a ".grid_label" and ".conservative" fields for any collection
GCHP runs sucessfully and regrids the output from native cubed sphere to the lat-lon grid.
GCHP crashes and has errors poinitng to MAPL (e.g. MAPL_HistoryGridComp.F90, MAPL_Generic.F90, etc.)
I used cmake and ifort 18. The standard environment used by Lizzie Lundgren.
Run commands
I used the gchp.run script (single run)
Nothing says "add text here" but a lot of messages say "need informative message"
pe=00000 FAIL at line=01064 MAPL_HistoryGridComp.F90 <needs informative message>
pe=00000 FAIL at line=01829 MAPL_Generic.F90 <needs informative message>
pe=00000 FAIL at line=00614 MAPL_CapGridComp.F90 <status=1>
pe=00000 FAIL at line=00559 MAPL_CapGridComp.F90 <status=1>
pe=00001 FAIL at line=01064 MAPL_HistoryGridComp.F90 <needs informative message>
pe=00001 FAIL at line=01829 MAPL_Generic.F90 <needs informative message>
pe=00001 FAIL at line=00614 MAPL_CapGridComp.F90 <status=1>
pe=00001 FAIL at line=00559 MAPL_CapGridComp.F90 <status=1>
pe=00001 FAIL at line=00849 MAPL_CapGridComp.F90 <status=1>
pe=00001 FAIL at line=00322 MAPL_Cap.F90 <status=1>
pe=00001 FAIL at line=00198 MAPL_Cap.F90 <status=1>
pe=00001 FAIL at line=00157 MAPL_Cap.F90 <status=1>
pe=00002 FAIL at line=01064 MAPL_HistoryGridComp.F90 <needs informative message>
pe=00002 FAIL at line=01829 MAPL_Generic.F90 <needs informative message>
pe=00002 FAIL at line=00614 MAPL_CapGridComp.F90 <status=1>
pe=00002 FAIL at line=00559 MAPL_CapGridComp.F90 <status=1>
pe=00002 FAIL at line=00849 MAPL_CapGridComp.F90 <status=1>
pe=00002 FAIL at line=00322 MAPL_Cap.F90 <status=1>
pe=00002 FAIL at line=00198 MAPL_Cap.F90 <status=1>
pe=00002 FAIL at line=00157 MAPL_Cap.F90 <status=1>
pe=00002 FAIL at line=00131 MAPL_Cap.F90 <status=1>
pe=00002 FAIL at line=00029 GCHPctm.F90 <status=1>
pe=00003 FAIL at line=01064 MAPL_HistoryGridComp.F90 <needs informative message>
pe=00003 FAIL at line=01829 MAPL_Generic.F90 <needs informative message>
pe=00003 FAIL at line=00614 MAPL_CapGridComp.F90 <status=1>
pe=00003 FAIL at line=00559 MAPL_CapGridComp.F90 <status=1>
pe=00003 FAIL at line=00849 MAPL_CapGridComp.F90 <status=1>
pe=00003 FAIL at line=00322 MAPL_Cap.F90 <status=1>
pe=00003 FAIL at line=00198 MAPL_Cap.F90 <status=1>
pe=00003 FAIL at line=00157 MAPL_Cap.F90 <status=1>
pe=00003 FAIL at line=00131 MAPL_Cap.F90 <status=1>
pe=00003 FAIL at line=00029 GCHPctm.F90 <status=1>
pe=00005 FAIL at line=01064 MAPL_HistoryGridComp.F90 <needs informative message>
pe=00005 FAIL at line=01829 MAPL_Generic.F90 <needs informative message>
pe=00005 FAIL at line=00614 MAPL_CapGridComp.F90 <status=1>
pe=00005 FAIL at line=00559 MAPL_CapGridComp.F90 <status=1>
pe=00005 FAIL at line=00849 MAPL_CapGridComp.F90 <status=1>
pe=00005 FAIL at line=00322 MAPL_Cap.F90 <status=1>
pe=00005 FAIL at line=00198 MAPL_Cap.F90 <status=1>
...
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 5 in communicator MPI_COMM_WORLD
with errorcode 262146.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
In: PMI_Abort(262146, N/A)
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 21 in communicator MPI_COMM_WORLD
with errorcode 262146.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
In: PMI_Abort(262146, N/A)
--------------------------------------------------------------------------
...
see more in the log file
see here on Cannon for all the above files: /n/holyscratch01/jacob_lab/jmoch/geoE_rdirs/GCHP_13.0.0_geoE_off_vtest2
the log file relevant is: slurm-6797176.out
The ESMF interface target needs to link:
libpnetcdfc.a
)libxerces-c-3.2.a
)I am working off of version 12.5 in the old repo, but I found the source of the problem and that code is still in the gchp_ctm repo so I thought I'd drop this bug report here. The symptoms may be different in gchp_ctm, but I'll post what happens to my runs here.
Since the HEMCO.log file is the first persistent output filed opened, it gets a unit number of 11 from findFreeLUN. Then, at some point after EMISSIONS_INIT is finished, MAPL_cfio/ESMF_CFIOMod.F90::ESMF_CFIOFileOpen is called and these lines are executed:
858 open(11, file=fileName)
859 read(11, '(a)') dset
860 close(11)
This closes my HEMCO.log file handle. Next time HCO_MSG gets called, it opens a new default file name, which is fort.11, and the HEMCO log output continues in that file. It's obviously not critical, but it is annoying and should be easy to fix. For my code, because I couldn't access inquireMod from the MAPL_io folder, I've just added a for loop searching for a free LUN before that and replaced the 11 with that LUN variable. I don't know what the most copacetic fix is for the new repo.
I'm opening this on behalf of Isaiah Sauvageau from Drexel University. He write:
Hello,
I am attempting to build and run GCHP, but I am having trouble initializing a build directory. The detailed description of my problem is shared through a paper linked here (https://paper.dropbox.com/doc/GCHP-build-Initialization--A_84vqolsYb3Vs14zP4J02SdAQ-OTU3d7EZVuYVxafmeAbWe). Any assistance would be greatly appreciated. If there is more information required, I am happy to share.
Thank you,
Isaiah Sauvageau
Pronouns: He/Him/His
PhD Candidate Environmental Engineering
Drexel University | College of Engineering
4/8/2021 Update: Table now reflects versions to be included in 13.1
5/14 Update: Advection libraries will also be updated in 13.1
GCHP 13.0 includes upgrades to all GMAO libraries relative to what is used in GCHP 12. However, most of these libraries have already had several additional version releases. We should upgrade again for GCHP 13.1.
Below is a table of each GCHPctm submodule that is a fork from GMAO, either GEOS-ESM or Goddard-Fortran-Ecosystem. Let me know if you think I missed one of the repos used in GCHPctm. The target versions are the latest available, but if there is a newer version at the time this work is done then we take that for the merge. Only take tagged versions that are on the upstream main
branch.
Repository versions in 13.0 and targets for 13.1
Repository | Version in 13.0.0 | Target for 13.1 | Notes |
---|---|---|---|
FMS | geos/orphan/v1.0.3 | geos/2019.01.02+noaff.6 | Our version is an orphan branch so special handling is needed for the upgrade. |
MAPL | v2.2.7 | v2.6.3 | Working on bug fix for v2.6.4. |
GMAO_Shared | v1.1.6 | v1.3.8 | We do not use most of the content in this repository. Update the skip list in CMakeLists.txt as needed. |
fvdycore | geos/v1.1.2 | v1.1.6 | Beware this is a submodule within a submodule. |
FVdycoreCubed_GridComp | v1.1.3 | v1.2.12 | An internal benchmark is essential when updating this library due to potential changes in offline advection which is not thoroughly tested at GMAO. |
ESMA_cmake | v3.0.6 | v3.0.6 | No version change in 13.1. |
ecbuild | geos/v1.0.5 | geos/v1.0.6 | If there is a new version to upgrade to, beware this is a submodule within a submodule. |
gFTL-shared | v1.0.7 | v1.2.0 | We are using a fork of this repo but perhaps do not need to. |
gFTL | v1.2.5 | v1.3.1 | We are not using a fork of this repo. |
pFlogger | v1.4.2 | v1.5.0 | We do not yet harness the full power of this library, but should in the future. |
yaFyaml | v0.4.0 | v0.5.0 | We should be able to use this to read the GEOS-Chem species database, but it has not yet been tried. |
pFUnit | v4.1.9 | v4.2.0 | We do not currently build this library, but it is included in GCHPctm for potential future use. |
Related to this, another goal I have is to use Goddard-Fortran-Ecosystem/GFE which Tom put together at my request to bundle the Goddard-Fortran-Ecosystem repos together. We generally do not change these libraries so could avoid using forks potentially. I have permissions to make branches on the upstream as needed. GFE could sit as a submodule within GCHPctm/src, replacing gFTL-shared, pFlogger, pFUnit, and yaFyaml in that directory, which would be much cleaner.
When I create a GCHPctm run directory, I've noticed that the CodeDir symbolic link points to the wrong directory. For example, after cloning GCHPctm and checking out the submodules:
cd GCHPctm/run
./createRunDir.sh
... then follow all the prompts to create your desired run dir type ...
... then cd into the rundir you just created...
ls -l CodeDir
The output I got was:
/n/holyscratch01/jacob_lab/ryantosca/GCHP/GCHPctm/src/GCHP_GridComp/GEOSChem_GridComp/geos-chem/run/
which is pointing to the run creation directory. But this should point to the top-level GCHPctm folder, i.e.:
/n/holyscratch01/jacob_lab/ryantosca/GCHP/GCHPctm
Manually unlink the CodeDir and reset it to the top-level GCHPctm folder,i.e.
unlink CodeDir
ln -s /n/holyscratch01/jacob_lab/ryantosca/GCHP/GCHPctm CodeDir
then compile as shown in the README.md
This is documentation of an issue I ran into with gchp_ctm (3f06a1b) with ESMF 8 and Intel 19 (this was on CentOS 7). When I compiled gchp_ctm I got the following link error:
Scanning dependencies of target geos
[100%] Building Fortran object src/CMakeFiles/geos.dir/GEOSChem.F90.o
[100%] Linking Fortran executable geos
ld: geos: hidden symbol `__intel_cpu_features_init_x' in /opt/intel/compilers_and_libraries_2019.5.281/linux/compiler/lib/intel64_lin/libirc.a(cpu_feature_disp.o) is referenced by DSO
ld: final link failed: Bad value
make[3]: *** [src/geos] Error 1
make[2]: *** [src/CMakeFiles/geos.dir/all] Error 2
make[1]: *** [src/CMakeFiles/geos.dir/rule] Error 2
This can be fixed by adding -lintlc
to ESMF's link libraries. I guess this library is the dynamic version of libirc.a
which is intel-specific optimizations (according to here).
This issue can be fixed with the following patch to ESMA_CMake
diff --git a/FindESMF.cmake b/FindESMF.cmake
index df05906..f10027e 100755
--- a/FindESMF.cmake
+++ b/FindESMF.cmake
@@ -86,7 +86,7 @@ find_package(NetCDF REQUIRED)
find_package(MPI REQUIRED)
execute_process (COMMAND ${CMAKE_CXX_COMPILER} --print-file-name=libstdc++.so OUTPUT_VARIABLE stdcxx OUTPUT_STRIP_TRAILING_WHITESPACE)
execute_process (COMMAND ${CMAKE_CXX_COMPILER} --print-file-name=libgcc.a OUTPUT_VARIABLE libgcc OUTPUT_STRIP_TRAILING_WHITESPACE)
-set(ESMF_LIBRARIES ${ESMF_LIBRARY} ${NETCDF_LIBRARIES} ${MPI_Fortran_LIBRARIES} ${MPI_CXX_LIBRARIES} rt ${stdcxx} ${libgcc})
+set(ESMF_LIBRARIES ${ESMF_LIBRARY} ${NETCDF_LIBRARIES} ${MPI_Fortran_LIBRARIES} ${MPI_CXX_LIBRARIES} rt -lintlc ${stdcxx} ${libgcc})
set(ESMF_INCLUDE_DIRS ${ESMF_HEADERS_DIR} ${ESMF_MOD_DIR})
# Make an imported target for ESMF
To me, issues like this seem to be a symptom of a larger issue which is how to determine transitive usage requirements from dependencies that aren't built with CMake.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.