Giter Club home page Giter Club logo

gchp's People

Contributors

bettycroft avatar branfosj avatar jourdan-he avatar laestrada avatar liambindle avatar lizziel avatar msulprizio avatar sdeastham avatar williamdowns avatar yantosca avatar yidant avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gchp's Issues

[DISCUSSION] Spurious pcolormesh wrapping is fixed

Hi everyone,

I mentioned this on Slack last week, but plotting GCHP full global data with pcolormesh() can be difficult because you'll get horizontal streaks for grid-boxes that cross the antimeridian. Below are examples.

TLDR: Use cartopy version 0.19 or greater if you want to plot GCHP data for the entire globe.

Plotting CS data

Set up the figure

ax = plt.axes(projection=ccrs.PlateCarree())
ax.set_global()
ax.coastlines()

image

Plot a face that doesn't cross the antimeridian (looks good)

ds = xr.open_dataset('GCHP.SpeciesConc.nc4')
plt.pcolormesh(
    ds.lons.isel(nf=4).values, 
    ds.lats.isel(nf=4).values, 
    ds.SpeciesConc_NO2.isel(nf=4, lev=0, time=0).values,  
    vmax=8e-9
)

image
Now, plot a face that does cross the antimeridian (results in horizontal streaking)

plt.pcolormesh(
    ds.lons.isel(nf=3).values, 
    ds.lats.isel(nf=3).values, 
    ds.SpeciesConc_NO2.isel(nf=3, lev=0, time=0).values,  
    vmax=8e-9
)

image

Plotting SG data

This is illustrated a bit better with stretched-grids.

Again, set up the figure

ax = plt.axes(projection=ccrs.PlateCarree())
ax.set_global()
ax.coastlines()

image

Plot a face that doesn't cross the antimeridian

ds = xr.open_dataset('GCHP.SpeciesConc.nc4')
plt.pcolormesh(
    ds.lons.isel(nf=0).values, 
    ds.lats.isel(nf=0).values, 
    ds.SpeciesConc_NO2.isel(nf=0, lev=0, time=0).values, 
    vmax=8e-9
)

image

Next, plot a face that does cross the AM

plt.pcolormesh(
    ds.lons.isel(nf=1).values, 
    ds.lats.isel(nf=1).values, 
    ds.SpeciesConc_NO2.isel(nf=1, lev=0, time=0).values, 
    vmax=8e-9
)

image

The issue and workarounds

The PlateCarree projection is a 2D space unaware of wrapping at the antimeridian. Work arounds include:

  • Don't plot data that's close to the antimeridian (e.g., GCPy doesn't plot grid-boxes that are within 2 deg of the AM; see here)
  • Manually draw polygons in a gnominic projection for grid-boxes that cross the AM (this is what I've prefered doing because it properly wraps the data around the AM)

Related issues

A recent fix

Fixed in SciTools/cartopy#1622 (thanks @htonchia and @greglucas). This merge was on August 19, 2020. IIUC this fix will be released in cartopy version 0.19.

[BUG/ISSUE] SPC_RESTART or GEOSCHEM_RESTARTS?

Describe the bug:

Is ExtData/SPC_RESTARTS or ExtData/GEOSCHEM_RESTARTS the proper place for our restarts? Currently, ./createRunDir.sh links to restarts in SPC_RESTARTS, but those don't exist on ComputeCanada. I assume ./createRunDir.sh needs to be updated--is that right?

[BUG] Parts of GCHPctm built with OpenMP

GCHPctm should not be built with OpenMP. However, MAPL is built with OpenMP and possibly other components as well. As a result, I ran into a compile issue with ifort19 for 13.0.0-alpha.8 due to an improper openMP directive (OMP serial). The fix for that issue went into 13.0.0-alpha.9, but the larger issue that GCHPctm is being compiled with OpenMP remains. I believe the issue has to do with the settings in the CMake files within MAPL and this needs to be further looked at.

[ANNOUNCEMENT] Master branch to be replaced with main

On Friday July 24, the master branch of GCHPctm will be deleted and the new branch main will take its place. At the same time, main will be moved up to the latest alpha pre-release version. If you have not already, take a look out the GCHPctm Releases page to view GCHPctm pre-release versions available and what they contain.

If you are currently using the master branch you do not need to change anything. If you decide to update versions simply do the following:

git fetch
git checkout main

If you have a fork of GCHPctm and have a second remote connected to the upstream (geoschem/gchpctm), you can do the following:
git fetch upstream (or whatever your upstream remote is called)

The main branch will then be available to checkout or merge:
git checkout upstream/main

To delete the stale branches from any of your remotes, do the following:
git fetch {remotename} --prune

[BUG/ISSUE] Differences in output when splitting a run into consecutive shorter runs

It has been a known issue for a long time that GCHP does not give exactly the same final result between a long single run and the identical run split up into shorter durations. This has been true for both the transport tracer and the full chemistry simulations.

This is especially problematic for GCHP because currently the only way to output monthly mean diagnostics is to break up a run into 1-month run segments. A monthly mean capability was supposed to be included within MAPL for the 13.0.0 release but that update is not yet ready in a MAPL release. Since we output monthly means in GCHP 1-year benchmarks I have been looking more closely at this issue to find fixes before we do the 13.0.0 benchmark.

Recent updates that are going into GEOS-Chem 13.0.0 correct this problem for transport tracers. Bug fixes in the GEOS-Chem and HEMCO submodules resolved the issue and the simulation now gives zero diffs regardless of how the run is split up. See the following posts on GitHub for more information on these updates:

Differences persist in the full chemistry simulation and I am actively looking into them.

[DISCUSSION] The status of CO2-only mode?

Kevin Bowman suggests that CO2-only mode can be a good use case for GCHP-on-cloud, as it requires much less I/O which is the major bottleneck on AWS cloud.

I haven't used the CO2 mode before and would like to learn more about its current status:

  • Is it up-to-date with the standard version? Can it be turned on just within the standard code repo, or is it only available in some frozen copy that is potentially several versions behind? Asking this for estimating the amount of technical works involved, as we've applied a few fixes to GCHP for compatibility with AWS environment.
  • What's the input data requirement?
  • What's the typical break-up of timing? (For full-chem it is like 20% transport + 60% GIGC + 20% I/O)
  • Any user manual available? (checked http://wiki.seas.harvard.edu/geos-chem/index.php/GEOS-Chem_HP but didn't find related info)
  • Documented scientific use case or paper?

Particularly @sdeastham who should have more experience with the CO2 mode.

[DISCUSSION] GCHP needs a Continuous Integration (with a build matrix)

Problem

The difficulty to build GCHP (despite the large improvement over early versions) is preventing user adoption and eating a lot of engineer time (e.g. on debugging makefiles). In particular, it is time-consuming to diagnose compiler/MPI-specific problems, as there are so many combinations of them.

So far this problem is being treated passively -- we stick to very few combinations we know that is working (notably ifort + OpenMPI3). Other combinations (gfortran, other MPIs) are solved case-by-case, typically when a user hit bugs on a specific system.

Suggestions

The sustainable way (which will save a lot of engineer time in the long-term) is to deal with this problem actively -- we should explore all common combinations using build matrix that is offered by most Continuous Integration (CI) services.

The components of the build matrix include:

  • OS: CentOS, Ubuntu, ...
    (for issues like #17)
  • Compiler: ifort + icc, ifort + gcc, gfortran + gcc, different versions of gfortran, ...
    (for issues like #15 and many other compiler problems)
  • MPI: OpenMPI , MVAPICH, MPICH, Intel MPI, ...
    (for issues like #9 #35 and many other MPI problems)

By having a continuous build at every commit / every minor releases, we will be able to:

  • know which combination works and which doesn't
  • avoid breaking the combinations that already work
  • try to make more combinations work correctly

This also helps user to find the "shortest path" to solve their specific error. An example question is "my build is failing on Ubuntu + gfortran + mpich; which component should I change to fix the problem"? By looking at the matrix, you can see that (for example) changing the MPI can lead to a correct built.

Where to start

A simple CI (on Travis) for GC-classic is geoschem/geos-chem#11 However, the memory & compute limit on Travis probably won't allow building GCHP. Other potentially better options are:

  • Azure pipelines (free)
  • GitHub actions (free)
  • AWS CodeBuild and CodePipeline (Costs money, but allows more compute. Can potentially grab input data from S3 to further run model. GCHP also has several run-time bugs that cannot be detected at compile time.)

Tutorial-like pages:

Existing models for reference:

  • CLIMA is the only Earth science model I am aware of that has continuous integration (on Azure pipelines).
  • Trilinos uses a trilinos-autotester bot to run tests on PR. I guess it runs on a on-premise cluster.

[FEATURE REQUEST] HEMCO as gridded component

HEMCO will be a separate sub-project from GEOS-Chem in 13.0.0-alpha.4. Ideally it would also be used as an ESMF gridded component. I am aiming to implement this for 13.0.0-alpha.5.

[QUESTION] Can I output chemical production and loss rates of NO and NO2 in History.rc

Ask your question here
Hi all,

I want to calculate lifetime of NOx and I need the chemical production and loss rates.
I want to add Prod_NO and Loss_NO in History.rc, but it didn't work.
As wiki said, some quantities in ProdLoss collection are not applicable to certain simulations.
So I wonder whether it can output Prod_NO, Loss_NO, Prod_NO2 and Loss_NO2?

Thanks and waiting for your replying!
Hongjian

[FEATURE REQUEST] Make HEMCO level ordering the same as other diagnostics

All HEMCO diagnostics are output with level 1 corresponding to top of atmospheric. This is opposite all GEOS-Chem diagnostics. It would be ideal to have consistency in level order across diagnostics to avoid confusion. This update would be for use in GCHP only. Levels should remain as they are if using GEOS.

[FEATURE REQUEST] Test with ESMF 8.0.1 and update documentation for download

ESMF 8.0.1 is now available. The official release page is here. This version is supposed to be backwards compatible so no changes should be necessary for running with GCHPctm. It includes updates that improve performance, but whether GCHPctm performance is improved is yet to be determined. Notably, with this release ESMF is now on GitHub.

GCHPctm needs to be tested with the new version of ESMF, and the documentation for ESMF download needs to be updated, both on the wiki and in the GitHub README.

Issues running GCHPctm with GEOS-Chem 12.8.2

Hi, I'm trying to get the GCHPctm wrapper working in the default standard simulation with MERRA2, using GEOS-Chem 12.8.2. I was able to build the geos executable following the "getting started" portion of the GCHPctm GitHub page, I am running into some problems with running the default (6-core, 1-node, 1-hour) test simulation. I am running interactively using the gchp.local.run script available in the runScriptSamples directory.

More information:
-I am running on the NCAR Cheyenne system, which uses a PBS scheduling system
-My interactive session uses 1 node with 36 cores/node
-I am using openmpi 4.0.3 and gfortran 8.3.0 compiler
-I built ESMF with ESMF_COMM=openmpi

I receive this output when running the default:

WARNING: NX and NY are set such that NX x NY/6 has side ratio >= 2.5. Consider adjusting resources in runConfig.sh to be more square. This will avoid negative effects due to excessive communication between cores.
Compute resources:
NX                             : 1                    GCHP.rc             
NY                             : 30                   GCHP.rc             
CoresPerNode                   : 30                   HISTORY.rc          
 
Cubed-sphere resolution:
GCHP.IM_WORLD                  : 24                   GCHP.rc             
GCHP.IM                        : 24                   GCHP.rc             
GCHP.JM                        : 144                  GCHP.rc             
IM                             : 24                   GCHP.rc             
JM                             : 144                  GCHP.rc             
npx                            : 24                   fvcore_layout.rc    
npy                            : 24                   fvcore_layout.rc    
GCHP.GRIDNAME                  : PE24x144-CF          GCHP.rc             
 
Initial reestart file:
GIGCchem_INTERNAL_RESTART_FILE : +initial_GEOSChem_rst.c24_standard.nc GCHP.rc             
 
Simulation start, end, duration:
BEG_DATE                       : 20160701 000000      CAP.rc              
END_DATE                       : 20160701 010000      CAP.rc              
JOB_SGMT                       : 00000000 010000      CAP.rc              
 
Checkpoint (restart) frequency:
RECORD_FREQUENCY               : 100000000            GCHP.rc             
RECORD_REF_DATE                : 20160701             GCHP.rc             
RECORD_REF_TIME                : 000000               GCHP.rc             

The run eventually crashes after reading the HEMCO_Config.rc file and gives this error:

FATAL from PE     1: mpp_domains_define.inc: not all the pe_end are in the pelist
FATAL from PE     3: mpp_domains_define.inc: not all the pe_end are in the pelist
FATAL from PE     5: mpp_domains_define.inc: not all the pe_end are in the pelist
FATAL from PE     0: mpp_domains_define.inc: not all the pe_end are in the pelist
FATAL from PE     2: mpp_domains_define.inc: not all the pe_end are in the pelist
FATAL from PE     4: mpp_domains_define.inc: not all the pe_end are in the pelist
FATAL from PE     0: mpp_domains_define.inc: not all the pe_end are in the pelist
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[r1i2n17:64901] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193
[r1i2n17:64901] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193
[r1i2n17:64901] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193
[r1i2n17:64901] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193
[r1i2n17:64901] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193
[r1i2n17:64901] 5 more processes have sent help message help-mpi-api.txt / mpi-abort
[r1i2n17:64901] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

I am quite new to this, so I am not sure what is going on and would appreciate any advice. Further, are there potential issues I might run into by attempting to use GEOS-Chem 12.8.2 with GCHPctm? I am happy to provide more info if needed, as well. The full output log is attached.
gchp.pdf

[FEATURE REQUEST] Turn off writing initial checkpoint file

Default values in GCHPctm 13.0.0-alpha.2 (and all previous GCHP versions) write a MAPL internal state checkpoint file during the first timestep. At high resolutions this adds significantly to the run-time. We should have the ability to turn off writing this first checkpoint file. It shouldn't be needed since we already have a restart file and the output checkpoint file is written separately at the end of the run.

[QUESTION] How do I run a mass conservation test?

Hi everyone,

In the GCSC meeting the other day, it was suggested I test mass conservation in stretched-grid simulations. Could someone help me understand how I could do this? I see there's a geosfp_2x25_masscons run directory for GC-Classic—is there a way to mirror this in GCHPctm?

Thanks in advance!

[FEATURE REQUEST] RRTMG

There is currently no way to calculate radiative forcing with GCHP. This is in spite of the presence of a version of RRTMG in the GEOS-Chem code.

[FEATURE REQUEST] Vertical advective flux diagnostic

GEOS-Chem Classic has diagnostic collection AdvFluxVert to save vertical fluxes in tpcore. There is no equivalent diagnostic in GCHP. Since tpcore is in FV3 we should be able to add in an equivalent diagnostic by creating a new MAPL export in the DYNAMICS grid comp in GCHP.

[FEATURE REQUEST] Debug printing during regridding portion of ExtData

The MAPL debug print option gives lots of information during the information collection stage of MAPL ExtData, such as parsing Extdata.rc and finding files with the right times. However, the regridding part of ExtData is very murky and GCHP seemlingly stalls for a while without any printing at all during this phase. If the run times out due to an issue not caught with error handling then it is hard to know where it went wrong.

This feature request is really for GEOS-ESM/MAPL but we can put it in with GCHP in mind and then submit it as a PR to go to the upstream MAPL.

[BUG/ISSUE] MAPL crashes on systems with >2.147 TB of memory

I noticed some GCHPctm simulations are crashing on some nodes on compute1 with the following error

...
At line 548 of file /my-projects/sgv/line-3/GCHPctm/src/MAPL/MAPL_Base/MAPL_MemUtils.F90
Fortran runtime error: Integer overflow while reading item 1
...

The line where the overflow occurs is https://github.com/geoschem/MAPL/blob/10e7a0bc8d0d79eb90a3742980fa6f7f073a87e3/MAPL_Base/MAPL_MemUtils.F90#L548 because memtot is a 32-bit signed integer, and this is happening on nodes that report >3TB of memory in /proc/meminfo.

Changing memtot to a 64-bit integer should fix this. I'll do this when I get a chance

[FEATURE REQUEST] Rebuild minimum file set with ifort like currently done in gfortran

Changing a line in MAPL_ExtDataGridComp.F90 results in recompiling all of MAPL_Base and FVdycore when rebuilding with ifort. Only MAPL_ExtDataGridComp.F90 is rebuilt, however, when building gfortran. This makes developing and debugging with gfortran superior to ifort. It would be great if we could eventually have the same functionality with ifort as well.

[BUG/ISSUE] Trial run with 13.0.0-alpha.9 version crashes after ~1 simulation hour and gives floating divide by zero error.

Hi everyone,

I'm trying to run a 30-core 1-day trial simulation with the 13.0.0-alpha.9 version, but the run ended after ~1 simulation hour and escaped with forrtl: error (73): floating divide by zero. The full log files are attached below.
163214_print_out.log
163214_error.log

More information:

  • intel MPI with Intel 18 compiler
  • ESMF 8.0.0 public release built with ESMF_COMM=intelmpi

I'm not sure how to troubleshoot this issue. I tried to cmake the source code with -DCMAKE_BUILD_TYPE=Debug (with the fix in #35) and rerun the simulation, but it gives a really large error log file so I'm not attaching it here. The first few lines of the error log are:

forrtl: error (63): output conversion error, unit -5, file Internal Formatted Write
Image              PC                Routine            Line        Source
geos               00000000094A364E  Unknown               Unknown  Unknown
geos               00000000094F8D62  Unknown               Unknown  Unknown
geos               00000000094F6232  Unknown               Unknown  Unknown
geos               000000000226CC73  advcore_gridcompm         261  AdvCore_GridCompMod.F90
geos               0000000007F00A0D  Unknown               Unknown  Unknown
geos               0000000007F0470B  Unknown               Unknown  Unknown
geos               00000000083BF095  Unknown               Unknown  Unknown
geos               0000000007F0219A  Unknown               Unknown  Unknown
geos               0000000007F01D4E  Unknown               Unknown  Unknown
geos               0000000007F01A85  Unknown               Unknown  Unknown
geos               0000000007EE1304  Unknown               Unknown  Unknown
geos               0000000006827DDA  mapl_genericmod_m        4545  MAPL_Generic.F90
geos               0000000006829035  mapl_genericmod_m        4580  MAPL_Generic.F90
geos               0000000000425200  gchp_gridcompmod_         138  GCHP_GridCompMod.F90
geos               0000000007F00A0D  Unknown               Unknown  Unknown
geos               0000000007F0470B  Unknown               Unknown  Unknown
geos               00000000083BF095  Unknown               Unknown  Unknown
geos               0000000007F0219A  Unknown               Unknown  Unknown
geos               0000000007F01D4E  Unknown               Unknown  Unknown
geos               0000000007F01A85  Unknown               Unknown  Unknown
geos               0000000007EE1304  Unknown               Unknown  Unknown
geos               0000000006827DDA  mapl_genericmod_m        4545  MAPL_Generic.F90
geos               0000000006A52D6C  mapl_capgridcompm         482  MAPL_CapGridComp.F90
geos               0000000007F00B39  Unknown               Unknown  Unknown
geos               0000000007F0470B  Unknown               Unknown  Unknown
geos               00000000083BF095  Unknown               Unknown  Unknown
geos               0000000007F0219A  Unknown               Unknown  Unknown
geos               000000000844804D  Unknown               Unknown  Unknown
geos               0000000007EE2A0F  Unknown               Unknown  Unknown
geos               0000000006A67F42  mapl_capgridcompm         848  MAPL_CapGridComp.F90
geos               0000000006A39B5E  mapl_capmod_mp_ru         321  MAPL_Cap.F90
geos               0000000006A370A7  mapl_capmod_mp_ru         198  MAPL_Cap.F90
geos               0000000006A344ED  mapl_capmod_mp_ru         157  MAPL_Cap.F90
geos               0000000006A32B5F  mapl_capmod_mp_ru         131  MAPL_Cap.F90
geos               00000000004242FF  MAIN__                     29  GCHPctm.F90
geos               000000000042125E  Unknown               Unknown  Unknown
geos               000000000042125E  Unknown               Unknown  Unknown
libc-2.17.so       00002AFBC9F34505  __libc_start_main     Unknown  Unknown
geos               0000000000421169  Unknown               Unknown  Unknown

I also noticed something weird towards the start of the run:

      MAPL: No configure file specified for logging layer.  Using defaults. 
     SHMEM: NumCores per Node = 6
     SHMEM: NumNodes in use   = 1
     SHMEM: Total PEs         = 6
     SHMEM: NumNodes in use  = 1

Previous versions (12.8.2) usually shows this instead:

 In MAPL_Shmem:
     NumCores per Node =            6
     NumNodes in use   =            1
     Total PEs         =            6


 In MAPL_InitializeShmem (NodeRootsComm):
     NumNodes in use   =            1

but I'm not sure if that matters.

[FEATURE REQUEST] Way to generate gridspec files for the model grid

It would be useful if there was a way to generate gridspec files for the model grid. I see some references to gridspec files in ExtData output which makes me think MAPL might some support for generating gridspec files, and ESMF supports gridspec inputs. For stretched-grids (and normal cubed-spheres) the grid-box corner coordinates are useful for plotting model output, and a way to generate gridspec files for the model grid seem like the proper and cleanest way to provide these coordinates. This would also facilitate the use of ESMF's offline regridders since ESMF_RegridWeightGen and ESMF_Regrid can take grid definitions for cubed-sphere and stretched-grid grids in the gridspec file format. I can inquire about this in the next MAPL call.

It would be nice if there was an option for geos like

./geos --generate_gridspec

that generated the gridspec file.

Note that right now I have a custom script for generating a NetCDF file with corner coordinates, but I think it would be best to avoid solutions like this if possible.

[DISCUSSION] ~40% of total time spent on MPI_Barrier due to load imbalance of chemical solver

Problem

It has been puzzling me that ~40% of GCHP simulation time is spent on MPI_Barrier, as shown by the IPM profiler (https://github.com/nerscadmin/IPM).

For example, here is the IPM profiling result of a 7-day c180 benchmark on 288 cores (version 12.3.2, runs on AWS):

##IPMv2.0.6########################################################
#
# command   : ./geos                    
# start     : Thu Sep 26 03:23:50 2019   host      : ip-172-31-0-86  
# stop      : Thu Sep 26 08:47:40 2019   wallclock : 19430.50
# mpi_tasks : 288 on 8 nodes             %comm     : 48.44
# mem [GB]  : 905.40                     gflop/sec : 0.00
#
#           :       [total]        <avg>          min          max
# wallclock :    5595949.52     19430.38     19430.11     19430.50 
# MPI       :    2710587.40      9411.76      7384.92     11767.68 
# %wall     :
#   MPI     :                      48.44        38.01        60.56 
# #calls    :
#   MPI     :    6023502175     20914938     16331252     21806384
# mem [GB]  :        905.40         3.14         3.04        11.42 
#
#                             [time]        [count]        <%wall>
# MPI_Barrier             2280341.77       11672064          40.75
# MPI_Bcast                242708.94       48370752           4.34
# MPI_Allreduce             89036.54       49428288           1.59
# MPI_Wait                  73996.00     2953775418           1.32
# MPI_Scatterv              21071.88         185184           0.38
# MPI_Isend                  2117.89     1476895338           0.04
# MPI_Gatherv                 969.00        5689728           0.02
# MPI_Irecv                   266.05     1476880080           0.00
# MPI_Comm_create              30.24            576           0.00
# MPI_Recv                     30.11          15258           0.00
# MPI_Comm_split               17.11           8064           0.00
# MPI_Allgather                 1.17            864           0.00
# MPI_Reduce                    0.45           1728           0.00
# MPI_Comm_rank                 0.24         503073           0.00
# MPI_Comm_size                 0.02          74600           0.00
# MPI_Comm_free                 0.01            296           0.00
# MPI_Comm_group                0.00            576           0.00
# MPI_Init                      0.00            288           0.00
#
###################################################################

Here the total wall time is 19430.50 seconds; the MPI_Barrier time is 2280341.77 / 288 = 7918 seconds (IPM prints the total time across all ranks), accounting for ~40% of total time. The fraction of MPI_Barrier is reduced to ~30% at 576 cores and ~20% at 1152 cores, but this is still much larger than a normal value (being blocked at 20%~40% of time seems a bit ridiculous).

Full log:

Visualize profiling results

I wrote a Python script to parse IPM results (https://github.com/JiaweiZhuang/ipm_util) so I can easily analyze & visualize MPI time.

Averaging over all ranks, MPI_Barrier takes much longer than any other MPI calls.
image

Break into per-rank time:
(similar to the "Communication balance by task" plot in IPM's default HTLM report)
area_plot

Same data but on individual panels:
line_facet_plot

Full notebook: https://gist.github.com/JiaweiZhuang/587a17fbb2b757182c5e49dcd3d1f8a9
The notebook reads these IPM XML log files: gchp_ipm_logs.zip

Possible explanations

I originally expected that MPI_Barrier comes from old MAPL's serial I/O, where other ranks are waiting for the master rank to read data from disk. However, I/O can't explain the problem because:

  • Time spent on EXTDATA is only 1154 seconds, much smaller than MPI_Barrier time (7918 seconds). There must be other components that have load-imbalance and cause this long blocking, maybe advection or gas-phase chemistry (say different spatial regions require different numbers of inner solver steps?)
  • All ranks spend roughly the same time on MPI_Barrier (7900 ± 790 seconds) -- this cannot come from a blocking I/O, where the master process should have near-zero barrier time.

A typical serial & blocking I/O should look like this toy MPI_Barrier example, with the core body:

call MPI_Barrier(  MPI_COMM_WORLD, ierror)
if (rank .eq. 0) then
    call SLEEP(3)  ! delaying everyone else
end if
call MPI_Barrier(  MPI_COMM_WORLD, ierror)

in which case rank 0 will have zero MPI_Barrier time, while other ranks will have 3 seconds of MPI_Barrier time. IPM shows very accurate results on this toy program.

As a comparison, WRF doesn't have such a long MPI_Barrier, shown by this WRF profiling result

More clues

An intriguing observation is that MPI_Barrier time decreases with the number of cores, while other MPI calls (especially MPI_Bcast) generally take longer time with more cores (due to increased communication, obviously).

image

image

One hypothesis is that, MPI_Barrier comes from the load-imbalance of photochemistry, as the KPP solver time varies a lot across day and night. Since the chemistry component scales almost perfectly with core counts, its time drops quickly with more cores.

Suggestions

We can locate those time-consuming MPI_Barrier calls by either

  • mark code regions with MPI_Pcontrol(...), as shown in IPM User Guide.
  • or use heavy-weight HPC profiler like HPCToolKit or TAU, to collect the full call stack.

If such load imbalance does come the chemical solver, then people should pay extra attention to the slowest spatial regions when trying to speed-up chemistry solvers. This might be a GCHP-only problem; GC-classic would be OK if OpenMP dynamic thread scheduling is used.

cc. @yantosca @lizziel something worth investigating if you want to profile the code.

[DISCUSSION] OpenUCX v1.6 gives incomplete traceback information

OpenMPI (and other MPI implementations) can make use of OpenUCX, which provides some low-level functionality. @mathomp4 discovered that MPI built with OpenUCX v1.6 will give incomplete traceback information when throwing an error due to floating point exceptions, at least when using some Intel compilers (openucx/ucx#5611). This can be resolved by instead using OpenUCX v1.8.1. GCHPctm successfully compiles with OpenUCX v1.8.1 and OpenMPI v4.0.4, so this should be the recommended (open source) software stack for GCHPctm.

[DISCUSSION] GCHP needs expanded CI or other automated test pipeline

This discussion will pick up where issue #43 (formerly of GCHP repository) left off. GCHP 13.0.0 includes a continuous integration pipeline via Azure but currently only builds the model. It also only builds with a single configuration of compiler flags.

Having test runs on Azure is challenging due to the size of the GCHP input data but there may be work-arounds to get some simple form of testing implemented. @LiamBindle has also suggested outsourcing automated tests to his local cluster where the input data is available and memory/storage constraints are not an issue. @msulprizio is developing integration testing for GEOS-Chem which could also fulfill some of the GCHPctm automated testing needs.

This discussion is intended to be a forum for people to weigh in on GCHPctm testing needs and help develop a feasible plan to implement over the course of the GCHP 13 series.

Default GCHP run crashes almost immediately in MAPL_CapGridComp.F90

Hi everyone,

I'm just submitting this for the archive of issues on GitHub.

Relevent Information

  • ESMF was built with Spack
  • Using Intel MPI with Intel 19 compilers
  • ESMF was unintentionally built with ESMF_COMM=mpiuni

What happened

Yesterday I tried running the default 6-core 1-node 1-hour GCHP simulation and it crashed almost immediately. This happned with GHCP_CTM 13.0.0-alpha.1, but this could happen with any version that uses MAPL 2.0+. Below is the full output. The important parts to pick out are:

  1. It failed almost immediately (very little output).
  2. The "Abort(XXXXXX) on node Y" lines report GCHP is running on different nodes despite this being a 6-core single node simulation.
  3. GCHP crashed after the assertion on line 250 of MAPL_CapGridComp.F90 failed (permalink here)

Failed run output:

 In MAPL_Shmem:
     NumCores per Node =            6
     NumNodes in use   =            1
     Total PEs         =            6
 In MAPL_InitializeShmem (NodeRootsComm):
     NumNodes in use   =            1
 Integer*4 Resource Parameter: HEARTBEAT_DT:600
 Integer*4 Resource Parameter: HEARTBEAT_DT:600
 Integer*4 Resource Parameter: HEARTBEAT_DT:600
 Integer*4 Resource Parameter: HEARTBEAT_DT:600
 Integer*4 Resource Parameter: HEARTBEAT_DT:600
 Integer*4 Resource Parameter: HEARTBEAT_DT:600
 NOT using buffer I/O for file: cap_restart
 NOT using buffer I/O for file: cap_restart
 NOT using buffer I/O for file: cap_restart
 NOT using buffer I/O for file: cap_restart
 NOT using buffer I/O for file: cap_restart
 NOT using buffer I/O for file: cap_restart
pe=00001 FAIL at line=00250    MAPL_CapGridComp.F90                     <something impossible happened>
pe=00001 FAIL at line=00826    MAPL_CapGridComp.F90                     <status=1>
pe=00001 FAIL at line=00427    MAPL_Cap.F90                             <status=1>
pe=00001 FAIL at line=00303    MAPL_Cap.F90                             <status=1>
pe=00001 FAIL at line=00151    MAPL_Cap.F90                             <status=1>
pe=00001 FAIL at line=00129    MAPL_Cap.F90                             <status=1>
pe=00001 FAIL at line=00029    GEOSChem.F90                             <status=1>
Abort(262146) on node 1 (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 262146) - process 1
pe=00002 FAIL at line=00250    MAPL_CapGridComp.F90                     <something impossible happened>
pe=00002 FAIL at line=00826    MAPL_CapGridComp.F90                     <status=1>
pe=00002 FAIL at line=00427    MAPL_Cap.F90                             <status=1>
pe=00002 FAIL at line=00303    MAPL_Cap.F90                             <status=1>
pe=00002 FAIL at line=00151    MAPL_Cap.F90                             <status=1>
pe=00002 FAIL at line=00129    MAPL_Cap.F90                             <status=1>
pe=00002 FAIL at line=00029    GEOSChem.F90                             <status=1>
pe=00003 FAIL at line=00250    MAPL_CapGridComp.F90                     <something impossible happened>
pe=00003 FAIL at line=00826    MAPL_CapGridComp.F90                     <status=1>
pe=00003 FAIL at line=00427    MAPL_Cap.F90                             <status=1>
pe=00003 FAIL at line=00303    MAPL_Cap.F90                             <status=1>
pe=00003 FAIL at line=00151    MAPL_Cap.F90                             <status=1>
pe=00003 FAIL at line=00129    MAPL_Cap.F90                             <status=1>
pe=00003 FAIL at line=00029    GEOSChem.F90                             <status=1>
pe=00000 FAIL at line=00250    MAPL_CapGridComp.F90                     <something impossible happened>
pe=00000 FAIL at line=00826    MAPL_CapGridComp.F90                     <status=1>
pe=00000 FAIL at line=00427    MAPL_Cap.F90                             <status=1>
pe=00000 FAIL at line=00303    MAPL_Cap.F90                             <status=1>
pe=00000 FAIL at line=00151    MAPL_Cap.F90                             <status=1>
pe=00000 FAIL at line=00129    MAPL_Cap.F90                             <status=1>
pe=00000 FAIL at line=00029    GEOSChem.F90                             <status=1>
Abort(262146) on node 2 (rank 2 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 262146) - process 2
Abort(262146) on node 3 (rank 3 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 262146) - process 3
pe=00004 FAIL at line=00250    MAPL_CapGridComp.F90                     <something impossible happened>
pe=00004 FAIL at line=00826    MAPL_CapGridComp.F90                     <status=1>
pe=00004 FAIL at line=00427    MAPL_Cap.F90                             <status=1>
pe=00004 FAIL at line=00303    MAPL_Cap.F90                             <status=1>
pe=00004 FAIL at line=00151    MAPL_Cap.F90                             <status=1>
pe=00004 FAIL at line=00129    MAPL_Cap.F90                             <status=1>
pe=00004 FAIL at line=00029    GEOSChem.F90                             <status=1>
Abort(262146) on node 4 (rank 4 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 262146) - process 4
pe=00005 FAIL at line=00250    MAPL_CapGridComp.F90                     <something impossible happened>
pe=00005 FAIL at line=00826    MAPL_CapGridComp.F90                     <status=1>
pe=00005 FAIL at line=00427    MAPL_Cap.F90                             <status=1>
pe=00005 FAIL at line=00303    MAPL_Cap.F90                             <status=1>
pe=00005 FAIL at line=00151    MAPL_Cap.F90                             <status=1>
pe=00005 FAIL at line=00129    MAPL_Cap.F90                             <status=1>
pe=00005 FAIL at line=00029    GEOSChem.F90                             <status=1>
Abort(262146) on node 5 (rank 5 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 262146) - process 5
Abort(262146) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 262146) - process 0

The Problem

The issue was ESMF was built with ESMF_COMM=mpiuni. This appears to have happended because the spack install spec wasn't quite right, but I didn't build ESMF myself so I can't be sure.

How do I check which ESMF_COMM my ESMF was built with?

The build-time value of ESMF_COMM is written to esmf.mk beside your ESMF libraries. You can see it with the following command

grep 'ESMF_COMM' $(spack location -i esmf)/lib/esmf.mk

or

grep 'ESMF_COMM' /path/to/ESMF/libraries/esmf.mk

Solution

Rebuild ESMF and make sure ESMF_COMM is set to the appropriate MPI flavor.

[BUG/ISSUE] GCHP 12.9.3 multirun option fails after 1st run [error: cap_restart did not update to different date]

Hello,
To get more familiar with the multi-run option, I am trying to split a 3-hour simulation into 3 jobs (GCHP version 12.9.3), following the respective wiki instructions. However, after the first run, the simulation crashes with an error: cap_restart did not update to different date. Checking the cap_restart file I can see that the file is empty.
Maybe the problem is that the date of the first job is not written in the cap_restart file and thus the second job cannot start running? Which part of the code is writting the cap_restart file?

Compilation commands

  1. make clean_all
  2. make build_all

Run commands
./gchp.multirun.sh

Error messages

There are some errors in the slurm file of the first job (attached), and an error in the multirun.log file:

Error: cap_restart did not update to different date

Required information:

Your GCHP version and runtime environment:

  • GCHPctm version (can be last commit hash): GHCP 12.9.3
  • MPI type and version: openmpi/icc/3.0.2
  • Fortran cmpiler type and version: ifort 18.0.2 20180210
  • netCDF version: netcdf/4.7.3-openmpi
  • Are you using GCHP "out of the box" (i.e. unmodified): No
    • If you have modified GCHP, please list what was changed: __

Input and log files to attach

I would appreciate it if you could provide some help to solve my problem. Thank you in advance.
Regards,
Maria Tsivlidou

[BUG] -DCMAKE_BUILD_TYPE=Debug Fortran_FLAGS missing comma

When I do cmake with the latest main branch, I get a compile error which I traced back to src/MAPL/MAPL_cfio_r4/CMakeFiles/MAPL_cfio_r4.dir/flags.make. The Fortran_FLAGS variable includes -check bounds uninit but for ifort these options should be comma separated, i.e. -check bounds,uninit

[BUG/ISSUE] Error when trying to restart interrupted simulation using existing restart file GCHP 12.9.3

Hello,
I am Maria Tsivlidou, a PhD student in Laboratoire d'Aerologie in Toulouse, France supervised by Bastien Sauvage and Brice Barret. I am trying to restart a GCHP simulation using an existing restart file, and I would like to kindly ask your assistance for an error I have. 

Describe the bug:

I used GCHP version 12.9.3 to produce a successful simulation with start date 20080501 and end date 20080601. At the end of the run the checkpoint file was created (gcchem_internal_checkpoint.restart.20080601_000000.nc4.txt). Now I am trying to restart the run since 20080601, but it is not working. 
Compilation commands
1.  make clean_all
2.  make build_all

Run commands

  1. sbatch gchp_nuwa.run

Error messages

The error message in the gchp.log file is:              Mem/Swap Used (MB) at GIGCenvMAPL_GenericInitialize= 6.8260E+03 0.0
 000E+00
 ERROR: Timer TOTAL needs to be set first
 ERROR: Timer INITIALIZE needs to be set first
 ERROR: Timer TOTAL needs to be set first
 ERROR: Timer INITIALIZE needs to be set first
 ERROR: Timer TOTAL needs to be set first
 ERROR: Timer GenInitTot needs to be set first
 ERROR: Timer --GenInitMine needs to be set first

Also, there are several errors in the slurm.out.txt file attached below. 

Your GEOS-Chem version and runtime environment:

 - GEOS-Chem version:  GCHP 12.9.3
 - Compiler version: ifort 18.0.2 20180210
 - netCDF version: netcdf/4.7.3-openmpi
 - netCDF-Fortran version (if applicable): __
 - Did you run on a computational cluster, on the AWS cloud: No
   - If you ran on the AWS cloud, please specify the Amazon Machine Image (AMI) ID: __
 - Are you using GEOS-Chem "out of the box" (i.e. unmodified): No 
   - If you have modified GEOS-Chem, please list what was changed: __

Input and log files to attach

 - lastbuild: __
 - input.geos: input.geos.txt

 - HEMCO_Config.rc: HEMCO_Config.rc.txt

 - GEOS-Chem "Classic" log file: gchp.log.txt

 - HEMCO.log: HEMCO.log.txt

 - slurm.out or any other error messages from your scheduler: slurm-510683.out.txt

 - runConfig: runConfig.sh.txt

Additional context

I had the same error even when I tried to restart a 6-month simulation that was interrupted after 3 months, using the last restart file that was created. 

[DISCUSSION] Grid-box corner coordinates in (or appended to) diagnostics

To plot data on a curvlinear grid with routines like matplotlib's pcolormesh(), the coordinates of grid-box edges are necessary. Currently the diagnostics don't include edge coordinates, and there's no easy way to get them.

GCPy calculates edge coordiantes internally

GCPy calculates edge coordinates itself (privately), and uses those to plot cubed-sphere data. (See here).

    else:
        #Cubed-sphere single level
        ax.coastlines()
        try:
            if masked_data == None:
                masked_data = np.ma.masked_where(np.abs(grid["lon"] - 180) < 2, plot_vals.data.reshape(6, res, res))
        except ValueError:
            #Comparison of numpy arrays throws errors
            pass
        [minlon,maxlon,minlat,maxlat] = extent
        #Catch issue with plots extending into both the western and eastern hemisphere
        if np.max(grid["lon_b"] > 180):
            grid["lon_b"] = (((grid["lon_b"]+180)%360)-180)
        for j in range(6):
            plot = ax.pcolormesh(
                grid["lon_b"][j, :, :],
                grid["lat_b"][j, :, :],
                masked_data[j, :, :],
                transform=proj,
                cmap=comap,
                norm=norm
            )

In this snippet, the grid dict contains arrays with edge coordinates. The grid dict is generated by call_make_grid(). But, these coordinates are not available to users.

Potential solutions

Below are some ideas that come to mind:

  1. Add a command-line tool to GCPy that appends the edge coordinates (shape: nf, YEdim, XEdim; where YEdim, XEdim are csres+1 in size) to a give file
  2. Perhaps there's a way to add this information to the diagnostics with MAPL?

I look forward to the thoughts of others. Is there a solution to this that I'm not aware of?

[BUG/ISSUE] Writing initial restart file is very slow with IntelMPI for big/high core count simulations

Writing the initial restart file can be very slow with IntelMPI for big simulations/high core counts.

Last week I was running a C360 simulation on 900 cores, and it got stuck writing the first gcchem_internal_checkpoint. It was writing it so slow that it would have taken >1 day.

Related: GEOS-ESM/MAPL#548

Solution

Set the following environment variable fixed it for me:

export I_MPI_ADJUST_GATHERV=3

See also: https://software.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-windows/top/environment-variable-reference/i-mpi-adjust-family-environment-variables.html

Set this environment variable to select the desired algorithm(s) for the collective operation under particular conditions. Each collective operation has its own environment variable and algorithms.

Environment Variables, Collective Operations, and Algorithms

Environment Variable Collective Operation Algorithms
I_MPI_ADJUST_GATHERV MPI_Gatherv 1. Linear
2. Topology aware linear
3. Knomial

[BUG/ISSUE] Segmentation fault when running c180 simulation

Hello, I have been running GCHPctm at c24 with no issue using GEOS-Chem 12.8.2, but I recently wanted to change my simulation resolution to c180 and am running into problems.

Some info:
-Running on NCAR Casper environment
-I am running on 8 nodes, 288 cores total (but this error persists if I request different numbers of nodes/cores)
-I am using openmpi 4.0.3 and gfortran 8.3.0 compiler

Regardless of the resources I request from my cluster, the simulation crashes at this point in the log file:

 MAPL ExtData initialization complete
Mem/Swap Used (MB) at MAPL_Cap:TimeLoop
=  1.919E+05  3.103E+03
 Calling MAPL ExtData Run_
 ExtData Run_: READ_LOOP
 ExtData Run_: ---PopulateBundle
 ExtData Run_: ---CreateCFIO
 ExtData Run_: ---prefetch

I don't receive an error message from the code, but I see this message in my SLURM file and find core files in my run directory:

[casper15:68225:0:68225] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xffff
fff2d2f75588)
==== backtrace (tid:  68225) ====

What might be causing this?

[FEATURE REQUEST] GEOS tracers gridded component

For the transport tracers simulation GCHP still needs to go through the GEOS-Chem classic code to get information about the tracers used in a given simulation. Ideally GCHP would instead use the NASA GMAO tracer gridded component called TR_GridComp used in the GEOS system.

Currently it is not its own repository but is part of https://github.com/GEOS-ESM/GEOSchem_GridComp. For now it is best to simply copy it into GCHPctm as a start. See discussion with GMAO on this here: GEOS-ESM/GEOSchem_GridComp#51.

Comply with MAPL "positive" standards in export definitions

Currently GCHP does not obey MAPL conventions regarding which way is "up" in exported data. It appears that - when writing out - data acquired through the GEOS-Chem diagnostic arrays are produced "inverted" (level 1 = surface, even though HISTORY is meant to output with level 1 = TOA) but all other data are produced "right way up" (level 1 = TOA). I've opened a request in the GEOS-ESM/MAPL repo to make the "positive" attribute of vertical data be something that is communicated explicitly in imports and exports (GEOS-ESM/MAPL#284), which would resolve this issue. However, this will require that we modify GCHP to comply with MAPL's standards.

[BUG/ISSUE] GCHP crash on reading in lightning NOx when trying to start a simulation in February 2016 and crash when trying to do a leap day

Describe the bug:

GCHP crashes and says there is an error reading in lightning NOx when I try to restart the multirun set of simulations. GCHP also crashes on a leap day with a MAPL error. I don't know if the two errors are related.

Expected behavior:

GCHP reading in lightning NOx and proceeds with the simulation and doesn't crash in the first place when getting to the leap day.

Actual behavior:

GCHP crashes and sas there is an error reading in lightning NOx when I try to restart a multirun set of simulations and crashes on a leap day.

Steps to reproduce: the bug:

Start a single run simulaiton on 20160229 000000 (or 20160207 000000 or seemingly any time in Febuary 2016)

Or attempt to re-start a multirun simulation set that previously crashed by using an existing cap_reststart and the last restart file from the multirun (restarting on Feburary 1).

For the leap day simulations I've now had multile simulations crash for the month of February when getting to 00:00 on Feb 29. See the log file below for an example.

Compilation commands
I used cmake and ifort 18. The standard environment used by Lizzie Lundgren. With RRTMG on.

Run commands
used the gchp.run script.

Error messages

For the lightning NOx crash the .out file says:
ExtData could not find bracketing data from file template
./HcoDir/OFFLINE_LIGHTNING/v2020-03/GEOSFP/%y4/FLASH_CTH_GEOSFP_0.25x0.3125_%y4
_%m2.nc4 for side L

The .err files for both types of crashes have lots of MAPL errors and MPI abort errors. See the relevant log files listed below.

HEMCO.log didn't have anything specific for either of the two errors.

Required information:

Your GCHP version and runtime environment:

  • GCHPctm version (can be last commit hash): ____ dev.gchp_13.0.0
  • MPI type and version: __
  • Fortran cmpiler type and version: ___ifort 18
  • netCDF version: __
  • Are you using GCHP "out of the box" (i.e. unmodified): __
    • If you have modified GCHP, please list what was changed: __ __ I added full column diagnostics for RRTMG and read in some startopsheric aeroosl properties. These all work fine for the first month but the simulations keep crashing when starting Feb 29 2016. And then when I try to restart them on Feb 1 2016 I get the lightning NOx error

Input and log files to attach

  • runConfig.sh: __
  • input.geos: __
  • HEMCO_Config.rc: __
  • ExtData.rc: __
  • HISTORY.rc: __
  • GCHP compile log file: __
  • GCHP run log file: __
  • HEMCO.log: __
  • slurm.out or any other error messages from your scheduler: __
  • Any other error messages: __

see here on Cannon for all the above files: /n/holyscratch01/jacob_lab/jmoch/geoE_rdirs/GCHP_13.0.0_geoE_off_vtest3
the log file relevant is: slurm-7116496.out and slurm-7116496.err for the initial crash. And slurm-7180457.out and slurm-7180457.out for the crash when I try to restart it and get a lightning NOx error.

Additional context

[DISCUSSION] How to submit a pull request with submodules

I believe the GCHP adjoint code is close to a state where I can submit a pull request. However, I have had to fork the geos-chem, MAPL, FVdycore, and HEMCO submodule repositories and make changes to them in addition to the GCHP code. Is that multiple pull requests then? How can they be coordinated? More generally, is there a guide about coding and testing requirements before submitting the request?

[BUG/ISSUE] Crash when using Intel MPI with certain fabric providers

As of Intel MPI 2019 Update 8 and libfabric 1.10.0, there is a bug related to registering memory that causes a crash in GCHP when using certain fabric providers. This was originally identified as an issue when using the EFA provider on AWS EC2, but has also been encountered on systems that use the Verbs provider. This issue may be fixed in libfabric 1.11.0. For users who cannot update the libfabric version on their system, a temporary solution is to put the line export MPIR_CVAR_CH4_OFI_ENABLE_RMA=0 in gchp.env. This bug is not relevant to users of other MPI providers such as OpenMPI.

[DISCUSSION] How to implement vertical flipping of imports from ExtData

Hi everyone, I was thinking about how to add the ability to vertically flip metfields today (for running from native metfields), and I came to two ideas:

  1. Define ExtData variable that need flipping in a config file. Flip those variables in GCHPctmEnv_GridComp. This is only "clean" if GCHPctmEnv_GridComp can have an IMPORT and EXPORT with the same name—does anyone know if that's possible?
  2. Add a new derived export function (see here) to perform vertical flipping.

Any thoughts? Does anyone have a different idea in mind?

[BUG/ISSUE] GCHP crash in dev/gchp_13.0.0 when trying to use the .grid_label field in HISTORY.rc

Describe the bug:

GCHP crashes when I try to use a ".grid_label" and ".conservative" fields for any collection

Expected behavior:

GCHP runs sucessfully and regrids the output from native cubed sphere to the lat-lon grid.

Actual behavior:

GCHP crashes and has errors poinitng to MAPL (e.g. MAPL_HistoryGridComp.F90, MAPL_Generic.F90, etc.)

Steps to reproduce: the bug:

I used cmake and ifort 18. The standard environment used by Lizzie Lundgren.

Run commands

I used the gchp.run script (single run)

Error messages

Nothing says "add text here" but a lot of messages say "need informative message"
 
pe=00000 FAIL at line=01064    MAPL_HistoryGridComp.F90                 <needs informative message>
pe=00000 FAIL at line=01829    MAPL_Generic.F90                         <needs informative message>
pe=00000 FAIL at line=00614    MAPL_CapGridComp.F90                     <status=1>
pe=00000 FAIL at line=00559    MAPL_CapGridComp.F90                     <status=1>
pe=00001 FAIL at line=01064    MAPL_HistoryGridComp.F90                 <needs informative message>
pe=00001 FAIL at line=01829    MAPL_Generic.F90                         <needs informative message>
pe=00001 FAIL at line=00614    MAPL_CapGridComp.F90                     <status=1>
pe=00001 FAIL at line=00559    MAPL_CapGridComp.F90                     <status=1>
pe=00001 FAIL at line=00849    MAPL_CapGridComp.F90                     <status=1>
pe=00001 FAIL at line=00322    MAPL_Cap.F90                             <status=1>
pe=00001 FAIL at line=00198    MAPL_Cap.F90                             <status=1>
pe=00001 FAIL at line=00157    MAPL_Cap.F90                             <status=1>
pe=00002 FAIL at line=01064    MAPL_HistoryGridComp.F90                 <needs informative message>
pe=00002 FAIL at line=01829    MAPL_Generic.F90                         <needs informative message>
pe=00002 FAIL at line=00614    MAPL_CapGridComp.F90                     <status=1>
pe=00002 FAIL at line=00559    MAPL_CapGridComp.F90                     <status=1>
pe=00002 FAIL at line=00849    MAPL_CapGridComp.F90                     <status=1>
pe=00002 FAIL at line=00322    MAPL_Cap.F90                             <status=1>
pe=00002 FAIL at line=00198    MAPL_Cap.F90                             <status=1>
pe=00002 FAIL at line=00157    MAPL_Cap.F90                             <status=1>
pe=00002 FAIL at line=00131    MAPL_Cap.F90                             <status=1>
pe=00002 FAIL at line=00029    GCHPctm.F90                              <status=1>
pe=00003 FAIL at line=01064    MAPL_HistoryGridComp.F90                 <needs informative message>
pe=00003 FAIL at line=01829    MAPL_Generic.F90                         <needs informative message>
pe=00003 FAIL at line=00614    MAPL_CapGridComp.F90                     <status=1>
pe=00003 FAIL at line=00559    MAPL_CapGridComp.F90                     <status=1>
pe=00003 FAIL at line=00849    MAPL_CapGridComp.F90                     <status=1>
pe=00003 FAIL at line=00322    MAPL_Cap.F90                             <status=1>
pe=00003 FAIL at line=00198    MAPL_Cap.F90                             <status=1>
pe=00003 FAIL at line=00157    MAPL_Cap.F90                             <status=1>
pe=00003 FAIL at line=00131    MAPL_Cap.F90                             <status=1>
pe=00003 FAIL at line=00029    GCHPctm.F90                              <status=1>
pe=00005 FAIL at line=01064    MAPL_HistoryGridComp.F90                 <needs informative message>
pe=00005 FAIL at line=01829    MAPL_Generic.F90                         <needs informative message>
pe=00005 FAIL at line=00614    MAPL_CapGridComp.F90                     <status=1>
pe=00005 FAIL at line=00559    MAPL_CapGridComp.F90                     <status=1>
pe=00005 FAIL at line=00849    MAPL_CapGridComp.F90                     <status=1>
pe=00005 FAIL at line=00322    MAPL_Cap.F90                             <status=1>
pe=00005 FAIL at line=00198    MAPL_Cap.F90                             <status=1>

...

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 5 in communicator MPI_COMM_WORLD
with errorcode 262146.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
In: PMI_Abort(262146, N/A)
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 21 in communicator MPI_COMM_WORLD
with errorcode 262146.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
In: PMI_Abort(262146, N/A)
--------------------------------------------------------------------------
...
see more in the log file





Required information:

Your GCHP version and runtime environment:

  • GCHPctm version (can be last commit hash): __ dev.gchp_13.0.0
  • MPI type and version: __
  • Fortran cmpiler type and version: __ifort 18
  • netCDF version: __
  • Are you using GCHP "out of the box" (i.e. unmodified): __ I added full column diagnostics for RRTMG, but these work if I don't try regridding the output
    • If you have modified GCHP, please list what was changed: __ see above

Input and log files to attach

  • runConfig.sh: __
  • input.geos: __
  • HEMCO_Config.rc: __
  • ExtData.rc: __
  • HISTORY.rc: __
  • GCHP compile log file: __
  • GCHP run log file: __
  • HEMCO.log: __
  • slurm.out or any other error messages from your scheduler: __
  • Any other error messages: __

see here on Cannon for all the above files: /n/holyscratch01/jacob_lab/jmoch/geoE_rdirs/GCHP_13.0.0_geoE_off_vtest2
the log file relevant is: slurm-6797176.out

Additional context

gchp.log

HEMCO.log file gets closed, reopened as fort.11

I am working off of version 12.5 in the old repo, but I found the source of the problem and that code is still in the gchp_ctm repo so I thought I'd drop this bug report here. The symptoms may be different in gchp_ctm, but I'll post what happens to my runs here.

Since the HEMCO.log file is the first persistent output filed opened, it gets a unit number of 11 from findFreeLUN. Then, at some point after EMISSIONS_INIT is finished, MAPL_cfio/ESMF_CFIOMod.F90::ESMF_CFIOFileOpen is called and these lines are executed:

    858       open(11, file=fileName)
    859       read(11, '(a)') dset
    860       close(11)

This closes my HEMCO.log file handle. Next time HCO_MSG gets called, it opens a new default file name, which is fort.11, and the HEMCO log output continues in that file. It's obviously not critical, but it is annoying and should be easy to fix. For my code, because I couldn't access inquireMod from the MAPL_io folder, I've just added a for loop searching for a free LUN before that and replaced the 11 with that LUN variable. I don't know what the most copacetic fix is for the new repo.

[BUG/ISSUE] Trouble initializing build directory with alpha.9

I'm opening this on behalf of Isaiah Sauvageau from Drexel University. He write:

Hello,

I am attempting to build and run GCHP, but I am having trouble initializing a build directory. The detailed description of my problem is shared through a paper linked here (https://paper.dropbox.com/doc/GCHP-build-Initialization--A_84vqolsYb3Vs14zP4J02SdAQ-OTU3d7EZVuYVxafmeAbWe). Any assistance would be greatly appreciated. If there is more information required, I am happy to share.

Thank you,

Isaiah Sauvageau
Pronouns: He/Him/His
PhD Candidate Environmental Engineering
Drexel University | College of Engineering

[FEATURE REQUEST] Update GMAO submodule versions in GCHP 13.1

4/8/2021 Update: Table now reflects versions to be included in 13.1
5/14 Update: Advection libraries will also be updated in 13.1

GCHP 13.0 includes upgrades to all GMAO libraries relative to what is used in GCHP 12. However, most of these libraries have already had several additional version releases. We should upgrade again for GCHP 13.1.

Below is a table of each GCHPctm submodule that is a fork from GMAO, either GEOS-ESM or Goddard-Fortran-Ecosystem. Let me know if you think I missed one of the repos used in GCHPctm. The target versions are the latest available, but if there is a newer version at the time this work is done then we take that for the merge. Only take tagged versions that are on the upstream main branch.

Repository versions in 13.0 and targets for 13.1

Repository Version in 13.0.0 Target for 13.1 Notes
FMS geos/orphan/v1.0.3 geos/2019.01.02+noaff.6 Our version is an orphan branch so special handling is needed for the upgrade.
MAPL v2.2.7 v2.6.3 Working on bug fix for v2.6.4.
GMAO_Shared v1.1.6 v1.3.8 We do not use most of the content in this repository. Update the skip list in CMakeLists.txt as needed.
fvdycore geos/v1.1.2 v1.1.6 Beware this is a submodule within a submodule.
FVdycoreCubed_GridComp v1.1.3 v1.2.12 An internal benchmark is essential when updating this library due to potential changes in offline advection which is not thoroughly tested at GMAO.
ESMA_cmake v3.0.6 v3.0.6 No version change in 13.1.
ecbuild geos/v1.0.5 geos/v1.0.6 If there is a new version to upgrade to, beware this is a submodule within a submodule.
gFTL-shared v1.0.7 v1.2.0 We are using a fork of this repo but perhaps do not need to.
gFTL v1.2.5 v1.3.1 We are not using a fork of this repo.
pFlogger v1.4.2 v1.5.0 We do not yet harness the full power of this library, but should in the future.
yaFyaml v0.4.0 v0.5.0 We should be able to use this to read the GEOS-Chem species database, but it has not yet been tried.
pFUnit v4.1.9 v4.2.0 We do not currently build this library, but it is included in GCHPctm for potential future use.

Related to this, another goal I have is to use Goddard-Fortran-Ecosystem/GFE which Tom put together at my request to bundle the Goddard-Fortran-Ecosystem repos together. We generally do not change these libraries so could avoid using forks potentially. I have permissions to make branches on the upstream as needed. GFE could sit as a submodule within GCHPctm/src, replacing gFTL-shared, pFlogger, pFUnit, and yaFyaml in that directory, which would be much cleaner.

[BUG/ISSUE] CodeDir is incorrectly set when creating GCHPctm run directories

Describe the bug

When I create a GCHPctm run directory, I've noticed that the CodeDir symbolic link points to the wrong directory. For example, after cloning GCHPctm and checking out the submodules:

cd GCHPctm/run
./createRunDir.sh 
... then follow all the prompts to create your desired run dir type ...
... then cd into the rundir you just created...
ls -l CodeDir

The output I got was:

/n/holyscratch01/jacob_lab/ryantosca/GCHP/GCHPctm/src/GCHP_GridComp/GEOSChem_GridComp/geos-chem/run/

which is pointing to the run creation directory. But this should point to the top-level GCHPctm folder, i.e.:

/n/holyscratch01/jacob_lab/ryantosca/GCHP/GCHPctm

Quick fix

Manually unlink the CodeDir and reset it to the top-level GCHPctm folder,i.e.

unlink CodeDir
ln -s /n/holyscratch01/jacob_lab/ryantosca/GCHP/GCHPctm CodeDir

then compile as shown in the README.md

Error linking ESMF 8 with Intel 19 compiler

This is documentation of an issue I ran into with gchp_ctm (3f06a1b) with ESMF 8 and Intel 19 (this was on CentOS 7). When I compiled gchp_ctm I got the following link error:

Scanning dependencies of target geos
[100%] Building Fortran object src/CMakeFiles/geos.dir/GEOSChem.F90.o
[100%] Linking Fortran executable geos
ld: geos: hidden symbol `__intel_cpu_features_init_x' in /opt/intel/compilers_and_libraries_2019.5.281/linux/compiler/lib/intel64_lin/libirc.a(cpu_feature_disp.o) is referenced by DSO
ld: final link failed: Bad value
make[3]: *** [src/geos] Error 1
make[2]: *** [src/CMakeFiles/geos.dir/all] Error 2
make[1]: *** [src/CMakeFiles/geos.dir/rule] Error 2

This can be fixed by adding -lintlc to ESMF's link libraries. I guess this library is the dynamic version of libirc.a which is intel-specific optimizations (according to here).

This issue can be fixed with the following patch to ESMA_CMake

diff --git a/FindESMF.cmake b/FindESMF.cmake
index df05906..f10027e 100755
--- a/FindESMF.cmake
+++ b/FindESMF.cmake
@@ -86,7 +86,7 @@ find_package(NetCDF REQUIRED)
 find_package(MPI REQUIRED)
 execute_process (COMMAND ${CMAKE_CXX_COMPILER} --print-file-name=libstdc++.so OUTPUT_VARIABLE stdcxx OUTPUT_STRIP_TRAILING_WHITESPACE)
 execute_process (COMMAND ${CMAKE_CXX_COMPILER} --print-file-name=libgcc.a OUTPUT_VARIABLE libgcc OUTPUT_STRIP_TRAILING_WHITESPACE)
-set(ESMF_LIBRARIES ${ESMF_LIBRARY} ${NETCDF_LIBRARIES} ${MPI_Fortran_LIBRARIES} ${MPI_CXX_LIBRARIES} rt ${stdcxx} ${libgcc})
+set(ESMF_LIBRARIES ${ESMF_LIBRARY} ${NETCDF_LIBRARIES} ${MPI_Fortran_LIBRARIES} ${MPI_CXX_LIBRARIES} rt -lintlc ${stdcxx} ${libgcc})
 set(ESMF_INCLUDE_DIRS ${ESMF_HEADERS_DIR} ${ESMF_MOD_DIR})
 
 # Make an imported target for ESMF

To me, issues like this seem to be a symptom of a larger issue which is how to determine transitive usage requirements from dependencies that aren't built with CMake.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.