e3sm-project / e3sm Goto Github PK

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.

Home Page: https://docs.e3sm.org/E3SM

License: Other

Shell 0.70% XSLT 0.02% Perl 1.06% Fortran 81.98% Makefile 0.18% C++ 8.06% C 0.96% CMake 0.84% Python 1.61% TeX 2.05% MATLAB 0.01% HTML 1.34% CSS 0.01% JavaScript 0.02% NewLisp 0.36% Puppet 0.01% Forth 0.01% NCL 0.70% Roff 0.08% sed 0.02%

climate-model climate-science climate fortran e3sm snl-applications

e3sm's People

Stargazers

Watchers

Forkers

optimus-prime scollis susburrows doe-prospect kevans32 rgknox bishtgautam hmhorow subond pwolfram xuchongang sfrias mt5555 cjvogl tungka rabramoff ambrad juliusvira oksanaguba darincomeau mark-petersen jinyun1tang xuxm84 bartgol jhkennedy nsashi tjfulle btnadiga sbrus89 naromero77 caozd999 mpgurav johelli qingli411 sarats srinathv waltonmacey zlyu027 amametjanov apcraig pressel climatemodeling luzhai chengdang tirthabanerjee hgordo xylar glemieux leosiqueira kaizhangpnl bwzh clm-pflotran asteyer bgin jenniferholm ckoven bsulman sciencewiki yunamao xyuan milenaveneziani changliao1025 yanhp2009-git zhangshixuan1987 tringuyenttt amschne tdycores-project herrwang0 lpassarella forsyth2 sukjinchoi yueli92 brhillman serbinsh lyh910926 xianwuxue-noaa jessicaneedham de1ightrain someone1990 zhf1127 abdurazzoq21 jimmybaby imentn777 maridegan roeungsoklen shahabgeravesh cwp1996 aekka1932 aboodmw qianyuxuan xiajz jnpellegrini aldivi fuadyassin jingxianwen fhyuancn pshchien globalice karapeterson celeste-huang

e3sm's Issues

ACME v0.1 cant run on mira: Missing libXML.pm

The ACME v0.1 code stopped working on Mira earlier this week. Identical code and case was working last week. Assigning to Jayesh since you're listed as the Mira POC.

Reported to ALCF support on Jan 30.

The perl configuration script fails with:

Can't locate XML/LibXML.pm in @inc (@inc contains:
/gpfs/mira-home/taylorm/codes/acme-v0.1/scripts/b1850c5_m1a/Tools
/gpfs/mira-home/taylorm/codes/acme-v0.1/scripts/b1850c5_m1a/Tools/Tools
/usr/local/lib64/perl5 /usr/local/share/perl5
/usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl
/usr/lib64/perl5 /usr/share/perl5 .) at
/gpfs/mira-home/taylorm/codes/acme-v0.1/scripts/b1850c5_m1a/Tools/ConfigCas
e.pm
line 101. BEGIN failed--compilation aborted at
/gpfs/mira-home/taylorm/codes/acme-v0.1/scripts/b1850c5_m1a/Tools/ConfigCas
e.pm
line 101. Compilation failed in require at ./Tools/xml2env line 122.

This missing library, "XML/LibXML.pm", does exist in /usr/lib64/perl5,
which is one of the paths shown in the error log.

HOMME: bug in threaded min/max for diagnostics calcluation

I've identified what may be a serious bug in HOMME. (It is obviously causing errors in diagnostic output, but the issue could have much wider consequences.) There is a simple workaround however, so we can avoid this particular problem in our development and production runs.

The problem showed up on Mira, where it is useful to have more threads in the atmosphere dynamics than there are grid cells, so some threads will be idle during the dynamics. (This has been the case on other systems in the past, but Mira is high thread count friendly, so we bumped into this scenario here most recently. I have verified identical behavior on Titan.)

The global variable NThreads is set to the thread count specified in env_mach_pes.xml (I assume). In one particular example (ne30_g16, FAMIPC5, 900 processes, 8 threads per process):

Main:NThreads = 8
Main:n_domains= 6

This breaks the logic in the max and min functions in reduction_mod.F90 ,where threads are let into the critical section until a counter exceeds NThreads, at which point thread 0 calls an MPI_Allreduce. In this case, only 6 threads are active (have assigned work), so the calls to min and max are overlapping, e.g. 6 data from a call to min and the next two data from a call to max are "reduced" together before the call to the Allreduce. (I'm surprised that this does not cause a hang, but print statements have verified this behavior.)

The max/min operators are the only ones with this particular logic, but there is also code in, for example, decompose in domain_mod.F90 where

integer :: beg(0:ndomains)

...
domain%start=beg(ipe)
domain%end =beg(ipe+1)-1

and ipe is 0:NThreads-1 . So this looks to be reading from random memory for thread 7, and could cause problems if thread 7 decides that it has work to do? A HOMME expert will have to comment.

In any case, it LOOKS like we are safe if we do not overdecompose the model. I can add an error abort in case we do that by accident until this is fixed. Comments?

Next branch not building on Melvin

/home/jgfouca/ACME_Climate/models/atm/cam/src/chemistry/modal_aero/modal_aero_convproc.F90:212.7:

use abortutils, only: endrun
1
Fatal Error: Can't open module file 'abortutils.mod' for reading at (1): No such file or directory

Incorrect precision of constant within SoilTemperatureMod.F90

The precision of a constant is incorrect at multiple locations in SoilTemperature.F90 (1 instead of 1._r8). This bug has been filed in NCAR's bugzilla as:

bug 2061: Fixed in clm4_5_1_r096
bug 2141: Open as of 2/3/2015.

Tabs should be eliminated from source files.

We have found a mixture of tabs and spaces used for indentation in various source files in ACME. The problem with tabs is that everyone sets their tab stops differently according to their particular practice/preference/religion. This means that a single file can render sensibly in one person's text editor and completely nonsensically on another.

The solution is to either pound away on your space bar, or to tell your text editor to expand a tab into a fixed number of spaces. Optimally, we could communicate this aesthetic to people's editors via special files placed in the source tree.

More broadly, we should probably attempt to assemble some least-common-denominator form of a style guide for ACME code (possibly in connection with CESM/NCAR if we can agree on simple guidelines) so that we can avoid these issues in the future.

But for the moment, this issue exists only to a) bring this to our attention, and b) decide how to eradicate all the existing tabs.

NOTE: I don't think we have a proper label for this kind of issue yet. Perhaps we can discuss this as well.

Fatal PGI compiler bug when building B case with CLM4.5 on Titan

for
./create_newcase -case ne30_B1850C5L45BGC_pgi -mach titan -compiler pgi -compset B1850C5L45BGC -res ne30_g16

builds with the PGI compiler are dying with

pgf90-Fatal-/opt/pgi/14.7.0/linux86-64/14.7/bin/pgf901 TERMINATED by signal 11
gmake: *** [ActiveLayerMod.o] Error 127

and

pgf90-Fatal-/opt/pgi/14.7.0/linux86-64/14.7/bin/pgf901 TERMINATED by signal 11
gmake: *** [dynCNDVMod.o] Error 127

and

pgf90-Fatal-/opt/pgi/14.7.0/linux86-64/14.7/bin/pgf901 TERMINATED by signal 11
gmake: *** [lnd2glcMod.o] Error 127

The same ActiveLayer error occured whether using pgi/14.2.0 or pgi/14.7.0 . I tried essentially gutting the routine in this file, and the error persisted, so has something to do with the modules being used?

COSP does not work with OpenMP on Mira

In support of the Atmosphere Group I have been trying to find a layout that works with COSP for an ne120 resolution on Mira. My understanding is that this works fine for ne30. However, I have found exactly one pe layout so far that allows this case to work for ne120 (within my current search space):

7200 processes, No OpenMP, MAX_TASKS_PER_NODE=4

If I turn on threading it dies during the first COSP calculation (during timestep 8), apparently from a memory issue. Moreover, it dies even with as few as 2 threads per process, e.g.

7200 processes, 2 threads per task, MAX_TASKS_PER_NODE=8

(lots of core files, with error message of the form

***FAULT Encountered unhandled signal 0x0000000b (11) (SIGSEGV)
Generated by interrupt..................0x00000008 (Data TLB Miss Exception DEAR=0x0000001f09ab1e80 ESR=0x0000000000800000)
)

FYI - without COSP, we run with 7200 processes, 16 threads per task, MAX_TASKS_PER_NODE=64.

On Titan, I had to increase the thread stack size to 128 MB to get COSP for ne120 to work. I have tried this on Mira, even going as high as 512MB per thread, but it has not worked. (I also tried 384, 192, and the default 96.)

I also tried just increasing the number of MPI tasks without using OpenMP e.g.

14400 processes, No OpenMP, MAX_TASKS_PER_NODE=8

and this dies in CICE initialization, also due to memory issues:

in cice log:

(shr_strdata_init) calling shr_dmodel_mapSet for remap
in cesm log:
0: "/gpfs/mira-home/worley/ACME/master/ACME/models/csm_share/shr/shr_map_mod.F90", line 1300: 1525-108 Error encountered while attempting to allocate a data object. The program will stop.

This latter problem appears to be a function solely of the number of ATM MPI processes, not CICE (which I varied) or how many processes are assigned to each node.

So, any Mira experts with any advice? About thread stack size? About how to increase the per process memory limits (so that CICE initialization does not fail if I use 14400 process).

The current working layout is wasteful and too slow. I am guessing that this is not a COSP bug given that ne30 works fine, just a memory issue?

AG-140: missing deallocate in physics_types.F90

In the process of looking for the reason of the "recursive I/O error" documented in Issue #27, I found the following entry in the NCAR CAM ChangeLog.

Tag name: cam5_3_39
Originator(s): santos
Date: 2014/05/19
One-line Summary: SC-WACCM development, reduce bare add_default calls.
...
M models/atm/cam/src/physics/cam/physics_types.F90
- Add missing deallocation of state%cid.

Despite the One-line Summary, this missing deallocate seems to be a bug that is not specific to SC-WACCM development, occurring in physics_state_dealloc.

This:

deallocate(state%cid, stat=ierr)
if ( ierr /= 0 ) call endrun('physics_state_dealloc error: deallocation error for state%cid')

(right before

end subroutine physics_state_dealloc

in physics_types.F90)

probably should be added to ACME CAM.

bug fix to ice_nucleate

Bug identified in CAM ChangeLog (cam5_3_88)
bug fix to ice_nucleate to check qi for non-zero value to prevent divide by zero

Bugfix for COSP1.4

Bug identified in CAM ChangeLog in cam5_3_88

CLD_CAL_LIQ, CLD_CAL_ICE, CLD_CAL_UN,CLD_CAL_TMP, CLD_CAL_TMPLIQ CLD_CAL_TMPICE, CLD_CAL_TNPUN) for COSP1.4 should all be dimensioned the same as the original CALIPSO variable CLD_CAL.

CLD_CAL uses nht_cosp as a vertical size and cosp_ht as the mdim variable. These new output fields are all being saved on the COSP grid - not the CAM grid.

moved the outfld calls inside of the if(llidar_sim) conditional.
removed the adddefault calls for CLD_CAL_TMP, CLD_CAL_TMPLIQ,CLD_CAL_TMPICE, CLD_CAL_TNPUN as they are temporary variables and unclear how they would be used for model evaluation.
removed the comment "fails check_accum if 'A'" where it doesn't belong. This comment is meant to be only for COSP outputs with sub-column dimensions (e.g., TAU_ISCCP).
You cannot save COSP outputs with sub-columns as 'A' fields - they have to be 'I'. This
comment "fails check_accum if 'A'" is not applicable for CLD_CAL_LIQ, CLD_CAL_ICE,
CLD_CAL_UN,CLD_CAL_TMP, CLD_CAL_TMPLIQ,CLD_CAL_TMPICE, CLD_CAL_TNPUN.

Adding USE_PETSC in env_build.xml

For the land model, I'm developing a solver that uses PETSc. Presently, after a case is created, I manually modify the Macros file and add PETSc relevant information. I noticed that for CISM there already exists a 'USE_TRILINOS' entry in env_build.xml and am wondering if someone from s/w team could provide help in adding 'USE_PETSC' or suggest alternate workflow.

Thanks,
Gautam

Prescribed aerosols modal treatment produces erroneous random number distributions

Prescribed aerosol code uses Fortran intrinsic random number generator. There is an algorithm to generate seeds for this random number generator each time it is called. The intention here was to get uniformly distributed random numbers but during our experiments we found the random numbers are not uniformly distributed (due to incorrect seeding).

To fix this problem we are now using KISSVEC random number generator which already exists in CAM. KISSVEC is initialized with 4 predefined seeds and keep on producing uniformly distributed random numbers each time it is invoked. Our tests reveal that using KISSVEC solves the problem we were having with the skewed random number distributions.

CLM fails to build using PGI on Titan (I compset): Error in WaterStateType.f90

I get the following error (I1850CLM45CN compset, pgi compiler on titan):
PGF90-F-0000-Internal compiler error. normalize_forall_array: non-conformable 4449 (/lustre/atlas1/cli112/proj-shared/zdr/models/ACME/models/lnd/clm/src/biogeophys/WaterStateType.F90: 122)
PGF90/x86-64 Linux 14.10-0: compilation aborted
Can others confirm?

A fix for a similar issue is suggested here:
https://wiki.ucar.edu/display/ccsm/Fortran+Compiler+Bug+List
In fact, defining the dummy allocatable array works on my local branch, but I wanted to identify the issue here first before proposing the fix.

BFB flag is required for any tests to pass using mvapich on cascade or blues.

All the test cases in acme_developer test suite were failing on PNNL's
Cascade machine.

Commit b078b5d
implements a workaround in which the BFBFLAG is set to TRUE on cascade, allowing (some) test cases to pass. Other tests are still failing.

Seg fault in CLM4.5

I tracked down a seg. fault when running with CLM4.5 in a B case. Checking the head of the CESM CLM2 trunk , this bug has already been fixed there. I'll try their fix, but an FYI that the version of CLM that we grabbed has bugs. Not unexpected, but all the same ...

Specify a concrete timelimit for Cetus jobs

Since the default queue on Cetus (the only queue with unrestricted access) has a timelimit of 1hr, currently the jobs on Cetus are launched with 0 timelimit (ignoring weights generated by scripts). Ideally jobs with 0 timelimit should get converted to max timelimit (1 hr). But due to a bug in qsub this does not seem to work as expected everytime.
The ALCF admins have also requested us to refrain from submitting 0 timelimit jobs till the qsub bug is fixed (They are not able to enforce job queuing policies due to the bug). So we need to cap all jobs to have a timelimit of 59m on Cetus.

calculation with NaN in CLM4_5 with PGI compiler on Titan

Recent experiments with CLM4_5 (both I and B case) using pgi/14.10 on Titan are failing with "urban net longwave radiation error: no convergence". Identical experiments using the Intel compiler (with all of the recent Intel-specific fixes) does not exhibit this problem.

I tracked this down to computation with NaNs. The field in question is initialized to NaN, but, at first glance, it appears to later be set to a non-NaN value. I am still investigating, but may hand this off to someone if I run out of time.

Compile time failures in PIO using NAG compiler for master (c26945e9059)

NAG compiler is failing to compile PIO in each and every test case in the acme developer test suite for master (c26945e).
Error messages:
https://gist.github.com/singhbalwinder/cd21494e3bbff14cc5dd
These errors are related to type mismatch between two data types in subroutine calls.

OOM error on Titan when using Intel compiler

I traced down an out-of-memory error on Titan when using the Intel compiler to a "merge" instruction in init_moc_ts_transport_arrays in diags_on_lat_aux_grid.F90 in ocn/pop2/source . Checking the NCAR ChangeLog, this line has been replaced with a "where":

Tag Creator: mlevy
Developers: jedwards
Tag Date: 3 Nov 2014
Tag Name: pop2/trunk_tags/cesm_pop_2_1_20141103
Tag Summary: Bugfix - memory leak in POP diagnostics (change a "merge" command
to a "where" statement)

Files Modified:
M source/diags_on_lat_aux_grid.F90

I also noticed another bug fix that we might care about:

Tag Creator: andre
Developers: andre
Tag Date: 25 Nov 2014
Tag Name: pop2/trunk_tags/cesm_pop_2_1_20141126
Tag Summary: intel15 debug builds mistakenly treat unallocated memory
on all non-io processors as an error when calling
write_nstd_netcdf, even though only the ioprocesser
accesses the memory.

Files Modified:
M source/diags_on_lat_aux_grid.F90

These are simple, hopefully bit-for-bit, changes. Since they are required for the Intel compiler on Titan, I will put in a pull request when I get the chance, unless someone does so first.
This relates to Jira task PG-71 .

CLM: irrigation does not restart properly

Bill Sacks pointed out the bug and bugfix identified as Bug-2168 by CESM.

CAM4 fails to build in F case on Titan/PGI

Today this builds successfully:

./create_newcase -case F1850C5.ne30 -mach titan -compset F1850C5 -res ne30_g16 -project cli115 -compiler pgi

while this does not

./create_newcase -case F1850.ne30 -mach titan -compset F1850 -res ne30_g16 -project cli115 -compiler pgi

The error message is
PGF90-F-0004-Unable to open MODULE file modal_aero_convproc.mod (/autofs/na3_home1/worley/ACPI/SVN/ACME/master/ACME/models/atm/cam/src/physics/cam/physpkg.F90: 1747)

and, in fact, I do not see a module file called modal_aero_convproc being built. (It is not that it is failing to build - it is simply has not been built before it is referenced.)

This is CAM4, so perhaps we do not care about it. Atmosphere Group should advise, and also verify on other systems, though this seems to be a build logic issue not a system-specific issue.

Note: Problem was first identified by @bmayerornl (for a T31_g37 FVCAM build). So this issue occurs for multiple dycores and multiple resolutions.

shortwave from coupler to ocean is incorrect when using data atmosphere

When I run MPAS-Ocean coupled to a data atmosphere, the shortwave forcing is incorrect. The easiest way to see this is to set the ocn/atm coupling period to one day. Then the shortwave to the ocean should be constant over a day, and nearly constant along a latitude line, within cloudiness. But the shortwave is just on a portion of the earth. Also note scale of x2o colorbar, which changes from hour 1 to 12. Here I had the coupler output history files.

Note I used the pull request branch for MPAS-O, but these coupler fields should be the same for POP. However, I did not check that.

atmosphere to coupler, hour 1, a2x_Faxa_swnet

coupler to ocean, hour 1, x2oacc_Foxx_swnet

atmosphere to coupler, hour 12, a2x_Faxa_swnet

coupler to ocean, hour 12, x2oacc_Foxx_swnet
(note color bar scale, and location is the same as hour 1!)

Here is how I made these images. I ran acme on mustang at LANL:

git checkout -b dj_mpaso_addmodel origin/douglasjacobsen/mpas-o/add-model
git submodule update
set ACME_CASE = a04t
create_newcase -case $CASE_ROOT/$ACME_CASE -compset CMPASO-IAF -mach mustang -res T62_mpas120

cd $CASE_ROOT/$ACME_CASE
./cesm_setup 
${ACME_CASE}.build

vi env_run.xml
change the following:
<entry id="NCPL_BASE_PERIOD"   value="day"  /> 
<entry id="HIST_OPTION"   value="nhour"  /> 
<entry id="HIST_N"   value="1"  />

run job. Then get the coupler history files:

mu-fe1.lanl.gov> ncdump a04t.cpl.hi.0001-01-01-10800.nc -h | grep -i swnet
    double a2x_Faxa_swnet(time, a2x_ny, a2x_nx) ;
        a2x_Faxa_swnet:_FillValue = 1.e+30 ;
        a2x_Faxa_swnet:units = "W m-2" ;
        a2x_Faxa_swnet:long_name = "Net shortwave radiation" ;
        a2x_Faxa_swnet:standard_name = "surface_net_shortwave_flux" ;
        a2x_Faxa_swnet:internal_dname = "a2x_ax" ;
    double x2oacc_Foxx_swnet(time, x2oacc_ny, x2oacc_nx) ;
        x2oacc_Foxx_swnet:_FillValue = 1.e+30 ;
        x2oacc_Foxx_swnet:units = "W m-2" ;
        x2oacc_Foxx_swnet:long_name = "Net shortwave radiation" ;
        x2oacc_Foxx_swnet:standard_name = "surface_net_shortwave_flux" ;
        x2oacc_Foxx_swnet:internal_dname = "x2oacc_ox" ;

remove needed variables only:

ncks -v a2x_Faxa_swnet,x2oacc_Foxx_swnet,domo_lon,domo_lat a04t.cpl.hi.0001-01-01-10800.nc a04t.cpl.hi.0001-01-01-10800_short.nc
ncks -v a2x_Faxa_swnet,x2oacc_Foxx_swnet,domo_lon,domo_lat a04t.cpl.hi.0001-01-01-43200.nc a04t.cpl.hi.0001-01-01-43200_short.nc

in ferret:

yes? use a04t.cpl.hi.0001-01-01-10800_short.nc 
yes? show data
     currently SET data sets:
    1> ./a04t.cpl.hi.0001-01-01-10800_short.nc  (default)
 name     title                             I         J         K         L         M         N
 A2X_FAXA_SWNET
          Net shortwave radiation          1:192     1:94      ...       1:1       ...       ...
 DOMO_LAT                                  1:28574   1:1       ...       1:1       ...       ...
 DOMO_LON                                  1:28574   1:1       ...       1:1       ...       ...
 X2OACC_FOXX_SWNET
          Net shortwave radiation          1:28574   1:1       ...       1:1       ...       ...

yes? go basemap x=0:360 Y=90s:90n 20
yes? go polymark polygon/over/nolab/key DOMO_LON DOMO_LAT X2OACC_FOXX_SWNET circle 0.25
yes?  shade A2X_FAXA_SWNET

Incorrect placement of the t_stopf() call for ecosysdyn in clm_driver.F90

The t_stopf() call for ecosysdyn is misplaced in clm_driver.F90, leading to inaccurate performance data for some science cases. This issue was identified and fixed by Pat Worley in pull request #90, which also fixed issue #81. This bug was eliminated in CESM CLM tag clm4_5_1_r097 as a part of ED refactorization and was also not identified as a fixed bug in that tag. ACME has chosen not to incorporate that ED refactorization into the V1 model development, but expects to use it for V2 model development.

CESMSCRATCHROOT required for blues config

If the user does not specify "-sharedlibroot" while creating and running tests (./create_test ... -sharedlibroot ...) on blues the tests fail to build correctly.

CESM BUILDEXE SCRIPT STARTING
COMPILER is intel

Build Libraries: mct gptl pio csm_share
Tue Mar 24 15:05:03 CDT 2015 UNSET/sharedlibroot.pio_rearr_opts_acme_dev_compare_ACMEnext-intel-openmpi2_01/intel/openmpi/nodebug/nothreads/mct.bldlog.150324-150457
ERROR: buildlib.mct failed, see UNSET/sharedlibroot.pio_rearr_opts_acme_dev_compare_ACMEnext-intel-openmpi2_01/intel/openmpi/nodebug/nothreads/mct.bldlog.150324-150457
ERROR: cat UNSET/sharedlibroot.pio_rearr_opts_acme_dev_compare_ACMEnext-intel-openmpi2_01/intel/openmpi/nodebug/nothreads/mct.bldlog.150324-150457
ERS.f45_g37_rx1.DTEST.blues_intel.C.pio_rearr_opts_acme_dev_compare_ACMEnext-intel-openmpi2_01 build status: CFAIL
building ERS.ne30_g16_rx1.A.blues_intel.C.pio_rearr_opts_acme_dev_compare_ACMEnext-intel-openmpi2_01

The solution is to define CESMSCRATCHROOT in the blues configuration

COSP interface bug

@polunma found a bug in COSP interface (models/atm/cam/src/physics/cam/cospsimulator_intr.F90; Line 2220). Liquid is used to compute ice mixing ratio. This is only for COSP simulator output and no impact on simulated result.

A bug fix is committed (PR #208) by @polunma and integrated now.

trop_mam4 is missing in the list of valid values for cam_chempkg (in namelist_definition.xml)

There is a minor bug in namelist_definition.xml. In the xml entry below, trop_mam4 should be listed as a valid value (but it is currently missing). I have confirmed this with @singhbalwinder , who added the MAM4 code in ACME.

Name of the CAM chemistry package. N.B. this variable may not be set by
the user. It is set by build-namelist via information in the configure
cache file to be consistent with how CAM was built.
Default: set by build-namelist

Incorrect logic in CLM for sub-dividing bottom snow layer

This bug is was reported to CSEG and fixed in clm4_5_1_r109.
Additional information about this bug is available at
http://bugs.cgd.ucar.edu/show_bug.cgi?id=2183

CLM Crop with CN biogeochemistry fails upon history write

The CLM compset I_2000_CLM45_CN_CROP (resolution f19_g16) fails when writing a history file. Based on the log file, the error may be due to faulty reading in of one of the initialization files. The errors include NetCDF: Invalid dimension ID or name and NetCDF: Variable not found. This is not a supported configuration of the model so the datasets may not have been updated.

Bug in POP2 reported by Pat Worley

Email from Pat Worley, October 12 2014:

The Cray compiler is complaining about an expression in ecosys_mod.F90 in the version of POP2 in master (and in the aces4bgc branch). I guess that this is working correctly for the compilers that aren't complaining, but the Cray compiler does have a point:

" A subscript must be a scalar integer expression."

 WORK1 = DTRACER_MODULE(:,:,dic_ind) + DTRACER_MODULE(:,:,doc_ind) &
          + DTRACER_MODULE(:,:,zooC_ind) &
          + sum(DTRACER_MODULE(:,:,autotrophs(:)%C_ind), dim=3)

here C_ind (in the third line) is declared to be a floating point value.

This has been fixed in the POP2 trunk on the CESM repository. The fix involves a significantly rewritten ecosys_params.F90 and the introduction of a new file ecosys_share.F90 . Something simple may be sufficient, but I wanted to point this out.

Were these modifications in collaboration with LANL? Do we care about this?

Note also that the Cray compiler does not like some of the (legacy) vectorization directives in vmix_kpp.F90. Some are not recognized (just warnings) and some are but with different expectations, resulting in errors.

FYI.

Pat

ACME developer tests should have realistic times/nodes on Mira/Cetus

Some ACME developer tests on Mira currently use 512 nodes and the walltimes for the tests are between 1.5 hrs and 6 hrs. On Cetus the tests use the same number of nodes but the walltime is capped at 1 hr. We need to modify the scripts (config_pes.xml?) so that the developer tests run on a smaller set of nodes (can we bunch up the whole test suite into a single job with 512 nodes?) and request realistic times.

[jayesh@miralac1 scripts (master=)]$ qstat -u jayesh
JobID User WallTime Nodes State Location
428838 jayesh 03:00:00 512 queued None
428849 jayesh 03:00:00 512 queued None
428852 jayesh 01:30:00 512 running MIR-44800-77B31-512
428853 jayesh 01:30:00 512 running MIR-04880-37BB1-512
428854 jayesh 01:30:00 512 running MIR-04800-37B31-512
428855 jayesh 01:30:00 512 queued None
428856 jayesh 01:30:00 512 queued None
428859 jayesh 06:00:00 1 queued None
428860 jayesh 01:30:00 512 queued None
428861 jayesh 06:00:00 1 queued None
428863 jayesh 01:30:00 512 queued None

unnecessary performance data collection in test suite

In order to guarantee collection of performance data for production runs, the env_run.xml parameter SAVE_TIMING is set to TRUE by default. This is unnecessary for functionality testing, and it fills up the performance data archive with provenance and performance data that will never be looked at. This is only an issue on Edison, Hopper, Mira, and Titan at the moment, but the intent is to add the performance data archiving capability to the development systems as well.

I propose adding an xmlchange to the test scripts to set SAVE_TIMING to FALSE except for tests meant to examine performance. I've put this into the issue list because I'd prefer that the keepers of the testing scripts weightin on this and also to implement the change(s).

Note that I DO NOT want the default changed to FALSE, as this defeats the goal of making performance data collection the norm for ACME runs.

seg. fault when using Intel 14.2 with CLM4.5 on Titan

When running compset ICLM45BGC, ACME built with the Intel compiler on Titan dies with a seg. fault in VOCEmissionMod.F90 (line 199). Checking more recent tags of CLM I found the following:

Tag name: clm4_5_1_r091
Originator(s): muszala (Stefan Muszala)
Date: Mon Oct 27 09:48:56 MDT 2014
One-line Summary: update externals. fix bug so CLM runs with Intel 14x.

Purpose of changes: Update externals. Fix bug in VOCEmissionMod.F90 that prevented
CLM from running with Intel 14x on yellowstone. Bring in workaround for bug 1730 from
Sacks.

-- remove duplicate assignment of 0_r8 to meg_out(imeg)%flux_out
M models/lnd/clm/src/biogeochem/VOCEmissionMod.F90

After making this change (commenting out the offending line), this ICLM45BGC case ran successfully. This change should be backported into our version of CLM4.5.

Note also the reference to another bug workaround:

-- Sacks' workaround for bug 1730
M models/lnd/clm/src/main/histFileMod.F90
M models/lnd/clm/src/main/ncdio_pio.F90
M models/lnd/clm/src/main/ncdio_pio.F90.in

I do not know what bug 1730 is (could not find it, since it has since been closed?), nor whether this is relevant to use.

env_mach_specific causing warnings on Edison

(added to issue list at @jnjohnsonlbl 's request):

The version of env_mach_specific in master for Edison (which is identical to the most recent one from NCAR) is generating some warning messages. They seem innocuous, but we might want to look into whether this needs to be updated:

during setup, build, and submit:

Unloading of cray-mpich module was not able to restore CRAY_LD_LIBRARY_PATH.
The following will resolve this issue:
'setenv CRAY_LD_LIBRARY_PATH /opt/cray/libsci/13.0.1/INTEL/140/x86_64/lib'

when job starts running:

cray-mpich/7.0.4(30):ERROR:150: Module 'cray-mpich/7.0.4' conflicts with the currently loaded module(s) 'cray-mpich/6.3.1'
cray-mpich/7.0.4(30):ERROR:102: Tcl command execution failed: conflict cray-mpich

version skew between Makefile (for COSP) and CAM in master

At some point the COSP Make logic (in the Makefile in the Machines directory) was updated in ACME, and it is incompatible with the rest of the model.

From the ACME Machines ChangeLog:

Originator: cacraig
Date: Jul 17, 2014
Model: Machines
Version: Machines_140717
One-line: CAM's abortutils is now cam_abortutils

M Makefile
- cosp now depends on cam_abortutils

From CESM CAM trunk ChangeLog:

Tag name: cam5_3_44
Originator(s): cacraig
Date: 07/16/14
One-line Summary: Update externals to cesm1_3_beta11

Purpose of changes:

Update externals to cesm1_3_beta11
Conflicting namespace for abortutils in CAM and CLM which both use endrun (with different implementations)
Renamed CAM's abortutils module to cam_abortutils
...

However, the ACME version of CAM is based off of (top entry of CAM ChangeLog in ACME)

Tag name: cam5_3_36
Originator(s): cacraig, goldy
Date: May 5, 2014

So we do not have cam_abortutils and the current build logic for COSP requires this (and the build fails without this). It may be enough to modify Makefile in scripts/ccsm_utils/Machines. I'll give this a try, but it may be better to update to a slightly more recent version of CAM (or import just this change), especially since the ChangeLog message indicates that this is required when using more recent versions of CLM?

Unable to clone git repository on rhea

Hi,

I'm having trouble cloning this git repository on rhea:

zender@rhea-login1g:~$ git clone https://github.com/ACME-Climate/DiagnosticsWorkflow.git
Initialized empty Git repository in /autofs/nccs-svm1_home1/zender/DiagnosticsWorkflow/.git/
error: The requested URL returned error: 403 Forbidden while accessing https://github.com/ACME-Climate/DiagnosticsWorkflow.git/info/refs
fatal: HTTP request failed

This same cloning command works fine on my home machine, not on rhea.

Any help appreciated. Thanks!

Charlie

Incorrect references to pft instead of col in c2l_Xd subroutines of subgridAveMod.F90

The c2l subroutines are used for averaging variables defined at 'column' to 'landunit' level. The subroutines c2l_1d and c2l_2d incorrectly use 'pft' level quantities. This bug was identified by NCAR as bug 2077 and fixed in clm4_5_r095 tag.

check_exactrestart.pl script contains a potential infinite loop

In the check_exactrestart.pl script, which is used to compare two files and make sure there was exact restartability between them, there is the potential for an infinite loop.

The loop that causes this issue is:
https://github.com/ACME-Climate/ACME/blob/master/scripts/ccsm_utils/Tools/check_exactrestart.pl#L112

When the second file passed into the script does not contain the comm_diag token, this loop is an infinite loop that causes all wallclock time to be burned for a test, without generating a pass (the test remains in the RUN state).

The comm_diag token is missing from a file, when the run fails. So, if the restart run fails to run while performing an ERS test, this script causes an infinite loop.

recursive I/O call runtime error on Titan when using Intel compiler

In an ne30 B1850C5 test case, the run will die on Titan with a "recursive I/O call" error message when using an executable built with the Intel compiler. I tracked this down to a call to shr_sys_flush in physics_update in physics_types.F90 in atm/cam/src/physics/cam . Commenting out this line eliminates the error, and this one test run then completes normally. I do not know if this is indicative of a deeper issue, however. I will try to verify with developer's test suite once Titan returns from its OS upgrade.

This is related to Jira task PG-71.

Atmosphere history does not support regional output

Regional output is selected via the fincllonlat namelist entries in the ACME atmosphere model. However, this entry is not currently supported for the target SE dycore. This bug was reported by @danielerosa.

bug in micro_mg_cam.F90 (MG1)

A bug for MG1 was reported in cam5_3_91 @ NCAR
There is a bug in micro_mg_cam.F90 when calling cnst_add incorrectly with is_convtran1=.false.

The bugfix is answer-changing.

JIRA task
https://acme-climate.atlassian.net/browse/AG-300

Implemented the bugfix for f27fac7 (where the bug was introduced) on github branch

kaizhangpnl/atm/bugfix_MG1_introb

and for the current master on branch

kaizhangpnl/atm/bugfix_MG1

The first branch is what Jin-Ho recommended (#207 (comment)). However, I also found on that page it says "Bugfix-doc: Unlike new feature development, bug fixes are typically never started from the HEAD of master". So it seems that we have an unusual case. I am not sure whether it still makes sense to follow that path. Anyway, both two options are provided to the integrator.

The 5-day simulation before and after the bugfix (with kaizhangpnl/atm/bugfix_MG1) can be found on cascade:

/dtemp/zhan524/csmruns/ACME_master_bugfix_mg1_nag (current master)
/dtemp/zhan524/csmruns/ACME_master_bugfix_mg1_nag_fix (bugfixed master)

Added namelist variable "apply_mg1_bugfix" to switch on/off the bugfix

modified: models/atm/cam/bld/namelist_files/namelist_defaults_cam.xml
modified: models/atm/cam/bld/namelist_files/namelist_definition.xml
modified: models/atm/cam/src/physics/cam/micro_mg_cam.F90
modified: models/atm/cam/src/physics/cam/phys_control.F90

CLM crashes for 1-d output

When hist_dov2xy=.false. is sepcified in user_nl_clm, the code crashes. This was reported as bug-1730 and fixed in clm4_5_1_r091 by NCAR.

@bbye reported ACME code crashes for the following case:
create_newcase -case cropharv_mtest -mach edison -res f19_g16 -compset I_2000_CLM45_CN_CROP -project ccsm1

memory corruption when using Intel 14.2 with the developer test suite on Titan

When running ERS.f45_g37.B1850C5 on Titan using the Intel compiler, the job dies with a malloc assertion error in PIO. I found a comment on DiscussCESM that CESM requires a modification to the FV version of spmd_dyn.F90 in order work with the 14 series of Intel compiler (which is what are available on Titan).
Using this updated version (or compiling the current version with -O1) eliminate the problem.
I'll check in this update when I get the change, however this is only necessary because our developers test suite includes FV, which we will never be using in production. It seems a waste of time to have to debug FV in order to get the test suite to work.

Prebeta tests fail on Mira after CLM 4.5 merge

The following prebeta tests fail on Mira (in addition to expected prebeta test failures) after the CLM 4.5 merge (master #d9d6f4c),

ERS_Ld7.f09_g16.B1850CNCHM.mira_ibm (RUN - still fails, as of Feb 17, 2015, #c5fa8d4)
ERS_IOP_Ld3.f19_f19.F1850PDC5.mira_ibm (SFAIL - still fails, as of Feb 17, 2015, #c5fa8d4)

These tests have been fixed by the changes discussed below,
ERI.ne30_g16.B1850C5CN.mira_ibm
PFS.ne30_g16.B1850C5CN.mira_ibm
PFS.ne30_g16.B1850C5L45BGC.mira_ibm
PFS.ne30_ne30.F1850C5L45BGC.mira_ibm
SMS_D.ne30_g16.FCN.mira_ibm

enabling histaux_a2x3hr crashes the code

The bug was been identified by NCAR as Bug 2090 and was fixed in drvseq5_1_02.

ACME build fails on Cetus+next - ATM/seasalt_model.F90

A Build of B1850C5CN_ne30_g16 case on cetus fails on next (#607557d). Looks like the variable emis_scale was defined twice (only compiled conditionally). The error message is shown below,

Case:
./create_newcase -case B1850C5CN_ne30_g16_test_build01 -compset B1850C5CN -res ne30_g16 -mach cetus

Build Error:
mpixlf2003_r ... /gpfs/mira-home/jayesh/acme/ACME_merge02/models/atm/cam/src/chemistry/modal_aero/seasalt_model.F90
"/gpfs/mira-home/jayesh/acme/ACME_merge02/models/atm/cam/src/chemistry/modal_aero/seasalt_model.F90", line 85.28: 1514-004 (S) Name given for constant with PARAMETER attribute was defined elsewhere with conflicting attributes. Name is ignored.
"/gpfs/mira-home/jayesh/acme/ACME_merge02/models/atm/cam/src/chemistry/modal_aero/seasalt_model.F90", line 85.28: 1514-071 (W) Identifier emis_scale was previously defined with same type.
** seasalt_model === End of Compilation 1 ===
1501-511 Compilation failed for file seasalt_model.F90.

CLM4_5 bug introduced in recent pull request (typo in subgridAveMod.F90?)

Building with the latest master with clm4_5 fails with the error:

./lnd/clm/src/main/subgridAveMod.F90", line 790.36: 1516-036 (S) Entity co has undefined type.

The line in question is

         sumwt(l) = sumwt(l) + co%wtlunit(c)

It appears that this should be

         sumwt(l) = sumwt(l) + col%wtlunit(c)

A land developer should verify (and submit a fix).

mpi-serial support on Titan is broken

The version of cray-netcdf being specified in env_mach_specific.titan for mpi-serial needs to be updated. It is currently 4.3.0, which no longer exists. The default (and most recent) is now 4.3.2 . 4.3.2 is already being used for the MPI branch.

Build of an acme_developer test fails on Mac

Using the following command to create and run the acme_developer tests on a Mac produced a build problem in a test case:

cd ACME/scripts
export NETCDF=/usr/local
./create_test -testid t01 -xml_category acme_developer -xml_mach mac -xml_compiler gnu

The following link error occurs during the build of the "glimmer-cism" library:

mpif90 -o /Users/jeff/projects/acme/scratch/ERS_Ly21.f09_g16.TG.mac_gnu.t01/bld/cesm.exe ccsm_comp_mod.o ccsm_driver.o component_mod.o component_type_mod.o cpl_comp_esmf.o cplcomp_exchange_mod.o prep_aoflux_mod.o prep_atm_mod.o prep_glc_mod.o prep_ice_mod.o prep_lnd_mod.o prep_ocn_mod.o prep_rof_mod.o prep_wav_mod.o seq_diag_mct.o seq_domain_mct.o seq_flux_mct.o seq_frac_mct.o seq_hist_mod.o seq_io_mod.o seq_map_esmf.o seq_map_mod.o seq_map_type_mod.o seq_rest_mod.o t_driver_timers_mod.o -L/Users/jeff/projects/acme/scratch/ERS_Ly21.f09_g16.TG.mac_gnu.t01/bld/lib/ -latm -L/Users/jeff/projects/acme/scratch/ERS_Ly21.f09_g16.TG.mac_gnu.t01/bld/lib/ -lice -L/Users/jeff/projects/acme/scratch/ERS_Ly21.f09_g16.TG.mac_gnu.t01/bld/lib/ -llnd -L/Users/jeff/projects/acme/scratch/ERS_Ly21.f09_g16.TG.mac_gnu.t01/bld/lib/ -locn -L/Users/jeff/projects/acme/scratch/ERS_Ly21.f09_g16.TG.mac_gnu.t01/bld/lib/ -lrof -L/Users/jeff/projects/acme/scratch/ERS_Ly21.f09_g16.TG.mac_gnu.t01/bld/lib/ -lglc -L/Users/jeff/projects/acme/scratch/ERS_Ly21.f09_g16.TG.mac_gnu.t01/bld/lib/ -lwav -L/Users/jeff/projects/acme/scratch/ERS_Ly21.f09_g16.TG.mac_gnu.t01/bld/glc/lib/ -lglimmercismfortran -L/Users/jeff/projects/acme/scratch/sharedlibroot.t01/gnu/openmpi/nodebug/nothreads/MCT/noesmf/a1l1r1i1o1g1w1/csm_share -lcsm_share -L/Users/jeff/projects/acme/scratch/sharedlibroot.t01/gnu/openmpi/nodebug/nothreads/lib -lpio -lgptl -lmct -lmpeu -L/usr/local/lib -lnetcdff -lnetcdf -framework Accelerate -all_load
duplicate symbol _main in:
ccsm_driver.o
/Users/jeff/projects/acme/scratch/ERS_Ly21.f09_g16.TG.mac_gnu.t01/bld/glc/lib//libglimmercismfortran.a(dlapqc.f.o)
ld: 1 duplicate symbol for architecture x86_64
collect2: error: ld returned 1 exit status

This is because dlapqc.f, a source file in the SLAP library bundled in with glimmer-solve, contains a Fortran program, not library functions. The fix is to make sure that this source is excluded from the library in models/glc/cism/glimmer-cism/CMakeLists.txt. I will be submitting a pull request to enact this change.

Alternatively, an interested party on the Land-Ice team could add the following line to the list of removed Fortran sources in models/glc/cism/glimmer-cism/CMakeLists.txt (around line 263 in the master branch):

${GLIMMER_SOURCE_DIR}/libglimmer-solve/SLAP/dlapqc.f

zmconv_tau default value not in atm_in (namelist)

Issue discovered by Qi Tang. In CAM, all allowable namelist variables should appear in the atm_in namelist, with their default values (or with non-default values if specified by the user in "user_nl_cam".) By default, "zmconv_tau" will be missing. If not changed by the user, it will use its correct value, 3600s, but it will not appear in the namelist.

This is a minor issue. It means that currently, to determine the value of zmconv_tau used by a simulation, one must check the log files as it wont appear in atm_in unless it was specifically set by the user for that simulation.

Fix: add an "add_default()" entry for zmconv_tau in cam/bld/build_namelist, and add a default value of 3600d0 in cam/bld/namelist_files/namelist_defaults.xml.

One caveat: Will reading in "3600d0" from the namelist be BFB with the current default of 3600._r8 set in the fortran? If they differ at machine precision, it will requiring rebaselining our regression suite, and hence this commit should be isolated and not combined with other changes. Hopefully this is not a problem since 3600 can be represented exactly in IEEE floating point.

AG-189 Solar insolation held constant during radiation time step

[AG-189] Minghua Zhang (SoMAS, Stony Brook University) found a small problem in the calculation of solar insolation in CAM. He talked about it at the AMWG meeting. The issue is that the insolation is calculated at the beginning of a time step, and held constant over the length of a radiation time step. This produces a small but discernible variation in the solar constant from one latitude and longitude to another. The correct method is to average over the time step [reported by @philrasch ]

Here is the image showing the difference:

Figure caption:Annual-mean FSDT, FSNTC, FSNT, FSNSC, FSNS for (left column) 1-hour radiation time step based on the revised algorithm, (middle column) the original algorithm minus the revised algorithm for 3-hour radiation time step, (right column) the original algorithm minus the revised algorithm for 1-hour radiation time step. Units: W/m2

File affected: shr_orb_mod.F90 (function shr_orb_cosz)

e3sm-project / e3sm Goto Github PK

e3sm's People

Stargazers

Watchers

Forkers

e3sm's Issues

Recommend Projects

Recommend Topics

Recommend Org