e3sm-project / e3sm Goto Github PK
View Code? Open in Web Editor NEWEnergy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
Home Page: https://docs.e3sm.org/E3SM
License: Other
Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
Home Page: https://docs.e3sm.org/E3SM
License: Other
The ACME v0.1 code stopped working on Mira earlier this week. Identical code and case was working last week. Assigning to Jayesh since you're listed as the Mira POC.
Reported to ALCF support on Jan 30.
The perl configuration script fails with:
Can't locate XML/LibXML.pm in @inc (@inc contains:
/gpfs/mira-home/taylorm/codes/acme-v0.1/scripts/b1850c5_m1a/Tools
/gpfs/mira-home/taylorm/codes/acme-v0.1/scripts/b1850c5_m1a/Tools/Tools
/usr/local/lib64/perl5 /usr/local/share/perl5
/usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl
/usr/lib64/perl5 /usr/share/perl5 .) at
/gpfs/mira-home/taylorm/codes/acme-v0.1/scripts/b1850c5_m1a/Tools/ConfigCas
e.pm
line 101. BEGIN failed--compilation aborted at
/gpfs/mira-home/taylorm/codes/acme-v0.1/scripts/b1850c5_m1a/Tools/ConfigCas
e.pm
line 101. Compilation failed in require at ./Tools/xml2env line 122.
This missing library, "XML/LibXML.pm", does exist in /usr/lib64/perl5,
which is one of the paths shown in the error log.
I've identified what may be a serious bug in HOMME. (It is obviously causing errors in diagnostic output, but the issue could have much wider consequences.) There is a simple workaround however, so we can avoid this particular problem in our development and production runs.
The problem showed up on Mira, where it is useful to have more threads in the atmosphere dynamics than there are grid cells, so some threads will be idle during the dynamics. (This has been the case on other systems in the past, but Mira is high thread count friendly, so we bumped into this scenario here most recently. I have verified identical behavior on Titan.)
The global variable NThreads is set to the thread count specified in env_mach_pes.xml (I assume). In one particular example (ne30_g16, FAMIPC5, 900 processes, 8 threads per process):
Main:NThreads = 8
Main:n_domains= 6
This breaks the logic in the max and min functions in reduction_mod.F90 ,where threads are let into the critical section until a counter exceeds NThreads, at which point thread 0 calls an MPI_Allreduce. In this case, only 6 threads are active (have assigned work), so the calls to min and max are overlapping, e.g. 6 data from a call to min and the next two data from a call to max are "reduced" together before the call to the Allreduce. (I'm surprised that this does not cause a hang, but print statements have verified this behavior.)
The max/min operators are the only ones with this particular logic, but there is also code in, for example, decompose in domain_mod.F90 where
integer :: beg(0:ndomains)
...
domain%start=beg(ipe)
domain%end =beg(ipe+1)-1
and ipe is 0:NThreads-1 . So this looks to be reading from random memory for thread 7, and could cause problems if thread 7 decides that it has work to do? A HOMME expert will have to comment.
In any case, it LOOKS like we are safe if we do not overdecompose the model. I can add an error abort in case we do that by accident until this is fixed. Comments?
/home/jgfouca/ACME_Climate/models/atm/cam/src/chemistry/modal_aero/modal_aero_convproc.F90:212.7:
use abortutils, only: endrun
1
Fatal Error: Can't open module file 'abortutils.mod' for reading at (1): No such file or directory
We have found a mixture of tabs and spaces used for indentation in various source files in ACME. The problem with tabs is that everyone sets their tab stops differently according to their particular practice/preference/religion. This means that a single file can render sensibly in one person's text editor and completely nonsensically on another.
The solution is to either pound away on your space bar, or to tell your text editor to expand a tab into a fixed number of spaces. Optimally, we could communicate this aesthetic to people's editors via special files placed in the source tree.
More broadly, we should probably attempt to assemble some least-common-denominator form of a style guide for ACME code (possibly in connection with CESM/NCAR if we can agree on simple guidelines) so that we can avoid these issues in the future.
But for the moment, this issue exists only to a) bring this to our attention, and b) decide how to eradicate all the existing tabs.
NOTE: I don't think we have a proper label for this kind of issue yet. Perhaps we can discuss this as well.
for
./create_newcase -case ne30_B1850C5L45BGC_pgi -mach titan -compiler pgi -compset B1850C5L45BGC -res ne30_g16
builds with the PGI compiler are dying with
pgf90-Fatal-/opt/pgi/14.7.0/linux86-64/14.7/bin/pgf901 TERMINATED by signal 11
gmake: *** [ActiveLayerMod.o] Error 127
and
pgf90-Fatal-/opt/pgi/14.7.0/linux86-64/14.7/bin/pgf901 TERMINATED by signal 11
gmake: *** [dynCNDVMod.o] Error 127
and
pgf90-Fatal-/opt/pgi/14.7.0/linux86-64/14.7/bin/pgf901 TERMINATED by signal 11
gmake: *** [lnd2glcMod.o] Error 127
The same ActiveLayer error occured whether using pgi/14.2.0 or pgi/14.7.0 . I tried essentially gutting the routine in this file, and the error persisted, so has something to do with the modules being used?
In support of the Atmosphere Group I have been trying to find a layout that works with COSP for an ne120 resolution on Mira. My understanding is that this works fine for ne30. However, I have found exactly one pe layout so far that allows this case to work for ne120 (within my current search space):
7200 processes, No OpenMP, MAX_TASKS_PER_NODE=4
If I turn on threading it dies during the first COSP calculation (during timestep 8), apparently from a memory issue. Moreover, it dies even with as few as 2 threads per process, e.g.
7200 processes, 2 threads per task, MAX_TASKS_PER_NODE=8
(lots of core files, with error message of the form
***FAULT Encountered unhandled signal 0x0000000b (11) (SIGSEGV)
Generated by interrupt..................0x00000008 (Data TLB Miss Exception DEAR=0x0000001f09ab1e80 ESR=0x0000000000800000)
)
FYI - without COSP, we run with 7200 processes, 16 threads per task, MAX_TASKS_PER_NODE=64.
On Titan, I had to increase the thread stack size to 128 MB to get COSP for ne120 to work. I have tried this on Mira, even going as high as 512MB per thread, but it has not worked. (I also tried 384, 192, and the default 96.)
I also tried just increasing the number of MPI tasks without using OpenMP e.g.
14400 processes, No OpenMP, MAX_TASKS_PER_NODE=8
and this dies in CICE initialization, also due to memory issues:
in cice log:
(shr_strdata_init) calling shr_dmodel_mapSet for remap
in cesm log:
0: "/gpfs/mira-home/worley/ACME/master/ACME/models/csm_share/shr/shr_map_mod.F90", line 1300: 1525-108 Error encountered while attempting to allocate a data object. The program will stop.
This latter problem appears to be a function solely of the number of ATM MPI processes, not CICE (which I varied) or how many processes are assigned to each node.
So, any Mira experts with any advice? About thread stack size? About how to increase the per process memory limits (so that CICE initialization does not fail if I use 14400 process).
The current working layout is wasteful and too slow. I am guessing that this is not a COSP bug given that ne30 works fine, just a memory issue?
In the process of looking for the reason of the "recursive I/O error" documented in Issue #27, I found the following entry in the NCAR CAM ChangeLog.
Tag name: cam5_3_39
Originator(s): santos
Date: 2014/05/19
One-line Summary: SC-WACCM development, reduce bare add_default calls.
...
M models/atm/cam/src/physics/cam/physics_types.F90
- Add missing deallocation of state%cid.
Despite the One-line Summary, this missing deallocate seems to be a bug that is not specific to SC-WACCM development, occurring in physics_state_dealloc.
This:
deallocate(state%cid, stat=ierr)
if ( ierr /= 0 ) call endrun('physics_state_dealloc error: deallocation error for state%cid')
(right before
end subroutine physics_state_dealloc
in physics_types.F90)
probably should be added to ACME CAM.
Bug identified in CAM ChangeLog (cam5_3_88)
bug fix to ice_nucleate to check qi for non-zero value to prevent divide by zero
Bug identified in CAM ChangeLog in cam5_3_88
CLD_CAL_LIQ, CLD_CAL_ICE, CLD_CAL_UN,CLD_CAL_TMP, CLD_CAL_TMPLIQ CLD_CAL_TMPICE, CLD_CAL_TNPUN) for COSP1.4 should all be dimensioned the same as the original CALIPSO variable CLD_CAL.
CLD_CAL uses nht_cosp as a vertical size and cosp_ht as the mdim variable. These new output fields are all being saved on the COSP grid - not the CAM grid.
For the land model, I'm developing a solver that uses PETSc. Presently, after a case is created, I manually modify the Macros file and add PETSc relevant information. I noticed that for CISM there already exists a 'USE_TRILINOS' entry in env_build.xml and am wondering if someone from s/w team could provide help in adding 'USE_PETSC' or suggest alternate workflow.
Thanks,
Gautam
Prescribed aerosol code uses Fortran intrinsic random number generator. There is an algorithm to generate seeds for this random number generator each time it is called. The intention here was to get uniformly distributed random numbers but during our experiments we found the random numbers are not uniformly distributed (due to incorrect seeding).
To fix this problem we are now using KISSVEC random number generator which already exists in CAM. KISSVEC is initialized with 4 predefined seeds and keep on producing uniformly distributed random numbers each time it is invoked. Our tests reveal that using KISSVEC solves the problem we were having with the skewed random number distributions.
I get the following error (I1850CLM45CN compset, pgi compiler on titan):
PGF90-F-0000-Internal compiler error. normalize_forall_array: non-conformable 4449 (/lustre/atlas1/cli112/proj-shared/zdr/models/ACME/models/lnd/clm/src/biogeophys/WaterStateType.F90: 122)
PGF90/x86-64 Linux 14.10-0: compilation aborted
Can others confirm?
A fix for a similar issue is suggested here:
https://wiki.ucar.edu/display/ccsm/Fortran+Compiler+Bug+List
In fact, defining the dummy allocatable array works on my local branch, but I wanted to identify the issue here first before proposing the fix.
All the test cases in acme_developer test suite were failing on PNNL's
Cascade machine.
Commit b078b5d
implements a workaround in which the BFBFLAG is set to TRUE on cascade, allowing (some) test cases to pass. Other tests are still failing.
I tracked down a seg. fault when running with CLM4.5 in a B case. Checking the head of the CESM CLM2 trunk , this bug has already been fixed there. I'll try their fix, but an FYI that the version of CLM that we grabbed has bugs. Not unexpected, but all the same ...
Since the default queue on Cetus (the only queue with unrestricted access) has a timelimit of 1hr, currently the jobs on Cetus are launched with 0 timelimit (ignoring weights generated by scripts). Ideally jobs with 0 timelimit should get converted to max timelimit (1 hr). But due to a bug in qsub this does not seem to work as expected everytime.
The ALCF admins have also requested us to refrain from submitting 0 timelimit jobs till the qsub bug is fixed (They are not able to enforce job queuing policies due to the bug). So we need to cap all jobs to have a timelimit of 59m on Cetus.
Recent experiments with CLM4_5 (both I and B case) using pgi/14.10 on Titan are failing with "urban net longwave radiation error: no convergence". Identical experiments using the Intel compiler (with all of the recent Intel-specific fixes) does not exhibit this problem.
I tracked this down to computation with NaNs. The field in question is initialized to NaN, but, at first glance, it appears to later be set to a non-NaN value. I am still investigating, but may hand this off to someone if I run out of time.
NAG compiler is failing to compile PIO in each and every test case in the acme developer test suite for master (c26945e).
Error messages:
https://gist.github.com/singhbalwinder/cd21494e3bbff14cc5dd
These errors are related to type mismatch between two data types in subroutine calls.
I traced down an out-of-memory error on Titan when using the Intel compiler to a "merge" instruction in init_moc_ts_transport_arrays in diags_on_lat_aux_grid.F90 in ocn/pop2/source . Checking the NCAR ChangeLog, this line has been replaced with a "where":
Tag Creator: mlevy
Developers: jedwards
Tag Date: 3 Nov 2014
Tag Name: pop2/trunk_tags/cesm_pop_2_1_20141103
Tag Summary: Bugfix - memory leak in POP diagnostics (change a "merge" command
to a "where" statement)
Files Modified:
M source/diags_on_lat_aux_grid.F90
I also noticed another bug fix that we might care about:
Tag Creator: andre
Developers: andre
Tag Date: 25 Nov 2014
Tag Name: pop2/trunk_tags/cesm_pop_2_1_20141126
Tag Summary: intel15 debug builds mistakenly treat unallocated memory
on all non-io processors as an error when calling
write_nstd_netcdf, even though only the ioprocesser
accesses the memory.
Files Modified:
M source/diags_on_lat_aux_grid.F90
These are simple, hopefully bit-for-bit, changes. Since they are required for the Intel compiler on Titan, I will put in a pull request when I get the chance, unless someone does so first.
This relates to Jira task PG-71 .
Bill Sacks pointed out the bug and bugfix identified as Bug-2168 by CESM.
Today this builds successfully:
./create_newcase -case F1850C5.ne30 -mach titan -compset F1850C5 -res ne30_g16 -project cli115 -compiler pgi
while this does not
./create_newcase -case F1850.ne30 -mach titan -compset F1850 -res ne30_g16 -project cli115 -compiler pgi
The error message is
PGF90-F-0004-Unable to open MODULE file modal_aero_convproc.mod (/autofs/na3_home1/worley/ACPI/SVN/ACME/master/ACME/models/atm/cam/src/physics/cam/physpkg.F90: 1747)
and, in fact, I do not see a module file called modal_aero_convproc being built. (It is not that it is failing to build - it is simply has not been built before it is referenced.)
This is CAM4, so perhaps we do not care about it. Atmosphere Group should advise, and also verify on other systems, though this seems to be a build logic issue not a system-specific issue.
Note: Problem was first identified by @bmayerornl (for a T31_g37 FVCAM build). So this issue occurs for multiple dycores and multiple resolutions.
When I run MPAS-Ocean coupled to a data atmosphere, the shortwave forcing is incorrect. The easiest way to see this is to set the ocn/atm coupling period to one day. Then the shortwave to the ocean should be constant over a day, and nearly constant along a latitude line, within cloudiness. But the shortwave is just on a portion of the earth. Also note scale of x2o colorbar, which changes from hour 1 to 12. Here I had the coupler output history files.
Note I used the pull request branch for MPAS-O, but these coupler fields should be the same for POP. However, I did not check that.
atmosphere to coupler, hour 1, a2x_Faxa_swnet
coupler to ocean, hour 1, x2oacc_Foxx_swnet
atmosphere to coupler, hour 12, a2x_Faxa_swnet
coupler to ocean, hour 12, x2oacc_Foxx_swnet
(note color bar scale, and location is the same as hour 1!)
Here is how I made these images. I ran acme on mustang at LANL:
git checkout -b dj_mpaso_addmodel origin/douglasjacobsen/mpas-o/add-model
git submodule update
set ACME_CASE = a04t
create_newcase -case $CASE_ROOT/$ACME_CASE -compset CMPASO-IAF -mach mustang -res T62_mpas120
cd $CASE_ROOT/$ACME_CASE
./cesm_setup
${ACME_CASE}.build
vi env_run.xml
change the following:
<entry id="NCPL_BASE_PERIOD" value="day" />
<entry id="HIST_OPTION" value="nhour" />
<entry id="HIST_N" value="1" />
run job. Then get the coupler history files:
mu-fe1.lanl.gov> ncdump a04t.cpl.hi.0001-01-01-10800.nc -h | grep -i swnet
double a2x_Faxa_swnet(time, a2x_ny, a2x_nx) ;
a2x_Faxa_swnet:_FillValue = 1.e+30 ;
a2x_Faxa_swnet:units = "W m-2" ;
a2x_Faxa_swnet:long_name = "Net shortwave radiation" ;
a2x_Faxa_swnet:standard_name = "surface_net_shortwave_flux" ;
a2x_Faxa_swnet:internal_dname = "a2x_ax" ;
double x2oacc_Foxx_swnet(time, x2oacc_ny, x2oacc_nx) ;
x2oacc_Foxx_swnet:_FillValue = 1.e+30 ;
x2oacc_Foxx_swnet:units = "W m-2" ;
x2oacc_Foxx_swnet:long_name = "Net shortwave radiation" ;
x2oacc_Foxx_swnet:standard_name = "surface_net_shortwave_flux" ;
x2oacc_Foxx_swnet:internal_dname = "x2oacc_ox" ;
remove needed variables only:
ncks -v a2x_Faxa_swnet,x2oacc_Foxx_swnet,domo_lon,domo_lat a04t.cpl.hi.0001-01-01-10800.nc a04t.cpl.hi.0001-01-01-10800_short.nc
ncks -v a2x_Faxa_swnet,x2oacc_Foxx_swnet,domo_lon,domo_lat a04t.cpl.hi.0001-01-01-43200.nc a04t.cpl.hi.0001-01-01-43200_short.nc
in ferret:
yes? use a04t.cpl.hi.0001-01-01-10800_short.nc
yes? show data
currently SET data sets:
1> ./a04t.cpl.hi.0001-01-01-10800_short.nc (default)
name title I J K L M N
A2X_FAXA_SWNET
Net shortwave radiation 1:192 1:94 ... 1:1 ... ...
DOMO_LAT 1:28574 1:1 ... 1:1 ... ...
DOMO_LON 1:28574 1:1 ... 1:1 ... ...
X2OACC_FOXX_SWNET
Net shortwave radiation 1:28574 1:1 ... 1:1 ... ...
yes? go basemap x=0:360 Y=90s:90n 20
yes? go polymark polygon/over/nolab/key DOMO_LON DOMO_LAT X2OACC_FOXX_SWNET circle 0.25
yes? shade A2X_FAXA_SWNET
The t_stopf() call for ecosysdyn is misplaced in clm_driver.F90, leading to inaccurate performance data for some science cases. This issue was identified and fixed by Pat Worley in pull request #90, which also fixed issue #81. This bug was eliminated in CESM CLM tag clm4_5_1_r097 as a part of ED refactorization and was also not identified as a fixed bug in that tag. ACME has chosen not to incorporate that ED refactorization into the V1 model development, but expects to use it for V2 model development.
If the user does not specify "-sharedlibroot" while creating and running tests (./create_test ... -sharedlibroot ...) on blues the tests fail to build correctly.
CESM BUILDEXE SCRIPT STARTING
COMPILER is intel
The solution is to define CESMSCRATCHROOT in the blues configuration
There is a minor bug in namelist_definition.xml. In the xml entry below, trop_mam4 should be listed as a valid value (but it is currently missing). I have confirmed this with @singhbalwinder , who added the MAM4 code in ACME.
Name of the CAM chemistry package. N.B. this variable may not be set by
the user. It is set by build-namelist via information in the configure
cache file to be consistent with how CAM was built.
Default: set by build-namelist
This bug is was reported to CSEG and fixed in clm4_5_1_r109.
Additional information about this bug is available at
http://bugs.cgd.ucar.edu/show_bug.cgi?id=2183
The CLM compset I_2000_CLM45_CN_CROP (resolution f19_g16) fails when writing a history file. Based on the log file, the error may be due to faulty reading in of one of the initialization files. The errors include NetCDF: Invalid dimension ID or name and NetCDF: Variable not found. This is not a supported configuration of the model so the datasets may not have been updated.
Email from Pat Worley, October 12 2014:
The Cray compiler is complaining about an expression in ecosys_mod.F90 in the version of POP2 in master (and in the aces4bgc branch). I guess that this is working correctly for the compilers that aren't complaining, but the Cray compiler does have a point:
" A subscript must be a scalar integer expression."
WORK1 = DTRACER_MODULE(:,:,dic_ind) + DTRACER_MODULE(:,:,doc_ind) &
+ DTRACER_MODULE(:,:,zooC_ind) &
+ sum(DTRACER_MODULE(:,:,autotrophs(:)%C_ind), dim=3)
here C_ind (in the third line) is declared to be a floating point value.
This has been fixed in the POP2 trunk on the CESM repository. The fix involves a significantly rewritten ecosys_params.F90 and the introduction of a new file ecosys_share.F90 . Something simple may be sufficient, but I wanted to point this out.
Were these modifications in collaboration with LANL? Do we care about this?
Note also that the Cray compiler does not like some of the (legacy) vectorization directives in vmix_kpp.F90. Some are not recognized (just warnings) and some are but with different expectations, resulting in errors.
FYI.
Pat
Some ACME developer tests on Mira currently use 512 nodes and the walltimes for the tests are between 1.5 hrs and 6 hrs. On Cetus the tests use the same number of nodes but the walltime is capped at 1 hr. We need to modify the scripts (config_pes.xml?) so that the developer tests run on a smaller set of nodes (can we bunch up the whole test suite into a single job with 512 nodes?) and request realistic times.
[jayesh@miralac1 scripts (master=)]$ qstat -u jayesh
JobID User WallTime Nodes State Location
428838 jayesh 03:00:00 512 queued None
428849 jayesh 03:00:00 512 queued None
428852 jayesh 01:30:00 512 running MIR-44800-77B31-512
428853 jayesh 01:30:00 512 running MIR-04880-37BB1-512
428854 jayesh 01:30:00 512 running MIR-04800-37B31-512
428855 jayesh 01:30:00 512 queued None
428856 jayesh 01:30:00 512 queued None
428859 jayesh 06:00:00 1 queued None
428860 jayesh 01:30:00 512 queued None
428861 jayesh 06:00:00 1 queued None
428863 jayesh 01:30:00 512 queued None
In order to guarantee collection of performance data for production runs, the env_run.xml parameter SAVE_TIMING is set to TRUE by default. This is unnecessary for functionality testing, and it fills up the performance data archive with provenance and performance data that will never be looked at. This is only an issue on Edison, Hopper, Mira, and Titan at the moment, but the intent is to add the performance data archiving capability to the development systems as well.
I propose adding an xmlchange to the test scripts to set SAVE_TIMING to FALSE except for tests meant to examine performance. I've put this into the issue list because I'd prefer that the keepers of the testing scripts weightin on this and also to implement the change(s).
Note that I DO NOT want the default changed to FALSE, as this defeats the goal of making performance data collection the norm for ACME runs.
When running compset ICLM45BGC, ACME built with the Intel compiler on Titan dies with a seg. fault in VOCEmissionMod.F90 (line 199). Checking more recent tags of CLM I found the following:
Tag name: clm4_5_1_r091
Originator(s): muszala (Stefan Muszala)
Date: Mon Oct 27 09:48:56 MDT 2014
One-line Summary: update externals. fix bug so CLM runs with Intel 14x.
Purpose of changes: Update externals. Fix bug in VOCEmissionMod.F90 that prevented
CLM from running with Intel 14x on yellowstone. Bring in workaround for bug 1730 from
Sacks.
-- remove duplicate assignment of 0_r8 to meg_out(imeg)%flux_out
M models/lnd/clm/src/biogeochem/VOCEmissionMod.F90
After making this change (commenting out the offending line), this ICLM45BGC case ran successfully. This change should be backported into our version of CLM4.5.
Note also the reference to another bug workaround:
-- Sacks' workaround for bug 1730
M models/lnd/clm/src/main/histFileMod.F90
M models/lnd/clm/src/main/ncdio_pio.F90
M models/lnd/clm/src/main/ncdio_pio.F90.in
I do not know what bug 1730 is (could not find it, since it has since been closed?), nor whether this is relevant to use.
(added to issue list at @jnjohnsonlbl 's request):
The version of env_mach_specific in master for Edison (which is identical to the most recent one from NCAR) is generating some warning messages. They seem innocuous, but we might want to look into whether this needs to be updated:
during setup, build, and submit:
Unloading of cray-mpich module was not able to restore CRAY_LD_LIBRARY_PATH.
The following will resolve this issue:
'setenv CRAY_LD_LIBRARY_PATH /opt/cray/libsci/13.0.1/INTEL/140/x86_64/lib'
when job starts running:
cray-mpich/7.0.4(30):ERROR:150: Module 'cray-mpich/7.0.4' conflicts with the currently loaded module(s) 'cray-mpich/6.3.1'
cray-mpich/7.0.4(30):ERROR:102: Tcl command execution failed: conflict cray-mpich
At some point the COSP Make logic (in the Makefile in the Machines directory) was updated in ACME, and it is incompatible with the rest of the model.
Originator: cacraig
Date: Jul 17, 2014
Model: Machines
Version: Machines_140717
One-line: CAM's abortutils is now cam_abortutils
M Makefile
- cosp now depends on cam_abortutils
Tag name: cam5_3_44
Originator(s): cacraig
Date: 07/16/14
One-line Summary: Update externals to cesm1_3_beta11
Purpose of changes:
Tag name: cam5_3_36
Originator(s): cacraig, goldy
Date: May 5, 2014
So we do not have cam_abortutils and the current build logic for COSP requires this (and the build fails without this). It may be enough to modify Makefile in scripts/ccsm_utils/Machines. I'll give this a try, but it may be better to update to a slightly more recent version of CAM (or import just this change), especially since the ChangeLog message indicates that this is required when using more recent versions of CLM?
Hi,
I'm having trouble cloning this git repository on rhea:
zender@rhea-login1g:~$ git clone https://github.com/ACME-Climate/DiagnosticsWorkflow.git
Initialized empty Git repository in /autofs/nccs-svm1_home1/zender/DiagnosticsWorkflow/.git/
error: The requested URL returned error: 403 Forbidden while accessing https://github.com/ACME-Climate/DiagnosticsWorkflow.git/info/refs
fatal: HTTP request failed
This same cloning command works fine on my home machine, not on rhea.
Any help appreciated. Thanks!
Charlie
The c2l subroutines are used for averaging variables defined at 'column' to 'landunit' level. The subroutines c2l_1d and c2l_2d incorrectly use 'pft' level quantities. This bug was identified by NCAR as bug 2077 and fixed in clm4_5_r095 tag.
In the check_exactrestart.pl script, which is used to compare two files and make sure there was exact restartability between them, there is the potential for an infinite loop.
The loop that causes this issue is:
https://github.com/ACME-Climate/ACME/blob/master/scripts/ccsm_utils/Tools/check_exactrestart.pl#L112
When the second file passed into the script does not contain the comm_diag token, this loop is an infinite loop that causes all wallclock time to be burned for a test, without generating a pass (the test remains in the RUN state).
The comm_diag token is missing from a file, when the run fails. So, if the restart run fails to run while performing an ERS test, this script causes an infinite loop.
In an ne30 B1850C5 test case, the run will die on Titan with a "recursive I/O call" error message when using an executable built with the Intel compiler. I tracked this down to a call to shr_sys_flush in physics_update in physics_types.F90 in atm/cam/src/physics/cam . Commenting out this line eliminates the error, and this one test run then completes normally. I do not know if this is indicative of a deeper issue, however. I will try to verify with developer's test suite once Titan returns from its OS upgrade.
This is related to Jira task PG-71.
Regional output is selected via the fincllonlat namelist entries in the ACME atmosphere model. However, this entry is not currently supported for the target SE dycore. This bug was reported by @danielerosa.
A bug for MG1 was reported in cam5_3_91 @ NCAR
There is a bug in micro_mg_cam.F90 when calling cnst_add incorrectly with is_convtran1=.false.
The bugfix is answer-changing.
JIRA task
https://acme-climate.atlassian.net/browse/AG-300
Implemented the bugfix for f27fac7 (where the bug was introduced) on github branch
kaizhangpnl/atm/bugfix_MG1_introb
and for the current master on branch
kaizhangpnl/atm/bugfix_MG1
The first branch is what Jin-Ho recommended (#207 (comment)). However, I also found on that page it says "Bugfix-doc: Unlike new feature development, bug fixes are typically never started from the HEAD of master". So it seems that we have an unusual case. I am not sure whether it still makes sense to follow that path. Anyway, both two options are provided to the integrator.
The 5-day simulation before and after the bugfix (with kaizhangpnl/atm/bugfix_MG1) can be found on cascade:
/dtemp/zhan524/csmruns/ACME_master_bugfix_mg1_nag (current master)
/dtemp/zhan524/csmruns/ACME_master_bugfix_mg1_nag_fix (bugfixed master)
Added namelist variable "apply_mg1_bugfix" to switch on/off the bugfix
modified: models/atm/cam/bld/namelist_files/namelist_defaults_cam.xml
modified: models/atm/cam/bld/namelist_files/namelist_definition.xml
modified: models/atm/cam/src/physics/cam/micro_mg_cam.F90
modified: models/atm/cam/src/physics/cam/phys_control.F90
When hist_dov2xy=.false.
is sepcified in user_nl_clm
, the code crashes. This was reported as bug-1730 and fixed in clm4_5_1_r091 by NCAR.
@bbye reported ACME code crashes for the following case:
create_newcase -case cropharv_mtest -mach edison -res f19_g16 -compset I_2000_CLM45_CN_CROP -project ccsm1
When running ERS.f45_g37.B1850C5 on Titan using the Intel compiler, the job dies with a malloc assertion error in PIO. I found a comment on DiscussCESM that CESM requires a modification to the FV version of spmd_dyn.F90 in order work with the 14 series of Intel compiler (which is what are available on Titan).
Using this updated version (or compiling the current version with -O1) eliminate the problem.
I'll check in this update when I get the change, however this is only necessary because our developers test suite includes FV, which we will never be using in production. It seems a waste of time to have to debug FV in order to get the test suite to work.
The following prebeta tests fail on Mira (in addition to expected prebeta test failures) after the CLM 4.5 merge (master #d9d6f4c),
ERS_Ld7.f09_g16.B1850CNCHM.mira_ibm (RUN - still fails, as of Feb 17, 2015, #c5fa8d4)
ERS_IOP_Ld3.f19_f19.F1850PDC5.mira_ibm (SFAIL - still fails, as of Feb 17, 2015, #c5fa8d4)
These tests have been fixed by the changes discussed below,
ERI.ne30_g16.B1850C5CN.mira_ibm
PFS.ne30_g16.B1850C5CN.mira_ibm
PFS.ne30_g16.B1850C5L45BGC.mira_ibm
PFS.ne30_ne30.F1850C5L45BGC.mira_ibm
SMS_D.ne30_g16.FCN.mira_ibm
The bug was been identified by NCAR as Bug 2090 and was fixed in drvseq5_1_02.
A Build of B1850C5CN_ne30_g16 case on cetus fails on next (#607557d). Looks like the variable emis_scale was defined twice (only compiled conditionally). The error message is shown below,
Case:
./create_newcase -case B1850C5CN_ne30_g16_test_build01 -compset B1850C5CN -res ne30_g16 -mach cetus
Build Error:
mpixlf2003_r ... /gpfs/mira-home/jayesh/acme/ACME_merge02/models/atm/cam/src/chemistry/modal_aero/seasalt_model.F90
"/gpfs/mira-home/jayesh/acme/ACME_merge02/models/atm/cam/src/chemistry/modal_aero/seasalt_model.F90", line 85.28: 1514-004 (S) Name given for constant with PARAMETER attribute was defined elsewhere with conflicting attributes. Name is ignored.
"/gpfs/mira-home/jayesh/acme/ACME_merge02/models/atm/cam/src/chemistry/modal_aero/seasalt_model.F90", line 85.28: 1514-071 (W) Identifier emis_scale was previously defined with same type.
** seasalt_model === End of Compilation 1 ===
1501-511 Compilation failed for file seasalt_model.F90.
Building with the latest master with clm4_5 fails with the error:
./lnd/clm/src/main/subgridAveMod.F90", line 790.36: 1516-036 (S) Entity co has undefined type.
The line in question is
sumwt(l) = sumwt(l) + co%wtlunit(c)
It appears that this should be
sumwt(l) = sumwt(l) + col%wtlunit(c)
A land developer should verify (and submit a fix).
The version of cray-netcdf being specified in env_mach_specific.titan for mpi-serial needs to be updated. It is currently 4.3.0, which no longer exists. The default (and most recent) is now 4.3.2 . 4.3.2 is already being used for the MPI branch.
Using the following command to create and run the acme_developer tests on a Mac produced a build problem in a test case:
cd ACME/scripts
export NETCDF=/usr/local
./create_test -testid t01 -xml_category acme_developer -xml_mach mac -xml_compiler gnu
The following link error occurs during the build of the "glimmer-cism" library:
mpif90 -o /Users/jeff/projects/acme/scratch/ERS_Ly21.f09_g16.TG.mac_gnu.t01/bld/cesm.exe ccsm_comp_mod.o ccsm_driver.o component_mod.o component_type_mod.o cpl_comp_esmf.o cplcomp_exchange_mod.o prep_aoflux_mod.o prep_atm_mod.o prep_glc_mod.o prep_ice_mod.o prep_lnd_mod.o prep_ocn_mod.o prep_rof_mod.o prep_wav_mod.o seq_diag_mct.o seq_domain_mct.o seq_flux_mct.o seq_frac_mct.o seq_hist_mod.o seq_io_mod.o seq_map_esmf.o seq_map_mod.o seq_map_type_mod.o seq_rest_mod.o t_driver_timers_mod.o -L/Users/jeff/projects/acme/scratch/ERS_Ly21.f09_g16.TG.mac_gnu.t01/bld/lib/ -latm -L/Users/jeff/projects/acme/scratch/ERS_Ly21.f09_g16.TG.mac_gnu.t01/bld/lib/ -lice -L/Users/jeff/projects/acme/scratch/ERS_Ly21.f09_g16.TG.mac_gnu.t01/bld/lib/ -llnd -L/Users/jeff/projects/acme/scratch/ERS_Ly21.f09_g16.TG.mac_gnu.t01/bld/lib/ -locn -L/Users/jeff/projects/acme/scratch/ERS_Ly21.f09_g16.TG.mac_gnu.t01/bld/lib/ -lrof -L/Users/jeff/projects/acme/scratch/ERS_Ly21.f09_g16.TG.mac_gnu.t01/bld/lib/ -lglc -L/Users/jeff/projects/acme/scratch/ERS_Ly21.f09_g16.TG.mac_gnu.t01/bld/lib/ -lwav -L/Users/jeff/projects/acme/scratch/ERS_Ly21.f09_g16.TG.mac_gnu.t01/bld/glc/lib/ -lglimmercismfortran -L/Users/jeff/projects/acme/scratch/sharedlibroot.t01/gnu/openmpi/nodebug/nothreads/MCT/noesmf/a1l1r1i1o1g1w1/csm_share -lcsm_share -L/Users/jeff/projects/acme/scratch/sharedlibroot.t01/gnu/openmpi/nodebug/nothreads/lib -lpio -lgptl -lmct -lmpeu -L/usr/local/lib -lnetcdff -lnetcdf -framework Accelerate -all_load
duplicate symbol _main in:
ccsm_driver.o
/Users/jeff/projects/acme/scratch/ERS_Ly21.f09_g16.TG.mac_gnu.t01/bld/glc/lib//libglimmercismfortran.a(dlapqc.f.o)
ld: 1 duplicate symbol for architecture x86_64
collect2: error: ld returned 1 exit status
This is because dlapqc.f, a source file in the SLAP library bundled in with glimmer-solve, contains a Fortran program, not library functions. The fix is to make sure that this source is excluded from the library in models/glc/cism/glimmer-cism/CMakeLists.txt. I will be submitting a pull request to enact this change.
Alternatively, an interested party on the Land-Ice team could add the following line to the list of removed Fortran sources in models/glc/cism/glimmer-cism/CMakeLists.txt (around line 263 in the master branch):
${GLIMMER_SOURCE_DIR}/libglimmer-solve/SLAP/dlapqc.f
Issue discovered by Qi Tang. In CAM, all allowable namelist variables should appear in the atm_in namelist, with their default values (or with non-default values if specified by the user in "user_nl_cam".) By default, "zmconv_tau" will be missing. If not changed by the user, it will use its correct value, 3600s, but it will not appear in the namelist.
This is a minor issue. It means that currently, to determine the value of zmconv_tau used by a simulation, one must check the log files as it wont appear in atm_in unless it was specifically set by the user for that simulation.
Fix: add an "add_default()" entry for zmconv_tau in cam/bld/build_namelist, and add a default value of 3600d0 in cam/bld/namelist_files/namelist_defaults.xml.
One caveat: Will reading in "3600d0" from the namelist be BFB with the current default of 3600._r8 set in the fortran? If they differ at machine precision, it will requiring rebaselining our regression suite, and hence this commit should be isolated and not combined with other changes. Hopefully this is not a problem since 3600 can be represented exactly in IEEE floating point.
[AG-189] Minghua Zhang (SoMAS, Stony Brook University) found a small problem in the calculation of solar insolation in CAM. He talked about it at the AMWG meeting. The issue is that the insolation is calculated at the beginning of a time step, and held constant over the length of a radiation time step. This produces a small but discernible variation in the solar constant from one latitude and longitude to another. The correct method is to average over the time step [reported by @philrasch ]
Here is the image showing the difference:
Figure caption:Annual-mean FSDT, FSNTC, FSNT, FSNSC, FSNS for (left column) 1-hour radiation time step based on the revised algorithm, (middle column) the original algorithm minus the revised algorithm for 3-hour radiation time step, (right column) the original algorithm minus the revised algorithm for 1-hour radiation time step. Units: W/m2
File affected: shr_orb_mod.F90 (function shr_orb_cosz)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.