noaa-gfdl / icebergs Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 19.0 2.37 MB

GFDL Climate Model Icebergs

License: Other

Fortran 72.18% Python 27.76% Shell 0.06%

icebergs's People

Contributors

Stargazers

Watchers

Forkers

adcroft sternalon wfcooke nichannah govtmirror joshuaeveleth jim888888 mbueti tahvildarzadeh nikizadehgfdl zhaobin74 lysun0725 hf-uw-soo alex-huth ngam mcallic2 uramirez8707 openweatherai olgasergienko

icebergs's Issues

icebergs cannot restart from distributed restart files

After commit 0348755 to fix the issue with large number of bergs (PR #51 ) the icebergs cannot restart from distributed files and such attempt causes the following crash:

Rank 64 [Mon Jun 19 11:01:23 2017] [c0-0c2s14n1] Fatal error in MPI_Allreduce: Message truncated, error stack:
MPI_Allreduce(1007).................: MPI_Allreduce(sbuf=0x7ffffffe70f0, rbuf=0x7ffffffe7070, count=1, dtype=0x4c000430, MPI_SUM, comm=0x84000004) failed
MPIR_Allreduce_impl(850)............:
MPIR_CRAY_Allreduce(346)............:
MPIR_Allreduce_intra(485)...........:
MPIC_Sendrecv(533)..................:
MPIDI_CH3U_Request_unpack_uebuf(618): Message truncated; 8 bytes received but buffer size is 4
MPIR_Allreduce_intra(485)...........:
MPIDI_CH3U_Receive_data_found(144)..: Message from rank 0 and tag 14 truncated; 8 bytes received but buffer size is 4

forrtl: error (76): Abort trap signal
Image              PC                Routine            Line        Source
fms_MOM6_SIS2_com  000000000180C124  mpp_mod_mp_mpp_su          21  mpp_sum_mpi.h
fms_MOM6_SIS2_com  000000000069F429  ice_bergs_framewo        4678  icebergs_framework.F90
fms_MOM6_SIS2_com  00000000006BE71C  ice_bergs_io_mp_r         715  icebergs_io.F90
fms_MOM6_SIS2_com  000000000067B58F  ice_bergs_mp_iceb         132  icebergs.F90
fms_MOM6_SIS2_com  000000000049B534  ice_model_mod_mp_        2462  ice_model.F90
fms_MOM6_SIS2_com  000000000040BA51  coupler_main_IP_c        1668  coupler_main.F90

This can be reproduce by trying to restart the experiment ice_ocean_SIS2/SIS2_icebergs with a SIS_layout other than the default 1,1.

model crashes when trying to restart from non-combined icebergs restart files

Here's what happens when I try to restart icebergs without combining the restart files ( experiment MOM6_GOLD_SIS2_bergs)

Rank 16 [Wed Dec 30 14:49:00 2015] [c0-0c0s1n2] Fatal error in MPI_Allreduce: Message truncated, error stack:
MPI_Allreduce(888)......................: MPI_Allreduce(sbuf=0x7ffffffe8dc4, rbuf=0x7ffffffe7e60, count=1, dtype=0x4c000430, MPI_MAX, comm=0x84000002) failed
MPIR_Allreduce_impl(739)................: 
MPIR_Allreduce_intra(223)...............: 
MPIR_Bcast_impl(1320)...................: 
MPIR_Bcast_intra(1154)..................: 
MPIR_Bcast_binomial(148)................: 
MPIDI_CH3_PktHandler_EagerShortSend(350): Message from rank 0 and tag 2 truncated; 8 bytes received but buffer size is 4
Internal Error: invalid error code 60e50e (Ring ids do not match) in MPIR_Allreduce_impl:739
Rank 32 [Wed Dec 30 14:49:00 2015] [c0-0c0s1n3] Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(888)......: MPI_Allreduce(sbuf=0x7ffffffe8dc4, rbuf=0x7ffffffe7e60, count=1, dtype=0x4c000430, MPI_MAX, comm=0x84000004) failed
MPIR_Allreduce_impl(739): 
Rank 48 [Wed Dec 30 14:49:00 2015] [c0-0c0s1n3] Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(888).......: MPI_Allreduce(sbuf=0x7ffffffe6530, rbuf=0x7ffffffe6320, count=1, dtype=0x4c000831, MPI_SUM, comm=0x84000002) failed
MPIR_Allreduce_impl(739).: 
MPIR_Allreduce_intra(223): 
MPIR_Bcast_impl(1320)....: 
MPIR_Bcast_intra(1154)...: 
MPIR_Bcast_binomial(157).: message sizes do not match across processes in the collective
forrtl: error (76): Abort trap signal
Image              PC                Routine            Line        Source             
fms_MOM6_SIS2_com  00000000060F7051  mpp_mod_mp_mpp_su          21  mpp_sum_mpi.h
fms_MOM6_SIS2_com  000000000611A34D  mpp_mod_mp_mpp_su          15  mpp_sum.inc
fms_MOM6_SIS2_com  00000000060FBA06  mpp_mod_mp_mpp_ch          16  mpp_chksum_int.h
fms_MOM6_SIS2_com  0000000006107BF0  mpp_mod_mp_mpp_ch          13  mpp_chksum.h
fms_MOM6_SIS2_com  0000000005E18EBB  mpp_io_mod_mp_mpp          85  mpp_read_compressed.h
fms_MOM6_SIS2_com  0000000004F1F7BE  fms_io_mod_mp_rea        5153  fms_io.F90
fms_MOM6_SIS2_com  0000000004F1DD04  fms_io_mod_mp_rea        5111  fms_io.F90
fms_MOM6_SIS2_com  0000000004DD9E0C  ice_bergs_io_mp_r         693  icebergs_io.F90
fms_MOM6_SIS2_com  0000000004BF7CFD  ice_bergs_mp_iceb         103  icebergs.F90
fms_MOM6_SIS2_com  0000000000981922  ice_model_mod_mp_        4085  ice_model.F90
fms_MOM6_SIS2_com  0000000000416B65  coupler_main_IP_c        1540  coupler_main.F90
fms_MOM6_SIS2_com  0000000000400B4B  MAIN__                    520  coupler_main.F90

other pes at:

fms_MOM6_SIS2_com  00000000060F47DE  mpp_mod_mp_mpp_ma          14  mpp_reduce_mpi.h
fms_MOM6_SIS2_com  0000000004DAE124  ice_bergs_framewo        2744  icebergs_framework.F90
fms_MOM6_SIS2_com  0000000004BF7D29  ice_bergs_mp_iceb         105  icebergs.F90
fms_MOM6_SIS2_com  0000000000981922  ice_model_mod_mp_        4085  ice_model.F90
fms_MOM6_SIS2_com  0000000000416B65  coupler_main_IP_c        1540  coupler_main.F90
fms_MOM6_SIS2_com  0000000000400B4B  MAIN__                    520  coupler_main.F90

I think the traceback is trying to say there is a race condition somewhere.

When I comment out the line following line in shared/mpp/include/mpp_read_compressed.h
so that the compute_chksum is false and the code doesn't go through the checksum routines the
crash goes away!

if (ANY(field%checksum /= default_field%checksum) ) compute_chksum = .TRUE.

workdir: /lustre/f1/Niki.Zadeh/work/ulm_201510_mom6_2015.12.17/MOM6_GOLD_SIS2_bergs_2x0m1d_64pe.o1451489785

icebergs cannot restart from distributed restart files

When we do not combine iceberg restart files at the end of the 1st day of 2x1day restart experiment the model crashes with

FATAL from PE     0: diamonds, read_restart_bergs: Iceberg copied twice

forrtl: error (76): Abort trap signal
Image              PC                Routine            Line        Source
fms_MOM6_SIS2_com  0000000001FEF3E5  Unknown               Unknown  Unknown
fms_MOM6_SIS2_com  0000000001A54B06  mpp_mod_mp_mpp_er          50  mpp_util_mpi.inc
fms_MOM6_SIS2_com  00000000006B3E4D  ice_bergs_io_mp_r         875  icebergs_io.F90
fms_MOM6_SIS2_com  0000000000677586  ice_bergs_mp_iceb         116  icebergs.F90
fms_MOM6_SIS2_com  00000000004BE6DD  ice_model_mod_mp_        2316  ice_model.F90
fms_MOM6_SIS2_com  0000000000408FA5  coupler_main_IP_c        1574  coupler_main.F90
fms_MOM6_SIS2_com  0000000000401351  MAIN__                    525  coupler_main.F90

This used to work before (and after the new io was implemented by Jeff).

new icebergs io crashes in debug mode

I tried to run in debug mode (a model that runs fine in prod mode) and it crashes everytie with the following traceback:

forrtl: severe (408): fort: (2): Subscript #1 of the array SBUF has value 1 which is greater than the upper bound of 0

Image              PC                Routine            Line        Source             
fms_MOM6_SIS2_com  0000000005CD3C2B  mpp_mod_mp_mpp_ga          77  mpp_gather.h
fms_MOM6_SIS2_com  00000000059D7718  mpp_io_mod_mp_mpp          30  mpp_write_unlimited_axis.h
fms_MOM6_SIS2_com  0000000004A7BB5A  fms_io_mod_mp_sav        2516  fms_io.F90
fms_MOM6_SIS2_com  0000000004A64FE3  fms_io_mod_mp_sav        2114  fms_io.F90
fms_MOM6_SIS2_com  0000000003A72342  ice_bergs_io_mp_w         211  icebergs_io.F90
fms_MOM6_SIS2_com  0000000003999418  ice_bergs_mp_iceb        2152  icebergs.F90
fms_MOM6_SIS2_com  00000000033A26B2  ice_type_mod_mp_i         967  ice_type.F90
fms_MOM6_SIS2_com  0000000003202118  ice_model_mod_mp_        4071  ice_model.F90
fms_MOM6_SIS2_com  000000000040B603  coupler_main_IP_c        1662  coupler_main.F90
fms_MOM6_SIS2_com  0000000000404CEF  MAIN__                    887  coupler_main.F90

new icebrgs io produces a restart file even for io tiles that have no icebergs

We get a icebergs.res.nc* restart file for every io tile including the ones that have no icebergs.
Hence some restart files have
i = UNLIMITED ; // (0 currently)
Hence the general method of ncrcating them together at the end of the run (in the postprocess section of the xml) crashes:

gaea6: /lustre/f1/Niki.Zadeh/work/ulm_accumGTfluxes_mom6_2015.05.27/OM4_SIS2_baseline_1x0m10d_2997pe.o7085905/RESTART % ncrcat icebergs.res.nc.0000 icebergs.res.nc.0001 icebergs.res.nc
ncrcat: ERROR nco_lmt_sct_mk() reports record variable exists and is size zero, i.e., has no records yet.
ncrcat: HINT: Perform record-oriented operations only after file has valid records.
ncrcat: cnt < 0 in nco_lmt_sct_mk()
/bin/sh: line 1: 17130 Segmentation fault      $b_op $b_argv 2> $b_logfile.out

We need to either:

Avoid writing a restart file on an empty tile
Write a combine tool for iceberg restarts to handle empty file.

We have to fix this otherwise we cannot run models with the new icebergs:

/lustre/f1/Niki.Zadeh/ulm_accumGTfluxes_mom6_2015.05.27_0/OM4_SIS2_baseline/ncrc2.intel-prod/stdout/run/OM4_SIS2_baseline_1x0m10d_2997pe.o7085905

icebergs restart does not reproduce across npes change

When I change the number of cores, icebergs.res.nc does not match, every other restart match.
This is probably due to a change in the order of writing icebergs to restart file.
Is this fixable?

ensemble runs append string _ensNN twice to icebergs restart files causing them not to be read

Restart tests for ensemble runs fail because the icebergs restart file names from each ensemble member has the ensemble_id appended twice like
icebergs.res.ens_01.ens_01.nc instead of
icebergs.res.ens_01.nc
and so they are ignored when the model restarts.

CM4 highres crash in icebergs_framework with forrtl: severe (41): insufficient virtual memory

CM4 highres model crashes on gaea with the following message and traceback on day 19 of the model run (probably because the model is trying to produce the first berg).

forrtl: severe (41): insufficient virtual memory
Image              PC                Routine            Line        Source
fms_cm4_sis2_comp  00000000055555DE  Unknown               Unknown  Unknown
fms_cm4_sis2_comp  0000000000E5D225  ice_bergs_framewo        2119  icebergs_framework.F90
fms_cm4_sis2_comp  0000000000E9E87A  ice_bergs_io_mp_w        1296  icebergs_io.F90
fms_cm4_sis2_comp  0000000000E19F8E  ice_bergs_mp_iceb        3371  icebergs.F90
fms_cm4_sis2_comp  00000000004C305F  sis_dyn_trans_mp_         285  SIS_dyn_trans.F90
fms_cm4_sis2_comp  000000000046FF0E  ice_model_mod_mp_         232  ice_model.F90
fms_cm4_sis2_comp  000000000046FB8F  ice_model_mod_mp_         157  ice_model.F90
fms_cm4_sis2_comp  0000000000403FB5  MAIN__                   1034  coupler_main.F90

Line 2119 is an allocate statement

    allocate(new%data(width,new_size))

I put a print statement before it that shows width=6 , new_size=1080275147 !!
Clearly new_size gets garbage which crashes the run.

This is happening on gaea. The ran before with the same executable for many years, but for some reason it started to crash like this a few weeks prior to the CLE7 update.
It still runs fine on Orion.

Model run hangs at the end when trying to write icebergs trajectories

1 year long run of an ocean-ice model hangs at the end of the run. stdout indicates that the last thing the model was doing was writing the iceberg trajectories:

diamonds, bergs_chksum: write_restart berg chksum=             583222154 chksum2=             706327505 chksum3=           -1682259823 chksum4=           -1682259823 chksum5=            -865960067 #=                 15170
diamonds, grd_chksum2:    # of bergs/cell chksum=                     0 chksum2=                     0 min= 0.000000000E+00 max= 0.000000000E+00 mean= 0.000000000E+00 rms= 0.000000000E+00 sd= 0.000000000E+00
diamonds, grd_chksum3:   write stored_ice chksum=             -81009900 chksum2=             552945600 min= 0.000000000E+00 max= 7.399908186E+11 mean= 1.057744475E+11 rms= 1.701813200E+11 sd= 1.333170955E+11
diamonds, grd_chksum2:  write stored_heat chksum=                     0 chksum2=                     0 min= 0.000000000E+00 max= 0.000000000E+00 mean= 0.000000000E+00 rms= 0.000000000E+00 sd= 0.000000000E+00

The above was on 128 cores interactive on gaea.

Note that this happens for the longer runs that have a lot of icebergs at the end (this one has 15170). Since the trajectories stay in memory and gets written to file at the end of the run this may indicate an issue with the I/O buffer.
Is there a way to increase that buffer?

Ever growing iceberg trajectory files

If there is an iceberg_trajectories.nc file in the output directory, the iceberg code will append to it, regardless of whether the model has been started from a restart file (in which case this is appropriate) or if it is a new run (in which case it may not be). In my testing directories, this has led to long and growing iceberg_trajectories.nc files. We should consider starting these files afresh for new runs, the way that we do for the MOM6 ocean.stats and SIS2 seaice.stats files.

ncrcat fails to concat icebergs restarts

Using dev/master for icebergs project I get restart files that cannot be combined and hence the model cannot restart.

% cd /lustre/f1/Niki.Zadeh/work/ulm_201505_mom6_2015.07.20/OM4_SIS2_baseline_1x0m10d_2997pe1.o7143211/output.stager/lustre/f1/Niki.Zadeh/ulm_201505_mom6_2015.07.20_0/OM4_SIS2_baseline/ncrc2.intel-prod/1x0m10d_2997pe1/restart/19000111

% ncrcat icebergs.res.nc.0000 icebergs.res.nc.0001 icebergs.res.nc.0002 icebergs.res.nc.0003 icebergs.res.nc.0004 icebergs.res.nc.0005 icebergs.res.nc.0006 icebergs.res.nc.0007 icebergs.res.nc.0008 icebergs.res.nc.0009 icebergs.res.nc.0010 icebergs.res.nc.0011 icebergs.res.nc.0012 icebergs.res.nc.0013 icebergs.res.nc.0014 icebergs.res.nc.0015 icebergs.res.nc.0016 icebergs.res.nc.0017 icebergs.res.nc.0018 icebergs.res.nc.0019 icebergs.res.nc.0020 icebergs.res.nc.0021 icebergs.res.nc.0022 icebergs.res.nc.0023 icebergs.res.nc.0024 icebergs.res.nc.0025 icebergs.res.nc.0026 icebergs.res.nc.0028 icebergs.res.nc.0029 icebergs.res.nc.0030 icebergs.res.nc.0031 icebergs.res.nc.0032 icebergs.res.nc.0033 icebergs.res.nc.0034 icebergs.res.nc.0035 icebergs.res.nc.0036 icebergs.res.nc.0037 icebergs.res.nc.0038 icebergs.res.nc.0039 icebergs.res.nc.0040 icebergs.res.nc.0041 icebergs.res.nc.0042 icebergs.res.nc.0043 icebergs.res.nc.0044 icebergs.res.nc
ncrcat: ERROR nco_lmt_sct_mk() reports record variable exists and is size zero, i.e., has no records yet.
ncrcat: HINT: Perform record-oriented operations only after file has valid records.
ncrcat: cnt < 0 in nco_lmt_sct_mk()

icebergs are present but grid checksums are zero

CM4 model stdout has:

diamonds, bergs_chksum: read_restart bergs chksum=            1503455964 chksum2=            -564881530 chksum3=           -1273480993 chksum4=           -1273480993 chksum5=             322483758 #=                 88278
diamonds, grd_chksum2:    # of bergs/cell chksum=                     0 chksum2=                     0 min= 0.000000000E+00 max= 0.000000000E+00 mean= 0.000000000E+00 rms= 0.000000000E+00 sd= 0.000000000E+00
diamonds: Bergs found with creation dates after model date! Adjusting berg dates by        1 years
diamonds, bergs_chksum: before adjusting s chksum=            1503455964 chksum2=            -564881530 chksum3=           -1273480993 chksum4=           -1273480993 chksum5=             322483758 #=                 88278
diamonds, grd_chksum2:    # of bergs/cell chksum=                     0 chksum2=                     0 min= 0.000000000E+00 max= 0.000000000E+00 mean= 0.000000000E+00 rms= 0.000000000E+00 sd= 0.000000000E+00
diamonds, bergs_chksum: after adjusting st chksum=           -1254997837 chksum2=             254665629 chksum3=           -1181833352 chksum4=           -1181833352 chksum5=             198276612 #=                 88278
diamonds, grd_chksum2:    # of bergs/cell chksum=                     0 chksum2=                     0 min= 0.000000000E+00 max= 0.000000000E+00 mean= 0.000000000E+00 rms= 0.0

Why are the grd_chksum2 all zero?

Moreover what does the following NOTEs mean and should something be done to address them?

NOTE from PE     0: During mpp_io(mpp_read_compressed_2d) int field ine found fill. Icebergs, or code using defaults can safely ignore.  If manually overriding compressed restart fills, confirm this is what you want.
NOTE from PE     0: During mpp_io(mpp_read_compressed_2d) int field jne found fill. Icebergs, or code using defaults can safely ignore.  If manually overriding compressed restart fills, confirm this is what you want.
NOTE from PE     0: During mpp_io(mpp_read_compressed_2d) int field start_year found fill. Icebergs, or code using defaults can safely ignore.  If manually overriding compressed restart fills, confirm this is what you want.

CM4 comes down with duplicate bergs with "0" ids

The safety checks after reading a restart are being trigger because the icebuerg_num (id) = 0 for at least one berg, which is meant to be an impossible value.

Duplicated berg across PEs with id= 0 0 seen 574 times pe= 0 17392
Duplicated berg across PEs with id= 0 0 seen 574 times pe= 1 17392
Duplicated berg across PEs with id= 0 0 seen 574 times pe= 11 17392
Duplicated berg across PEs with id= 0 0 seen 574 times pe= 7 17392
...

icebergs.res.nc has valid values for iceberg_num but after reading the restart we have bad (0) values.

To complete CM4 runs I'm commenting out the checks.

dev/master/2016.08.23 icebergs and calving restarts differ across a restart

dev/master/2016.08.23 (new icebergs code)
Restart experiments show that icebergs.res.nc and calving.res.nc differ between 1x30d and 2x15d regression tests. The differences are limited to these two files, the answers match otherwise.
E.g.,

/// /lustre/f1/Niki.Zadeh/verona_mom6_2016.08.23_sis2fix/CM4_c96L32_am4g10r14_2000_OMp5_H5_ndiff_meke_MLE30d_ePBLe/ncrc3.intel15-prod-openmp/archive/1x1m0d_216x2a_842x1o/restart/00010201.tar                                                                                                                                                  
\\\ /lustre/f1/Niki.Zadeh/verona_mom6_2016.08.23_sis2fix/CM4_c96L32_am4g10r14_2000_OMp5_H5_ndiff_meke_MLE30d_ePBLe/ncrc3.intel15-prod-openmp/archive/2x0m15d_216x2a_842x1o/restart/00010201.tar                                                                                                                                                 

      Comparing calving.res.nc...                                                                                                                                       
DIFFER : VARIABLE : iceberg_counter_grd : POSITION : 0 0 1 209 : VALUES : 8 <> 9      
      Comparing icebergs.res.nc...                                                                                                                                      
DIFFER : VARIABLE : iceberg_num : ATTRIBUTE : checksum : VALUES : 68FA72DF <> B39AC75E                                                                              


   nccmp  -d first/icebergs.res.nc second/icebergs.res.nc
DIFFER : VARIABLE : iceberg_num : POSITION : 0 : VALUES : 456485 <> 871205

  nccmp -d first/calving.res.nc  second/calving.res.nc
DIFFER : VARIABLE : iceberg_counter_grd : POSITION : 0 0 1 209 : VALUES : 8 <> 9

Allocated arrays are not initialized. Causes FPE in some cases.

https://github.com/NOAA-GFDL/icebergs/blob/dev/master/icebergs.F90#L1215

Icebergs restarting with wrong year

There is a bug in the iceberg code in the file icebergs_framework.F90, in the subroutine offset_berg_dates
This routine tries to see if there are any icebergs which were calved in the future, and if so, it subtracts an offset date.

In order to see if there are future bergs, the code compares the number computed in yearday with the number 367. However, 367 should be 373 (since 12*31+1)=373.

As a result of this bug, there are many icebergs which are given the incorrect start dates even when the code is running well.

This bug should be fixed by either
(i) Changing 367 to 373
(ii) The new version of the code will have unique iceberg id's, so this date offset should be removed completely.

(Interestingly, this bug only applies to icebergs calved after Christmas)

License updated to Gnu Lesser General Public License, version 3

As of September 8, 2017, commit 6595cdd, the license shipped with "icebergs" has been updated to version 3 of the Gnu Lesser General Public License. Permission to change the license was obtained from all contributors that had contributed under the previous license (as recorded by git and on GitHub) .

If you have a fork of NOAA-GFDL/icebergs predating the change in license, you are not obliged to change the license. The new license only applies to versions of icebergs newer than September 8, 2017 which have the new license file. You can keep the old license by not updating to our latest code. However, we ask that you accept the new license by updating to the latest code on the dev/master branch. We will not accept pull requests for contributions unless they are made under the new license.

In a nutshell: we replaced the GPLv3 license with the LGPLv3 license that allows NOAA-GFDL/icebergs to be used within a coupled model that might not be using a compatible license. If you have questions, please reach out to us on this thread or by email.

IO_LAYOUT=0,0 causes FATAL "fms_io(get_field_size): The io domain must be defined"

If IO_LAYOUT in SIS2 is set to 0,0 we get:

FATAL from PE 109: fms_io(get_field_size): The io domain must be defined

We often run with IO_LAYOUT=2,2 which seems to run fine.

Answers change with ice model layout

I can get different answers if I change the ice model layout even with the same total core count. i.e use 144,2 vs 36,8. I think it has to do with the domain size getting too small and halo updates needing to reach across two domains to update.
A check on the domain size at initialization should mitigate this.

This may be related to issue #5 although I see this with the same core count.

icebergs.res.nc unreadable using NETCDF=4

When compiling with NETCDF=4 the iceberg restart files are successfully written but are actually unreadable. ncdump returns

ncdump: MOM6-examples/ice_ocean_SIS2/SIS2_bergs_cgrid/RESTART/icebergs.res.nc: NetCDF: HDF error

The model is unable to read the iceberg files. @wfcooke first spotted a related/similar problem and has volunteered to provide an XML to reproduce the problem if needed.

new icebergs io crashes when using ulm_201505 shared code with FATAL from PE 34: mpp_chksum_int.h was called with real mask_val, and mask_val can not be safely cast to int type of var (nonzero high bits).

After upgrading shared code from ulm to ulm_201505 patch the same restart regression test crashes as follows:

mpp_read_compressed chksum: mass_scaling = 9550000000000000
NOTE from PE     0: mpp_read_compressed chksum: mass_scaling failed!
FATAL from PE    34: mpp_chksum_int.h was called with real mask_val, and mask_val can not be safely cast to int type of var (nonzero high bits).

fms_MOM6_SIS2_com  000000000157787D  mpp_mod_mp_mpp_er          52  mpp_util_mpi.inc
fms_MOM6_SIS2_com  000000000153CBBB  mpp_mod_mp_mpp_ch          43  mpp_chksum_int.h
fms_MOM6_SIS2_com  000000000141AD22  mpp_io_mod_mp_mpp          78  mpp_read_compressed.h
fms_MOM6_SIS2_com  00000000011181CF  fms_io_mod_mp_rea        4950  fms_io.F90
fms_MOM6_SIS2_com  0000000000D72335  ice_bergs_io_mp_r         604  icebergs_io.F90
fms_MOM6_SIS2_com  0000000000D3FF4F  ice_bergs_mp_iceb         103  icebergs.F90
fms_MOM6_SIS2_com  000000000050A631  ice_model_mod_mp_        4034  ice_model.F90
fms_MOM6_SIS2_com  000000000040A017  coupler_main_IP_c        1402  coupler_main.F90
fms_MOM6_SIS2_com  0000000000400D6A  MAIN__                    385  coupler_main.F90

Note that this is in the 2nd leg of the restart test after the a restart attempt.

Why does fms_io think the icebrgs.res.nc file is a compressed files?

Holland and Jenkins melt parameterisation issues

Hello,

I found some issues in the MOM6 ice shelf melt parameterisation module (a missing von Karman constant in the definition of ZETA_N, buoyancy iteration loop that doesn't progress and salt iteration overriding the false position method) which were explained and addressed in PR NOAA-GFDL/MOM6#395. I believe the same bugs are present in the iceberg melt code:

Missing von Karman constant in ZETA_N: https://github.com/NOAA-GFDL/icebergs/blob/dev/gfdl/src/icebergs.F90#L3514
Buoyancy iteration loop doesn't update: https://github.com/NOAA-GFDL/icebergs/blob/dev/gfdl/src/icebergs.F90#L3697
Salt iteration overrides false position method a line earlier: https://github.com/NOAA-GFDL/icebergs/blob/dev/gfdl/src/icebergs.F90#L3755

CM4 does not reproduce across a change in ice_layout , unless icebergs are off

This is a very old issue which was first seen in ESM2 years ago.

The CM4 coupled model (using SIS2 and its old icebergs module) does not produce the same answers when ice_layout is changed. When I turn off the icebergs the answers are bitwise identical across ice_layout change.

This is with repro mode and with make_exchange_reproduce=.true., but I think neither has an effect here.

I believe this issue persists if I swap SIS2 with SIS1 . No reason to go away with new icebergs module either.

Here's the two configs that do not reproduce (ALL restart files differ) unless I turn off the bergs.
They differ only in ice_layout 72,4 vs 96,3

 else if ( "$npes" == "2560" ) then
  set atmos_npes = "288"
  set atmos_nthreads = "2"
  set nxblocks = "4" ; set nyblocks = "2" 
  set fv_layout    =   "4,12";  set fv_io_layout    =  "1,4"
  set land_layout  =   "4,12";  set land_io_layout  =  "1,4"
  set ice_layout   =   "72,4";  set ice_io_layout   =  "1,4"
  set ocn_layout   =   "36,72"; set ocn_io_layout   =  "1,4"; set ocn_mask_table = "mask_table.622.36x72"
  set ocean_npes = "1970"

else if ( "$npes" == "2561" ) then
  set atmos_npes = "288"
  set atmos_nthreads = "2"
  set nxblocks = "4" ; set nyblocks = "2" 
  set fv_layout    =   "4,12";  set fv_io_layout    =  "1,4"
  set land_layout  =   "4,12";  set land_io_layout  =  "1,4"
  set ice_layout   =   "96,3";  set ice_io_layout   =  "1,3"
  set ocn_layout   =   "36,72"; set ocn_io_layout   =  "1,4"; set ocn_mask_table = "mask_table.622.36x72"
  set ocean_npes = "1970"

The experiments I tried are:

CM4_c96L32_am4g5r2_2000_sis2 which has the issue

/// /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2/ncrc2.intel-repro-openmp/archive/1x0m10d_2560pe/restart/00010111.tar
\\\ /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2/ncrc2.intel-repro-openmp/archive/1x0m10d_2561pe/restart/00010111.tar
DIFFER : ALL
    CROSSOVER   FAILED: CM4_c96L32_am4g5r2_2000_sis2

CM4_c96L32_am4g5r2_2000_sis2_nobergs which does not have the issue

/// /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2_nobergs/ncrc2.intel-repro-openmp/archive/1x0m10d_2560pe/restart/00010111.tar
\\\ /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2_nobergs/ncrc2.intel-repro-openmp/archive/1x0m10d_2561pe/restart/00010111.tar

    CROSSOVER   PASSED: CM4_c96L32_am4g5r2_2000_sis2_nobergs

Out of bounds index in save_restart (fms_io)

It looks restarts do get written when in production mode but there is clearly a bug somewhere since we get an out-of-bounds with a dbug executable:

forrtl: severe (408): fort: (2): Subscript #1 of the array SBUF has value 1 which is greater than the upper bound of 0

Image PC Routine Line Source
MOM6 00000000055B6FAA Unknown Unknown Unknown
MOM6 00000000055B5B25 Unknown Unknown Unknown
MOM6 0000000005572786 Unknown Unknown Unknown
MOM6 000000000550EE95 Unknown Unknown Unknown
MOM6 000000000550F2E9 Unknown Unknown Unknown
MOM6 0000000004EDA0D1 mpp_mod_mp_mpp_ga 77 mpp_gather.h
MOM6 0000000004BDD30C mpp_io_mod_mp_mpp 30 mpp_write_unlimited_axis.h
MOM6 0000000003D210AA fms_io_mod_mp_sav 2516 fms_io.F90
MOM6 0000000003D0A533 fms_io_mod_mp_sav 2114 fms_io.F90
MOM6 0000000000D7E44A ice_bergs_io_mp_w 211 icebergs_io.F90
MOM6 0000000003517354 ice_bergs_mp_iceb 2152 icebergs.F90
MOM6 000000000338FC72 ice_type_mod_mp_i 967 ice_type.F90
MOM6 000000000168E1D0 ice_model_mod_mp_ 4071 ice_model.F90
MOM6 0000000001930DB7 coupler_main_IP_c 1662 coupler_main.F90
MOM6 000000000192A4A3 MAIN__ 887 coupler_main.F90
MOM6 00000000004006AC Unknown Unknown Unknown
MOM6 00000000055C58D4 Unknown Unknown Unknown
MOM6 000000000040057D Unknown Unknown Unknown
[NID 00252] 2015-07-16 13:59:37 Apid 105491249: initiated application termination

speed_limit is not being applied

The parameter speed_limit was introduced to handle coupled instabilities and was meant to limit the speed of a berg if became excessive. The code limits local variables uveln and vveln but the routine is returning variables axn and bxn which need to be recalculated from the modified uveln and vveln.