Giter Club home page Giter Club logo

Comments (36)

casparvl avatar casparvl commented on September 22, 2024 1

Yes, it's part of https://github.com/EESSI/software-layer . Your timing is pretty good, I very recently made a PR to our docs to explain how to use it to replicate build failures. PR isn't merged yet, but it's markdown, so you can simply view a rendered version in my feature branch. Links won't work in there, but I guess you can find your way around if need be - though I think this one markdown doc should cover it all.

from fftw3.

lrbison avatar lrbison commented on September 22, 2024

@casparvl @boegel

I followed this issue here from the EESSI repo. I'm trying to reproduce, but I haven't been able to do so . I've tried gcc 13.2.0, with Open MPI 4.1.6 and Open MPI 4.1.5. I'm running on an AWS hpc7g instance (ubuntu 2204). After being unable to reproduce directly from fftw source, I tried the following easybuild:

eb -dfr --from-pr 18884 --prefix=/fsx/eb --disable-cleanup-builddir

which is based on trying to reproduce https://gist.github.com/boegel/d97b974b8780c93753f1bf7462367082.

After the build, I can run make check in the builddir, but none of them reproduce the crash. Do you have any other suggestions on how to reproduce?

from fftw3.

lrbison avatar lrbison commented on September 22, 2024

One observation I have is that all the failures I've seen reported are from mpi-bench. It is true that mpirun may do slightly different things when it detects that it is running as part of a Slurm job. Can you provide any detail about how the slurm job is allocated or launched?

from fftw3.

casparvl avatar casparvl commented on September 22, 2024

I'm not sure of the exact job characteristics for the test build reported in https://gist.github.com/boegel/d97b974b8780c93753f1bf7462367082

For the builds done in EESSI I also couldn't tell you exactly what resources were requested in the job. But: this is run in a container, and then in a shell in which the only SLURM related job variable that is set is the SLURM_JOB_ID. So, I'm not sure if there is much for mpirun to pick up on here to figure out it actually is in a SLURM environment... Of course, SLURM can do things like set cgroups etc, which potentially affect how things run, but I couldn't tell you if that is done on this cluster. All node allocations here are exclusive, so I don't think a cgroup would do much anyway (as it would encompass the entire VM).

I did notice that I had fewer failures when I did the building interactively (though still in a job environment, it was an interactive SLURM job), as mentioned here. That seems to confirm that somehow environment has an affect, but... I couldn't really say what. This is a hard one :(

from fftw3.

casparvl avatar casparvl commented on September 22, 2024

Hm, I suddenly realize one difference between our bot building for EESSI, and your typical interactive environment: the bot not only builds in the container, it builds in a writeable overlay in the container. That tends to be a bit sluggish in terms of I/O. I'm wondering if that can somehow affect how these tests run. It's a bit far-fetched, and I wouldn't be able to explain the mechanism that makes it fail, but it would explain why my own interactive attempts showed a much higher success rate.

from fftw3.

lrbison avatar lrbison commented on September 22, 2024

Hm, in that container I wonder how many CPUs were allocated to it? I saw it was configured to allow oversubscription, I guess there is probably only 1 CPU core, which is different from my testing...

from fftw3.

boegel avatar boegel commented on September 22, 2024

Our build nodes in AWS have 16 cores (*.4xlarge instances in AWS), using a single core would be way too slow.

Not sure what @casparvl used for testing interactively

from fftw3.

lrbison avatar lrbison commented on September 22, 2024

Is there a way for me to get access to that build container so I may try it myself?

from fftw3.

casparvl avatar casparvl commented on September 22, 2024

Btw, I've tried to reproduce it once again, since we now have a new build cluster (based on Magic Castle instead of Cluster in the Cloud). I've only tried interactively (basically following the docs I just shared), and I cannot for the life of me replicate our own issue. As mentioned in the original issue, interactively I had much higher success rates (9/10 times more or less), but I've ran

perl -w ../tests/check.pl --verbose --random --maxsize=10000 -c=10  --mpi "mpirun -np 4 `pwd`/mpi-bench"

at least 20 times without failures now.

I'd love to see if the error still occurs when the bot builds it (as there it was consistently failing before), but my initial attempt failed for other reasons (basically, the bot cannot reinstall anything that already exists in the EESSI software stack - if you try, it'll fail on trying to change permissions on a read only file). I'll check with others if there is something I can do to work around this, so that I can actually trigger a rebuild with the bot.

from fftw3.

lrbison avatar lrbison commented on September 22, 2024

Yeah, I ran it over 200 times without failure on my cluster. Thank you for the pointers in that doc PR. I'll use that to try and trigger it again.

from fftw3.

boegel avatar boegel commented on September 22, 2024

@casparvl Should I temporarily revive a node in our old CitC Slurm cluster, to check if the problem was somehow specific to that environment?

from fftw3.

lrbison avatar lrbison commented on September 22, 2024

@casparvl I haven't had the time to reproduce within a container. Are we still seeing the testing failures occur or is it not happening on the newer build cluster?

from fftw3.

boegel avatar boegel commented on September 22, 2024

I am still seeing this problem on the our build cluster, when doing a test installation (in an interactive session) of FFTW.MPI/3.3.10-gompi-2023a for the new EESSI repository software.eessi.io.

A first attempt resulted in a segfault:

[aarch64-neoverse-v1-node2:2475846] Signal: Segmentation fault (11)
[aarch64-neoverse-v1-node2:2475846] Signal code:  (-6)
[aarch64-neoverse-v1-node2:2475846] Failing at address: 0xea670025c746
[aarch64-neoverse-v1-node2:2475846] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0x4000042507a0]
[aarch64-neoverse-v1-node2:2475846] [ 1] /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/OpenMPI/4.1.5-GCC-12.3.0/lib/libopen-pal.so.40(opal_convertor_generic_simple_position+0x10)[0x400004815b10]
[aarch64-neoverse-v1-node2:2475846] [ 2] /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/OpenMPI/4.1.5-GCC-12.3.0/lib/libopen-pal.so.40(opal_convertor_set_position_nocheck+0x120)[0x40000480dc60]
[aarch64-neoverse-v1-node2:2475846] [ 3] /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/OpenMPI/4.1.5-GCC-12.3.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_request_progress_frag+0x360)[0x4000067f3be0]
[aarch64-neoverse-v1-node2:2475846] [ 4] /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/OpenMPI/4.1.5-GCC-12.3.0/lib/openmpi/mca_btl_smcuda.so(mca_btl_smcuda_component_progress+0x424)[0x4000060b6ec4]
[aarch64-neoverse-v1-node2:2475846] [ 5] /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/OpenMPI/4.1.5-GCC-12.3.0/lib/libopen-pal.so.40(opal_progress+0x3c)[0x4000047fc99c]
[aarch64-neoverse-v1-node2:2475846] [ 6] /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/OpenMPI/4.1.5-GCC-12.3.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x2e4)[0x4000067ec3a4]
[aarch64-neoverse-v1-node2:2475846] [ 7] /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/OpenMPI/4.1.5-GCC-12.3.0/lib/libmpi.so.40(MPI_Sendrecv+0x188)[0x4000044a1228]
[aarch64-neoverse-v1-node2:2475846] [ 8] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/libfftw3f_mpi.so.3(+0xa424)[0x40000426a424]
[aarch64-neoverse-v1-node2:2475846] [ 9] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/libfftw3f_mpi.so.3(+0xa480)[0x40000426a480]
[aarch64-neoverse-v1-node2:2475846] [10] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/libfftw3f_mpi.so.3(+0xc2e8)[0x40000426c2e8]
[aarch64-neoverse-v1-node2:2475846] [11] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/lt-mpi-bench[0x40586c]
[aarch64-neoverse-v1-node2:2475846] [12] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/lt-mpi-bench[0x409144]
[aarch64-neoverse-v1-node2:2475846] [13] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/lt-mpi-bench[0x40aed4]
[aarch64-neoverse-v1-node2:2475846] [14] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/lt-mpi-bench[0x409434]
[aarch64-neoverse-v1-node2:2475846] [15] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/lt-mpi-bench[0x4083c0]
[aarch64-neoverse-v1-node2:2475846] [16] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/lt-mpi-bench[0x408430]
[aarch64-neoverse-v1-node2:2475846] [17] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/lt-mpi-bench[0x40619c]
[aarch64-neoverse-v1-node2:2475846] [18] /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/lib/../lib64/libc.so.6(+0x26a7c)[0x400004586a7c]
[aarch64-neoverse-v1-node2:2475846] [19] /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/lib/../lib64/libc.so.6(__libc_start_main+0x98)[0x400004586b4c]
[aarch64-neoverse-v1-node2:2475846] [20] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/lt-mpi-bench[0x402ef0]
[aarch64-neoverse-v1-node2:2475846] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node aarch64-neoverse-v1-node2 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
FAILED mpirun -np 2 /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/mpi-bench:  --verify 'ofc]12x11x4x7' --verify 'ifc]12x11x4x7' --verify 'ok7hx3o01x13e11' --verify 'ik7hx3o01x13e11' --verify 'obr6x2x8' --verify 'ibr6x2x8' --verify 'ofr6x2x8' --verify 'ifr6x2x8' --verify 'obc6x2x8' --verify 'ibc6x2x8' --verify 'ofc6x2x8' --verify 'ifc6x2x8' --verify 'ok]7o00x3o10' --verify 'ik]7o00x3o10' --verify 'ofr]5x6x12x10v1' --verify 'ifr]5x6x12x10v1' --verify 'obc]5x6x12x10v1' --verify 'ibc]5x6x12x10v1' --verify 'ofc]5x6x12x10v1' --verify 'ifc]5x6x12x10v1' --verify 'ok[3e11x13e11x9e10x9e00' --verify 'ik[3e11x13e11x9e10x9e00' --verify 'obr9x9' --verify 'ibr9x9' --verify 'ofr9x9' --verify 'ifr9x9' --verify 'obc9x9' --verify 'ibc9x9' --verify 'ofc9x9' --verify 'ifc9x9' --verify 'obrd11x24' --verify 'ibrd11x24' --verify 'ofrd11x24' --verify 'ifrd11x24' --verify 'obcd11x24' --verify 'ibcd11x24' --verify 'ofcd11x24' --verify 'ifcd11x24' --verify 'ok]8bx5o00x7o00x9e00' --verify 'ik]8bx5o00x7o00x9e00' --verify 'obc936' --verify 'ibc936' --verify 'ofc936' --verify 'ifc936'
make[3]: *** [Makefile:997: check-local] Error 1
make[3]: Leaving directory '/tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi'

A 2nd attempt showed relative error again:

--------------------------------------------------------------
     MPI FFTW transforms passed 10 tests, 1 CPU
--------------------------------------------------------------
perl -w ../tests/check.pl --verbose --random --maxsize=10000 -c=10  --mpi "mpirun -np 2 `pwd`/mpi-bench"
obr[18x8x20 3.69362e-34 6.89042e-34 5.06167e-34
ibr[18x8x20 2.97137e-34 9.18722e-34 4.77233e-34
obc[18x8x20 3.25632e-34 5.74201e-34 7.15416e-34
ibc[18x8x20 3.65554e-34 6.89042e-34 6.48955e-34
ofc[18x8x20 3.19732e-34 5.74201e-34 6.1247e-34
ifc[18x8x20 3.37891e-34 5.74201e-34 8.06618e-34
ofr]3x13x9v6 2.8078e-34 3.83781e-34 8.48571e-34
ifr]3x13x9v6 2.98203e-34 3.83781e-34 7.858e-34
obc]3x13x9v6 3.76241e-34 5.75672e-34 7.44837e-34
Found relative error 5.965609e-01 (time shift)
       0   4.921836295807  -1.026863218297     4.921836295807  -1.026863218297
       1  -2.116798599036   2.941852671019    -2.116798599036   2.941852671019
       2   1.771648568109  -0.438286594686     1.771648568109  -0.438286594686
       3  -4.602865281776   5.484918179038    -4.602865281776   5.484918179038
       4   2.819387086928   0.816109936207     2.819387086928   0.816109936207
       5  -6.414801972466   5.098682093116    -6.414801972466   5.098682093116
       6  -9.366154153178  -8.548834590260    -9.366154153178  -8.548834590260
       7  -5.574288760734   4.865193642565    -5.574288760734   4.865193642565
       8 -14.824759940401   5.035876685740   -14.824759940401   5.035876685740
       9 -14.634486341483   1.743297726353   -14.634486341483   1.743297726353
      10  -0.602576641742  -1.289954675842    -0.602576641742  -1.289954675842
      11   3.105024350062   4.204442622313     3.105024350062   4.204442622313
      12   9.179168922129   2.120173302885     9.179168922129   2.120173302885
      13  -6.635818177405  -0.070873458827    -6.635818177405  -0.070873458827
      14  -7.664759877207  -1.432949782471    -7.664759877207  -1.432949782471
      15   2.136338652683  -3.130528874653     2.136338652683  -3.130528874653
      16  -8.538098299824   5.591890241715    -8.538098299824   5.591890241715
      17  -0.991271869558  -3.379819003153    -0.991271869558  -3.379819003153
      18  -5.661447610867   7.529683912859    -5.661447610867   7.529683912859
      19  -6.086167949355  -1.670238032124    -6.086167949355  -1.670238032124
      20  -6.469068064222   7.690184841734    -6.469068064222   7.690184841734
      21   1.465582751707   7.789424354982     1.465582751707   7.789424354982
      22  -4.932751249830  -0.964580902292    -4.932751249830  -0.964580902292
      23  -4.495483109168   3.138270992002    -4.495483109168   3.138270992002
      24  -4.298238335069  -2.009670396150    -4.298238335069  -2.009670396150
      25  -5.616046746225  -1.630859337171    -5.616046746225  -1.630859337171
      26  -0.721988139199  -3.380289724460    -0.721988139199  -3.380289724460
      27   6.817499183174   3.754929401943     6.817499183174   3.754929401943
      28  -0.030920191076   4.600357644276    -0.030920191076   4.600357644276
      29   0.839410098370  -1.908666344239     0.839410098370  -1.908666344239
      30   3.385789170379   4.595090032781     3.385789170379   4.595090032781
      31   4.379259724979   0.784057635193     4.379259724979   0.784057635193
      32  11.841737046195  -6.148986050574    11.841737046195  -6.148986050574
      33  -4.188145406309   4.506890617698    -4.188145406309   4.506890617698
      34   2.987555638465   9.441583497205     2.987555638465   9.441583497205
      35  -8.098881460448  -0.524743787520    -8.098881460448  -0.524743787520
      36  -1.600749567878  -6.044191420031    -1.600749567878  -6.044191420031
      37   0.163953738123   2.046467146682     0.163953738123   2.046467146682
      38  -5.113238538613  -3.363399510184    -5.113238538613  -3.363399510184
      39  -2.872798422536  -8.040245973957    -2.872798422536  -8.040245973957
      40 -10.392736117901  -2.761172631260   -10.392736117901  -2.761172631260
      41   4.039519771041   8.003816207053     4.039519771041   8.003816207053
      42   1.790990423870  -8.383785422669     1.790990423870  -8.383785422669
      43  -9.165783259172  -2.186625587455    -9.165783259172  -2.186625587455
      44   5.007541953864   5.543722867012     5.007541953864   5.543722867012
      45   2.365419650732   2.977801310135     2.365419650732   2.977801310135
      46  -3.377120254702   4.906540019430    -3.377120254702   4.906540019430
      47  -0.010783860068   4.273408211548    -0.010783860068   4.273408211548
      48  -6.894392286266   6.830078049229    -6.894392286266   6.830078049229
      49  -3.254264449347   6.744977714739    -3.254264449347   6.744977714739
      50  -8.471641489793   5.603488318600    -8.471641489793   5.603488318600
      51   0.084130029380   1.367262769771     0.084130029380   1.367262769771
      52   1.482642504437  -4.602524328752     1.482642504437  -4.602524328752
      53   0.788628072835  -9.891756192852     0.788628072835  -9.891756192852
      54  -2.633046303010  11.214109607678    -2.633046303010  11.214109607678
      55  -3.192499246401   3.363355364265    -3.192499246401   3.363355364265
      56  -1.598444209258   3.573016880938    -1.598444209258   3.573016880938
      57   5.522584641083   0.912730173997     5.522584641083   0.912730173997
      58  -2.850571159892  -3.538531368267    -2.850571159892  -3.538531368267
      59   0.289119554985   1.226480324376     0.289119554985   1.226480324376
      60  -1.310174923968  -3.091891051678    -1.310174923968  -3.091891051678
      61  -2.749495846212  -9.372017422996    -2.749495846212  -9.372017422996
      62   3.279899011670   4.859168417630     3.279899011670   4.859168417630
      63   2.379547285718   1.774931614389     2.379547285718   1.774931614389
      64   4.662292029542   2.025644366541     4.662292029542   2.025644366541
      65  -6.175223059442  -1.891888996868    -6.175223059442  -1.891888996868
      66   1.731642745422  14.247081701735     1.731642745422  14.247081701735
      67 -10.929576224104  -8.727780396180   -10.929576224104  -8.727780396180
      68   5.844513943309  -1.235652769240     5.844513943309  -1.235652769240
      69   4.853189951788   0.397500732336     4.853189951788   0.397500732336
      70   1.645686104377   1.838816934461     1.645686104377   1.838816934461
      71  -1.387808178933  -6.069222393915    -1.387808178933  -6.069222393915
      72  -8.640352779734   7.623552803539    -8.640352779734   7.623552803539
      73  -2.621092502218   6.557474990141    -2.621092502218   6.557474990141
      74  -2.460425638794   0.126130793461    -2.460425638794   0.126130793461
      75  -3.642105748754  -3.042790015208    -3.642105748754  -3.042790015208
      76   0.903895069572   5.573680347688     0.903895069572   5.573680347688
      77  -3.850746636008  -0.664540783961    -3.850746636008  -0.664540783961
      78   2.670783169330   1.168453854800     2.670783169330   1.168453854800
      79   0.863490161325   2.800910717379     0.863490161325   2.800910717379
      80 -10.408734415051  -0.623237951468   -10.408734415051  -0.623237951468
      81  -6.746215176255 -10.162136743830    -6.746215176255 -10.162136743830
      82   6.010383700192   2.700168967362     6.010383700192   2.700168967362
      83   7.250381313471   2.507195619411     7.250381313471   2.507195619411
      84   5.728973913944  -2.066599007246     5.728973913944  -2.066599007246
      85 -10.049824910825   5.688927229637   -10.049824910825   5.688927229637
      86   2.592017899133  -1.850191728792     2.592017899133  -1.850191728792
      87  10.779025866591  -1.076683736319    10.779025866591  -1.076683736319
      88  -4.383388756630   1.650480826796    -4.383388756630   1.650480826796
      89  -0.055685598972  -3.774783473873    -0.055685598972  -3.774783473873
      90   6.628995072655   1.367150047102     6.628995072655   1.367150047102
      91  -0.810232261568  -2.976939725877    -0.810232261568  -2.976939725877
      92  -0.207344369538  -4.505328272435    -0.207344369538  -4.505328272435
      93   5.262364487884   5.245089127649     5.262364487884   5.245089127649
      94  -0.879545465455  -6.694733840184    -0.879545465455  -6.694733840184
      95  -0.807449055017  -5.586509120899    -0.807449055017  -5.586509120899
      96   4.706214482159   1.081938739490     4.706214482159   1.081938739490
      97  -1.981403259786   7.529674456958    -1.981403259786   7.529674456958
      98   2.203956996302  -4.983523613820     2.203956996302  -4.983523613820
      99  -2.296628834421   2.179234813172    -2.296628834421   2.179234813172
     100   9.173485452525   3.228133868069     9.173485452525   3.228133868069
     101  -6.386943659435   6.926987789753    -6.386943659435   6.926987789753
     102   3.076153055928   1.493617153748     3.076153055928   1.493617153748
     103  10.054141435677  13.326661925432    10.054141435677  13.326661925432
     104   8.463391787584  -5.877325613584     8.463391787584  -5.877325613584
     105  -0.696625001947  -3.802301741098    -0.696625001947  -3.802301741098
     106  -8.196977873692  -2.069536940407    -8.196977873692  -2.069536940407
     107   2.948666032147   2.516823938344     2.948666032147   2.516823938344
     108  -7.976790406507  -8.442930303150    -7.976790406507  -8.442930303150
     109  -2.921418292350   0.328394194535    -2.921418292350   0.328394194535
     110   2.105361692243  -1.048071016627     2.105361692243  -1.048071016627
     111  -0.122956865261  -3.178104995804    -0.122956865261  -3.178104995804
     112   1.377690789409  -1.577444205340     1.377690789409  -1.577444205340
     113  -4.004584148861  -4.382890836537    -4.004584148861  -4.382890836537
     114   0.011427712451   6.324099444670     0.011427712451   6.324099444670
     115   5.826045729088 -14.340030576439     5.826045729088 -14.340030576439
     116   7.577427495586   2.873642967239     7.577427495586   2.873642967239
     117  -1.210172393913   3.087617904153    -1.210172393913   3.087617904153
     118   4.129688769436  -0.269191081687     4.129688769436  -0.269191081687
     119   2.498805623692   8.629698093887     2.498805623692   8.629698093887
     120   0.180001563022   1.905778234978     0.180001563022   1.905778234978
     121   7.007577095520  -8.896514053123     7.007577095520  -8.896514053123
     122   6.566401034660  -3.159194023820     6.566401034660  -3.159194023820
     123  -7.616361041524  -6.592271202720    -7.616361041524  -6.592271202720
     124  -7.030945328309  -2.404710690963    -7.030945328309  -2.404710690963
     125  -4.795666771461   7.565990037469    -4.795666771461   7.565990037469
     126  -2.375104348185   1.918133142771    -2.375104348185   1.918133142771
     127   4.793627396078 -11.569053139350     4.793627396078 -11.569053139350
     128   0.825614651653  -5.877317639277     0.825614651653  -5.877317639277
     129   6.404638041792   7.660923814373     6.404638041792   7.660923814373
     130  -5.608845937279   6.189883435798    -5.608845937279   6.189883435798
     131  -2.052132858903  -3.021527799608    -2.052132858903  -3.021527799608
     132  -3.547036584342  -8.799408090505    -3.547036584342  -8.799408090505
     133  -0.668169395838   0.242810562341    -0.668169395838   0.242810562341
     134   6.968865621898   6.811579013049     6.968865621898   6.811579013049
     135  -4.777970484256   4.042227001858    -4.777970484256   4.042227001858
     136  -8.001526926080  -6.737608100204    -8.001526926080  -6.737608100204
     137  -0.028276084813   2.602238255100    -0.028276084813   2.602238255100
     138  -0.512308956568  -6.981404730691    -0.512308956568  -6.981404730691
     139   9.387188053728 -13.669568093222     9.387188053728 -13.669568093222
     140   4.861551650830   4.744755456001     4.861551650830   4.744755456001
     141  -1.458785307411   4.663376600332    -1.458785307411   4.663376600332
     142  -3.825099657621  -4.135819803719    -3.825099657621  -4.135819803719
     143   7.337295848097  -5.254042209712     7.337295848097  -5.254042209712
     144  -0.313260555864   6.771687218297    -0.313260555864   6.771687218297
     145  -3.163795468737   8.593709314445    -3.163795468737   8.593709314445
     146  -1.637608385520  -3.916686625097    -1.637608385520  -3.916686625097
     147   2.893356680624   3.492613129989     2.893356680624   3.492613129989
     148  -0.241462371122   7.603996304141    -0.241462371122   7.603996304141
     149   4.792674968811   6.244544979428     4.792674968811   6.244544979428
     150  -4.187404522818   3.480699468993    -4.187404522818   3.480699468993
     151   2.275412058088   8.711606271295     2.275412058088   8.711606271295
     152   9.309440618908   8.500678323888     9.309440618908   8.500678323888
     153  -5.146801960557   0.480271780127    -5.146801960557   0.480271780127
     154  -0.342934280885   4.006492082219    -0.342934280885   4.006492082219
     155  -0.520225001067  -2.871435828872    -0.520225001067  -2.871435828872
     156  -3.872971943304   1.447235114939    -3.872971943304   1.447235114939
     157  -6.260170736857  -4.000013983045    -6.260170736857  -4.000013983045
     158  -1.793247919295   4.904867267000    -1.793247919295   4.904867267000
     159  -5.476491940734   2.221240632587    -5.476491940734   2.221240632587
     160  -6.926551145538   4.990927999485    -6.926551145538   4.990927999485
     161   8.742622092574  10.567091128674     8.742622092574  10.567091128674
     162   1.485449402871   3.314914414563     1.485449402871   3.314914414563
     163  -6.730468872131  -5.788026037934    -6.730468872131  -5.788026037934
     164  -2.981885794878  -2.587215880092    -2.981885794878  -2.587215880092
     165   1.447396351206 -12.814694105896     1.447396351206 -12.814694105896
     166  -2.474143000457  -8.199676906604    -2.474143000457  -8.199676906604
     167  -6.968727036826   6.621321359661    -6.968727036826   6.621321359661
     168  -3.257964523801   0.484386452538    -3.257964523801   0.484386452538
     169   2.319015390451  -1.703639037599     2.319015390451  -1.703639037599
     170  -1.645353274574  11.946438535003    -1.645353274574  11.946438535003
     171  -0.711343655735  -4.829312331723    -0.711343655735  -4.829312331723
     172  -0.462013339680   3.395796127960    -0.462013339680   3.395796127960
     173  -1.879403680530  -1.220043545876    -1.879403680530  -1.220043545876
     174   1.907603137772   4.707561015705     1.907603137772   4.707561015705
     175   4.690694650819  -3.134057632254     4.690694650819  -3.134057632254
     176  -0.731397825734  10.216171123903    -0.731397825734  10.216171123903
     177   1.727112370787   1.537556202680     1.727112370787   1.537556202680
     178   9.804231130535  -3.050822838002     9.804231130535  -3.050822838002
     179  -3.521642259704   8.644200067602    -3.521642259704   8.644200067602
     180   1.847171292586   6.297594444781     1.847171292586   6.297594444781
     181  -2.944580056826  -8.904668383923    -2.944580056826  -8.904668383923
     182  -0.479878773202   9.252293550971    -0.479878773202   9.252293550971
     183   8.105438096502  -0.100680885472     8.105438096502  -0.100680885472
     184  -3.261705711112   5.625865138249    -3.261705711112   5.625865138249
     185  -9.001340472449   3.481531232669    -9.001340472449   3.481531232669
     186  -9.922858428321   8.928064172077    -9.922858428321   8.928064172077
     187  -0.262071419781  -2.637186613527    -0.262071419781  -2.637186613527
     188  13.976902439634   2.365843139075    13.976902439634   2.365843139075
     189  -1.200796504034  -4.514856210494    -1.200796504034  -4.514856210494
     190  10.243971312066   4.464830376249    10.243971312066   4.464830376249
     191  -2.874371721067  -6.435933215796    -2.874371721067  -6.435933215796
     192   1.013001932314   2.999699060836     1.013001932314   2.999699060836
     193  -0.993840710862  -6.386582096375    -0.993840710862  -6.386582096375
     194   3.561437964884   7.779957565555     3.561437964884   7.779957565555
     195   9.312380923566  -6.079419786231     9.312380923566  -6.079419786231
     196   0.417492073520   0.675369898888     0.417492073520   0.675369898888
     197  -5.373267320387  -5.228378193047    -5.373267320387  -5.228378193047
     198   2.811480320243  -1.530828750353     2.811480320243  -1.530828750353
     199  -3.810636424898  -4.270965066499    -3.810636424898  -4.270965066499
     200  -1.929116070223  -2.795097831046    -1.929116070223  -2.795097831046
     201  -4.910461489544   3.949953732577    -4.910461489544   3.949953732577
     202  -1.110838593410   0.859180227354    -1.110838593410   0.859180227354
     203  -2.647010599309  10.090425689658    -2.647010599309  10.090425689658
     204   0.055618930342  10.953225553089     0.055618930342  10.953225553089
     205   7.677359001006   1.191345729669     7.677359001006   1.191345729669
     206  -1.498323690654   0.356861042000    -1.498323690654   0.356861042000
     207   2.097613279533   1.809878602708     2.097613279533   1.809878602708
     208 -10.433389542881  -2.767883226818   -10.433389542881  -2.767883226818
     209   4.485006007605  -2.861710075652     4.485006007605  -2.861710075652
     210 -11.299061334429  -4.240819220427   -11.299061334429  -4.240819220427
     211   0.889330359867  -6.122606788728     0.889330359867  -6.122606788728
     212   1.644972522082  -4.130609805857     1.644972522082  -4.130609805857
     213   3.119752911839   5.520336783880     3.119752911839   5.520336783880
     214   6.451529263230  -6.991195712115     6.451529263230  -6.991195712115
     215   1.950360060868  -5.530643072460     1.950360060868  -5.530643072460
     216   6.040150340031   4.344206024582     6.040150340031   4.344206024582
     217   1.752640417864 -13.456400163587     1.752640417864 -13.456400163587
     218  13.891823564455  -4.615650120662    13.891823564455  -4.615650120662
     219   3.353087607440  -5.568825085630     3.353087607440  -5.568825085630
     220  -0.238755291286  -0.122203225111    -0.238755291286  -0.122203225111
     221  -4.994487942039  -2.421765879674    -4.994487942039  -2.421765879674
     222  -8.659719429484   2.921445084397    -8.659719429484   2.921445084397
     223   1.322825765261   6.523414416538     1.322825765261   6.523414416538
     224   0.383609830312 -11.798272908431     0.383609830312 -11.798272908431
     225  -4.959900682847  -4.419719391506    -4.959900682847  -4.419719391506
     226  -1.407603649500  -2.756941605224    -1.407603649500  -2.756941605224
     227  -7.044264525785   0.083191244366    -7.044264525785   0.083191244366
     228   5.027093393519   5.195264035163     5.027093393519   5.195264035163
     229   1.563992212574  -0.701216248220     1.563992212574  -0.701216248220
     230   0.306554234674   4.476987321667     0.306554234674   4.476987321667
     231   1.226269348284  -2.296913229853     1.226269348284  -2.296913229853
     232  -4.098996468141  -6.855165091528    -4.098996468141  -6.855165091528
     233  -8.845292917687  -0.923422749681    -8.845292917687  -0.923422749681
     234  -4.250287799692   3.557076157786    -4.250287799692   3.557076157786
     235   0.469057774787   8.279657163755     0.469057774787   8.279657163755
     236   4.340048272752  -0.232303117938     4.340048272752  -0.232303117938
     237   1.752288340162  -4.554038855546     1.752288340162  -4.554038855546
     238  -2.786461997863   0.349152549109    -2.786461997863   0.349152549109
     239  -9.048613296502   4.902932369427    -9.048613296502   4.902932369427
     240   1.292868067079  10.372646253328     1.292868067079  10.372646253328
     251   6.531174970235   3.643867565208     6.531174970235   3.643867565208
     252  -4.541992135905   0.814485798927    -4.541992135905   0.814485798927
     253  10.547201095289  13.176243470534    10.547201095289  13.176243470534
     254   1.680946638121  13.004362273317     1.680946638121  13.004362273317
     255   7.244027605296  -1.038411963768     7.244027605296  -1.038411963768
     262  -5.925648177938  -1.268203314222    -5.925648177938  -1.268203314222
     267   1.353136474995  -2.122072383697     1.353136474995  -2.122072383697
     268   3.486826915908   1.572000562698     3.486826915908   1.572000562698
     269  -4.426443526401   2.623133044118    -4.426443526401   2.623133044118
     270   8.632661103766 -11.643280600455     8.632661103766 -11.643280600455
     271   3.232947482898   8.184951094877     3.232947482898   8.184951094877
     272  -0.058647847283  -6.334711265114    -0.058647847283  -6.334711265114
     273  -0.941586491285   8.349265532221    -0.941586491285   8.349265532221
     274  -6.447305295794  -5.049955925120    -6.447305295794  -5.049955925120
     275 -11.945598200550  -4.015966059585   -11.945598200550  -4.015966059585
     276   0.362306726308  14.450774594960     0.362306726308  14.450774594960
     277 -13.833931179943  -7.361791104432   -13.833931179943  -7.361791104432
     278   0.768431424357   6.017412350709     0.768431424357   6.017412350709
     279   0.030233743672  -2.307785598212     0.030233743672  -2.307785598212
     280  10.014906601641   3.051751435802    10.014906601641   3.051751435802
     281   5.112072879529  -4.132941863717     5.112072879529  -4.132941863717
     282  -0.438617802708  10.276119662869    -0.438617802708  10.276119662869
     283  -3.027137527217  -0.561076703303    -3.027137527217  -0.561076703303
     284   3.926003026828  -4.086725429315     3.926003026828  -4.086725429315
     285   0.786845785234  -1.530531963474     0.786845785234  -1.530531963474
     286  -2.893235031611  -8.453773261229    -2.893235031611  -8.453773261229
     287  11.596087554883  -4.013957133276    11.596087554883  -4.013957133276
     288  -4.988489747276  11.688234628619    -4.988489747276  11.688234628619
     289  -5.099846866775   3.149676053203    -5.099846866775   3.149676053203
     290   3.993544832699  -2.176510514608     3.993544832699  -2.176510514608
     291   1.791994775922   2.679198098395     1.791994775922   2.679198098395
     292   6.229541027538   7.197596224506     6.229541027538   7.197596224506
     293  -2.690450075242   6.678106532908    -2.690450075242   6.678106532908
     294   7.028412388425  10.238169492735     7.028412388425  10.238169492735
     295  -4.703505231104  -6.328634949054    -4.703505231104  -6.328634949054
     296 -14.073077800312  -6.540533668748   -14.073077800312  -6.540533668748
     297  -2.359761010290   4.669844938190    -2.359761010290   4.669844938190
     298  -3.973951647153  -7.985259797914    -3.973951647153  -7.985259797914
     299   4.741028202046  -0.901953828990     4.741028202046  -0.901953828990
FAILED mpirun -np 2 /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/mpi-bench:  --verify 'obr[18x8x20' --verify 'ibr[18x8x20' --verify 'obc[18x8x20' --verify 'ibc[18x8x20' --verify 'ofc[18x8x20' --verify 'ifc[18x8x20' --verify 'ofr]3x13x9v6' --verify 'ifr]3x13x9v6' --verify 'obc]3x13x9v6' --verify 'ibc]3x13x9v6' --verify 'ofc]3x13x9v6' --verify 'ifc]3x13x9v6' --verify 'okd[10e00x7e01x4e00v11' --verify 'ikd[10e00x7e01x4e00v11' --verify 'obrd[10x11x3x10' --verify 'ibrd[10x11x3x10' --verify 'obcd[10x11x3x10' --verify 'ibcd[10x11x3x10' --verify 'ofcd[10x11x3x10' --verify 'ifcd[10x11x3x10' --verify 'okd]9o10x10e00x10e00x10b' --verify 'ikd]9o10x10e00x10e00x10b' --verify 'ofrd]3x12x6v6' --verify 'ifrd]3x12x6v6' --verify 'obcd]3x12x6v6' --verify 'ibcd]3x12x6v6' --verify 'ofcd]3x12x6v6' --verify 'ifcd]3x12x6v6' --verify 'okd6bx2e00v9' --verify 'ikd6bx2e00v9' --verify 'obr5x2x6v2' --verify 'ibr5x2x6v2' --verify 'ofr5x2x6v2' --verify 'ifr5x2x6v2' --verify 'obc5x2x6v2' --verify 'ibc5x2x6v2' --verify 'ofc5x2x6v2' --verify 'ifc5x2x6v2' --verify 'ofr]12x5x10v3' --verify 'ifr]12x5x10v3' --verify 'obc]12x5x10v3' --verify 'ibc]12x5x10v3' --verify 'ofc]12x5x10v3' --verify 'ifc]12x5x10v3'
make[3]: *** [Makefile:997: check-local] Error 1
make[3]: Leaving directory '/tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi'

from fftw3.

lrbison avatar lrbison commented on September 22, 2024

I tried to replicate this over the weekend. @casparvl's documentation was extremely helpful, thank you! I tried to debug this PR: https://github.com/EESSI/software-layer/pull/374/files

git clone https://github.com/EESSI/software-layer.git
cd software-layer
git remote add https://github.com/casparvl/software-layer casparvl
git remote add casparvl https://github.com/casparvl/software-layer
git fetch casparvl
git checkout casparvl/fftw_test
./eessi_container.sh --access rw --save /fsx/essi-fftw1

And then within the easybuild container did this in a loop:

eb --easystack eessi-2023.06-eb-4.8.1-2022a.yml --robot

It ran 374 times over the weekend without failure on an hpc7g.16xlarge (64 cores).

@casparvl sounded like you suspected a writable overlay could cause more slugish I/O. I'm not familiar enough with eessi container, but I think with the access rw I have done that, correct?

Do either of you have other ideas for me to change? I suppose I can switch to a c7g.4xlarge....

from fftw3.

lrbison avatar lrbison commented on September 22, 2024

I was able to compile and successfully run on c7g.4xlarge as well, with no issues there either.

from fftw3.

lrbison avatar lrbison commented on September 22, 2024

@casparvl Do you have other ideas on how I can try to reproduce? I'm not sure if it matters, but my attempt was on an Ubuntu 2004 and the container was started using: ./eessi_container.sh --access rw --save /fsx/lrbison/essi-fftw1 where the mount was hosted from FSx for Lustre file system.

My repeated testing was repeated calls of eb --easystack eessi-2023.06-eb-4.8.1-2022a.yml --robot rather than repeatedly starting the container.

from fftw3.

casparvl avatar casparvl commented on September 22, 2024

Sorry for failing to come back to you on this. I'll try again myself as well. I just did one install, which indeed was succesfull. Second time, I ran into the same error as @boegel had the 2nd time around:

Error log:
      MPI FFTW transforms passed 10 tests, 3 CPUs
--------------------------------------------------------------
perl -w ../tests/check.pl --verbose --random --maxsize=10000 -c=10  --mpi "mpirun -np 4 `pwd`/mpi-bench"
Executing "mpirun -np 4 /tmp/eessi-debug.n0muoZ0cuh/easybuild/build/FFTWMPI/3.3.10/gompi-2022a/fftw-3.3.10/mpi/mpi-bench --verbose=1   --verify 'ofc10x10x3' --verify 'ifc10x10x3' --verify 'ok
]16bx11o11v6' --verify 'ik]16bx11o11v6' --verify 'ofr]12x13x8' --verify 'ifr]12x13x8' --verify 'obc]12x13x8' --verify 'ibc]12x13x8' --verify 'ofc]12x13x8' --verify 'ifc]12x13x8' --verify 'okd
]12o01x30e00' --verify 'ikd]12o01x30e00' --verify 'ofrd]3x6x3x4' --verify 'ifrd]3x6x3x4' --verify 'obcd]3x6x3x4' --verify 'ibcd]3x6x3x4' --verify 'ofcd]3x6x3x4' --verify 'ifcd]3x6x3x4' --veri
fy 'okd[8o11x9e10x10o00x10e01' --verify 'ikd[8o11x9e10x10o00x10e01' --verify 'obrd12x12x5v2' --verify 'ibrd12x12x5v2' --verify 'ofrd12x12x5v2' --verify 'ifrd12x12x5v2' --verify 'obcd12x12x5v2' --verify 'ibcd12x12x5v2' --verify 'ofcd12x12x5v2' --verify 'ifcd12x12x5v2' --verify 'ok[13e11x52o00' --verify 'ik[13e11x52o00' --verify 'obrd[8x7v2' --verify 'ibrd[8x7v2' --verify 'obcd[8x7v2' --verify 'ibcd[8x7v2' --verify 'ofcd[8x7v2' --verify 'ifcd[8x7v2' --verify 'obr12x3x2x8' --verify 'ibr12x3x2x8' --verify 'ofr12x3x2x8' --verify 'ifr12x3x2x8' --verify 'obc12x3x2x8' --verify 'ibc12x3x2x8' --verify 'ofc12x3x2x8' --verify 'ifc12x3x2x8'"
ofc10x10x3 1.95174e-07 3.30362e-07 1.86409e-07
ifc10x10x3 1.7346e-07 3.30362e-07 2.59827e-07
ok]16bx11o11v6 1.73834e-07 1.48147e-06 1.88905e-07
ik]16bx11o11v6 2.28489e-07 1.60348e-06 1.94972e-07
ofr]12x13x8 2.74646e-07 4.3193e-07 1.84938e-07
ifr]12x13x8 1.88937e-07 4.3193e-07 1.63803e-07
obc]12x13x8 2.10673e-07 4.3193e-07 2.28376e-07
ibc]12x13x8 1.97341e-07 4.3193e-07 2.27807e-07
ofc]12x13x8 2.19374e-07 5.39912e-07 2.17205e-07
ifc]12x13x8 2.08943e-07 4.3193e-07 2.19416e-07
okd]12o01x30e00 2.51417e-07 4.47886e-06 1.94862e-07
ikd]12o01x30e00 2.48254e-07 5.89166e-06 3.59064e-07
ofrd]3x6x3x4 1.82793e-07 2.59557e-07 1.48863e-07
ifrd]3x6x3x4 1.75387e-07 2.59557e-07 1.83453e-07
obcd]3x6x3x4 1.89722e-07 3.24447e-07 1.87965e-07
ibcd]3x6x3x4 1.94751e-07 3.24447e-07 1.69235e-07
ofcd]3x6x3x4 1.69961e-07 3.24447e-07 1.56861e-07
ifcd]3x6x3x4 1.82658e-07 3.24447e-07 1.69306e-07
Found relative error 2.900030e+35 (time shift)
       0 -164.457138061523   -164.457199096680
       1 -225.637115478516   -225.637100219727
       2 -20.902750015259   -20.902732849121
       3 172.414703369141   172.414733886719
       4  -4.662590026855    -4.662593841553
       5   7.010725498199     7.010738372803
       6 -89.267349243164   -89.267364501953
       7 326.806823730469   326.806823730469
       8 -19.448410034180   -19.448524475098
       9  69.001441955566    69.001434326172
      10 -104.643005371094   -104.643020629883
      11 -26.874126434326   -26.874076843262
      12 -24.399785995483   -24.399751663208
      13 -141.903198242188   -141.903198242188
      14 -90.872367858887   -90.872360229492
      15 -44.611225128174   -44.611217498779
      16  41.871009826660    41.871009826660
      17 176.062194824219   176.062194824219
      18 -90.186141967773   -90.186141967773
      19  -4.998665332794    -4.998687744141

Running it a third time, it completed succesfully again.

The only thing you don't mention explicitely is if you also followed the steps of activating the prefix environment & EESSI pilot stack, as described on https://www.eessi.io/docs/adding_software/debugging_failed_builds/ , and if you sourced the configure_easybuild script. Did you do that?

If you didn't I guess that means you've built the full software stack from the ground up. If that's the case, and if that works, then I guess the conclusion is something is fishy with one of the FFTW.MPI dependencies we pick up from the EESSI pilot stack (and for which you would have done a fresh build). That's useful information, because it would show that the combination of using the dependencies from EESSI somehow trigger this issue. Also, it'd mean you could actually try those steps as well (i.e. start the prefix environment, start the EESSI pilot stack, source the configure_easybuild script), and see if you can replicate the issue that way. That would unambiguously prove that the issue is somewhere in the dependencies that we already have in the stack.

Just for reference, this is a snippet of my history from the point I start the container, to having run the eb --easystack eessi-2023.06-eb-4.8.1-2022a.yml --robot command once:

    1  EESSI_CVMFS_REPO=/cvmfs/pilot.eessi-hpc.org/
    2  EESSI_PILOT_VERSION=2023.06
    3  source ${EESSI_CVMFS_REPO}/versions/${EESSI_PILOT_VERSION}/init/bash
    4  export WORKDIR=$(mktemp --directory --tmpdir=/tmp  -t eessi-debug.XXXXXXXXXX)
    5  source configure_easybuild
    6  module load EasyBuild/4.8.1
    7  eb --show-config
    8  eb --easystack eessi-2023.06-eb-4.8.1-2022a.yml --robot

The result of eb --show-config is:

[EESSI pilot 2023.06] $ eb --show-config
#
# Current EasyBuild configuration
# (C: command line argument, D: default value, E: environment variable, F: configuration file)
#
buildpath            (E) = /tmp/eessi-debug.n0muoZ0cuh/easybuild/build
containerpath        (E) = /tmp/eessi-debug.n0muoZ0cuh/easybuild/containers
debug                (E) = True
experimental         (E) = True
filter-deps          (E) = Autoconf, Automake, Autotools, binutils, bzip2, DBus, flex, gettext, gperf, help2man, intltool, libreadline, libtool, Lua, M4, makeinfo, ncurses, util-linux, XZ, zlib, Yasm
filter-env-vars      (E) = LD_LIBRARY_PATH
hooks                (E) = /home/casparvl/debug_PR374/software-layer/eb_hooks.py
ignore-osdeps        (E) = True
installpath          (E) = /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_n1/testing
module-extensions    (E) = True
packagepath          (E) = /tmp/eessi-debug.n0muoZ0cuh/easybuild/packages
prefix               (E) = /tmp/eessi-debug.n0muoZ0cuh/easybuild
read-only-installdir (E) = True
repositorypath       (E) = /tmp/eessi-debug.n0muoZ0cuh/easybuild/ebfiles_repo
robot-paths          (D) = /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_n1/software/EasyBuild/4.8.1/easybuild/easyconfigs
rpath                (E) = True
sourcepath           (E) = /tmp/eessi-debug.n0muoZ0cuh/easybuild/sources:
sysroot              (E) = /cvmfs/pilot.eessi-hpc.org/versions/2023.06/compat/linux/aarch64
trace                (E) = True
zip-logs             (E) = bzip2

Curious to hear if you ran using the EESSI pilot stack for dependencies. Maybe you can also share your eb --show-config output.

from fftw3.

casparvl avatar casparvl commented on September 22, 2024

I'm also still puzzled by the randomness of this issue. I'd love to better understand why the failrue of these tests are random. Is the input randomly generated? Is the algorithm simply non-deterministic (e.g. because of non-deterministic order in reduction operations or something of that nature)? I'd love to understand if that 'randomness' could somehow be affected by environment, as initially I seem to have seen many more failures in a job environment than interactively... But I'm not sure if any of you has such an intricate knowledge of what these particular tests do :)

from fftw3.

lrbison avatar lrbison commented on September 22, 2024

Yes, I'm afraid I can't speak for the fftw developers here, perhaps @matteo-frigo could help answer the question about what ../tests/check.pl is checking, and if the failures are catastrophic or simply small precision errors?

from fftw3.

lrbison avatar lrbison commented on September 22, 2024

@casparvl

My complete steps are here:

git clone https://github.com/EESSI/software-layer.git
cd software-layer
git remote add https://github.com/casparvl/software-layer casparvl
git remote add casparvl https://github.com/casparvl/software-layer
git fetch casparvl
git checkout casparvl/fftw_test
./eessi_container.sh --access rw --save /fsx/lrbison/essi-fftw1
Apptainer> echo ${EESSI_CVMFS_REPO}; echo ${EESSI_PILOT_VERSION}
/cvmfs/pilot.eessi-hpc.org
2023.06

export EESSI_OS_TYPE=linux  # We only support Linux for now
export EESSI_CPU_FAMILY=$(uname -m)
${EESSI_CVMFS_REPO}/versions/${EESSI_PILOT_VERSION}/compat/${EESSI_OS_TYPE}/${EESSI_CPU_FAMILY}/startprefix
#...(wait a bit)
export EESSI_CVMFS_REPO=/cvmfs/pilot.eessi-hpc.org
export EESSI_PILOT_VERSION=2023.06
source ${EESSI_CVMFS_REPO}/versions/${EESSI_PILOT_VERSION}/init/bash

export WORKDIR=/tmp/try1
source configure_easybuild
module load EasyBuild/4.8.1
eb --show-config

eb --easystack eessi-2023.06-eb-4.8.1-2022a.yml --robot

Sadly I didn't save my easybuild output, let me re-create again. I am curious, when you "retry" do you retry from eb --easystack... or do you retry from ./eessi_container.sh ...?

from fftw3.

casparvl avatar casparvl commented on September 22, 2024

Ok, so you also built on top of the dependencies that were already provided from the EESSI side. Then I really don't see any differences, other than (potentially) things in the environment... Strange!

I am curious, when you "retry" do you retry from eb --easystack... or do you retry from ./eessi_container.sh ...?

Like you, I retried from eb --easystack .... So, I get different results, even without restarting the container...

Also interesting, I've tried a 4th time. Now I get a hanging process. I.e. I see two lt-mpi-bench processes using ~100% CPU, and having done so for 66 minutes straight. They normally complete much faster. MPI deadlock...?

from fftw3.

lrbison avatar lrbison commented on September 22, 2024

I would love a backtrace of both of those processes!

from fftw3.

casparvl avatar casparvl commented on September 22, 2024

Great idea... but unfortunately my allocation ended 2 minutes after I noticed the hang :( I'm pretty sure I had process hangs before as well, when I ran into this issue originally. I'll try to run it a couple more times tonight, see if I can trigger it again and get a backtrace...

from fftw3.

casparvl avatar casparvl commented on September 22, 2024

Hm, while trying to reproduce my hang (which I didn't succeed in yet), I noticed something: the automatic initialization script from EESSI thinks this node is a neoverse_n1. I seem to remember some chatter about this architecture not being detected properly, but thought we fixed that - maybe not. Anyway, it will build against dependencies optimized on neoverse_n1. I'm pretty sure our build bot overrides this automatic CPU architecture detection, but maybe @boegel can confirm... It would at least point to one difference between what our bot does, and what I get interactively.

Anyway, for now, I'll override myself with export EESSI_SOFTWARE_SUBDIR_OVERRIDE=aarch64/neoverse_v1 before sourcing the init script. See where that takes me in terms of build failures, hangs, etc.

from fftw3.

casparvl avatar casparvl commented on September 22, 2024

Interesting, now that I correctly use the right dependencies (due to export EESSI_SOFTWARE_SUBDIR_OVERRIDE=aarch64/neoverse_v1), the failures are suddenly consistent, instead of occassional. Maybe you could give that a try as well: set it after running startprefix, but before sourcing the initialization script. Also, at this point, you may unset EESSI_SILENT. That will course the init script to print what architecture is selected (it should respect your override, but it's good to check).

I've run it about 10-15 times now. Each time, it fails with a numerical error like the one above. Now, finally, I've managed to reproduce the hanging 2 processes. Here's the backtrace:

(gdb) bt full
#0  0x000040002c61c604 in opal_timer_linux_get_cycles_sys_timer ()
   from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/libopen-pal.so.40
No symbol table info available.
#1  0x000040002c5ccaec in opal_progress_events.isra ()
   from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/libopen-pal.so.40
No symbol table info available.
#2  0x000040002c5ccc88 in opal_progress () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/libopen-pal.so.40
No symbol table info available.
#3  0x000040002c22babc in ompi_request_default_wait () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/libmpi.so.40
No symbol table info available.
#4  0x000040002c27e284 in ompi_coll_base_sendrecv_actual ()
   from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/libmpi.so.40
No symbol table info available.
#5  0x000040002c27f40c in ompi_coll_base_allreduce_intra_recursivedoubling ()
   from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/libmpi.so.40
No symbol table info available.
#6  0x000040002c27fad4 in ompi_coll_base_allreduce_intra_ring ()
   from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/libmpi.so.40
No symbol table info available.
#7  0x000040002ea861cc in ompi_coll_tuned_allreduce_intra_dec_fixed ()
   from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/openmpi/mca_coll_tuned.so
No symbol table info available.
#8  0x000040002c23b4e8 in PMPI_Allreduce () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/libmpi.so.40
No symbol table info available.
#9  0x000040002c0161d0 in fftwf_mpi_any_true ()
   from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f_mpi.so.3
No symbol table info available.
#10 0x000040002c067648 in mkplan () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#11 0x000040002c06781c in fftwf_mkplan_d ()
   from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#12 0x000040002c01ef0c in mkplan () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f_mpi.so.3
No symbol table info available.
#13 0x000040002c0670e8 in search0 () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#14 0x000040002c0673a4 in mkplan () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#15 0x000040002c06781c in fftwf_mkplan_d ()
   from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#16 0x000040002c01e49c in mkplan () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f_mpi.so.3
No symbol table info available.
#17 0x000040002c0670e8 in search0 () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#18 0x000040002c0673a4 in mkplan () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#19 0x000040002c0e83ac in mkplan () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#20 0x000040002c0e85a0 in fftwf_mkapiplan ()
   from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#21 0x000040002c017aac in fftwf_mpi_plan_guru_r2r ()
   from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f_mpi.so.3
No symbol table info available.
#22 0x000040002c017bcc in fftwf_mpi_plan_many_r2r ()
   from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f_mpi.so.3
No symbol table info available.
#23 0x0000000000404928 in mkplan ()
No symbol table info available.
#24 0x0000000000405778 in setup ()
No symbol table info available.
#25 0x00000000004085e0 in verify ()
No symbol table info available.
#26 0x0000000000406498 in bench_main ()
No symbol table info available.
#27 0x000040002c346a7c in ?? () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/compat/linux/aarch64/lib/../lib64/libc.so.6
No symbol table info available.
#28 0x000040002c346b4c in __libc_start_main () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/compat/linux/aarch64/lib/../lib64/libc.so.6
No symbol table info available.
#29 0x0000000000402f30 in _start ()
No symbol table info available.

from fftw3.

boegel avatar boegel commented on September 22, 2024

Hm, while trying to reproduce my hang (which I didn't succeed in yet), I noticed something: the automatic initialization script from EESSI thinks this node is a neoverse_n1. I seem to remember some chatter about this architecture not being detected properly, but thought we fixed that - maybe not. Anyway, it will build against dependencies optimized on neoverse_n1. I'm pretty sure our build bot overrides this automatic CPU architecture detection, but maybe @boegel can confirm... It would at least point to one difference between what our bot does, and what I get interactively.

Our bot indeed overrides the CPU auto-detection during building, because archspec is sometimes a bit too pedantic (see for example archspec/archspec-json#38).

In software.eessi.io we've switched to our own pure bash archdetect mechanism, which is less pedantic, but that's not used during build either: the build bot just sets $EESSI_SOFTWARE_SUBDIR_OVERRIDE based on it's configuration.

from fftw3.

lrbison avatar lrbison commented on September 22, 2024

Seems like we (you) are making progress! I tried to add your override. Here is my eb config:

buildpath            (E) = /tmp/try1/easybuild/build
containerpath        (E) = /tmp/try1/easybuild/containers
debug                (E) = True
experimental         (E) = True
filter-deps          (E) = Autoconf, Automake, Autotools, binutils, bzip2, DBus, flex, gettext, gperf, help2man, intltool, libreadline, libtool, Lua, M4, makeinfo, ncurses, util-linux, XZ, zlib, Yasm
filter-env-vars      (E) = LD_LIBRARY_PATH
hooks                (E) = /tmp/software-layer/eb_hooks.py
ignore-osdeps        (E) = True
installpath          (E) = /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing
module-extensions    (E) = True
packagepath          (E) = /tmp/try1/easybuild/packages
prefix               (E) = /tmp/try1/easybuild
read-only-installdir (E) = True
repositorypath       (E) = /tmp/try1/easybuild/ebfiles_repo
robot-paths          (D) = /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/EasyBuild/4.8.1/easybuild/easyconfigs
rpath                (E) = True
sourcepath           (E) = /tmp/try1/easybuild/sources:
sysroot              (E) = /cvmfs/pilot.eessi-hpc.org/versions/2023.06/compat/linux/aarch64
trace                (E) = True
zip-logs             (E) = bzip2

But I still don't get failures during testing.

I do think allreduce has the potential to be non-deterministic, however I'm unsure if the ompi_coll_base_allreduce_intra_ring implementation is or isn't deterministic.

I wonder, is there a way for me to continually run the test without rebuilding each time?

from fftw3.

casparvl avatar casparvl commented on September 22, 2024

It is possible. What you could do is stop the EasyBuild installation after a certain point using the --stop argument. You can do that by editing the yaml file and make it look like this at the end:

  - FFTW.MPI-3.3.10-gompi-2022a.eb:
      options:
        rebuild: True
        stop: 'build'

This should stop it after the build step (and before the test step). Then, you'd want to run

eb FFTW.MPI-3.3.10-gompi-2022a.eb --dump-env-script

This will dump a script FFTW.MPI-3.3.10-gompi-2022a.env that you can source to get the same environment that EasyBuild has during the build. Then, check one of your prior builds (done before you added the 'stop' in the yaml file) to see what command was executed by EasyBuild as its test step and in which directory. The logs are pretty verbose, so it may be a bit of a puzzle to find, but at least it shows all that information.

Then, source that FFTW.MPI-3.3.10-gompi-2022a.env, and go to the directory in which EasyBuild normally runs its test step (or an equivalent dir: your tempdir might be different between your stopped build, and the prior build you inspected the logs for. So the prefix might look a little different) and run the command that EasyBuild also ran as 'test step'. That last command, you should be able to put in a loop.

from fftw3.

casparvl avatar casparvl commented on September 22, 2024

By the way, your installpath from the eb --show-config shows that you are indeed using the neoverse_v1 copy of the software stack (which should be the case since you use the override), so that's good.

I'm absolutely puzzled by why things are different for you than for us. Short from seeing if we could have you test things on our cluster, I don't know what else to try for you to reproduce the failure... :/ I that's something you would be up for, see if you can reach out to @boegel on the EESSI Slack in a DM (join here if you're not yet on that channel), he might be able to arrange it for you.

@boegel maybe you could also do the reverse: spin up a regular VM outside of our Magic Castle setup and see if you can reproduce the issue there? If not, it must be related to our cluster setup somehow...

Also a heads up: I'm going to be on quite a long leave, so won't be able to respond for the next month or so. Again, maybe @boegel can follow up if needed :)

from fftw3.

lrbison avatar lrbison commented on September 22, 2024

Thank you for the testing insight and the slack invite. Enjoy the break. I'll talk to @boegel on slack and see what he thinks is a reasonable next step.

from fftw3.

boegel avatar boegel commented on September 22, 2024

@lrbison When would you like to follow up on this?

from fftw3.

lrbison avatar lrbison commented on September 22, 2024

I talked offline with Kenneth.

In the mean time, my pattern-matching neurons fired:

both #334 (comment) and https://gitlab.com/eessi/support/-/issues/24#note_1734228961 have something in common:

Both are in mca_btl_smcuda_component_progress from the smcuda module, but I recall smcuda should really only be engaged when CUDA/ROCm/{accelerator} memory is used, otherwise we should be using the SM BTL. I'll follow up on that.

Another similarity is that although the fftw backtrace is just form a sendrecv, the hang was stopped during allreduce, and both OpenFOAM and FFTW cases were doing ompi_coll_base_allreduce_intra_recursivedoubling. However my gut tells me it's not the reduction at fault but rather the progress engine, (partially because I know for a fact we are testing that allreduce function daily without issue).

from fftw3.

lrbison avatar lrbison commented on September 22, 2024

Moving the rest of this discussion to https://gitlab.com/eessi/support/-/issues/41

from fftw3.

lrbison avatar lrbison commented on September 22, 2024

The root cause was open-mpi/ompi#12270 Fixed in open-mpi/ompi#12338, so this issue can be closed.

from fftw3.

rdolbeau avatar rdolbeau commented on September 22, 2024

For Neoverse V1 users, if you can also try and report on the release-for-testing in #315 it would be useful to get SVE support upstream.

from fftw3.

rdolbeau avatar rdolbeau commented on September 22, 2024

Closing as requested.

from fftw3.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.